Release v3.10.0: External Embedder Endpoint · proffesor-for-testing/agentic-qe

What's New

AQE can now route its semantic vector layer to an external embedder endpoint instead of loading @huggingface/transformers in-process. Set one env var and AQE talks to any OpenAI-compatible /v1/embeddings server.

# Run any OpenAI-compatible /v1/embeddings server. Example with llama.cpp:
llama-server -m all-MiniLM-L6-v2.Q8_0.gguf --port 8080 --embeddings --pooling mean -c 512

# Point AQE at it:
export AQE_EMBEDDER_ENDPOINT=http://127.0.0.1:8080
# Or a Unix socket for same-host deployments:
export AQE_EMBEDDER_ENDPOINT=unix:/run/embedder.sock
# Optional bearer auth:
export AQE_EMBEDDER_TOKEN=your-token-here

That's it — no behavior change when the env var is unset.

Why

Two real production pain points:

Co-deployments with ruflo/ruvector load byte-identical model weights in two or more processes. ~45–90 MB heap per copy of Xenova/all-MiniLM-L6-v2. Shared endpoint → one resident embedder, many warm clients.
Every aqe hooks … invocation is a fresh OS process paying a full cold model load. Pointing hooks at a long-running embedder server eliminates that overhead — cold path drops from ~1s to 15 ms end-to-end against localhost llama-server.

Highlights

OpenAI wire format (encoding_format: 'float' pinned) — verified end-to-end against llama-server with all-MiniLM-L6-v2.Q8_0.gguf.
HTTP and HTTP-over-Unix-socket transports — one protocol, two transports.
Identity fingerprint of a canary embedding asserts dim === 384 and persists to memory.db so cross-run model drift fires a loud warning on next boot.
Circuit breaker (3 failures / 60s) with automatic re-probe on recovery — endpoint restarts often coincide with model swaps.
TLS knobs (ca, cert, key, rejectUnauthorized, servername) for self-hosted HTTPS endpoints.
Hard-fail on error — no silent hash fallback. Mixing hash and transformer embeddings in the same HNSW index silently degrades recall forever; the boundary refuses to do that.

Numbers (against real `llama-server` on localhost)

Path	Time
Cold (`import + init + probe + embed`)	30.7 ms
Warm (`embed` with cached identity + keep-alive socket)	1.6 ms
In-process cold load (no endpoint, today's behavior)	~1000 ms

Compatibility

OpenAI shape is what TEI / vLLM / Ollama / LocalAI / LM Studio / OpenAI all advertise. End-to-end verified against llama-server only; the rest are expected to work but each is unverified until a per-provider integration test lands. The reference template is tests/integration/embedder-endpoint-llamacpp.test.ts.

Getting Started

npx agentic-qe@3.10.0 init --auto

See CHANGELOG, v3.10.0 release notes, and ADR-097 for full details.

Closes #503.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v3.10.0: External Embedder Endpoint

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

What's New

Why

Highlights

Numbers (against real `llama-server` on localhost)

Compatibility

Getting Started

Uh oh!

Uh oh!

v3.10.0: External Embedder Endpoint

What's New

Why

Highlights

Numbers (against real llama-server on localhost)

Compatibility

Getting Started

Uh oh!

Numbers (against real `llama-server` on localhost)