v3.10.0: External Embedder Endpoint
·
115 commits
to main
since this release
What's New
AQE can now route its semantic vector layer to an external embedder endpoint instead of loading @huggingface/transformers in-process. Set one env var and AQE talks to any OpenAI-compatible /v1/embeddings server.
# Run any OpenAI-compatible /v1/embeddings server. Example with llama.cpp:
llama-server -m all-MiniLM-L6-v2.Q8_0.gguf --port 8080 --embeddings --pooling mean -c 512
# Point AQE at it:
export AQE_EMBEDDER_ENDPOINT=http://127.0.0.1:8080
# Or a Unix socket for same-host deployments:
export AQE_EMBEDDER_ENDPOINT=unix:/run/embedder.sock
# Optional bearer auth:
export AQE_EMBEDDER_TOKEN=your-token-hereThat's it — no behavior change when the env var is unset.
Why
Two real production pain points:
- Co-deployments with
ruflo/ruvectorload byte-identical model weights in two or more processes. ~45–90 MB heap per copy ofXenova/all-MiniLM-L6-v2. Shared endpoint → one resident embedder, many warm clients. - Every
aqe hooks …invocation is a fresh OS process paying a full cold model load. Pointing hooks at a long-running embedder server eliminates that overhead — cold path drops from ~1s to 15 ms end-to-end against localhostllama-server.
Highlights
- OpenAI wire format (
encoding_format: 'float'pinned) — verified end-to-end againstllama-serverwithall-MiniLM-L6-v2.Q8_0.gguf. - HTTP and HTTP-over-Unix-socket transports — one protocol, two transports.
- Identity fingerprint of a canary embedding asserts
dim === 384and persists tomemory.dbso cross-run model drift fires a loud warning on next boot. - Circuit breaker (3 failures / 60s) with automatic re-probe on recovery — endpoint restarts often coincide with model swaps.
- TLS knobs (
ca,cert,key,rejectUnauthorized,servername) for self-hosted HTTPS endpoints. - Hard-fail on error — no silent hash fallback. Mixing hash and transformer embeddings in the same HNSW index silently degrades recall forever; the boundary refuses to do that.
Numbers (against real llama-server on localhost)
| Path | Time |
|---|---|
Cold (import + init + probe + embed) |
30.7 ms |
Warm (embed with cached identity + keep-alive socket) |
1.6 ms |
| In-process cold load (no endpoint, today's behavior) | ~1000 ms |
Compatibility
OpenAI shape is what TEI / vLLM / Ollama / LocalAI / LM Studio / OpenAI all advertise. End-to-end verified against llama-server only; the rest are expected to work but each is unverified until a per-provider integration test lands. The reference template is tests/integration/embedder-endpoint-llamacpp.test.ts.
Getting Started
npx agentic-qe@3.10.0 init --autoSee CHANGELOG, v3.10.0 release notes, and ADR-097 for full details.
Closes #503.