Skip to content

v3.10.0: External Embedder Endpoint

Choose a tag to compare

@proffesor-for-testing proffesor-for-testing released this 20 May 09:49
· 115 commits to main since this release
2c84e65

What's New

AQE can now route its semantic vector layer to an external embedder endpoint instead of loading @huggingface/transformers in-process. Set one env var and AQE talks to any OpenAI-compatible /v1/embeddings server.

# Run any OpenAI-compatible /v1/embeddings server. Example with llama.cpp:
llama-server -m all-MiniLM-L6-v2.Q8_0.gguf --port 8080 --embeddings --pooling mean -c 512

# Point AQE at it:
export AQE_EMBEDDER_ENDPOINT=http://127.0.0.1:8080
# Or a Unix socket for same-host deployments:
export AQE_EMBEDDER_ENDPOINT=unix:/run/embedder.sock
# Optional bearer auth:
export AQE_EMBEDDER_TOKEN=your-token-here

That's it — no behavior change when the env var is unset.

Why

Two real production pain points:

  1. Co-deployments with ruflo/ruvector load byte-identical model weights in two or more processes. ~45–90 MB heap per copy of Xenova/all-MiniLM-L6-v2. Shared endpoint → one resident embedder, many warm clients.
  2. Every aqe hooks … invocation is a fresh OS process paying a full cold model load. Pointing hooks at a long-running embedder server eliminates that overhead — cold path drops from ~1s to 15 ms end-to-end against localhost llama-server.

Highlights

  • OpenAI wire format (encoding_format: 'float' pinned) — verified end-to-end against llama-server with all-MiniLM-L6-v2.Q8_0.gguf.
  • HTTP and HTTP-over-Unix-socket transports — one protocol, two transports.
  • Identity fingerprint of a canary embedding asserts dim === 384 and persists to memory.db so cross-run model drift fires a loud warning on next boot.
  • Circuit breaker (3 failures / 60s) with automatic re-probe on recovery — endpoint restarts often coincide with model swaps.
  • TLS knobs (ca, cert, key, rejectUnauthorized, servername) for self-hosted HTTPS endpoints.
  • Hard-fail on error — no silent hash fallback. Mixing hash and transformer embeddings in the same HNSW index silently degrades recall forever; the boundary refuses to do that.

Numbers (against real llama-server on localhost)

Path Time
Cold (import + init + probe + embed) 30.7 ms
Warm (embed with cached identity + keep-alive socket) 1.6 ms
In-process cold load (no endpoint, today's behavior) ~1000 ms

Compatibility

OpenAI shape is what TEI / vLLM / Ollama / LocalAI / LM Studio / OpenAI all advertise. End-to-end verified against llama-server only; the rest are expected to work but each is unverified until a per-provider integration test lands. The reference template is tests/integration/embedder-endpoint-llamacpp.test.ts.

Getting Started

npx agentic-qe@3.10.0 init --auto

See CHANGELOG, v3.10.0 release notes, and ADR-097 for full details.

Closes #503.