Skip to content

Dense embedding provider not served: vectorizer:3.3.0 coerces every collection to BM25-512, blocking semantic search #306

@andrehrferreira

Description

@andrehrferreira

@

Summary

The running hivehub/vectorizer:3.3.0 image appears to serve only the BM25 sparse provider at dimension 512. Any attempt to provision a dense collection (e.g. FastEmbed / ONNX / 768-dim) is silently coerced to bm25, and the server-side /embed endpoint ignores the requested model and always returns a 512-dim BM25 vector. This blocks true semantic / dense top-k search — the product is described as "designed for semantic search and top-k nearest neighbor queries", but in this image only lexical (BM25) retrieval is available.

Downstream impact: in hivellm/cortex the hybrid retrieval pipeline (vector + keyword + graph) is effectively degraded to keyword-only. Paraphrase / semantic queries that share no vocabulary with the corpus return irrelevant results because the vector lane contributes nothing.

Environment

  • Image: hivehub/vectorizer:3.3.0
  • Launch: --auto-generate-jwt-secret
  • GET /stats{"version":"3.3.0","default_quantization":"sq-8bit", ...}

Reproduction

  1. Create a collection requesting a dense provider at 768-dim:

    POST /collections
    {"name":"denseprobe","dimension":768,"metric":"cosine","embedding_provider":"fastembed"}
    

    201 Collection ... created successfully.

  2. Read it back — the server reports bm25, not the requested provider:

    GET /collections/denseprobe
    → { "dimension": 768, "embedding_provider": "bm25", ... }
    

    The embedding_provider field is accepted but ignored. The same coercion happens for onnx, dense, sentence-transformers, minilm, bge-small, nomic-embed-text-v1.5 — all create "successfully" but report bm25.

  3. Server-side embed ignores the model param and always returns BM25-512:

    POST /embed {"text":"hello","model":"fastembed"}              → {"dimension":512, ...}
    POST /embed {"text":"hello","model":"nomic-embed-text-v1.5"}  → {"dimension":512, ...}
    POST /embed {"text":"hello"}                                  → {"dimension":512, ...}
    
  4. Existing collections created by the normal ingestion path also all report embedding_provider: "bm25", dimension: 512.

Questions / Ask

  1. Does vectorizer:3.3.0 support a dense embedding provider (FastEmbed, ONNX, or a bundled model such as nomic-embed-text-v1.5 / bge-small) at all? If so:
    • How is it enabled? (env var, config.yml, a model download/mount?) The current image with --auto-generate-jwt-secret and no extra config serves BM25-only.
    • Document the create-collection contract so embedding_provider is honored (or rejected with a clear error) instead of silently coerced to bm25.
  2. If dense is not supported in 3.3.0: this is a feature request to ship a dense provider (768-dim, cosine) so semantic search works end-to-end.
  3. Minimum bug regardless of the above: embedding_provider and /embed model params should not be silently ignored — either honor them or return an explicit unsupported_provider error so callers do not believe they provisioned a dense collection.

Why it matters

Cortex pins CORTEX_EMBEDDER_DIM=512 specifically because "768 gets Invalid dimension on every insert" against this image — i.e. the deployment was forced down to BM25-512 to avoid insert failures. Unlocking a dense provider would let Cortex (and any other consumer) actually use the semantic lane the database advertises.
@

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions