@
Summary
The running hivehub/vectorizer:3.3.0 image appears to serve only the BM25 sparse provider at dimension 512. Any attempt to provision a dense collection (e.g. FastEmbed / ONNX / 768-dim) is silently coerced to bm25, and the server-side /embed endpoint ignores the requested model and always returns a 512-dim BM25 vector. This blocks true semantic / dense top-k search — the product is described as "designed for semantic search and top-k nearest neighbor queries", but in this image only lexical (BM25) retrieval is available.
Downstream impact: in hivellm/cortex the hybrid retrieval pipeline (vector + keyword + graph) is effectively degraded to keyword-only. Paraphrase / semantic queries that share no vocabulary with the corpus return irrelevant results because the vector lane contributes nothing.
Environment
- Image:
hivehub/vectorizer:3.3.0
- Launch:
--auto-generate-jwt-secret
GET /stats → {"version":"3.3.0","default_quantization":"sq-8bit", ...}
Reproduction
-
Create a collection requesting a dense provider at 768-dim:
POST /collections
{"name":"denseprobe","dimension":768,"metric":"cosine","embedding_provider":"fastembed"}
→ 201 Collection ... created successfully.
-
Read it back — the server reports bm25, not the requested provider:
GET /collections/denseprobe
→ { "dimension": 768, "embedding_provider": "bm25", ... }
The embedding_provider field is accepted but ignored. The same coercion happens for onnx, dense, sentence-transformers, minilm, bge-small, nomic-embed-text-v1.5 — all create "successfully" but report bm25.
-
Server-side embed ignores the model param and always returns BM25-512:
POST /embed {"text":"hello","model":"fastembed"} → {"dimension":512, ...}
POST /embed {"text":"hello","model":"nomic-embed-text-v1.5"} → {"dimension":512, ...}
POST /embed {"text":"hello"} → {"dimension":512, ...}
-
Existing collections created by the normal ingestion path also all report embedding_provider: "bm25", dimension: 512.
Questions / Ask
- Does
vectorizer:3.3.0 support a dense embedding provider (FastEmbed, ONNX, or a bundled model such as nomic-embed-text-v1.5 / bge-small) at all? If so:
- How is it enabled? (env var,
config.yml, a model download/mount?) The current image with --auto-generate-jwt-secret and no extra config serves BM25-only.
- Document the create-collection contract so
embedding_provider is honored (or rejected with a clear error) instead of silently coerced to bm25.
- If dense is not supported in 3.3.0: this is a feature request to ship a dense provider (768-dim, cosine) so semantic search works end-to-end.
- Minimum bug regardless of the above:
embedding_provider and /embed model params should not be silently ignored — either honor them or return an explicit unsupported_provider error so callers do not believe they provisioned a dense collection.
Why it matters
Cortex pins CORTEX_EMBEDDER_DIM=512 specifically because "768 gets Invalid dimension on every insert" against this image — i.e. the deployment was forced down to BM25-512 to avoid insert failures. Unlocking a dense provider would let Cortex (and any other consumer) actually use the semantic lane the database advertises.
@
@
Summary
The running
hivehub/vectorizer:3.3.0image appears to serve only the BM25 sparse provider at dimension 512. Any attempt to provision a dense collection (e.g. FastEmbed / ONNX / 768-dim) is silently coerced tobm25, and the server-side/embedendpoint ignores the requested model and always returns a 512-dim BM25 vector. This blocks true semantic / dense top-k search — the product is described as "designed for semantic search and top-k nearest neighbor queries", but in this image only lexical (BM25) retrieval is available.Downstream impact: in hivellm/cortex the hybrid retrieval pipeline (vector + keyword + graph) is effectively degraded to keyword-only. Paraphrase / semantic queries that share no vocabulary with the corpus return irrelevant results because the vector lane contributes nothing.
Environment
hivehub/vectorizer:3.3.0--auto-generate-jwt-secretGET /stats→{"version":"3.3.0","default_quantization":"sq-8bit", ...}Reproduction
Create a collection requesting a dense provider at 768-dim:
→
201 Collection ... created successfully.Read it back — the server reports
bm25, not the requested provider:The
embedding_providerfield is accepted but ignored. The same coercion happens foronnx,dense,sentence-transformers,minilm,bge-small,nomic-embed-text-v1.5— all create "successfully" but reportbm25.Server-side embed ignores the model param and always returns BM25-512:
Existing collections created by the normal ingestion path also all report
embedding_provider: "bm25",dimension: 512.Questions / Ask
vectorizer:3.3.0support a dense embedding provider (FastEmbed, ONNX, or a bundled model such asnomic-embed-text-v1.5/bge-small) at all? If so:config.yml, a model download/mount?) The current image with--auto-generate-jwt-secretand no extra config serves BM25-only.embedding_provideris honored (or rejected with a clear error) instead of silently coerced tobm25.embedding_providerand/embedmodelparams should not be silently ignored — either honor them or return an explicitunsupported_providererror so callers do not believe they provisioned a dense collection.Why it matters
Cortex pins
CORTEX_EMBEDDER_DIM=512specifically because "768 getsInvalid dimensionon every insert" against this image — i.e. the deployment was forced down to BM25-512 to avoid insert failures. Unlocking a dense provider would let Cortex (and any other consumer) actually use the semantic lane the database advertises.@