Dense embedding provider not served: vectorizer:3.3.0 coerces every collection to BM25-512, blocking semantic search

@
## Summary

The running `hivehub/vectorizer:3.3.0` image appears to serve **only the BM25 sparse provider at dimension 512**. Any attempt to provision a dense collection (e.g. FastEmbed / ONNX / 768-dim) is silently coerced to `bm25`, and the server-side `/embed` endpoint ignores the requested model and always returns a 512-dim BM25 vector. This blocks true semantic / dense top-k search — the product is described as "designed for semantic search and top-k nearest neighbor queries", but in this image only lexical (BM25) retrieval is available.

Downstream impact: in [hivellm/cortex](https://github.com/hivellm/cortex) the hybrid retrieval pipeline (vector + keyword + graph) is effectively degraded to **keyword-only**. Paraphrase / semantic queries that share no vocabulary with the corpus return irrelevant results because the vector lane contributes nothing.

## Environment

- Image: `hivehub/vectorizer:3.3.0`
- Launch: `--auto-generate-jwt-secret`
- `GET /stats` → `{"version":"3.3.0","default_quantization":"sq-8bit", ...}`

## Reproduction

1. **Create a collection requesting a dense provider at 768-dim:**
   ```
   POST /collections
   {"name":"denseprobe","dimension":768,"metric":"cosine","embedding_provider":"fastembed"}
   ```
   → `201 Collection ... created successfully`.

2. **Read it back** — the server reports `bm25`, not the requested provider:
   ```
   GET /collections/denseprobe
   → { "dimension": 768, "embedding_provider": "bm25", ... }
   ```
   The `embedding_provider` field is accepted but ignored. The same coercion happens for `onnx`, `dense`, `sentence-transformers`, `minilm`, `bge-small`, `nomic-embed-text-v1.5` — all create "successfully" but report `bm25`.

3. **Server-side embed ignores the model param** and always returns BM25-512:
   ```
   POST /embed {"text":"hello","model":"fastembed"}              → {"dimension":512, ...}
   POST /embed {"text":"hello","model":"nomic-embed-text-v1.5"}  → {"dimension":512, ...}
   POST /embed {"text":"hello"}                                  → {"dimension":512, ...}
   ```

4. **Existing collections** created by the normal ingestion path also all report `embedding_provider: "bm25"`, `dimension: 512`.

## Questions / Ask

1. Does `vectorizer:3.3.0` support a **dense embedding provider** (FastEmbed, ONNX, or a bundled model such as `nomic-embed-text-v1.5` / `bge-small`) at all? If so:
   - How is it enabled? (env var, `config.yml`, a model download/mount?) The current image with `--auto-generate-jwt-secret` and no extra config serves BM25-only.
   - Document the create-collection contract so `embedding_provider` is honored (or rejected with a clear error) instead of silently coerced to `bm25`.
2. If dense is **not** supported in 3.3.0: this is a feature request to ship a dense provider (768-dim, cosine) so semantic search works end-to-end.
3. Minimum bug regardless of the above: **`embedding_provider` and `/embed` `model` params should not be silently ignored** — either honor them or return an explicit `unsupported_provider` error so callers do not believe they provisioned a dense collection.

## Why it matters

Cortex pins `CORTEX_EMBEDDER_DIM=512` specifically because "768 gets `Invalid dimension` on every insert" against this image — i.e. the deployment was forced down to BM25-512 to avoid insert failures. Unlocking a dense provider would let Cortex (and any other consumer) actually use the semantic lane the database advertises.
@

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dense embedding provider not served: vectorizer:3.3.0 coerces every collection to BM25-512, blocking semantic search #306

Summary

Environment

Reproduction

Questions / Ask

Why it matters

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Dense embedding provider not served: vectorizer:3.3.0 coerces every collection to BM25-512, blocking semantic search #306

Description

Summary

Environment

Reproduction

Questions / Ask

Why it matters

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions