lancedb · prrao87 · Apr 21, 2026 · Apr 21, 2026 · Apr 21, 2026
diff --git a/.gitignore b/.gitignore
@@ -17,4 +17,10 @@ boost-example
 .DS_Store
 .vscode
 node_modules
-tests/rs/target
+tests/rs/target
+
+# Generated docs-audit workflow state
+workflows/docs-audit/artifacts/**
+!workflows/docs-audit/artifacts/
+workflows/docs-audit/state/**
+!workflows/docs-audit/state/
diff --git a/AGENTS.md b/AGENTS.md
@@ -8,17 +8,6 @@ This is a documentation site for [LanceDB](https://docs.lancedb.com).
 - Best practices for linting, formatting and code complexity for each respective language apply.
 - Write idiomatic code as far as possible
 
-## Running Python code
-
-When running Python code, we have to cater to users of both pip and uv.
-
-- Use 4 spaces to represent a tab (do not use tab characters)
-- Always attempt to first run *any* Python code via the local virtual environment
-  - Look for a local virtual environment (typically in `.venv` or `venv`)
-  - Activate the environment, so that you can run multiple code exampes in the same environment
-- Avoid using `uv run` directly, as you have issues running it in your sandbox
-- Only fall back to the system `python3` to run code if the above steps don't work
-
 ## Generate snippets
 
 - Generate the required code snippets using the provided Makefile: `make snippets`
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -8,13 +8,6 @@ This is a documentation site for [LanceDB](https://docs.lancedb.com).
 - Best practices for linting, formatting and code complexity for each respective language apply.
 - Write idiomatic code as far as possible
 
-## Running Python code
+## Generate snippets
 
-When running Python code, we have to cater to users of both pip and uv.
-
-- Use 4 spaces to represent a tab (do not use tab characters)
-- Always attempt to first run *any* Python code via the local virtual environment
-  - Look for a local virtual environment (typically in `.venv` or `venv`)
-  - Activate the environment, so that you can run multiple code exampes in the same environment
-- Avoid using `uv run` directly, as you have issues running it in your sandbox
-- Only fall back to the system `python3` to run code if the above steps don't work
+- Generate the required code snippets using the provided Makefile: `make snippets`
diff --git a/docs/indexing/fts-index.mdx b/docs/indexing/fts-index.mdx
@@ -62,6 +62,10 @@ The `create_fts_index` method is not available on `AsyncTable`. Use `create_inde
 | `stem` | bool | `True` | Apply stemming (`running` → `run`) |
 | `remove_stop_words` | bool | `True` | Drop common stop words |
 | `ascii_folding` | bool | `True` | Normalize accented characters |
+| `custom_stop_words` | list[str] | `None` | Extra stop words to drop in addition to the language defaults. Requires `remove_stop_words=True`. |
+| `min_ngram_length` | int | `3` | Minimum n-gram length. Applies only when `base_tokenizer="ngram"`. |
+| `max_ngram_length` | int | `3` | Maximum n-gram length. Applies only when `base_tokenizer="ngram"`. |
+| `prefix_only` | bool | `False` | Index only prefix n-grams rather than all substrings. Applies only when `base_tokenizer="ngram"`. |
 
 <Note title="Key parameters">
 - `max_token_length` can filter out base64 blobs or long URLs.

diff --git a/docs/indexing/quantization.mdx b/docs/indexing/quantization.mdx
@@ -16,8 +16,10 @@ Use quantization when:
 LanceDB currently exposes multiple quantized vector index types, including:
 - `IVF_PQ` -- Inverted File index with Product Quantization (default). See the [vector indexing guide](/indexing/vector-index) for `IVF_PQ` examples.
 - `IVF_RQ` -- Inverted File index with **RaBitQ** quantization (binary, 1 bit per dimension). Requires vector dimensions divisible by `8`. See [below](#rabitq-quantization) for details.
+- `IVF_HNSW_SQ` -- IVF partitions with an **HNSW graph per partition** plus **Scalar Quantization**. Strong recall/latency/size trade-off for most workloads.
+- `IVF_HNSW_PQ` -- IVF partitions with an **HNSW graph per partition** plus **Product Quantization**. Prefer when PQ-level compression matters and you still want HNSW-style in-partition search.
 
-`IVF_PQ` is the default indexing option in LanceDB and works well in many cases. However, in cases where more drastic compression is needed, RaBitQ is also a reasonable option.
+Two axes are being combined here: whether partitions are searched flatly or via an HNSW graph (`IVF_*` vs. `IVF_HNSW_*`), and which quantizer compresses the vectors (`PQ`, `RQ`, or `SQ`). `IVF_PQ` is the default and works well in many cases. For more drastic compression, RaBitQ (`IVF_RQ`) is a reasonable option. For higher recall at low latency, the HNSW-backed variants are usually the right pick. The ["Choose the Right Index"](/indexing/vector-index#choose-the-right-index) table on the vector indexing page is the canonical decision tool.
 
 ## RaBitQ quantization
 

diff --git a/docs/indexing/vector-index.mdx b/docs/indexing/vector-index.mdx
@@ -208,6 +208,10 @@ Compare ANN results against a flat-scan ground truth to compute recall@k. This i
 Flat search is $O(n)$ — reserve `bypass_vector_index()` for sampled recall measurements or small tables, not production queries.
 </Warning>
 
+<Note title="Multivector distance constraint">
+Multivector indexing currently requires `distance_type="cosine"` — `l2` is rejected at index-creation time. That restriction is why `bypass_vector_index()` is the escape hatch for non-cosine queries on a multivector column: the metric you want at query time cannot be served by the index, so you fall back to a flat scan. See [Multivector Search](/search/multivector-search) for the full rules.
+</Note>
+
 ## Example: Construct an HNSW Index
 
 ### Index Configuration
@@ -248,6 +252,13 @@ Binary vectors are useful for hash-based retrieval, fingerprinting, or any scena
 - `metric`: the `hamming` distance is used for similarity search
 - The dimension of binary vectors must be a multiple of 8. For example, a 128-dimensional vector is stored as a uint8 array of size 16.
 
+<Warning>
+**`IVF_FLAT` + `hamming` is the only supported path for binary vectors.**
+
+- `hamming` distance is only valid on packed binary (uint8) data; it is rejected on float vector columns.
+- Quantized index types (`IVF_PQ`, `IVF_RQ`, `IVF_SQ`, `IVF_HNSW_PQ`, `IVF_HNSW_SQ`) do not accept binary inputs — their `distance_type` is restricted to `l2`, `cosine`, or `dot`.
+</Warning>
+
 ### 1. Create Table and Schema
 
 <CodeGroup>

diff --git a/docs/search/full-text-search.mdx b/docs/search/full-text-search.mdx
@@ -156,6 +156,7 @@ The tokenizer is customizable, you can specify how the tokenizer splits the text
 - `stem`: true
 - `remove_stop_words`: true
 - `ascii_folding`: true
+- `custom_stop_words`: `None` — pass a `list[str]` to drop additional words beyond the language defaults. Requires `remove_stop_words=True`.
 
 For example, for language with accents, you can specify the tokenizer to use `ascii_folding` to remove accents, e.g. 'é' to 'e':
 

diff --git a/workflows/docs-audit/AGENTS.md b/workflows/docs-audit/AGENTS.md
@@ -0,0 +1,33 @@
+# AGENTS.md
+
+This workspace orchestrates a docs-gap audit across external local repos. It does not own or vendor the source code from those repos.
+
+## Working model
+
+- Deterministic scripts live in `scripts/`.
+- Area manifests live in `manifests/`.
+- Codex prompt templates live in `prompts/`.
+- Run state lives in `state/`.
+- Generated run artifacts live in `artifacts/`.
+
+## Rules for future agents
+
+- Do not copy large code snapshots from the watched repos into this workspace.
+- Keep the deterministic layer deterministic: refresh, extract, fingerprint, select, and update state.
+- Keep semantic reasoning page-scoped and artifact-backed.
+- Reports must describe only what is missing from the docs.
+- When adding a new docs area, prefer a new manifest over changes to the core runner.
+- Keep evidence compact and user-facing where possible.
+- Preserve the distinction between deterministic outputs (`page_bundles`, metadata, state) and LLM outputs (`llm_outputs`, final report).
+
+## Expected output shape
+
+A completed run should leave behind:
+
+- `metadata.json`
+- `selected_pages.json`
+- `page_bundles/*.json`
+- `llm_outputs/*`
+- `report.md`
+
+The report should be concise, grouped by page or subsection, and should not contain implementation plans or doc-fix patches.