Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 7 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -17,4 +17,10 @@ boost-example
.DS_Store
.vscode
node_modules
tests/rs/target
tests/rs/target

# Generated docs-audit workflow state
workflows/docs-audit/artifacts/**
!workflows/docs-audit/artifacts/
workflows/docs-audit/state/**
!workflows/docs-audit/state/
11 changes: 0 additions & 11 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,17 +8,6 @@ This is a documentation site for [LanceDB](https://docs.lancedb.com).
- Best practices for linting, formatting and code complexity for each respective language apply.
- Write idiomatic code as far as possible

## Running Python code

When running Python code, we have to cater to users of both pip and uv.

- Use 4 spaces to represent a tab (do not use tab characters)
- Always attempt to first run *any* Python code via the local virtual environment
- Look for a local virtual environment (typically in `.venv` or `venv`)
- Activate the environment, so that you can run multiple code exampes in the same environment
- Avoid using `uv run` directly, as you have issues running it in your sandbox
- Only fall back to the system `python3` to run code if the above steps don't work

## Generate snippets

- Generate the required code snippets using the provided Makefile: `make snippets`
11 changes: 2 additions & 9 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,13 +8,6 @@ This is a documentation site for [LanceDB](https://docs.lancedb.com).
- Best practices for linting, formatting and code complexity for each respective language apply.
- Write idiomatic code as far as possible

## Running Python code
## Generate snippets

When running Python code, we have to cater to users of both pip and uv.

- Use 4 spaces to represent a tab (do not use tab characters)
- Always attempt to first run *any* Python code via the local virtual environment
- Look for a local virtual environment (typically in `.venv` or `venv`)
- Activate the environment, so that you can run multiple code exampes in the same environment
- Avoid using `uv run` directly, as you have issues running it in your sandbox
- Only fall back to the system `python3` to run code if the above steps don't work
- Generate the required code snippets using the provided Makefile: `make snippets`
4 changes: 4 additions & 0 deletions docs/indexing/fts-index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,10 @@ The `create_fts_index` method is not available on `AsyncTable`. Use `create_inde
| `stem` | bool | `True` | Apply stemming (`running` → `run`) |
| `remove_stop_words` | bool | `True` | Drop common stop words |
| `ascii_folding` | bool | `True` | Normalize accented characters |
| `custom_stop_words` | list[str] | `None` | Extra stop words to drop in addition to the language defaults. Requires `remove_stop_words=True`. |
| `min_ngram_length` | int | `3` | Minimum n-gram length. Applies only when `base_tokenizer="ngram"`. |
| `max_ngram_length` | int | `3` | Maximum n-gram length. Applies only when `base_tokenizer="ngram"`. |
| `prefix_only` | bool | `False` | Index only prefix n-grams rather than all substrings. Applies only when `base_tokenizer="ngram"`. |

<Note title="Key parameters">
- `max_token_length` can filter out base64 blobs or long URLs.
Expand Down
4 changes: 3 additions & 1 deletion docs/indexing/quantization.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,10 @@ Use quantization when:
LanceDB currently exposes multiple quantized vector index types, including:
- `IVF_PQ` -- Inverted File index with Product Quantization (default). See the [vector indexing guide](/indexing/vector-index) for `IVF_PQ` examples.
- `IVF_RQ` -- Inverted File index with **RaBitQ** quantization (binary, 1 bit per dimension). Requires vector dimensions divisible by `8`. See [below](#rabitq-quantization) for details.
- `IVF_HNSW_SQ` -- IVF partitions with an **HNSW graph per partition** plus **Scalar Quantization**. Strong recall/latency/size trade-off for most workloads.
- `IVF_HNSW_PQ` -- IVF partitions with an **HNSW graph per partition** plus **Product Quantization**. Prefer when PQ-level compression matters and you still want HNSW-style in-partition search.

`IVF_PQ` is the default indexing option in LanceDB and works well in many cases. However, in cases where more drastic compression is needed, RaBitQ is also a reasonable option.
Two axes are being combined here: whether partitions are searched flatly or via an HNSW graph (`IVF_*` vs. `IVF_HNSW_*`), and which quantizer compresses the vectors (`PQ`, `RQ`, or `SQ`). `IVF_PQ` is the default and works well in many cases. For more drastic compression, RaBitQ (`IVF_RQ`) is a reasonable option. For higher recall at low latency, the HNSW-backed variants are usually the right pick. The ["Choose the Right Index"](/indexing/vector-index#choose-the-right-index) table on the vector indexing page is the canonical decision tool.

## RaBitQ quantization

Expand Down
11 changes: 11 additions & 0 deletions docs/indexing/vector-index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -208,6 +208,10 @@ Compare ANN results against a flat-scan ground truth to compute recall@k. This i
Flat search is $O(n)$ — reserve `bypass_vector_index()` for sampled recall measurements or small tables, not production queries.
</Warning>

<Note title="Multivector distance constraint">
Multivector indexing currently requires `distance_type="cosine"` — `l2` is rejected at index-creation time. That restriction is why `bypass_vector_index()` is the escape hatch for non-cosine queries on a multivector column: the metric you want at query time cannot be served by the index, so you fall back to a flat scan. See [Multivector Search](/search/multivector-search) for the full rules.
</Note>

## Example: Construct an HNSW Index

### Index Configuration
Expand Down Expand Up @@ -248,6 +252,13 @@ Binary vectors are useful for hash-based retrieval, fingerprinting, or any scena
- `metric`: the `hamming` distance is used for similarity search
- The dimension of binary vectors must be a multiple of 8. For example, a 128-dimensional vector is stored as a uint8 array of size 16.

<Warning>
**`IVF_FLAT` + `hamming` is the only supported path for binary vectors.**

- `hamming` distance is only valid on packed binary (uint8) data; it is rejected on float vector columns.
- Quantized index types (`IVF_PQ`, `IVF_RQ`, `IVF_SQ`, `IVF_HNSW_PQ`, `IVF_HNSW_SQ`) do not accept binary inputs — their `distance_type` is restricted to `l2`, `cosine`, or `dot`.
</Warning>

### 1. Create Table and Schema

<CodeGroup>
Expand Down
1 change: 1 addition & 0 deletions docs/search/full-text-search.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -156,6 +156,7 @@ The tokenizer is customizable, you can specify how the tokenizer splits the text
- `stem`: true
- `remove_stop_words`: true
- `ascii_folding`: true
- `custom_stop_words`: `None` — pass a `list[str]` to drop additional words beyond the language defaults. Requires `remove_stop_words=True`.

For example, for language with accents, you can specify the tokenizer to use `ascii_folding` to remove accents, e.g. 'é' to 'e':

Expand Down
33 changes: 33 additions & 0 deletions workflows/docs-audit/AGENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# AGENTS.md

This workspace orchestrates a docs-gap audit across external local repos. It does not own or vendor the source code from those repos.

## Working model

- Deterministic scripts live in `scripts/`.
- Area manifests live in `manifests/`.
- Codex prompt templates live in `prompts/`.
- Run state lives in `state/`.
- Generated run artifacts live in `artifacts/`.

## Rules for future agents

- Do not copy large code snapshots from the watched repos into this workspace.
- Keep the deterministic layer deterministic: refresh, extract, fingerprint, select, and update state.
- Keep semantic reasoning page-scoped and artifact-backed.
- Reports must describe only what is missing from the docs.
- When adding a new docs area, prefer a new manifest over changes to the core runner.
- Keep evidence compact and user-facing where possible.
- Preserve the distinction between deterministic outputs (`page_bundles`, metadata, state) and LLM outputs (`llm_outputs`, final report).

## Expected output shape

A completed run should leave behind:

- `metadata.json`
- `selected_pages.json`
- `page_bundles/*.json`
- `llm_outputs/*`
- `report.md`

The report should be concise, grouped by page or subsection, and should not contain implementation plans or doc-fix patches.
Loading
Loading