Skip to content

Explore DuckDB FTS extension for full-text search in Explorer #84

@rdhyee

Description

@rdhyee

Problem

The Interactive Explorer currently uses ILIKE '%term%' for text search (line 398). This is a brute-force substring scan — no ranking, no stemming, no stopword handling.

Proposal

Use DuckDB's full-text search extension, which is supported in DuckDB-WASM. This would give us BM25-ranked results, Porter stemming, and stopword filtering.

Approach: Pre-built index

Building the FTS index at runtime in the browser would be too slow for 6.7M records. Instead:

  1. Build offline: Run PRAGMA create_fts_index('samples', 'pid', 'label', 'description', 'place_name') in native DuckDB
  2. Export as .duckdb file: The FTS index is stored as plain tables + macros in a named schema
  3. Host on data.isamples.org alongside the parquet files
  4. ATTACH in browser: ATTACH 'https://data.isamples.org/isamples_fts_index.duckdb' — BM25 scoring comes along for free

Query change

-- Before (substring scan):
WHERE label ILIKE '%pottery%' OR description ILIKE '%pottery%'

-- After (ranked FTS):
SELECT *, score
FROM (
  SELECT *, fts_main_samples.match_bm25(pid, 'pottery') AS score
  FROM samples
)
WHERE score IS NOT NULL
ORDER BY score DESC

Open questions

  • How large will the .duckdb index file be for 6.7M records with 3-4 text columns?
  • Does ATTACH over HTTP work reliably in DuckDB-WASM for files this size?
  • Should we index all text columns or just label + description?
  • Alternative: DuckDB text analytics functions (stemming, stopwords) without a full FTS index — lighter weight but still better than raw ILIKE

Related

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions