Problem
The Interactive Explorer currently uses ILIKE '%term%' for text search (line 398). This is a brute-force substring scan — no ranking, no stemming, no stopword handling.
Proposal
Use DuckDB's full-text search extension, which is supported in DuckDB-WASM. This would give us BM25-ranked results, Porter stemming, and stopword filtering.
Approach: Pre-built index
Building the FTS index at runtime in the browser would be too slow for 6.7M records. Instead:
- Build offline: Run
PRAGMA create_fts_index('samples', 'pid', 'label', 'description', 'place_name') in native DuckDB
- Export as
.duckdb file: The FTS index is stored as plain tables + macros in a named schema
- Host on
data.isamples.org alongside the parquet files
ATTACH in browser: ATTACH 'https://data.isamples.org/isamples_fts_index.duckdb' — BM25 scoring comes along for free
Query change
-- Before (substring scan):
WHERE label ILIKE '%pottery%' OR description ILIKE '%pottery%'
-- After (ranked FTS):
SELECT *, score
FROM (
SELECT *, fts_main_samples.match_bm25(pid, 'pottery') AS score
FROM samples
)
WHERE score IS NOT NULL
ORDER BY score DESC
Open questions
- How large will the
.duckdb index file be for 6.7M records with 3-4 text columns?
- Does
ATTACH over HTTP work reliably in DuckDB-WASM for files this size?
- Should we index all text columns or just
label + description?
- Alternative: DuckDB text analytics functions (stemming, stopwords) without a full FTS index — lighter weight but still better than raw
ILIKE
Related
Problem
The Interactive Explorer currently uses
ILIKE '%term%'for text search (line 398). This is a brute-force substring scan — no ranking, no stemming, no stopword handling.Proposal
Use DuckDB's full-text search extension, which is supported in DuckDB-WASM. This would give us BM25-ranked results, Porter stemming, and stopword filtering.
Approach: Pre-built index
Building the FTS index at runtime in the browser would be too slow for 6.7M records. Instead:
PRAGMA create_fts_index('samples', 'pid', 'label', 'description', 'place_name')in native DuckDB.duckdbfile: The FTS index is stored as plain tables + macros in a named schemadata.isamples.orgalongside the parquet filesATTACHin browser:ATTACH 'https://data.isamples.org/isamples_fts_index.duckdb'— BM25 scoring comes along for freeQuery change
Open questions
.duckdbindex file be for 6.7M records with 3-4 text columns?ATTACHover HTTP work reliably in DuckDB-WASM for files this size?label+description?ILIKERelated