Updated 2026-05-08 per Codex review on #165. Major changes: refactored around a sample-centric document projection (Section 1), v1 minimum now includes dereferenced concept labels, query-time policy split from build-time tokenizer (new Section 3), quality gate hardened (Section 8). Original framing preserved in git history.
Sub-issue of #165. No code dependencies; can start in parallel with #N1 (#167).
Goal
Land a single design doc — SEARCH_INDEX_V1.md — that pins the v1 substrate contract before any pipeline or query code is written. Tracks 3-5 implement and measure against this doc.
Required contents
1. Sample search document projection
The substrate is not "tokenize these parquet columns." It is "tokenize a sample-centric document whose text fragments are joined across the property graph and tagged by their entity origin."
Each sample (pid) has a logical document of weighted text fragments. At build time, a join across the wide parquet (or its narrow equivalents) produces this projection; the substrate then tokenizes per fragment, tagging each token row with the virtual field name (entity dot field), not the source parquet column.
v1 minimum — the projection that ships first:
| virtual field |
source |
rationale |
sample.label |
MaterialSampleRecord.label (~6.68M coverage) |
canonical title; near-universal |
sample.description |
MaterialSampleRecord.description (~1.61M ≈ 24%) |
sparse but high-signal where present |
sample.place_name |
samples_map_lite.parquet.place_name[] (~2.21M) |
already proven valuable in current ILIKE search |
concept.label |
material / context / object_type URIs dereferenced via vocab_labels.parquet (pref_label, lang=en) |
load-bearing addition: facet URIs are near-universal but raw URIs are useless to FTS; dereferenced labels make pottery, ceramic, basalt, bone, marine work as the user expects |
A sample whose material URI is <…>/Pottery gets a row {token: 'pottery', pid: ..., field: 'concept.label', tf: 1, ...}. One row per sample per facet URI per token. Coverage: most of the dataset.
v1.5 expansion (post-v1, additive — no schema change):
| virtual field |
source (Solr equivalent) |
event.label |
producedBy_label (~1.92M) |
event.description |
producedBy_description (~5.54M ≈ 83%) |
event.has_feature_of_interest |
producedBy_hasFeatureOfInterest (~6.35M ≈ 95%) |
event.sampling_purpose |
producedBy_samplingPurpose (~262K) |
site.label |
producedBy_samplingSite_label (~190K) |
site.description |
producedBy_samplingSite_description (~172K) |
site.place_name |
producedBy_samplingSite_placeName[] (~336K rows) |
v2 / Solr searchText parity (named, not built):
| virtual field |
source |
agent.name |
registrant + responsibility agents |
curation.label |
curation_label |
curation.description |
curation_description |
curation.location |
curation_location |
keywords |
(if present) |
source |
source (already a facet; low-value as FTS) |
2. Tokenizer (build-time)
- Lowercase ASCII via
String.prototype.toLowerCase() / Python str.lower().
- Unicode NFKC normalization.
- Diacritic stripping via NFD + combining-mark removal.
- Whitespace split, punctuation stripped, length filter (
1 ≤ len ≤ 64).
- No stemming. Honest limitation; document in UI copy.
- Index every token, including stopwords. Stopword handling is query-time, not build-time (see §3) — keeps substrate flexible for future phrase queries.
- Parallel implementations: JS for browser query, Python for offline build. Shared regression test set (≥30 strings).
3. Query-time policy (distinct from build-time)
A separate axis from the build-time tokenizer.
- Tokenize the user input with the same tokenizer used at build (lowercase + NFKC + diacritic strip + whitespace split + length filter). Keeps the round-trip invariant.
- Drop or downweight English stopwords from the bag-of-words AND. Curated list (
a, an, the, of, from, for, to, in, on, at, is, was, with, and, or) — small, conservative, no language detection.
- Rationale: a query like
pottery from Cyprus should not fail because no sample has from in its text. Build-time skipping would lose phrase-query potential; query-time is reversible policy.
- AND-combine the surviving tokens. Empty surviving set ⇒ empty result with helpful copy.
- No query-language syntax in v1. No quoted phrases, no field-prefix operators (
label:foo), no booleans, no negation. Documented v2 path: phrase quoting first; field-prefix and negation later. (Reference: query-spec.qmd Solr surface — explicitly not implemented in v1.)
4. Substrate row schema
{
token: VARCHAR -- normalized token
pid: VARCHAR -- sample primary id
field: VARCHAR -- 'sample.label' | 'sample.description' | 'sample.place_name'
| 'concept.label' | (future: 'event.*' | 'site.*' | 'agent.*' …)
tf: USMALLINT -- term frequency in this (pid, field) pair
doc_len: USMALLINT -- token count of (pid, field) for BM25 length norm
}
Field weights are query-side code, not substrate data. Adding a v1.5 / v2 field = re-running the build pipeline with more sources, no schema migration.
5. Ranking spec
- BM25, fixed
k1=1.2, b=0.75 (tune in v1.1 only if benchmark drift demands it).
- DF (per-token document frequency) precomputed at build time, stored alongside the substrate.
- Length norm uses
doc_len from the schema above.
- Field weights (query-side, v1):
| field |
weight |
sample.label |
3.0 |
concept.label |
2.5 |
sample.place_name |
2.0 |
sample.description |
1.0 |
Final result rank = sum across (pid, field) BM25 contributions weighted by field weight.
6. Partition shape
- Hash-partition by token:
hash(token) % N shards.
- Per-shard byte cap: ≤ 5 MB uncompressed parquet.
- High-frequency token rule: if a single token's postings would exceed the cap, sub-shard by
hash(pid) % M within that token's logical shard.
- Number of top-level shards (
N): start with 64, refine in build measurement.
7. Budgets
| metric |
target |
rationale |
| cold first search (P50) |
≤ 2 s |
matches user expectation for "search" |
| warm repeat search |
≤ 500 ms |
substantial improvement over ILIKE |
| filter-composed cold search |
≤ 3 s |
accommodates source + facet AND |
| bytes transferred cold |
≤ 5 MB |
acceptable on residential broadband |
| bytes transferred warm |
≤ 1 MB per query |
repeated queries don't refetch shards |
These are contract. Track 5's GO/NO-GO gate is mechanical against this table. "Warm" disambiguation (per #174 deferred): the contract distinguishes
re-run-same-query warm: same query, second invocation, same page (measures end-to-end cache + render path)
new-query-after-warm-up warm: different query, after parquet metadata is cached (measures query execution after substrate file is warm)
Both are reported by the benchmark; the budget targets above apply to both.
8. Versioning
- URL pattern:
https://data.isamples.org/isamples_YYYYMM_search_index_v1/<shard>.parquet.
- Explorer pins to a specific
YYYYMM so a dataset rebuild can't break a deployed site mid-flight.
- Index version tied to data version. v1.x format bumps require URL path bump (
_v1 → _v2).
9. Curated benchmark + quality gate
- File:
tests/search_benchmark.json
- 12-15 queries, hand-labeled top-10 by Raymond. Must include:
- bare-text queries (
pottery, basalt)
- multi-term (
pottery Cyprus)
- stopword-heavy (
pottery from Cyprus) — verifies query-time stopword policy works
- concept-only queries (
ceramic, bone, mammal) — verifies dereferenced concept labels work; fails loudly if v1 ships without concept labels
- diacritic (
Çatalhöyük)
- no-hit (
xyzzyqqqplugh)
- filter-composed cases (source-only, source + material)
- Quality gate is a hard requirement, not advisory. Each release of the substrate must hit:
10. Build-stats artifact (contract requirement)
The v1 substrate build pipeline (#170) MUST emit build_stats.json alongside the partitioned token-row parquets, recording per-virtual-field populated-sample-count, total-token-count, average doc length, concept-label URI resolution rate, top-DF tokens, and shard size distribution. Schema and acceptance thresholds are specified in #170 §6.
This contract item exists so SEARCH_INDEX_V1.md and the builder cannot drift: every release of the substrate carries empirical coverage data, not a doc claim about coverage.
11. Out of scope (v1)
- Solr-parity field set (named in §1 as the v2 expansion path; not implemented).
- Stemming (English-specific, hurts non-English content; v2+ if at all).
- Query-language syntax: quoted phrases, field operators, booleans, negation, wildcards, fuzzy matching, ranges, boosts — all v2+.
Acceptance
Refs
#165, #164, PR #95, #170 (build-stats), #174 (warm disambiguation)
Sub-issue of #165. No code dependencies; can start in parallel with #N1 (#167).
Goal
Land a single design doc —
SEARCH_INDEX_V1.md— that pins the v1 substrate contract before any pipeline or query code is written. Tracks 3-5 implement and measure against this doc.Required contents
1. Sample search document projection
The substrate is not "tokenize these parquet columns." It is "tokenize a sample-centric document whose text fragments are joined across the property graph and tagged by their entity origin."
Each sample (
pid) has a logical document of weighted text fragments. At build time, a join across the wide parquet (or its narrow equivalents) produces this projection; the substrate then tokenizes per fragment, tagging each token row with the virtual field name (entity dot field), not the source parquet column.v1 minimum — the projection that ships first:
sample.labelMaterialSampleRecord.label(~6.68M coverage)sample.descriptionMaterialSampleRecord.description(~1.61M ≈ 24%)sample.place_namesamples_map_lite.parquet.place_name[](~2.21M)concept.labelmaterial/context/object_typeURIs dereferenced viavocab_labels.parquet(pref_label, lang=en)pottery,ceramic,basalt,bone,marinework as the user expectsA sample whose
materialURI is<…>/Potterygets a row{token: 'pottery', pid: ..., field: 'concept.label', tf: 1, ...}. One row per sample per facet URI per token. Coverage: most of the dataset.v1.5 expansion (post-v1, additive — no schema change):
event.labelproducedBy_label(~1.92M)event.descriptionproducedBy_description(~5.54M ≈ 83%)event.has_feature_of_interestproducedBy_hasFeatureOfInterest(~6.35M ≈ 95%)event.sampling_purposeproducedBy_samplingPurpose(~262K)site.labelproducedBy_samplingSite_label(~190K)site.descriptionproducedBy_samplingSite_description(~172K)site.place_nameproducedBy_samplingSite_placeName[](~336K rows)v2 / Solr
searchTextparity (named, not built):agent.namecuration.labelcuration_labelcuration.descriptioncuration_descriptioncuration.locationcuration_locationkeywordssourcesource(already a facet; low-value as FTS)2. Tokenizer (build-time)
String.prototype.toLowerCase()/ Pythonstr.lower().1 ≤ len ≤ 64).3. Query-time policy (distinct from build-time)
A separate axis from the build-time tokenizer.
a,an,the,of,from,for,to,in,on,at,is,was,with,and,or) — small, conservative, no language detection.pottery from Cyprusshould not fail because no sample hasfromin its text. Build-time skipping would lose phrase-query potential; query-time is reversible policy.label:foo), no booleans, no negation. Documented v2 path: phrase quoting first; field-prefix and negation later. (Reference:query-spec.qmdSolr surface — explicitly not implemented in v1.)4. Substrate row schema
Field weights are query-side code, not substrate data. Adding a v1.5 / v2 field = re-running the build pipeline with more sources, no schema migration.
5. Ranking spec
k1=1.2,b=0.75(tune in v1.1 only if benchmark drift demands it).doc_lenfrom the schema above.sample.labelconcept.labelsample.place_namesample.descriptionFinal result rank = sum across (pid, field) BM25 contributions weighted by field weight.
6. Partition shape
hash(token) % Nshards.hash(pid) % Mwithin that token's logical shard.N): start with 64, refine in build measurement.7. Budgets
These are contract. Track 5's GO/NO-GO gate is mechanical against this table. "Warm" disambiguation (per #174 deferred): the contract distinguishes
re-run-same-query warm: same query, second invocation, same page (measures end-to-end cache + render path)new-query-after-warm-up warm: different query, after parquet metadata is cached (measures query execution after substrate file is warm)Both are reported by the benchmark; the budget targets above apply to both.
8. Versioning
https://data.isamples.org/isamples_YYYYMM_search_index_v1/<shard>.parquet.YYYYMMso a dataset rebuild can't break a deployed site mid-flight._v1→_v2).9. Curated benchmark + quality gate
tests/search_benchmark.jsonpottery,basalt)pottery Cyprus)pottery from Cyprus) — verifies query-time stopword policy worksceramic,bone,mammal) — verifies dereferenced concept labels work; fails loudly if v1 ships without concept labelsÇatalhöyük)xyzzyqqqplugh)10. Build-stats artifact (contract requirement)
The v1 substrate build pipeline (#170) MUST emit
build_stats.jsonalongside the partitioned token-row parquets, recording per-virtual-field populated-sample-count, total-token-count, average doc length, concept-label URI resolution rate, top-DF tokens, and shard size distribution. Schema and acceptance thresholds are specified in #170 §6.This contract item exists so
SEARCH_INDEX_V1.mdand the builder cannot drift: every release of the substrate carries empirical coverage data, not a doc claim about coverage.11. Out of scope (v1)
Acceptance
SEARCH_INDEX_V1.mdlands in repo root (ordocs/— match the EXPLORER_STATE.md placement)tests/search_benchmark.jsonlands with hand-labeled top-10 for the canonical query set, including the concept-only and stopword-heavy queriesRefs
#165, #164, PR #95, #170 (build-stats), #174 (warm disambiguation)