Skip to content

Explorer FTS Track 2: search_index_v1 contract doc #169

@rdhyee

Description

@rdhyee

Updated 2026-05-08 per Codex review on #165. Major changes: refactored around a sample-centric document projection (Section 1), v1 minimum now includes dereferenced concept labels, query-time policy split from build-time tokenizer (new Section 3), quality gate hardened (Section 8). Original framing preserved in git history.

Sub-issue of #165. No code dependencies; can start in parallel with #N1 (#167).

Goal

Land a single design doc — SEARCH_INDEX_V1.md — that pins the v1 substrate contract before any pipeline or query code is written. Tracks 3-5 implement and measure against this doc.

Required contents

1. Sample search document projection

The substrate is not "tokenize these parquet columns." It is "tokenize a sample-centric document whose text fragments are joined across the property graph and tagged by their entity origin."

Each sample (pid) has a logical document of weighted text fragments. At build time, a join across the wide parquet (or its narrow equivalents) produces this projection; the substrate then tokenizes per fragment, tagging each token row with the virtual field name (entity dot field), not the source parquet column.

v1 minimum — the projection that ships first:

virtual field source rationale
sample.label MaterialSampleRecord.label (~6.68M coverage) canonical title; near-universal
sample.description MaterialSampleRecord.description (~1.61M ≈ 24%) sparse but high-signal where present
sample.place_name samples_map_lite.parquet.place_name[] (~2.21M) already proven valuable in current ILIKE search
concept.label material / context / object_type URIs dereferenced via vocab_labels.parquet (pref_label, lang=en) load-bearing addition: facet URIs are near-universal but raw URIs are useless to FTS; dereferenced labels make pottery, ceramic, basalt, bone, marine work as the user expects

A sample whose material URI is <…>/Pottery gets a row {token: 'pottery', pid: ..., field: 'concept.label', tf: 1, ...}. One row per sample per facet URI per token. Coverage: most of the dataset.

v1.5 expansion (post-v1, additive — no schema change):

virtual field source (Solr equivalent)
event.label producedBy_label (~1.92M)
event.description producedBy_description (~5.54M ≈ 83%)
event.has_feature_of_interest producedBy_hasFeatureOfInterest (~6.35M ≈ 95%)
event.sampling_purpose producedBy_samplingPurpose (~262K)
site.label producedBy_samplingSite_label (~190K)
site.description producedBy_samplingSite_description (~172K)
site.place_name producedBy_samplingSite_placeName[] (~336K rows)

v2 / Solr searchText parity (named, not built):

virtual field source
agent.name registrant + responsibility agents
curation.label curation_label
curation.description curation_description
curation.location curation_location
keywords (if present)
source source (already a facet; low-value as FTS)

2. Tokenizer (build-time)

  • Lowercase ASCII via String.prototype.toLowerCase() / Python str.lower().
  • Unicode NFKC normalization.
  • Diacritic stripping via NFD + combining-mark removal.
  • Whitespace split, punctuation stripped, length filter (1 ≤ len ≤ 64).
  • No stemming. Honest limitation; document in UI copy.
  • Index every token, including stopwords. Stopword handling is query-time, not build-time (see §3) — keeps substrate flexible for future phrase queries.
  • Parallel implementations: JS for browser query, Python for offline build. Shared regression test set (≥30 strings).

3. Query-time policy (distinct from build-time)

A separate axis from the build-time tokenizer.

  • Tokenize the user input with the same tokenizer used at build (lowercase + NFKC + diacritic strip + whitespace split + length filter). Keeps the round-trip invariant.
  • Drop or downweight English stopwords from the bag-of-words AND. Curated list (a, an, the, of, from, for, to, in, on, at, is, was, with, and, or) — small, conservative, no language detection.
    • Rationale: a query like pottery from Cyprus should not fail because no sample has from in its text. Build-time skipping would lose phrase-query potential; query-time is reversible policy.
  • AND-combine the surviving tokens. Empty surviving set ⇒ empty result with helpful copy.
  • No query-language syntax in v1. No quoted phrases, no field-prefix operators (label:foo), no booleans, no negation. Documented v2 path: phrase quoting first; field-prefix and negation later. (Reference: query-spec.qmd Solr surface — explicitly not implemented in v1.)

4. Substrate row schema

{
  token:      VARCHAR  -- normalized token
  pid:        VARCHAR  -- sample primary id
  field:      VARCHAR  -- 'sample.label' | 'sample.description' | 'sample.place_name'
                         | 'concept.label' | (future: 'event.*' | 'site.*' | 'agent.*' …)
  tf:         USMALLINT -- term frequency in this (pid, field) pair
  doc_len:    USMALLINT -- token count of (pid, field) for BM25 length norm
}

Field weights are query-side code, not substrate data. Adding a v1.5 / v2 field = re-running the build pipeline with more sources, no schema migration.

5. Ranking spec

  • BM25, fixed k1=1.2, b=0.75 (tune in v1.1 only if benchmark drift demands it).
  • DF (per-token document frequency) precomputed at build time, stored alongside the substrate.
  • Length norm uses doc_len from the schema above.
  • Field weights (query-side, v1):
field weight
sample.label 3.0
concept.label 2.5
sample.place_name 2.0
sample.description 1.0

Final result rank = sum across (pid, field) BM25 contributions weighted by field weight.

6. Partition shape

  • Hash-partition by token: hash(token) % N shards.
  • Per-shard byte cap: ≤ 5 MB uncompressed parquet.
  • High-frequency token rule: if a single token's postings would exceed the cap, sub-shard by hash(pid) % M within that token's logical shard.
  • Number of top-level shards (N): start with 64, refine in build measurement.

7. Budgets

metric target rationale
cold first search (P50) ≤ 2 s matches user expectation for "search"
warm repeat search ≤ 500 ms substantial improvement over ILIKE
filter-composed cold search ≤ 3 s accommodates source + facet AND
bytes transferred cold ≤ 5 MB acceptable on residential broadband
bytes transferred warm ≤ 1 MB per query repeated queries don't refetch shards

These are contract. Track 5's GO/NO-GO gate is mechanical against this table. "Warm" disambiguation (per #174 deferred): the contract distinguishes

  • re-run-same-query warm: same query, second invocation, same page (measures end-to-end cache + render path)
  • new-query-after-warm-up warm: different query, after parquet metadata is cached (measures query execution after substrate file is warm)

Both are reported by the benchmark; the budget targets above apply to both.

8. Versioning

  • URL pattern: https://data.isamples.org/isamples_YYYYMM_search_index_v1/<shard>.parquet.
  • Explorer pins to a specific YYYYMM so a dataset rebuild can't break a deployed site mid-flight.
  • Index version tied to data version. v1.x format bumps require URL path bump (_v1_v2).

9. Curated benchmark + quality gate

  • File: tests/search_benchmark.json
  • 12-15 queries, hand-labeled top-10 by Raymond. Must include:
    • bare-text queries (pottery, basalt)
    • multi-term (pottery Cyprus)
    • stopword-heavy (pottery from Cyprus) — verifies query-time stopword policy works
    • concept-only queries (ceramic, bone, mammal) — verifies dereferenced concept labels work; fails loudly if v1 ships without concept labels
    • diacritic (Çatalhöyük)
    • no-hit (xyzzyqqqplugh)
    • filter-composed cases (source-only, source + material)
  • Quality gate is a hard requirement, not advisory. Each release of the substrate must hit:

10. Build-stats artifact (contract requirement)

The v1 substrate build pipeline (#170) MUST emit build_stats.json alongside the partitioned token-row parquets, recording per-virtual-field populated-sample-count, total-token-count, average doc length, concept-label URI resolution rate, top-DF tokens, and shard size distribution. Schema and acceptance thresholds are specified in #170 §6.

This contract item exists so SEARCH_INDEX_V1.md and the builder cannot drift: every release of the substrate carries empirical coverage data, not a doc claim about coverage.

11. Out of scope (v1)

  • Solr-parity field set (named in §1 as the v2 expansion path; not implemented).
  • Stemming (English-specific, hurts non-English content; v2+ if at all).
  • Query-language syntax: quoted phrases, field operators, booleans, negation, wildcards, fuzzy matching, ranges, boosts — all v2+.

Acceptance

  • SEARCH_INDEX_V1.md lands in repo root (or docs/ — match the EXPLORER_STATE.md placement)
  • All 11 sections above populated
  • tests/search_benchmark.json lands with hand-labeled top-10 for the canonical query set, including the concept-only and stopword-heavy queries
  • Build-stats artifact requirement (§10) referenced in Explorer FTS Track 3: Offline index builder + tokenizer regression set #170 acceptance
  • Doc-only PR; no pipeline or browser code

Refs

#165, #164, PR #95, #170 (build-stats), #174 (warm disambiguation)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestexplorerInteractive Explorer features

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions