Skip to content

Add single-field text_and_string indexing with native fast field support#156

Closed
tlee732 wants to merge 5 commits into
indextables:mainfrom
tlee732:feature/text-and-string-single-field
Closed

Add single-field text_and_string indexing with native fast field support#156
tlee732 wants to merge 5 commits into
indextables:mainfrom
tlee732:feature/text-and-string-single-field

Conversation

@tlee732
Copy link
Copy Markdown
Contributor

@tlee732 tlee732 commented Apr 6, 2026

Summary

Single-field text_and_string indexing mode for companion splits. One tantivy field serves both full-text search and aggregations — replacing the dual-field __text companion approach from the original PR for lower storage cost and simpler query routing.

Architecture

Each text_and_string column creates one tantivy field with two independent behaviors:

Capability Storage Tokenizer How it works
Full-text search Inverted index default (lowercase + split on non-alphanumeric) Standard tantivy term/phrase queries
EqualTo / IN filters Inverted index default PhraseQuery(slop=0) as candidate, Spark post-filters for exactness
GROUP BY / aggregation Fast field (columnar) raw (stores original string) Tantivy terms aggregation on raw values
Sorting Fast field (columnar) raw Tantivy sort on raw values

Write path

schema_derivation.rs: TextAndString →
    TextOptions::default()
        .set_indexing_options(default tokenizer, WithFreqsAndPositions)
        .set_fast(Some("raw"))

indexing.rs: doc.add_text(field, val)
    → tantivy writes to inverted index (tokenized) AND fast field (raw) in one call

Read path — fast field transcoding

The companion read path normally transcodes string fast fields from parquet at query time (Hybrid mode). TextAndString fields are excluded from transcoding because they already have native fast data from set_fast(Some("raw")).

The exclusion uses manifest.string_indexing_modes (checking for TextAndString) rather than fast_field_tokenizer.is_some() because build_column_mapping sets fast_field_tokenizer on ALL Str columns — only string_indexing_modes correctly distinguishes TextAndString from regular string fields.

Without this exclusion, merge_two_columnars() combines native + transcoded data, producing duplicate ordinals that double GROUP BY counts (the bug this PR fixes).

Manifest representation

ColumnMapping {
    tantivy_field_name: "message",
    tantivy_type: "Str",
    fast_field_tokenizer: None,  // None = has native fast data, no transcoding
}
string_indexing_modes: { "message": TextAndString }

Regular string fields have fast_field_tokenizer: Some("raw") (needs transcoding).

Design decisions

  1. string_indexing_modes as discriminator, not fast_field_tokenizer: All Str columns have fast_field_tokenizer set in production (build_column_mapping defaults to Some("raw")). Using fast_field_tokenizer.is_some() would skip transcoding for ALL Str fields, breaking GROUP BY on regular string columns. string_indexing_modes is the authoritative source.

  2. fast_field_tokenizer: None for TextAndString: Fixed build_column_mapping to set None instead of Some("raw"). The field has native fast data and doesn't need transcoding — None accurately represents this. Previously the misleading Some("raw") suggested it needed transcoding.

  3. Hybrid-only skip: The transcode skip only applies in FastFieldMode::Hybrid. In ParquetOnly mode, native .fast data is ignored entirely, so TextAndString must be transcoded to have any fast data at all.

  4. Backward compatibility: string_indexing_modes is #[serde(default)] so old manifests deserialize with an empty map. TextAndString and string_indexing_modes were introduced together — no old manifest can have TextAndString native fast data without the corresponding entry.

Testing

6 Rust unit tests (transcode.rs):

Test What it validates
test_columns_to_transcode_hybrid TextAndString skipped, regular string transcoded
test_columns_to_transcode_hybrid_distinguishes_text_and_string_from_regular Regression: both Str fields have same fast_field_tokenizer but only TextAndString is skipped
test_columns_to_transcode_parquet_only ParquetOnly transcodes ALL including TextAndString
test_columns_to_transcode_hybrid_requested_text_and_string_still_skipped Explicit requested_columns can't force transcoding
test_columns_to_transcode_disabled Disabled mode transcodes nothing
test_columns_to_transcode_with_filter Column filter works on non-TextAndString fields

Test fixture (make_test_manifest) matches production: TextAndString field has fast_field_tokenizer: None + string_indexing_modes entry. Regular string field has fast_field_tokenizer: Some("raw") with no indexing mode.

3 Rust integration tests (indexing.rs):

  • Single-field schema: validates default tokenizer + raw fast field on same field, no __text companion
  • PhraseQuery false positives: "New York" matches "New York City" and "I love New York!" — documents the ~10% FP rate
  • Punctuation in fast field: URLs and emails stored verbatim in raw fast field while tokenized search finds individual words

Open items (out of scope for this PR)

  1. Rust-side fast field post-filter: Would eliminate ~10% PhraseQuery false positives in tantivy before Spark sees them. Rejected because the companion streaming path (nativeStartStreamingRetrieval) bypasses searchWithSplitQuery — the filter would need to exist in two separate code paths. Spark's candidate post-filter already guarantees correctness. Revisit as performance optimization.

  2. Non-companion text_and_string: The Java SchemaBuilder.addTextField(fast=true) uses one tokenizer for both inverted index and fast field. Separate tokenizers (default for search, raw for fast) are only possible through the companion schema_derivation.rs path. Not a limitation in practice since text_and_string is companion-only.

Dependencies

🤖 Generated with Claude Code

tlee732 and others added 5 commits April 2, 2026 12:14
Creates two tantivy fields from one parquet string column:
- <name> with raw tokenizer (exact match, aggregation, sorting)
- <name>__text with default tokenizer (full-text search)

Includes collision detection, hash field rewriter skip, and 7 tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ases

- Fix full_text/phrase queries on TextAndString fields silently hitting wrong
  field by adding explicit routing to __text companion in hash_field_rewriter
- Cache text_companion_field lookup outside per-document loop to avoid 100M+
  string allocations and HashMap lookups on large parquet files
- Add serde wire format test pinning {"mode":"text_and_string"} JSON format
- Normalize text_and_string/exact_only to "raw" in build_column_mapping to
  prevent storing invalid tokenizer names in fast_field_tokenizer
- Add design comment explaining why TextAndString omits set_stored/set_fast
- Add edge case integration test covering empty strings and multiple
  text_and_string columns

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Single field uses default-tokenized inverted index for full-text search
and PhraseQuery equality, plus raw fast field for aggregations and sorting.
Eliminates the __text companion field, halving index size per text_and_string
column.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
TextAndString fields have native fast data (from set_fast(Some("raw")))
but were also transcoded from parquet in Hybrid mode. The merge of native
+ transcoded data doubled fast field ordinals, causing GROUP BY counts
to be 2x.

- Skip parquet transcoding for TextAndString by checking
  manifest.string_indexing_modes (not fast_field_tokenizer, which is
  set on ALL Str columns)
- Set fast_field_tokenizer=None for TextAndString in build_column_mapping
  (it has native fast data, no transcoding needed)
- Classify TextAndString as native in ensure_fast_fields_for_query
- Add debug logging for transcode skip decisions
- Add error logging in jni_prewarm.rs for serialization failures
- 3 new regression tests + updated fixture to match production manifests

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@tlee732
Copy link
Copy Markdown
Contributor Author

tlee732 commented Apr 7, 2026

Closing in favor of a clean branch rebased on latest main (no stacked PR dependencies). Reopening as new PR from feature/text-and-string-clean.

@tlee732 tlee732 closed this Apr 7, 2026
@tlee732
Copy link
Copy Markdown
Contributor Author

tlee732 commented Apr 7, 2026

Replaced by #157 (same code, clean branch rebased on latest main).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant