Add single-field text_and_string indexing with native fast field support by tlee732 · Pull Request #156 · indextables/tantivy4java

tlee732 · 2026-04-06T21:45:27Z

Summary

Single-field text_and_string indexing mode for companion splits. One tantivy field serves both full-text search and aggregations — replacing the dual-field __text companion approach from the original PR for lower storage cost and simpler query routing.

Architecture

Each text_and_string column creates one tantivy field with two independent behaviors:

Capability	Storage	Tokenizer	How it works
Full-text search	Inverted index	`default` (lowercase + split on non-alphanumeric)	Standard tantivy term/phrase queries
EqualTo / IN filters	Inverted index	`default`	PhraseQuery(slop=0) as candidate, Spark post-filters for exactness
GROUP BY / aggregation	Fast field (columnar)	`raw` (stores original string)	Tantivy terms aggregation on raw values
Sorting	Fast field (columnar)	`raw`	Tantivy sort on raw values

Write path

schema_derivation.rs: TextAndString →
    TextOptions::default()
        .set_indexing_options(default tokenizer, WithFreqsAndPositions)
        .set_fast(Some("raw"))

indexing.rs: doc.add_text(field, val)
    → tantivy writes to inverted index (tokenized) AND fast field (raw) in one call

Read path — fast field transcoding

The companion read path normally transcodes string fast fields from parquet at query time (Hybrid mode). TextAndString fields are excluded from transcoding because they already have native fast data from set_fast(Some("raw")).

The exclusion uses manifest.string_indexing_modes (checking for TextAndString) rather than fast_field_tokenizer.is_some() because build_column_mapping sets fast_field_tokenizer on ALL Str columns — only string_indexing_modes correctly distinguishes TextAndString from regular string fields.

Without this exclusion, merge_two_columnars() combines native + transcoded data, producing duplicate ordinals that double GROUP BY counts (the bug this PR fixes).

Manifest representation

ColumnMapping {
    tantivy_field_name: "message",
    tantivy_type: "Str",
    fast_field_tokenizer: None,  // None = has native fast data, no transcoding
}
string_indexing_modes: { "message": TextAndString }

Regular string fields have fast_field_tokenizer: Some("raw") (needs transcoding).

Design decisions

string_indexing_modes as discriminator, not fast_field_tokenizer: All Str columns have fast_field_tokenizer set in production (build_column_mapping defaults to Some("raw")). Using fast_field_tokenizer.is_some() would skip transcoding for ALL Str fields, breaking GROUP BY on regular string columns. string_indexing_modes is the authoritative source.
fast_field_tokenizer: None for TextAndString: Fixed build_column_mapping to set None instead of Some("raw"). The field has native fast data and doesn't need transcoding — None accurately represents this. Previously the misleading Some("raw") suggested it needed transcoding.
Hybrid-only skip: The transcode skip only applies in FastFieldMode::Hybrid. In ParquetOnly mode, native .fast data is ignored entirely, so TextAndString must be transcoded to have any fast data at all.
Backward compatibility: string_indexing_modes is #[serde(default)] so old manifests deserialize with an empty map. TextAndString and string_indexing_modes were introduced together — no old manifest can have TextAndString native fast data without the corresponding entry.

Testing

6 Rust unit tests (transcode.rs):

Test	What it validates
`test_columns_to_transcode_hybrid`	TextAndString skipped, regular string transcoded
`test_columns_to_transcode_hybrid_distinguishes_text_and_string_from_regular`	Regression: both Str fields have same `fast_field_tokenizer` but only TextAndString is skipped
`test_columns_to_transcode_parquet_only`	ParquetOnly transcodes ALL including TextAndString
`test_columns_to_transcode_hybrid_requested_text_and_string_still_skipped`	Explicit `requested_columns` can't force transcoding
`test_columns_to_transcode_disabled`	Disabled mode transcodes nothing
`test_columns_to_transcode_with_filter`	Column filter works on non-TextAndString fields

Test fixture (make_test_manifest) matches production: TextAndString field has fast_field_tokenizer: None + string_indexing_modes entry. Regular string field has fast_field_tokenizer: Some("raw") with no indexing mode.

3 Rust integration tests (indexing.rs):

Single-field schema: validates default tokenizer + raw fast field on same field, no __text companion
PhraseQuery false positives: "New York" matches "New York City" and "I love New York!" — documents the ~10% FP rate
Punctuation in fast field: URLs and emails stored verbatim in raw fast field while tokenized search finds individual words

Open items (out of scope for this PR)

Rust-side fast field post-filter: Would eliminate ~10% PhraseQuery false positives in tantivy before Spark sees them. Rejected because the companion streaming path (nativeStartStreamingRetrieval) bypasses searchWithSplitQuery — the filter would need to exist in two separate code paths. Spark's candidate post-filter already guarantees correctness. Revisit as performance optimization.
Non-companion text_and_string: The Java SchemaBuilder.addTextField(fast=true) uses one tokenizer for both inverted index and fast field. Separate tokenizers (default for search, raw for fast) are only possible through the companion schema_derivation.rs path. Not a limitation in practice since text_and_string is companion-only.

Dependencies

Companion Spark PR: Add text_and_string indexing mode for hybrid exact + full-text search indextables_spark#292

🤖 Generated with Claude Code

Creates two tantivy fields from one parquet string column: - <name> with raw tokenizer (exact match, aggregation, sorting) - <name>__text with default tokenizer (full-text search) Includes collision detection, hash field rewriter skip, and 7 tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ases - Fix full_text/phrase queries on TextAndString fields silently hitting wrong field by adding explicit routing to __text companion in hash_field_rewriter - Cache text_companion_field lookup outside per-document loop to avoid 100M+ string allocations and HashMap lookups on large parquet files - Add serde wire format test pinning {"mode":"text_and_string"} JSON format - Normalize text_and_string/exact_only to "raw" in build_column_mapping to prevent storing invalid tokenizer names in fast_field_tokenizer - Add design comment explaining why TextAndString omits set_stored/set_fast - Add edge case integration test covering empty strings and multiple text_and_string columns Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Single field uses default-tokenized inverted index for full-text search and PhraseQuery equality, plus raw fast field for aggregations and sorting. Eliminates the __text companion field, halving index size per text_and_string column. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

TextAndString fields have native fast data (from set_fast(Some("raw"))) but were also transcoded from parquet in Hybrid mode. The merge of native + transcoded data doubled fast field ordinals, causing GROUP BY counts to be 2x. - Skip parquet transcoding for TextAndString by checking manifest.string_indexing_modes (not fast_field_tokenizer, which is set on ALL Str columns) - Set fast_field_tokenizer=None for TextAndString in build_column_mapping (it has native fast data, no transcoding needed) - Classify TextAndString as native in ensure_fast_fields_for_query - Add debug logging for transcode skip decisions - Add error logging in jni_prewarm.rs for serialization failures - 3 new regression tests + updated fixture to match production manifests Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

tlee732 · 2026-04-07T01:20:01Z

Closing in favor of a clean branch rebased on latest main (no stacked PR dependencies). Reopening as new PR from feature/text-and-string-clean.

tlee732 · 2026-04-07T01:22:49Z

Replaced by #157 (same code, clean branch rebased on latest main).

tlee732 and others added 5 commits April 2, 2026 12:14

Add text_and_string to compact string indexing documentation

c7d4f9f

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

tlee732 had a problem deploying to test-runners April 6, 2026 21:45 — with GitHub Actions Failure

tlee732 mentioned this pull request Apr 6, 2026

Add text_and_string indexing mode for hybrid exact + full-text search indextables/indextables_spark#292

Draft

5 tasks

tlee732 closed this Apr 7, 2026

tlee732 mentioned this pull request Apr 7, 2026

Add single-field text_and_string indexing with native fast field support #157

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add single-field text_and_string indexing with native fast field support#156

Add single-field text_and_string indexing with native fast field support#156
tlee732 wants to merge 5 commits into
indextables:mainfrom
tlee732:feature/text-and-string-single-field

tlee732 commented Apr 6, 2026 •

edited

Loading

Uh oh!

tlee732 commented Apr 7, 2026

Uh oh!

tlee732 commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tlee732 commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Architecture

Write path

Read path — fast field transcoding

Manifest representation

Design decisions

Testing

Open items (out of scope for this PR)

Dependencies

Uh oh!

tlee732 commented Apr 7, 2026

Uh oh!

tlee732 commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tlee732 commented Apr 6, 2026 •

edited

Loading