Skip to content

Persist _doc_mapping.json in split bundles for merge-time JSON sub-field recovery#155

Merged
schenksj merged 2 commits into
indextables:mainfrom
tlee732:feature/eliminate-store-files
Apr 11, 2026
Merged

Persist _doc_mapping.json in split bundles for merge-time JSON sub-field recovery#155
schenksj merged 2 commits into
indextables:mainfrom
tlee732:feature/eliminate-store-files

Conversation

@tlee732
Copy link
Copy Markdown
Contributor

@tlee732 tlee732 commented Apr 4, 2026

Summary

  • JSON fields in schema_derivation.rs unconditionally called .set_stored(), ignoring the store_fields config flag — this caused companion splits (which set store_fields: false) to redundantly write JSON data into .store files
  • Replaced the inline JsonSubfieldMap accumulation approach (which threaded a mutable HashMap through every row of the indexing hot path) with post-commit fast-field discovery
  • Introduced _doc_mapping.json persistence in split bundles (following the _parquet_manifest.json precedent) so JSON sub-field metadata survives split creation → merge cycles
  • At merge time, reads _doc_mapping.json from each input split, unions sub-field maps with type widening, and falls back gracefully for legacy splits

Architecture

Three-tier doc_mapping recovery at merge time:

  1. Primary: Read _doc_mapping.json from each input split, parse sub-fields, union with type widening
  2. Fallback (legacy companion): Discover from fast-field columns on the merged index
  3. Fallback (legacy non-companion): Sample stored documents from .store

When merging a mix of modern and legacy splits, the primary path supplements with fast-field discovery on the merged index to avoid losing sub-fields from legacy splits.

Creation-time persistence (both modes):

  • Companion (indexing.rs): Post-commit discover_json_subfields_from_fast_fields()extract_doc_mapping_with_json_subfields() → write _doc_mapping.json to index dir → filesystem scan bundles it automatically
  • FFI/standard (arrow_ffi_import.rs): extract_doc_mapping_from_index() (samples .store) → write _doc_mapping.json → bundled into split

Split extraction (temp_management.rs): _doc_mapping.json is explicitly copied from the bundle during extraction, matching the _parquet_manifest.json pattern.

Design Decisions

  1. Post-commit discovery over inline accumulation: The original approach threaded &mut JsonSubfieldMap through the full row-processing call chain, doing per-row HashMap work (~9M wasted allocations on a 10M-row file). Post-commit fast-field discovery runs once with zero hot-path overhead.

  2. _doc_mapping.json persistence pattern: Follows the established _parquet_manifest.json convention — write to index dir before the filesystem scan in create_quickwit_split, which picks it up automatically.

  3. Raw doc_mapping persisted, not promoted: Promotion (fast: true on all fields) is a client-facing concern. extract_json_subfields_from_doc_mapping only uses name + type from sub-fields, so both versions produce identical sub-field maps at merge time.

  4. stored parameter on subfield_type_to_mapping: Previously hardcoded stored: false. Now parameterized so the metadata is truthful — prevents the metadata from masking schema bugs.

  5. Mixed legacy/modern merge safety: When only some input splits have _doc_mapping.json (rolling upgrades), we supplement with fast-field discovery on the merged index to catch sub-fields from legacy splits.

  6. Module organization: Discovery functions stay in quickwit_split/json_discovery.rs (shared infrastructure). Moving to parquet_companion would create a circular dependency since merge_impl.rs needs them. The stored param removes the companion-specific assumption.

Review Items Addressed

# Finding Resolution
1 Inline accumulation unnecessary Removed; post-commit fast-field discovery
2 throwaway_subfields wastes allocs Removed
3 ArrowFfiSplitContext::json_subfields unbounded Field deleted
4 unwrap_or_default() swallows errors Proper error propagation + logging
5 Merge should use input splits' doc_mapping Reads _doc_mapping.json, unions with widening
6 Companion-specific logic in shared code stored param on subfield_type_to_mapping
7 store_fields structural enforcement Deliberate non-fix (runtime flag sufficient)
8 Fast-field discovery misses indexed-but-not-fast IpAddr fix; documented for legacy fallback
9 extract_json_subfields_from_doc_mapping dead code Now core of merge path
10 Sample limit counts rows not discoveries Eliminated
11 1u8 as char magic number JSON_PATH_SEP constant
12 format! in double loop Pre-computed prefixes
13 Factually wrong comment Fixed
14 store_stub.rs already dead Already deleted
15 Duplicate test coverage Consolidated
16-20 Testing gaps All addressed (see below)

Test plan

  • cargo test json_discovery — 17 unit tests (persistence, discovery, widening, I/O)
  • cargo test schema_derivation — 40 tests (exhaustive store_fields for all 20 Arrow types)
  • cargo test indexing::tests — 53 tests (E2E merge, no-stored-fields verification)
  • cargo test — full suite passes (110 PR-related, 596 total)

🤖 Generated with Claude Code

Copy link
Copy Markdown
Collaborator

@schenksj schenksj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review — PR #155: Eliminate Redundant .store for Companion JSON Fields

Requesting changes. This PR has several correctness, safety, design, and unnecessary-complexity issues that need to be addressed before merge.


Blocking / High Priority

1. The entire JsonSubfieldMap inline accumulation approach is unnecessary — and the replacement is also incomplete

The PR introduces JsonSubfieldMap accumulated inline during indexing to discover JSON sub-fields without reading from .store. This approach is unnecessary overhead: after create_split_from_parquet() commits the index, the sub-field information is already available directly from the built index via two complementary APIs:

  • Fast fields: ColumnarReader::list_columns() enumerates all column names (including fieldname\x01subpath entries for JSON sub-fields) with their ColumnType, giving both name and type without any document sampling. This is already used in transcode.rs.
  • Indexed (non-fast) fields: The JSON field's term dictionary encodes terms as subpath\x01value. Iterating the term dictionary and extracting unique sub-path prefixes recovers all indexed-but-not-fast sub-paths. SegmentReader::inverted_index(field) is already used in prewarm/field_specific.rs for exactly this kind of field enumeration.

The PR's own discover_json_subfields_from_fast_fields uses only the fast-field approach and therefore misses indexed-but-not-fast JSON sub-fields. A JSON sub-field can be indexed (searchable, present in the term dictionary) but not fast (no columnar data). If it's omitted from doc_mapping_json, aggregations on that sub-field silently fail after merge. The PR introduces this gap in the merge path.

The correct fix is a single post-commit discovery function that:

  1. Calls ColumnarReader::list_columns() per segment to enumerate fast-field sub-paths and their ColumnType
  2. Iterates the JSON field's term dictionary per segment to enumerate indexed-only sub-paths (extracting the sub-path prefix before \x01 in each term)
  3. Merges results across all segments with type widening

This function is called once after create_split_from_parquet() completes — replacing both the inline accumulation in the build path and discover_json_subfields_from_fast_fields in the merge path. No JsonSubfieldMap threading through the entire document-building call chain is needed.

As a result, all of the following introduced in this PR become unnecessary and should be removed:

  • JsonSubfieldMap type alias
  • JSON_SUBFIELD_SAMPLE_LIMIT constant and sampling counter
  • json_subfields: &mut JsonSubfieldMap parameter on build_doc_from_arrow_row, arrow_row_to_tantivy_doc, add_arrow_value_to_doc, add_string_value_to_doc, add_json_string_value
  • throwaway_subfields and all surrounding sampling logic
  • extract_doc_mapping_with_json_subfields wrapper
  • ArrowFfiSplitContext::json_subfields field

2. throwaway_subfields still allocates per row past the sample limit

Even within the PR's own framing, clear() + or_default() inside discover_fields_from_object allocates into the map on every row past the limit. On a 10M-row file this is ~9M wasted allocations. The intended fix (per finding #1 above) is to remove the accumulation entirely. If the approach is kept, the signature must use Option<&mut JsonSubfieldMap> and skip all work when None.

3. ArrowFfiSplitContext::json_subfields accumulates silently discarded data and grows unboundedly

/// Kept as a field to avoid changing shared function signatures that require &mut HashMap.
json_subfields: JsonSubfieldMap,

The PR's own comment acknowledges the problem but chooses not to fix it. The field grows across all partitions in a multi-partition FFI session without ever resetting, and all work populating it is discarded. add_arrow_batch feeds &mut ctx.json_subfields on two separate codepaths (non-partition and partition paths), both throwaway. The fix per finding #1 is to remove the accumulation approach entirely.

4. unwrap_or_default() in the merge path silently produces broken splits

let json_subfields = discover_json_subfields_from_fast_fields(&merged_index)
    .unwrap_or_default();

If fast-field discovery fails (e.g., corrupt fast field), the error is swallowed, the sub-field map is empty, and JSON fields are dropped from the merged split's doc_mapping_json entirely (the if discovered_subfields.is_empty() { continue; } guard). Aggregation queries on those fields then silently fail. At minimum log a warning; ideally propagate.

5. The merge path is wrong — input splits already have doc_mapping_json

The merge path re-discovers JSON sub-fields by scanning fast-field columns of the merged index. But the input splits already have a doc_mapping_json containing exactly this information. The correct approach:

  1. Read doc_mapping_json from each input split
  2. Union the sub-field maps with type widening where splits disagree
  3. Use the merged mapping for the output split

Beyond the design problem, calling discover_json_subfields_from_fast_fields on the merged index risks type divergence: if split A has a sub-field as U64 and split B has it as I64, the merged columnar result may differ from what either input split declared. Reading doc_mapping_json from input splits and widening explicitly is correct and predictable.

Note: extract_json_subfields_from_doc_mapping — added by this PR with tests but with no callsite — appears to be the intended building block for this correct approach. The PR adds the right tool and then doesn't use it, choosing the wrong approach instead.

6. Companion-specific logic does not belong in quickwit_split

subfield_type_to_mapping, JsonSubfieldMap, discover_fields_from_object, and extract_doc_mapping_with_json_subfields encode companion-specific assumptions (stored: false because parquet is the authoritative store). They have no business living in quickwit_split, which is shared infrastructure. The correct fix is to move this family of functions into parquet_companion. If quickwit_split needs to invoke sub-field mapping at merge time, it should do so through a trait or callback owned by parquet_companion.

As a secondary consequence, subfield_type_to_mapping is pub and hardcodes "stored": false for every type. For non-companion splits (store_fields: true), JSON field data is physically written to .store. If a future caller in quickwit_split uses this function for those splits, Quickwit's DocMapper will be told sub-fields are stored: false — making on-disk data permanently inaccessible with no error. Moving to parquet_companion removes it from the shared namespace and eliminates this footgun.

7. store_fields option on JSON fields should not exist for companion splits

The store_fields flag gates .stored() on JSON fields in schema derivation. But companion splits will never use .store for JSON — parquet is the authoritative store by design. This contract should be enforced structurally (separate code paths or a type-level distinction) rather than as a runtime flag.

8. discover_json_subfields_from_fast_fields misses indexed-but-not-fast sub-fields

Per finding #1, this function only scans fast-field columns and ignores JSON sub-fields that are indexed but not fast. This produces an incomplete doc_mapping_json for any split with non-fast indexed JSON sub-fields, silently omitting those fields from aggregation. Additionally, ColumnType::IpAddr columns are mapped to "text" rather than "ip", breaking range queries on IP sub-fields after merge.


Medium Priority

9. extract_json_subfields_from_doc_mapping ships with tests but has no callsite

The function is added and tested but nothing in the production code calls it. It appears to be the intended building block for the correct merge approach (reading doc_mapping_json from input splits — per finding #5), but the PR implements a different strategy in merge_impl.rs and leaves this function orphaned. Either use it correctly in the merge path or remove it.

10. Sample limit counts rows globally, not per-field discoveries

json_subfield_rows_sampled is a global counter across all parquet files and all JSON fields. Once 1000 rows have been processed, discovery stops for ALL fields — even fields whose JSON values are null in the first 1000 rows. For sparse JSON columns, this means zero sub-fields discovered. The counter should be per-field, or count rows that actually contributed new sub-fields, or use a "no new fields in last N rows" window.

11. 1u8 as char is a magic number for the tantivy separator

let sep = 1u8 as char; // JSON_PATH_SEGMENT_SEP

This appears in discover_json_subfields_from_fast_fields. Import common::json_path_writer::JSON_PATH_SEGMENT_SEP directly — it is pub. A silent change to this constant in the forked tantivy breaks all JSON path reconstruction without a compile error.

12. format!("{}{}", json_field, sep) is inside a double loop

In discover_json_subfields_from_fast_fields, prefix is recomputed for every (col_name, json_field) pair across every segment. Pre-compute once before the column loop:

let prefixes: Vec<(String, &str)> = json_field_names.iter()
    .map(|f| (format!("{}{}", f, sep), f.as_str()))
    .collect();

13. Factually wrong comment in indexing.rs

A NOTE comment added by the PR states that .store files must remain on disk for hotcache generation and that create_quickwit_split excludes them from the bundle in companion mode. Both claims are false after the PR's own changes: companion splits produce no .store files at all (no fields are stored), and split_creation.rs has no logic to exclude .store files in companion mode. The file inclusion filter only skips .tantivy* lock files.

14. store_stub.rs was already dead code before this PR

The PR deletes store_stub.rs as part of .store elimination, implying it was previously load-bearing. In fact generate_store_stubs had zero callsites before this PR — it was already dead. The deletion is correct but the motivation is misleading.

15. Duplicate test coverage for store_fields on UTF8 JSON

After the PR there are three tests in schema_derivation.rs covering the same UTF8 JSON + store_fields combinations: the original test_json_fields_stored_and_indexed (now with store_fields=true), test_json_fields_not_stored_in_companion_mode, and the new parametric test_json_field_stored_respects_store_fields_flag. The first two are redundant with the parametric test and should be removed.


Testing Gaps

16. No end-to-end test for the merge path

No test: (a) builds two companion splits with JSON fields, (b) merges them, (c) asserts the merged split's doc_mapping_json correctly reflects all JSON sub-fields including non-fast indexed ones. This is the most critical gap given the merge path is entirely new.

17. No direct unit test for discover_json_subfields_from_fast_fields

The most important new production function in the PR has no unit test. It is only reachable via the merge path integration.

18. should_widen_type edge cases not directly tested

test_discover_fields_from_object_type_widening validates should_widen_type indirectly. There is no direct test covering: ("text", "text") → true (no-op update), ("i64", "bool") → false, ("date", "bool") → false, (f64, i64) → false.

19. No cross-segment type-widening test in discover_json_subfields_from_fast_fields

No test where segment 0 has a sub-field as i64 and segment 1 has it as f64, verifying the merged result is f64.

20. extract_json_subfields_from_doc_mapping not tested with dotted paths

The roundtrip test uses only flat sub-fields. A nested sub-field addr.city should be covered to verify the dotted-path roundtrip through JSON serialization.


Nits

  • expand_dots=false with dotted key names: If a sub-field key contains a literal dot and expand_dots=false, .replace('\x01', '.') produces an ambiguous path. Add a comment.
  • u64 assertion fragility: The assertion assert_eq!(score_field["type"], "u64") in test_json_subfield_discovery_in_companion_split depends on id_offset=0 producing non-negative scores. Document this dependency or use matches!(field_type, "u64" | "i64").
  • test_json_subfield_discovery_in_companion_split tests the approach being removed: Once the inline accumulation is replaced by post-commit discovery, this test name and structure will need updating to reflect what is actually being tested.

Copy link
Copy Markdown
Collaborator

@schenksj schenksj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for finding and addressing this very real bug!!! Please see the comments inline in the conversation.

When approaching a problem like this, it can be very helpful to use claude's planning mode to find the best approach and ask the planner to fix or re-research anything that seems incomplete.

@tlee732
Copy link
Copy Markdown
Contributor Author

tlee732 commented Apr 5, 2026

Proposal: Persist doc_mapping.json in split bundles

After reviewing your feedback, I'd like to propose an approach that addresses #1, #2, #3, #4, #7, and #10 in a single architectural change.

Core idea

Write doc_mapping.json into the split's index_dir at creation time (same pattern as the parquet manifest — the filesystem scan in create_quickwit_split picks it up automatically). At merge time, read it from each extracted input split and union with type widening.

Creation time — two strategies by mode

Mode Discovery method Why
Companion After writer.commit(), call discover_json_subfields_from_fast_fields(&index) No .store exists; all fields are fast in companion mode — this is authoritative and complete, no sampling limit
Non-companion extract_doc_mapping_from_index(&index) (existing .store sampling) .store exists; not all fields are necessarily fast, so fast-field enumeration would miss indexed-but-not-fast sub-fields

Both modes write the result as doc_mapping.json in the index directory before the split is bundled.

Merge time — both modes, same path

  1. Read doc_mapping.json from each extracted input split
  2. Parse via extract_json_subfields_from_doc_mapping() (currently flagged as dead code — becomes the core of the merge path)
  3. Union all maps with should_widen_type() for type evolution across splits
  4. Fallback: if doc_mapping.json is missing (legacy splits), use the current mode-appropriate discovery with proper error logging

What this eliminates

Current code Status after this change
Inline discover_fields_from_object() during companion indexing (first 1000 rows + throwaway after) Removed entirely — post-commit fast-field discovery replaces it
json_subfields field on ArrowFfiSplitContext (grows unboundedly, never read) Removed entirely — FFI is non-companion, uses .store sampling at finalization
throwaway_subfields variable + 1000-row sample limit in indexing.rs Removed entirely
discover_json_subfields_from_fast_fields(&merged_index).unwrap_or_default() in merge path Replaced with authoritative file read + union; fast-field discovery becomes fallback only
extract_json_subfields_from_doc_mapping() dead code Now the core of merge — parses persisted doc_mappings from input splits

How it maps to review items

# Feedback How this addresses it
1 throwaway_subfields wastes ~9M allocations Inline discovery removed entirely for companion; no Option signature change needed
2 ArrowFfiSplitContext::json_subfields accumulates and discards Field removed; FFI path passes nothing since .store sampling handles it at finalization
3 unwrap_or_default() swallows errors Replaced with file read as primary path; fallback has proper error logging
4 Merge path should use input splits' doc_mapping Reads authoritative doc_mapping.json from each input split, unions with type widening
7 Sampling counts rows not discoveries Companion mode: eliminated (fast-field enumeration is complete). Non-companion: unchanged (.store sampling is the correct strategy there)
10 extract_json_subfields_from_doc_mapping dead code Becomes the core parser for the merge path

Performance impact

Dimension Effect
CPU at creation (companion) O(segments × columns) replaces O(N rows)
CPU at creation (FFI) Removes unbounded wasted HashMap work
CPU at merge Neutral — file read replaces column enumeration
Memory Removes unbounded HashMap growth in companion and FFI paths
Disk doc_mapping.json is a few KB per split (splits are 50-500MB)

Precedent

The parquet manifest already uses this exact pattern: write to index_dir, filesystem scan picks it up, bundled into .split, read back at merge time from extracted temp dirs.


Thoughts?

tlee732 and others added 2 commits April 6, 2026 21:15
JSON fields in schema_derivation.rs unconditionally called .set_stored(),
ignoring the store_fields config flag. This caused companion splits (which
set store_fields: false) to redundantly write JSON data into .store files.

- Make JSON fields conditionally stored based on config.store_fields
- Replace .store-based JSON subfield sampling with inline accumulation
  during indexing and fast-field discovery during merge
- Extract type-widening logic into reusable should_widen_type() helper
- Apply type widening to fast-field discovery so doc_mapping types are
  correct when JSON sub-field types evolve across segments
- Remove deleted store_stub.rs (no longer needed)
- Add tests for store_fields config, .store round-trip, and type widening

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…eld recovery

Replace inline JsonSubfieldMap accumulation (which threaded a mutable
HashMap through every row of the indexing hot path) with post-commit
fast-field discovery and _doc_mapping.json persistence in split bundles.

Creation time:
- Companion: discover_json_subfields_from_fast_fields() post-commit
- FFI/standard: extract_doc_mapping_from_index() samples .store
- Both write _doc_mapping.json to index dir before bundling

Merge time (three-tier fallback):
- Primary: read _doc_mapping.json from input splits, union with type widening
- Legacy companion: discover from fast-field columns on merged index
- Legacy non-companion: sample .store on merged index
- Mixed legacy/modern: supplement file-derived map with fast-field discovery

Additional fixes from review:
- Make JSON fields conditionally stored based on store_fields config flag
- Add stored param to subfield_type_to_mapping (metadata reflects reality)
- Replace 1u8 magic number with JSON_PATH_SEP constant
- Pre-compute JSON field prefixes outside inner loop
- Fix ColumnType::IpAddr mapping from "text" to "ip"
- Add insert_or_widen helper to deduplicate widening logic
- Add _doc_mapping.json to split extraction in temp_management.rs
- Fix should_widen_type same-type early return and doc comment accuracy
- Add logging to all error-handling paths in discovery functions
- Consolidate duplicate store_fields tests

Test coverage (110 tests):
- E2E merge test: create 2 splits, merge, verify sub-field names + types
- No-stored-fields test: extract split, verify no stored fields + _doc_mapping.json present
- Exhaustive store_fields tests: all 20 Arrow column types in both modes
- Fast-field discovery: flat fields, nested dotted paths, cross-segment widening
- Type widening: exhaustive type pairs including ip, same-type, bool->date
- Persistence I/O: write/read roundtrip, missing file returns None
- combine_json_subfield_maps: empty, single, multi-map union

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@tlee732 tlee732 force-pushed the feature/eliminate-store-files branch from 62fef8d to af9e111 Compare April 7, 2026 01:18
@tlee732 tlee732 changed the title Eliminate redundant .store for companion JSON fields Persist _doc_mapping.json in split bundles for merge-time JSON sub-field recovery Apr 7, 2026
@tlee732
Copy link
Copy Markdown
Contributor Author

tlee732 commented Apr 7, 2026

Revision 2: Persist _doc_mapping.json in split bundles

Pushed two commits addressing all 20 review findings:

Commit 1: 933f842 — Original schema fix (rebased onto main)

  • Makes JSON fields conditionally stored based on config.store_fields
  • Removes store_stub.rs (already dead code)

Commit 2: af9e111 — Architectural revision per review feedback

Replaces the inline JsonSubfieldMap accumulation with the _doc_mapping.json persistence approach discussed in the proposal comment above.

What changed architecturally:

  • Removed all json_subfields parameter threading through the hot path (arrow_row_to_tantivy_doc, add_arrow_value_to_doc, add_string_value_to_doc, add_json_string_value)
  • Removed ArrowFfiSplitContext::json_subfields field, throwaway_subfields, 1000-row sample limit
  • Added post-commit discover_json_subfields_from_fast_fields() for companion splits
  • Added _doc_mapping.json persistence in split bundles (same pattern as _parquet_manifest.json)
  • Added three-tier merge fallback: file → fast-field discovery → .store sampling
  • Added mixed legacy/modern merge safety (supplements file-derived map with fast-field discovery when not all input splits have _doc_mapping.json)

Additional fixes:

  • subfield_type_to_mapping now takes stored: bool — metadata reflects reality for both modes (prevents schema bugs from being masked)
  • discover_json_subfields also propagates stored from the parent field entry
  • ColumnType::IpAddr correctly maps to "ip" (was "text")
  • should_widen_type early-returns for same-type comparisons
  • JSON_PATH_SEP constant replaces 1u8 as char magic number
  • Pre-computed prefixes in fast-field discovery inner loop
  • insert_or_widen helper deduplicates 3 call sites
  • All error-handling paths in discovery functions now log (were silent continue blocks)
  • promote_doc_mapping_all_fast fallback in merge logs instead of silently degrading

Test coverage (110 tests, 0 failures):

  • E2E merge test: create 2 companion splits → merge → verify sub-field names AND types
  • No-stored-fields test: extract split bundle, verify no stored fields + _doc_mapping.json present
  • Exhaustive store_fields tests: all 20 Arrow column types in both modes (false + true)
  • Fast-field discovery: flat fields, nested/dotted paths, cross-segment type widening
  • should_widen_type: all type pairs including ip, same-type, bool→date
  • Persistence I/O: write/read roundtrip, missing file returns None
  • combine_json_subfield_maps: empty, single, multi-map union with widening
  • insert_or_widen: direct unit test for the widening helper

Deliberate non-fixes (per proposal):

  • cache eviction #7: store_fields structural enforcement — runtime flag set in 3 internal code paths, never by users
  • Revert "cache eviction" #8: Indexed-but-not-fast sub-fields — only affects legacy fallback path; companion has all fields fast; merge reads from _doc_mapping.json

tlee732 added a commit to tlee732/tantivy4java that referenced this pull request Apr 10, 2026
…g code

The single-field text_and_string rewrite eliminated the dual-field companion
design, but the text_companion_fields: &HashMap<String, Field> parameter was
left threaded through arrow_row_to_tantivy_doc, add_arrow_value_to_doc,
add_string_value_to_doc, and build_doc_from_arrow_row. The parameter was
always initialized as an empty HashMap and never read — pure dead code.

Removing it aligns PR indextables#157's function signatures with PR indextables#155's
(feature/eliminate-store-files), so the two PRs no longer collide on this
surface when both land on main. text_and_string routing now happens
exclusively in the tokenizer selection path, not via a companion field map.

No behavior change. Pure subtractive cleanup (15 deletions across 2 files).
cargo check --release compiles cleanly.

OUT OF SCOPE — pre-existing test build failure: indexing.rs has 14
compile errors in test code starting around line 3500 involving
tantivy::collector::TopDocs trait bounds (E0277/E0282). These errors were
present on this branch before this commit (verified by reproducing them with
this cleanup git-stashed) and are unrelated to the text_companion_fields
removal — they appear to be tantivy API drift from a version bump. Fixing
them is out of scope for this commit and should land in a separate PR
focused specifically on test API compatibility.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@schenksj schenksj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Nice job!

@schenksj schenksj merged commit becadd4 into indextables:main Apr 11, 2026
1 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants