Persist _doc_mapping.json in split bundles for merge-time JSON sub-field recovery#155
Conversation
There was a problem hiding this comment.
Code Review — PR #155: Eliminate Redundant .store for Companion JSON Fields
Requesting changes. This PR has several correctness, safety, design, and unnecessary-complexity issues that need to be addressed before merge.
Blocking / High Priority
1. The entire JsonSubfieldMap inline accumulation approach is unnecessary — and the replacement is also incomplete
The PR introduces JsonSubfieldMap accumulated inline during indexing to discover JSON sub-fields without reading from .store. This approach is unnecessary overhead: after create_split_from_parquet() commits the index, the sub-field information is already available directly from the built index via two complementary APIs:
- Fast fields:
ColumnarReader::list_columns()enumerates all column names (includingfieldname\x01subpathentries for JSON sub-fields) with theirColumnType, giving both name and type without any document sampling. This is already used intranscode.rs. - Indexed (non-fast) fields: The JSON field's term dictionary encodes terms as
subpath\x01value. Iterating the term dictionary and extracting unique sub-path prefixes recovers all indexed-but-not-fast sub-paths.SegmentReader::inverted_index(field)is already used inprewarm/field_specific.rsfor exactly this kind of field enumeration.
The PR's own discover_json_subfields_from_fast_fields uses only the fast-field approach and therefore misses indexed-but-not-fast JSON sub-fields. A JSON sub-field can be indexed (searchable, present in the term dictionary) but not fast (no columnar data). If it's omitted from doc_mapping_json, aggregations on that sub-field silently fail after merge. The PR introduces this gap in the merge path.
The correct fix is a single post-commit discovery function that:
- Calls
ColumnarReader::list_columns()per segment to enumerate fast-field sub-paths and theirColumnType - Iterates the JSON field's term dictionary per segment to enumerate indexed-only sub-paths (extracting the sub-path prefix before
\x01in each term) - Merges results across all segments with type widening
This function is called once after create_split_from_parquet() completes — replacing both the inline accumulation in the build path and discover_json_subfields_from_fast_fields in the merge path. No JsonSubfieldMap threading through the entire document-building call chain is needed.
As a result, all of the following introduced in this PR become unnecessary and should be removed:
JsonSubfieldMaptype aliasJSON_SUBFIELD_SAMPLE_LIMITconstant and sampling counterjson_subfields: &mut JsonSubfieldMapparameter onbuild_doc_from_arrow_row,arrow_row_to_tantivy_doc,add_arrow_value_to_doc,add_string_value_to_doc,add_json_string_valuethrowaway_subfieldsand all surrounding sampling logicextract_doc_mapping_with_json_subfieldswrapperArrowFfiSplitContext::json_subfieldsfield
2. throwaway_subfields still allocates per row past the sample limit
Even within the PR's own framing, clear() + or_default() inside discover_fields_from_object allocates into the map on every row past the limit. On a 10M-row file this is ~9M wasted allocations. The intended fix (per finding #1 above) is to remove the accumulation entirely. If the approach is kept, the signature must use Option<&mut JsonSubfieldMap> and skip all work when None.
3. ArrowFfiSplitContext::json_subfields accumulates silently discarded data and grows unboundedly
/// Kept as a field to avoid changing shared function signatures that require &mut HashMap.
json_subfields: JsonSubfieldMap,The PR's own comment acknowledges the problem but chooses not to fix it. The field grows across all partitions in a multi-partition FFI session without ever resetting, and all work populating it is discarded. add_arrow_batch feeds &mut ctx.json_subfields on two separate codepaths (non-partition and partition paths), both throwaway. The fix per finding #1 is to remove the accumulation approach entirely.
4. unwrap_or_default() in the merge path silently produces broken splits
let json_subfields = discover_json_subfields_from_fast_fields(&merged_index)
.unwrap_or_default();If fast-field discovery fails (e.g., corrupt fast field), the error is swallowed, the sub-field map is empty, and JSON fields are dropped from the merged split's doc_mapping_json entirely (the if discovered_subfields.is_empty() { continue; } guard). Aggregation queries on those fields then silently fail. At minimum log a warning; ideally propagate.
5. The merge path is wrong — input splits already have doc_mapping_json
The merge path re-discovers JSON sub-fields by scanning fast-field columns of the merged index. But the input splits already have a doc_mapping_json containing exactly this information. The correct approach:
- Read
doc_mapping_jsonfrom each input split - Union the sub-field maps with type widening where splits disagree
- Use the merged mapping for the output split
Beyond the design problem, calling discover_json_subfields_from_fast_fields on the merged index risks type divergence: if split A has a sub-field as U64 and split B has it as I64, the merged columnar result may differ from what either input split declared. Reading doc_mapping_json from input splits and widening explicitly is correct and predictable.
Note: extract_json_subfields_from_doc_mapping — added by this PR with tests but with no callsite — appears to be the intended building block for this correct approach. The PR adds the right tool and then doesn't use it, choosing the wrong approach instead.
6. Companion-specific logic does not belong in quickwit_split
subfield_type_to_mapping, JsonSubfieldMap, discover_fields_from_object, and extract_doc_mapping_with_json_subfields encode companion-specific assumptions (stored: false because parquet is the authoritative store). They have no business living in quickwit_split, which is shared infrastructure. The correct fix is to move this family of functions into parquet_companion. If quickwit_split needs to invoke sub-field mapping at merge time, it should do so through a trait or callback owned by parquet_companion.
As a secondary consequence, subfield_type_to_mapping is pub and hardcodes "stored": false for every type. For non-companion splits (store_fields: true), JSON field data is physically written to .store. If a future caller in quickwit_split uses this function for those splits, Quickwit's DocMapper will be told sub-fields are stored: false — making on-disk data permanently inaccessible with no error. Moving to parquet_companion removes it from the shared namespace and eliminates this footgun.
7. store_fields option on JSON fields should not exist for companion splits
The store_fields flag gates .stored() on JSON fields in schema derivation. But companion splits will never use .store for JSON — parquet is the authoritative store by design. This contract should be enforced structurally (separate code paths or a type-level distinction) rather than as a runtime flag.
8. discover_json_subfields_from_fast_fields misses indexed-but-not-fast sub-fields
Per finding #1, this function only scans fast-field columns and ignores JSON sub-fields that are indexed but not fast. This produces an incomplete doc_mapping_json for any split with non-fast indexed JSON sub-fields, silently omitting those fields from aggregation. Additionally, ColumnType::IpAddr columns are mapped to "text" rather than "ip", breaking range queries on IP sub-fields after merge.
Medium Priority
9. extract_json_subfields_from_doc_mapping ships with tests but has no callsite
The function is added and tested but nothing in the production code calls it. It appears to be the intended building block for the correct merge approach (reading doc_mapping_json from input splits — per finding #5), but the PR implements a different strategy in merge_impl.rs and leaves this function orphaned. Either use it correctly in the merge path or remove it.
10. Sample limit counts rows globally, not per-field discoveries
json_subfield_rows_sampled is a global counter across all parquet files and all JSON fields. Once 1000 rows have been processed, discovery stops for ALL fields — even fields whose JSON values are null in the first 1000 rows. For sparse JSON columns, this means zero sub-fields discovered. The counter should be per-field, or count rows that actually contributed new sub-fields, or use a "no new fields in last N rows" window.
11. 1u8 as char is a magic number for the tantivy separator
let sep = 1u8 as char; // JSON_PATH_SEGMENT_SEPThis appears in discover_json_subfields_from_fast_fields. Import common::json_path_writer::JSON_PATH_SEGMENT_SEP directly — it is pub. A silent change to this constant in the forked tantivy breaks all JSON path reconstruction without a compile error.
12. format!("{}{}", json_field, sep) is inside a double loop
In discover_json_subfields_from_fast_fields, prefix is recomputed for every (col_name, json_field) pair across every segment. Pre-compute once before the column loop:
let prefixes: Vec<(String, &str)> = json_field_names.iter()
.map(|f| (format!("{}{}", f, sep), f.as_str()))
.collect();13. Factually wrong comment in indexing.rs
A NOTE comment added by the PR states that .store files must remain on disk for hotcache generation and that create_quickwit_split excludes them from the bundle in companion mode. Both claims are false after the PR's own changes: companion splits produce no .store files at all (no fields are stored), and split_creation.rs has no logic to exclude .store files in companion mode. The file inclusion filter only skips .tantivy* lock files.
14. store_stub.rs was already dead code before this PR
The PR deletes store_stub.rs as part of .store elimination, implying it was previously load-bearing. In fact generate_store_stubs had zero callsites before this PR — it was already dead. The deletion is correct but the motivation is misleading.
15. Duplicate test coverage for store_fields on UTF8 JSON
After the PR there are three tests in schema_derivation.rs covering the same UTF8 JSON + store_fields combinations: the original test_json_fields_stored_and_indexed (now with store_fields=true), test_json_fields_not_stored_in_companion_mode, and the new parametric test_json_field_stored_respects_store_fields_flag. The first two are redundant with the parametric test and should be removed.
Testing Gaps
16. No end-to-end test for the merge path
No test: (a) builds two companion splits with JSON fields, (b) merges them, (c) asserts the merged split's doc_mapping_json correctly reflects all JSON sub-fields including non-fast indexed ones. This is the most critical gap given the merge path is entirely new.
17. No direct unit test for discover_json_subfields_from_fast_fields
The most important new production function in the PR has no unit test. It is only reachable via the merge path integration.
18. should_widen_type edge cases not directly tested
test_discover_fields_from_object_type_widening validates should_widen_type indirectly. There is no direct test covering: ("text", "text") → true (no-op update), ("i64", "bool") → false, ("date", "bool") → false, (f64, i64) → false.
19. No cross-segment type-widening test in discover_json_subfields_from_fast_fields
No test where segment 0 has a sub-field as i64 and segment 1 has it as f64, verifying the merged result is f64.
20. extract_json_subfields_from_doc_mapping not tested with dotted paths
The roundtrip test uses only flat sub-fields. A nested sub-field addr.city should be covered to verify the dotted-path roundtrip through JSON serialization.
Nits
expand_dots=falsewith dotted key names: If a sub-field key contains a literal dot andexpand_dots=false,.replace('\x01', '.')produces an ambiguous path. Add a comment.u64assertion fragility: The assertionassert_eq!(score_field["type"], "u64")intest_json_subfield_discovery_in_companion_splitdepends onid_offset=0producing non-negative scores. Document this dependency or usematches!(field_type, "u64" | "i64").test_json_subfield_discovery_in_companion_splittests the approach being removed: Once the inline accumulation is replaced by post-commit discovery, this test name and structure will need updating to reflect what is actually being tested.
schenksj
left a comment
There was a problem hiding this comment.
Thank you for finding and addressing this very real bug!!! Please see the comments inline in the conversation.
When approaching a problem like this, it can be very helpful to use claude's planning mode to find the best approach and ask the planner to fix or re-research anything that seems incomplete.
Proposal: Persist
|
| Mode | Discovery method | Why |
|---|---|---|
| Companion | After writer.commit(), call discover_json_subfields_from_fast_fields(&index) |
No .store exists; all fields are fast in companion mode — this is authoritative and complete, no sampling limit |
| Non-companion | extract_doc_mapping_from_index(&index) (existing .store sampling) |
.store exists; not all fields are necessarily fast, so fast-field enumeration would miss indexed-but-not-fast sub-fields |
Both modes write the result as doc_mapping.json in the index directory before the split is bundled.
Merge time — both modes, same path
- Read
doc_mapping.jsonfrom each extracted input split - Parse via
extract_json_subfields_from_doc_mapping()(currently flagged as dead code — becomes the core of the merge path) - Union all maps with
should_widen_type()for type evolution across splits - Fallback: if
doc_mapping.jsonis missing (legacy splits), use the current mode-appropriate discovery with proper error logging
What this eliminates
| Current code | Status after this change |
|---|---|
Inline discover_fields_from_object() during companion indexing (first 1000 rows + throwaway after) |
Removed entirely — post-commit fast-field discovery replaces it |
json_subfields field on ArrowFfiSplitContext (grows unboundedly, never read) |
Removed entirely — FFI is non-companion, uses .store sampling at finalization |
throwaway_subfields variable + 1000-row sample limit in indexing.rs |
Removed entirely |
discover_json_subfields_from_fast_fields(&merged_index).unwrap_or_default() in merge path |
Replaced with authoritative file read + union; fast-field discovery becomes fallback only |
extract_json_subfields_from_doc_mapping() dead code |
Now the core of merge — parses persisted doc_mappings from input splits |
How it maps to review items
| # | Feedback | How this addresses it |
|---|---|---|
| 1 | throwaway_subfields wastes ~9M allocations |
Inline discovery removed entirely for companion; no Option signature change needed |
| 2 | ArrowFfiSplitContext::json_subfields accumulates and discards |
Field removed; FFI path passes nothing since .store sampling handles it at finalization |
| 3 | unwrap_or_default() swallows errors |
Replaced with file read as primary path; fallback has proper error logging |
| 4 | Merge path should use input splits' doc_mapping | Reads authoritative doc_mapping.json from each input split, unions with type widening |
| 7 | Sampling counts rows not discoveries | Companion mode: eliminated (fast-field enumeration is complete). Non-companion: unchanged (.store sampling is the correct strategy there) |
| 10 | extract_json_subfields_from_doc_mapping dead code |
Becomes the core parser for the merge path |
Performance impact
| Dimension | Effect |
|---|---|
| CPU at creation (companion) | O(segments × columns) replaces O(N rows) |
| CPU at creation (FFI) | Removes unbounded wasted HashMap work |
| CPU at merge | Neutral — file read replaces column enumeration |
| Memory | Removes unbounded HashMap growth in companion and FFI paths |
| Disk | doc_mapping.json is a few KB per split (splits are 50-500MB) |
Precedent
The parquet manifest already uses this exact pattern: write to index_dir, filesystem scan picks it up, bundled into .split, read back at merge time from extracted temp dirs.
Thoughts?
JSON fields in schema_derivation.rs unconditionally called .set_stored(), ignoring the store_fields config flag. This caused companion splits (which set store_fields: false) to redundantly write JSON data into .store files. - Make JSON fields conditionally stored based on config.store_fields - Replace .store-based JSON subfield sampling with inline accumulation during indexing and fast-field discovery during merge - Extract type-widening logic into reusable should_widen_type() helper - Apply type widening to fast-field discovery so doc_mapping types are correct when JSON sub-field types evolve across segments - Remove deleted store_stub.rs (no longer needed) - Add tests for store_fields config, .store round-trip, and type widening Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…eld recovery Replace inline JsonSubfieldMap accumulation (which threaded a mutable HashMap through every row of the indexing hot path) with post-commit fast-field discovery and _doc_mapping.json persistence in split bundles. Creation time: - Companion: discover_json_subfields_from_fast_fields() post-commit - FFI/standard: extract_doc_mapping_from_index() samples .store - Both write _doc_mapping.json to index dir before bundling Merge time (three-tier fallback): - Primary: read _doc_mapping.json from input splits, union with type widening - Legacy companion: discover from fast-field columns on merged index - Legacy non-companion: sample .store on merged index - Mixed legacy/modern: supplement file-derived map with fast-field discovery Additional fixes from review: - Make JSON fields conditionally stored based on store_fields config flag - Add stored param to subfield_type_to_mapping (metadata reflects reality) - Replace 1u8 magic number with JSON_PATH_SEP constant - Pre-compute JSON field prefixes outside inner loop - Fix ColumnType::IpAddr mapping from "text" to "ip" - Add insert_or_widen helper to deduplicate widening logic - Add _doc_mapping.json to split extraction in temp_management.rs - Fix should_widen_type same-type early return and doc comment accuracy - Add logging to all error-handling paths in discovery functions - Consolidate duplicate store_fields tests Test coverage (110 tests): - E2E merge test: create 2 splits, merge, verify sub-field names + types - No-stored-fields test: extract split, verify no stored fields + _doc_mapping.json present - Exhaustive store_fields tests: all 20 Arrow column types in both modes - Fast-field discovery: flat fields, nested dotted paths, cross-segment widening - Type widening: exhaustive type pairs including ip, same-type, bool->date - Persistence I/O: write/read roundtrip, missing file returns None - combine_json_subfield_maps: empty, single, multi-map union Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
62fef8d to
af9e111
Compare
Revision 2: Persist
|
…g code The single-field text_and_string rewrite eliminated the dual-field companion design, but the text_companion_fields: &HashMap<String, Field> parameter was left threaded through arrow_row_to_tantivy_doc, add_arrow_value_to_doc, add_string_value_to_doc, and build_doc_from_arrow_row. The parameter was always initialized as an empty HashMap and never read — pure dead code. Removing it aligns PR indextables#157's function signatures with PR indextables#155's (feature/eliminate-store-files), so the two PRs no longer collide on this surface when both land on main. text_and_string routing now happens exclusively in the tokenizer selection path, not via a companion field map. No behavior change. Pure subtractive cleanup (15 deletions across 2 files). cargo check --release compiles cleanly. OUT OF SCOPE — pre-existing test build failure: indexing.rs has 14 compile errors in test code starting around line 3500 involving tantivy::collector::TopDocs trait bounds (E0277/E0282). These errors were present on this branch before this commit (verified by reproducing them with this cleanup git-stashed) and are unrelated to the text_companion_fields removal — they appear to be tantivy API drift from a version bump. Fixing them is out of scope for this commit and should land in a separate PR focused specifically on test API compatibility. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
schema_derivation.rsunconditionally called.set_stored(), ignoring thestore_fieldsconfig flag — this caused companion splits (which setstore_fields: false) to redundantly write JSON data into.storefilesJsonSubfieldMapaccumulation approach (which threaded a mutable HashMap through every row of the indexing hot path) with post-commit fast-field discovery_doc_mapping.jsonpersistence in split bundles (following the_parquet_manifest.jsonprecedent) so JSON sub-field metadata survives split creation → merge cycles_doc_mapping.jsonfrom each input split, unions sub-field maps with type widening, and falls back gracefully for legacy splitsArchitecture
Three-tier doc_mapping recovery at merge time:
_doc_mapping.jsonfrom each input split, parse sub-fields, union with type widening.storeWhen merging a mix of modern and legacy splits, the primary path supplements with fast-field discovery on the merged index to avoid losing sub-fields from legacy splits.
Creation-time persistence (both modes):
indexing.rs): Post-commitdiscover_json_subfields_from_fast_fields()→extract_doc_mapping_with_json_subfields()→ write_doc_mapping.jsonto index dir → filesystem scan bundles it automaticallyarrow_ffi_import.rs):extract_doc_mapping_from_index()(samples.store) → write_doc_mapping.json→ bundled into splitSplit extraction (
temp_management.rs):_doc_mapping.jsonis explicitly copied from the bundle during extraction, matching the_parquet_manifest.jsonpattern.Design Decisions
Post-commit discovery over inline accumulation: The original approach threaded
&mut JsonSubfieldMapthrough the full row-processing call chain, doing per-row HashMap work (~9M wasted allocations on a 10M-row file). Post-commit fast-field discovery runs once with zero hot-path overhead._doc_mapping.jsonpersistence pattern: Follows the established_parquet_manifest.jsonconvention — write to index dir before the filesystem scan increate_quickwit_split, which picks it up automatically.Raw doc_mapping persisted, not promoted: Promotion (
fast: trueon all fields) is a client-facing concern.extract_json_subfields_from_doc_mappingonly usesname+typefrom sub-fields, so both versions produce identical sub-field maps at merge time.storedparameter onsubfield_type_to_mapping: Previously hardcodedstored: false. Now parameterized so the metadata is truthful — prevents the metadata from masking schema bugs.Mixed legacy/modern merge safety: When only some input splits have
_doc_mapping.json(rolling upgrades), we supplement with fast-field discovery on the merged index to catch sub-fields from legacy splits.Module organization: Discovery functions stay in
quickwit_split/json_discovery.rs(shared infrastructure). Moving toparquet_companionwould create a circular dependency sincemerge_impl.rsneeds them. Thestoredparam removes the companion-specific assumption.Review Items Addressed
throwaway_subfieldswastes allocsArrowFfiSplitContext::json_subfieldsunboundedunwrap_or_default()swallows errors_doc_mapping.json, unions with wideningstoredparam onsubfield_type_to_mappingstore_fieldsstructural enforcementextract_json_subfields_from_doc_mappingdead code1u8 as charmagic numberJSON_PATH_SEPconstantformat!in double loopstore_stub.rsalready deadTest plan
cargo test json_discovery— 17 unit tests (persistence, discovery, widening, I/O)cargo test schema_derivation— 40 tests (exhaustive store_fields for all 20 Arrow types)cargo test indexing::tests— 53 tests (E2E merge, no-stored-fields verification)cargo test— full suite passes (110 PR-related, 596 total)🤖 Generated with Claude Code