Persist _doc_mapping.json in split bundles for merge-time JSON sub-field recovery by tlee732 · Pull Request #155 · indextables/tantivy4java

tlee732 · 2026-04-04T23:00:32Z

Summary

JSON fields in schema_derivation.rs unconditionally called .set_stored(), ignoring the store_fields config flag — this caused companion splits (which set store_fields: false) to redundantly write JSON data into .store files
Replaced the inline JsonSubfieldMap accumulation approach (which threaded a mutable HashMap through every row of the indexing hot path) with post-commit fast-field discovery
Introduced _doc_mapping.json persistence in split bundles (following the _parquet_manifest.json precedent) so JSON sub-field metadata survives split creation → merge cycles
At merge time, reads _doc_mapping.json from each input split, unions sub-field maps with type widening, and falls back gracefully for legacy splits

Architecture

Three-tier doc_mapping recovery at merge time:

Primary: Read _doc_mapping.json from each input split, parse sub-fields, union with type widening
Fallback (legacy companion): Discover from fast-field columns on the merged index
Fallback (legacy non-companion): Sample stored documents from .store

When merging a mix of modern and legacy splits, the primary path supplements with fast-field discovery on the merged index to avoid losing sub-fields from legacy splits.

Creation-time persistence (both modes):

Companion (indexing.rs): Post-commit discover_json_subfields_from_fast_fields() → extract_doc_mapping_with_json_subfields() → write _doc_mapping.json to index dir → filesystem scan bundles it automatically
FFI/standard (arrow_ffi_import.rs): extract_doc_mapping_from_index() (samples .store) → write _doc_mapping.json → bundled into split

Split extraction (temp_management.rs): _doc_mapping.json is explicitly copied from the bundle during extraction, matching the _parquet_manifest.json pattern.

Design Decisions

Post-commit discovery over inline accumulation: The original approach threaded &mut JsonSubfieldMap through the full row-processing call chain, doing per-row HashMap work (~9M wasted allocations on a 10M-row file). Post-commit fast-field discovery runs once with zero hot-path overhead.
_doc_mapping.json persistence pattern: Follows the established _parquet_manifest.json convention — write to index dir before the filesystem scan in create_quickwit_split, which picks it up automatically.
Raw doc_mapping persisted, not promoted: Promotion (fast: true on all fields) is a client-facing concern. extract_json_subfields_from_doc_mapping only uses name + type from sub-fields, so both versions produce identical sub-field maps at merge time.
stored parameter on subfield_type_to_mapping: Previously hardcoded stored: false. Now parameterized so the metadata is truthful — prevents the metadata from masking schema bugs.
Mixed legacy/modern merge safety: When only some input splits have _doc_mapping.json (rolling upgrades), we supplement with fast-field discovery on the merged index to catch sub-fields from legacy splits.
Module organization: Discovery functions stay in quickwit_split/json_discovery.rs (shared infrastructure). Moving to parquet_companion would create a circular dependency since merge_impl.rs needs them. The stored param removes the companion-specific assumption.

Review Items Addressed

#	Finding	Resolution
1	Inline accumulation unnecessary	Removed; post-commit fast-field discovery
2	`throwaway_subfields` wastes allocs	Removed
3	`ArrowFfiSplitContext::json_subfields` unbounded	Field deleted
4	`unwrap_or_default()` swallows errors	Proper error propagation + logging
5	Merge should use input splits' doc_mapping	Reads `_doc_mapping.json`, unions with widening
6	Companion-specific logic in shared code	`stored` param on `subfield_type_to_mapping`
7	`store_fields` structural enforcement	Deliberate non-fix (runtime flag sufficient)
8	Fast-field discovery misses indexed-but-not-fast	IpAddr fix; documented for legacy fallback
9	`extract_json_subfields_from_doc_mapping` dead code	Now core of merge path
10	Sample limit counts rows not discoveries	Eliminated
11	`1u8 as char` magic number	`JSON_PATH_SEP` constant
12	`format!` in double loop	Pre-computed prefixes
13	Factually wrong comment	Fixed
14	`store_stub.rs` already dead	Already deleted
15	Duplicate test coverage	Consolidated
16-20	Testing gaps	All addressed (see below)

Test plan

cargo test json_discovery — 17 unit tests (persistence, discovery, widening, I/O)
cargo test schema_derivation — 40 tests (exhaustive store_fields for all 20 Arrow types)
cargo test indexing::tests — 53 tests (E2E merge, no-stored-fields verification)
cargo test — full suite passes (110 PR-related, 596 total)

🤖 Generated with Claude Code

schenksj

Code Review — PR #155: Eliminate Redundant .store for Companion JSON Fields

Requesting changes. This PR has several correctness, safety, design, and unnecessary-complexity issues that need to be addressed before merge.

Blocking / High Priority

1. The entire `JsonSubfieldMap` inline accumulation approach is unnecessary — and the replacement is also incomplete

The PR introduces JsonSubfieldMap accumulated inline during indexing to discover JSON sub-fields without reading from .store. This approach is unnecessary overhead: after create_split_from_parquet() commits the index, the sub-field information is already available directly from the built index via two complementary APIs:

Fast fields: ColumnarReader::list_columns() enumerates all column names (including fieldname\x01subpath entries for JSON sub-fields) with their ColumnType, giving both name and type without any document sampling. This is already used in transcode.rs.
Indexed (non-fast) fields: The JSON field's term dictionary encodes terms as subpath\x01value. Iterating the term dictionary and extracting unique sub-path prefixes recovers all indexed-but-not-fast sub-paths. SegmentReader::inverted_index(field) is already used in prewarm/field_specific.rs for exactly this kind of field enumeration.

The PR's own discover_json_subfields_from_fast_fields uses only the fast-field approach and therefore misses indexed-but-not-fast JSON sub-fields. A JSON sub-field can be indexed (searchable, present in the term dictionary) but not fast (no columnar data). If it's omitted from doc_mapping_json, aggregations on that sub-field silently fail after merge. The PR introduces this gap in the merge path.

The correct fix is a single post-commit discovery function that:

Calls ColumnarReader::list_columns() per segment to enumerate fast-field sub-paths and their ColumnType
Iterates the JSON field's term dictionary per segment to enumerate indexed-only sub-paths (extracting the sub-path prefix before \x01 in each term)
Merges results across all segments with type widening

This function is called once after create_split_from_parquet() completes — replacing both the inline accumulation in the build path and discover_json_subfields_from_fast_fields in the merge path. No JsonSubfieldMap threading through the entire document-building call chain is needed.

As a result, all of the following introduced in this PR become unnecessary and should be removed:

JsonSubfieldMap type alias
JSON_SUBFIELD_SAMPLE_LIMIT constant and sampling counter
json_subfields: &mut JsonSubfieldMap parameter on build_doc_from_arrow_row, arrow_row_to_tantivy_doc, add_arrow_value_to_doc, add_string_value_to_doc, add_json_string_value
throwaway_subfields and all surrounding sampling logic
extract_doc_mapping_with_json_subfields wrapper
ArrowFfiSplitContext::json_subfields field

2. `throwaway_subfields` still allocates per row past the sample limit

Even within the PR's own framing, clear() + or_default() inside discover_fields_from_object allocates into the map on every row past the limit. On a 10M-row file this is ~9M wasted allocations. The intended fix (per finding #1 above) is to remove the accumulation entirely. If the approach is kept, the signature must use Option<&mut JsonSubfieldMap> and skip all work when None.

3. `ArrowFfiSplitContext::json_subfields` accumulates silently discarded data and grows unboundedly

/// Kept as a field to avoid changing shared function signatures that require &mut HashMap.
json_subfields: JsonSubfieldMap,

The PR's own comment acknowledges the problem but chooses not to fix it. The field grows across all partitions in a multi-partition FFI session without ever resetting, and all work populating it is discarded. add_arrow_batch feeds &mut ctx.json_subfields on two separate codepaths (non-partition and partition paths), both throwaway. The fix per finding #1 is to remove the accumulation approach entirely.

4. `unwrap_or_default()` in the merge path silently produces broken splits

let json_subfields = discover_json_subfields_from_fast_fields(&merged_index)
    .unwrap_or_default();

If fast-field discovery fails (e.g., corrupt fast field), the error is swallowed, the sub-field map is empty, and JSON fields are dropped from the merged split's doc_mapping_json entirely (the if discovered_subfields.is_empty() { continue; } guard). Aggregation queries on those fields then silently fail. At minimum log a warning; ideally propagate.

5. The merge path is wrong — input splits already have `doc_mapping_json`

The merge path re-discovers JSON sub-fields by scanning fast-field columns of the merged index. But the input splits already have a doc_mapping_json containing exactly this information. The correct approach:

Read doc_mapping_json from each input split
Union the sub-field maps with type widening where splits disagree
Use the merged mapping for the output split

Beyond the design problem, calling discover_json_subfields_from_fast_fields on the merged index risks type divergence: if split A has a sub-field as U64 and split B has it as I64, the merged columnar result may differ from what either input split declared. Reading doc_mapping_json from input splits and widening explicitly is correct and predictable.

Note: extract_json_subfields_from_doc_mapping — added by this PR with tests but with no callsite — appears to be the intended building block for this correct approach. The PR adds the right tool and then doesn't use it, choosing the wrong approach instead.

6. Companion-specific logic does not belong in `quickwit_split`

subfield_type_to_mapping, JsonSubfieldMap, discover_fields_from_object, and extract_doc_mapping_with_json_subfields encode companion-specific assumptions (stored: false because parquet is the authoritative store). They have no business living in quickwit_split, which is shared infrastructure. The correct fix is to move this family of functions into parquet_companion. If quickwit_split needs to invoke sub-field mapping at merge time, it should do so through a trait or callback owned by parquet_companion.

As a secondary consequence, subfield_type_to_mapping is pub and hardcodes "stored": false for every type. For non-companion splits (store_fields: true), JSON field data is physically written to .store. If a future caller in quickwit_split uses this function for those splits, Quickwit's DocMapper will be told sub-fields are stored: false — making on-disk data permanently inaccessible with no error. Moving to parquet_companion removes it from the shared namespace and eliminates this footgun.

7. `store_fields` option on JSON fields should not exist for companion splits

The store_fields flag gates .stored() on JSON fields in schema derivation. But companion splits will never use .store for JSON — parquet is the authoritative store by design. This contract should be enforced structurally (separate code paths or a type-level distinction) rather than as a runtime flag.

8. `discover_json_subfields_from_fast_fields` misses indexed-but-not-fast sub-fields

Per finding #1, this function only scans fast-field columns and ignores JSON sub-fields that are indexed but not fast. This produces an incomplete doc_mapping_json for any split with non-fast indexed JSON sub-fields, silently omitting those fields from aggregation. Additionally, ColumnType::IpAddr columns are mapped to "text" rather than "ip", breaking range queries on IP sub-fields after merge.

Medium Priority

9. `extract_json_subfields_from_doc_mapping` ships with tests but has no callsite

The function is added and tested but nothing in the production code calls it. It appears to be the intended building block for the correct merge approach (reading doc_mapping_json from input splits — per finding #5), but the PR implements a different strategy in merge_impl.rs and leaves this function orphaned. Either use it correctly in the merge path or remove it.

10. Sample limit counts rows globally, not per-field discoveries

json_subfield_rows_sampled is a global counter across all parquet files and all JSON fields. Once 1000 rows have been processed, discovery stops for ALL fields — even fields whose JSON values are null in the first 1000 rows. For sparse JSON columns, this means zero sub-fields discovered. The counter should be per-field, or count rows that actually contributed new sub-fields, or use a "no new fields in last N rows" window.

11. `1u8 as char` is a magic number for the tantivy separator

let sep = 1u8 as char; // JSON_PATH_SEGMENT_SEP

This appears in discover_json_subfields_from_fast_fields. Import common::json_path_writer::JSON_PATH_SEGMENT_SEP directly — it is pub. A silent change to this constant in the forked tantivy breaks all JSON path reconstruction without a compile error.

12. `format!("{}{}", json_field, sep)` is inside a double loop

In discover_json_subfields_from_fast_fields, prefix is recomputed for every (col_name, json_field) pair across every segment. Pre-compute once before the column loop:

let prefixes: Vec<(String, &str)> = json_field_names.iter()
    .map(|f| (format!("{}{}", f, sep), f.as_str()))
    .collect();

13. Factually wrong comment in `indexing.rs`

A NOTE comment added by the PR states that .store files must remain on disk for hotcache generation and that create_quickwit_split excludes them from the bundle in companion mode. Both claims are false after the PR's own changes: companion splits produce no .store files at all (no fields are stored), and split_creation.rs has no logic to exclude .store files in companion mode. The file inclusion filter only skips .tantivy* lock files.

14. `store_stub.rs` was already dead code before this PR

The PR deletes store_stub.rs as part of .store elimination, implying it was previously load-bearing. In fact generate_store_stubs had zero callsites before this PR — it was already dead. The deletion is correct but the motivation is misleading.

15. Duplicate test coverage for `store_fields` on UTF8 JSON

After the PR there are three tests in schema_derivation.rs covering the same UTF8 JSON + store_fields combinations: the original test_json_fields_stored_and_indexed (now with store_fields=true), test_json_fields_not_stored_in_companion_mode, and the new parametric test_json_field_stored_respects_store_fields_flag. The first two are redundant with the parametric test and should be removed.

Testing Gaps

16. No end-to-end test for the merge path

No test: (a) builds two companion splits with JSON fields, (b) merges them, (c) asserts the merged split's doc_mapping_json correctly reflects all JSON sub-fields including non-fast indexed ones. This is the most critical gap given the merge path is entirely new.

17. No direct unit test for `discover_json_subfields_from_fast_fields`

The most important new production function in the PR has no unit test. It is only reachable via the merge path integration.

18. `should_widen_type` edge cases not directly tested

test_discover_fields_from_object_type_widening validates should_widen_type indirectly. There is no direct test covering: ("text", "text") → true (no-op update), ("i64", "bool") → false, ("date", "bool") → false, (f64, i64) → false.

19. No cross-segment type-widening test in `discover_json_subfields_from_fast_fields`

No test where segment 0 has a sub-field as i64 and segment 1 has it as f64, verifying the merged result is f64.

20. `extract_json_subfields_from_doc_mapping` not tested with dotted paths

The roundtrip test uses only flat sub-fields. A nested sub-field addr.city should be covered to verify the dotted-path roundtrip through JSON serialization.

Nits

expand_dots=false with dotted key names: If a sub-field key contains a literal dot and expand_dots=false, .replace('\x01', '.') produces an ambiguous path. Add a comment.
u64 assertion fragility: The assertion assert_eq!(score_field["type"], "u64") in test_json_subfield_discovery_in_companion_split depends on id_offset=0 producing non-negative scores. Document this dependency or use matches!(field_type, "u64" | "i64").
test_json_subfield_discovery_in_companion_split tests the approach being removed: Once the inline accumulation is replaced by post-commit discovery, this test name and structure will need updating to reflect what is actually being tested.

schenksj

Thank you for finding and addressing this very real bug!!! Please see the comments inline in the conversation.

When approaching a problem like this, it can be very helpful to use claude's planning mode to find the best approach and ask the planner to fix or re-research anything that seems incomplete.

tlee732 · 2026-04-05T00:30:03Z

Proposal: Persist `doc_mapping.json` in split bundles

After reviewing your feedback, I'd like to propose an approach that addresses #1, #2, #3, #4, #7, and #10 in a single architectural change.

Core idea

Write doc_mapping.json into the split's index_dir at creation time (same pattern as the parquet manifest — the filesystem scan in create_quickwit_split picks it up automatically). At merge time, read it from each extracted input split and union with type widening.

Creation time — two strategies by mode

Mode	Discovery method	Why
Companion	After `writer.commit()`, call `discover_json_subfields_from_fast_fields(&index)`	No `.store` exists; all fields are fast in companion mode — this is authoritative and complete, no sampling limit
Non-companion	`extract_doc_mapping_from_index(&index)` (existing `.store` sampling)	`.store` exists; not all fields are necessarily fast, so fast-field enumeration would miss indexed-but-not-fast sub-fields

Both modes write the result as doc_mapping.json in the index directory before the split is bundled.

Merge time — both modes, same path

Read doc_mapping.json from each extracted input split
Parse via extract_json_subfields_from_doc_mapping() (currently flagged as dead code — becomes the core of the merge path)
Union all maps with should_widen_type() for type evolution across splits
Fallback: if doc_mapping.json is missing (legacy splits), use the current mode-appropriate discovery with proper error logging

What this eliminates

Current code	Status after this change
Inline `discover_fields_from_object()` during companion indexing (first 1000 rows + throwaway after)	Removed entirely — post-commit fast-field discovery replaces it
`json_subfields` field on `ArrowFfiSplitContext` (grows unboundedly, never read)	Removed entirely — FFI is non-companion, uses `.store` sampling at finalization
`throwaway_subfields` variable + 1000-row sample limit in `indexing.rs`	Removed entirely
`discover_json_subfields_from_fast_fields(&merged_index).unwrap_or_default()` in merge path	Replaced with authoritative file read + union; fast-field discovery becomes fallback only
`extract_json_subfields_from_doc_mapping()` dead code	Now the core of merge — parses persisted doc_mappings from input splits

How it maps to review items

#	Feedback	How this addresses it
1	`throwaway_subfields` wastes ~9M allocations	Inline discovery removed entirely for companion; no `Option` signature change needed
2	`ArrowFfiSplitContext::json_subfields` accumulates and discards	Field removed; FFI path passes nothing since `.store` sampling handles it at finalization
3	`unwrap_or_default()` swallows errors	Replaced with file read as primary path; fallback has proper error logging
4	Merge path should use input splits' doc_mapping	Reads authoritative `doc_mapping.json` from each input split, unions with type widening
7	Sampling counts rows not discoveries	Companion mode: eliminated (fast-field enumeration is complete). Non-companion: unchanged (`.store` sampling is the correct strategy there)
10	`extract_json_subfields_from_doc_mapping` dead code	Becomes the core parser for the merge path

Performance impact

Dimension	Effect
CPU at creation (companion)	O(segments × columns) replaces O(N rows)
CPU at creation (FFI)	Removes unbounded wasted HashMap work
CPU at merge	Neutral — file read replaces column enumeration
Memory	Removes unbounded HashMap growth in companion and FFI paths
Disk	`doc_mapping.json` is a few KB per split (splits are 50-500MB)

Precedent

The parquet manifest already uses this exact pattern: write to index_dir, filesystem scan picks it up, bundled into .split, read back at merge time from extracted temp dirs.

Thoughts?

JSON fields in schema_derivation.rs unconditionally called .set_stored(), ignoring the store_fields config flag. This caused companion splits (which set store_fields: false) to redundantly write JSON data into .store files. - Make JSON fields conditionally stored based on config.store_fields - Replace .store-based JSON subfield sampling with inline accumulation during indexing and fast-field discovery during merge - Extract type-widening logic into reusable should_widen_type() helper - Apply type widening to fast-field discovery so doc_mapping types are correct when JSON sub-field types evolve across segments - Remove deleted store_stub.rs (no longer needed) - Add tests for store_fields config, .store round-trip, and type widening Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…eld recovery Replace inline JsonSubfieldMap accumulation (which threaded a mutable HashMap through every row of the indexing hot path) with post-commit fast-field discovery and _doc_mapping.json persistence in split bundles. Creation time: - Companion: discover_json_subfields_from_fast_fields() post-commit - FFI/standard: extract_doc_mapping_from_index() samples .store - Both write _doc_mapping.json to index dir before bundling Merge time (three-tier fallback): - Primary: read _doc_mapping.json from input splits, union with type widening - Legacy companion: discover from fast-field columns on merged index - Legacy non-companion: sample .store on merged index - Mixed legacy/modern: supplement file-derived map with fast-field discovery Additional fixes from review: - Make JSON fields conditionally stored based on store_fields config flag - Add stored param to subfield_type_to_mapping (metadata reflects reality) - Replace 1u8 magic number with JSON_PATH_SEP constant - Pre-compute JSON field prefixes outside inner loop - Fix ColumnType::IpAddr mapping from "text" to "ip" - Add insert_or_widen helper to deduplicate widening logic - Add _doc_mapping.json to split extraction in temp_management.rs - Fix should_widen_type same-type early return and doc comment accuracy - Add logging to all error-handling paths in discovery functions - Consolidate duplicate store_fields tests Test coverage (110 tests): - E2E merge test: create 2 splits, merge, verify sub-field names + types - No-stored-fields test: extract split, verify no stored fields + _doc_mapping.json present - Exhaustive store_fields tests: all 20 Arrow column types in both modes - Fast-field discovery: flat fields, nested dotted paths, cross-segment widening - Type widening: exhaustive type pairs including ip, same-type, bool->date - Persistence I/O: write/read roundtrip, missing file returns None - combine_json_subfield_maps: empty, single, multi-map union Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

tlee732 · 2026-04-07T01:21:26Z

Revision 2: Persist `_doc_mapping.json` in split bundles

Pushed two commits addressing all 20 review findings:

Commit 1: `933f842` — Original schema fix (rebased onto main)

Makes JSON fields conditionally stored based on config.store_fields
Removes store_stub.rs (already dead code)

Commit 2: `af9e111` — Architectural revision per review feedback

Replaces the inline JsonSubfieldMap accumulation with the _doc_mapping.json persistence approach discussed in the proposal comment above.

What changed architecturally:

Removed all json_subfields parameter threading through the hot path (arrow_row_to_tantivy_doc, add_arrow_value_to_doc, add_string_value_to_doc, add_json_string_value)
Removed ArrowFfiSplitContext::json_subfields field, throwaway_subfields, 1000-row sample limit
Added post-commit discover_json_subfields_from_fast_fields() for companion splits
Added _doc_mapping.json persistence in split bundles (same pattern as _parquet_manifest.json)
Added three-tier merge fallback: file → fast-field discovery → .store sampling
Added mixed legacy/modern merge safety (supplements file-derived map with fast-field discovery when not all input splits have _doc_mapping.json)

Additional fixes:

subfield_type_to_mapping now takes stored: bool — metadata reflects reality for both modes (prevents schema bugs from being masked)
discover_json_subfields also propagates stored from the parent field entry
ColumnType::IpAddr correctly maps to "ip" (was "text")
should_widen_type early-returns for same-type comparisons
JSON_PATH_SEP constant replaces 1u8 as char magic number
Pre-computed prefixes in fast-field discovery inner loop
insert_or_widen helper deduplicates 3 call sites
All error-handling paths in discovery functions now log (were silent continue blocks)
promote_doc_mapping_all_fast fallback in merge logs instead of silently degrading

Test coverage (110 tests, 0 failures):

E2E merge test: create 2 companion splits → merge → verify sub-field names AND types
No-stored-fields test: extract split bundle, verify no stored fields + _doc_mapping.json present
Exhaustive store_fields tests: all 20 Arrow column types in both modes (false + true)
Fast-field discovery: flat fields, nested/dotted paths, cross-segment type widening
should_widen_type: all type pairs including ip, same-type, bool→date
Persistence I/O: write/read roundtrip, missing file returns None
combine_json_subfield_maps: empty, single, multi-map union with widening
insert_or_widen: direct unit test for the widening helper

Deliberate non-fixes (per proposal):

cache eviction #7: store_fields structural enforcement — runtime flag set in 3 internal code paths, never by users
Revert "cache eviction" #8: Indexed-but-not-fast sub-fields — only affects legacy fallback path; companion has all fields fast; merge reads from _doc_mapping.json

…g code The single-field text_and_string rewrite eliminated the dual-field companion design, but the text_companion_fields: &HashMap<String, Field> parameter was left threaded through arrow_row_to_tantivy_doc, add_arrow_value_to_doc, add_string_value_to_doc, and build_doc_from_arrow_row. The parameter was always initialized as an empty HashMap and never read — pure dead code. Removing it aligns PR indextables#157's function signatures with PR indextables#155's (feature/eliminate-store-files), so the two PRs no longer collide on this surface when both land on main. text_and_string routing now happens exclusively in the tokenizer selection path, not via a companion field map. No behavior change. Pure subtractive cleanup (15 deletions across 2 files). cargo check --release compiles cleanly. OUT OF SCOPE — pre-existing test build failure: indexing.rs has 14 compile errors in test code starting around line 3500 involving tantivy::collector::TopDocs trait bounds (E0277/E0282). These errors were present on this branch before this commit (verified by reproducing them with this cleanup git-stashed) and are unrelated to the text_companion_fields removal — they appear to be tantivy API drift from a version bump. Fixing them is out of scope for this commit and should land in a separate PR focused specifically on test API compatibility. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

schenksj

LGTM. Nice job!

tlee732 temporarily deployed to test-runners April 4, 2026 23:00 — with GitHub Actions Inactive

tlee732 had a problem deploying to test-runners April 4, 2026 23:00 — with GitHub Actions Failure

schenksj requested changes Apr 4, 2026

View reviewed changes

schenksj requested changes Apr 5, 2026

View reviewed changes

tlee732 and others added 2 commits April 6, 2026 21:15

tlee732 force-pushed the feature/eliminate-store-files branch from 62fef8d to af9e111 Compare April 7, 2026 01:18

tlee732 had a problem deploying to test-runners April 7, 2026 01:18 — with GitHub Actions Failure

tlee732 temporarily deployed to test-runners April 7, 2026 01:18 — with GitHub Actions Inactive

tlee732 changed the title ~~Eliminate redundant .store for companion JSON fields~~ Persist _doc_mapping.json in split bundles for merge-time JSON sub-field recovery Apr 7, 2026

tlee732 mentioned this pull request Apr 10, 2026

Add single-field text_and_string indexing with native fast field support #157

Open

schenksj approved these changes Apr 11, 2026

View reviewed changes

schenksj merged commit becadd4 into indextables:main Apr 11, 2026
1 of 3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Persist _doc_mapping.json in split bundles for merge-time JSON sub-field recovery#155

Persist _doc_mapping.json in split bundles for merge-time JSON sub-field recovery#155
schenksj merged 2 commits into
indextables:mainfrom
tlee732:feature/eliminate-store-files

tlee732 commented Apr 4, 2026 •

edited

Loading

Uh oh!

schenksj left a comment •

edited

Loading

Uh oh!

schenksj left a comment

Uh oh!

tlee732 commented Apr 5, 2026

Uh oh!

tlee732 commented Apr 7, 2026

Uh oh!

schenksj left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tlee732 commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Architecture

Design Decisions

Review Items Addressed

Test plan

Uh oh!

schenksj left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Code Review — PR #155: Eliminate Redundant .store for Companion JSON Fields

Blocking / High Priority

1. The entire JsonSubfieldMap inline accumulation approach is unnecessary — and the replacement is also incomplete

2. throwaway_subfields still allocates per row past the sample limit

3. ArrowFfiSplitContext::json_subfields accumulates silently discarded data and grows unboundedly

4. unwrap_or_default() in the merge path silently produces broken splits

5. The merge path is wrong — input splits already have doc_mapping_json

6. Companion-specific logic does not belong in quickwit_split

7. store_fields option on JSON fields should not exist for companion splits

8. discover_json_subfields_from_fast_fields misses indexed-but-not-fast sub-fields

Medium Priority

9. extract_json_subfields_from_doc_mapping ships with tests but has no callsite

10. Sample limit counts rows globally, not per-field discoveries

11. 1u8 as char is a magic number for the tantivy separator

12. format!("{}{}", json_field, sep) is inside a double loop

13. Factually wrong comment in indexing.rs

14. store_stub.rs was already dead code before this PR

15. Duplicate test coverage for store_fields on UTF8 JSON

Testing Gaps

16. No end-to-end test for the merge path

17. No direct unit test for discover_json_subfields_from_fast_fields

18. should_widen_type edge cases not directly tested

19. No cross-segment type-widening test in discover_json_subfields_from_fast_fields

20. extract_json_subfields_from_doc_mapping not tested with dotted paths

Nits

Uh oh!

schenksj left a comment

Choose a reason for hiding this comment

Uh oh!

tlee732 commented Apr 5, 2026

Proposal: Persist doc_mapping.json in split bundles

Core idea

Creation time — two strategies by mode

Merge time — both modes, same path

What this eliminates

How it maps to review items

Performance impact

Precedent

Uh oh!

tlee732 commented Apr 7, 2026

Revision 2: Persist _doc_mapping.json in split bundles

Commit 1: 933f842 — Original schema fix (rebased onto main)

Commit 2: af9e111 — Architectural revision per review feedback

Uh oh!

schenksj left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tlee732 commented Apr 4, 2026 •

edited

Loading

schenksj left a comment •

edited

Loading

1. The entire `JsonSubfieldMap` inline accumulation approach is unnecessary — and the replacement is also incomplete

2. `throwaway_subfields` still allocates per row past the sample limit

3. `ArrowFfiSplitContext::json_subfields` accumulates silently discarded data and grows unboundedly

4. `unwrap_or_default()` in the merge path silently produces broken splits

5. The merge path is wrong — input splits already have `doc_mapping_json`

6. Companion-specific logic does not belong in `quickwit_split`

7. `store_fields` option on JSON fields should not exist for companion splits

8. `discover_json_subfields_from_fast_fields` misses indexed-but-not-fast sub-fields

9. `extract_json_subfields_from_doc_mapping` ships with tests but has no callsite

11. `1u8 as char` is a magic number for the tantivy separator

12. `format!("{}{}", json_field, sep)` is inside a double loop

13. Factually wrong comment in `indexing.rs`

14. `store_stub.rs` was already dead code before this PR

15. Duplicate test coverage for `store_fields` on UTF8 JSON

17. No direct unit test for `discover_json_subfields_from_fast_fields`

18. `should_widen_type` edge cases not directly tested

19. No cross-segment type-widening test in `discover_json_subfields_from_fast_fields`

20. `extract_json_subfields_from_doc_mapping` not tested with dotted paths

Proposal: Persist `doc_mapping.json` in split bundles

Revision 2: Persist `_doc_mapping.json` in split bundles

Commit 1: `933f842` — Original schema fix (rebased onto main)

Commit 2: `af9e111` — Architectural revision per review feedback