escape rename check for static schema streams#1636
Conversation
|
Important Review skippedThis PR was authored by the user configured for CodeRabbit reviews. CodeRabbit does not review PRs authored by this user. It's recommended to use a dedicated user account to post CodeRabbit review feedback. ⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
WalkthroughGates JSON conflict detection and per-record renaming on a ChangesEvent JSON: static-schema gating & per-record rename
PutStreamHeaders: remove time_partition header
Stream creation/update: time_partition handling
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
src/event/format/json.rs (1)
97-99: ⚡ Quick winAvoid unconditional raw schema inference on the static-schema path.
raw_inferred_schemais still inferred before this branch, but the static-schema path does not use it. Moving inference inside theelsebranch avoids unnecessary work on a hot ingestion path.♻️ Proposed change
- // First, infer raw schema from incoming event to detect type conflicts - // IMPORTANT: Detect conflicts BEFORE update_field_type_in_schema, because - // update_field_type_in_schema may override types (e.g., force Utf8 to Timestamp - // if existing schema has Timestamp), which would hide the actual conflict. - let raw_inferred_schema = { - let _span = info_span!("infer_json_schema", record_count = value_arr.len()).entered(); - infer_json_schema_from_iterator(value_arr.iter().map(Ok)).map_err(|err| { - anyhow!("Could not infer schema for this event due to err {:?}", err) - })? - }; - let value_arr = if static_schema_flag { value_arr } else { + // First, infer raw schema from incoming event to detect type conflicts + // IMPORTANT: Detect conflicts BEFORE update_field_type_in_schema, because + // update_field_type_in_schema may override types (e.g., force Utf8 to Timestamp + // if existing schema has Timestamp), which would hide the actual conflict. + let raw_inferred_schema = { + let _span = info_span!("infer_json_schema", record_count = value_arr.len()).entered(); + infer_json_schema_from_iterator(value_arr.iter().map(Ok)).map_err(|err| { + anyhow!("Could not infer schema for this event due to err {:?}", err) + })? + }; + // Detect schema conflicts using raw inferred schema vs existing stream schema // Pass the actual values and schema_version to check if values can be coerced to existing types let conflicts = detect_schema_conflicts(🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/event/format/json.rs` around lines 97 - 99, The code currently infers raw_inferred_schema unconditionally even when static_schema_flag is true; move the raw schema inference so it only runs inside the else branch where value_arr is reconstructed (i.e., perform the raw_inferred_schema computation after checking static_schema_flag and inside the else block that produces value_arr) to avoid doing unnecessary work on the static-schema path; update any uses of raw_inferred_schema to be scoped where it is needed (inside the else) and ensure nothing outside the branch relies on it.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@src/event/format/json.rs`:
- Around line 97-99: The code currently infers raw_inferred_schema
unconditionally even when static_schema_flag is true; move the raw schema
inference so it only runs inside the else branch where value_arr is
reconstructed (i.e., perform the raw_inferred_schema computation after checking
static_schema_flag and inside the else block that produces value_arr) to avoid
doing unnecessary work on the static-schema path; update any uses of
raw_inferred_schema to be scoped where it is needed (inside the else) and ensure
nothing outside the branch relies on it.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 11cf4bd2-d59a-4343-8ec8-9dfdf5098573
📒 Files selected for processing (1)
src/event/format/json.rs
remove time partition headers from create/update logstream
8838f6d
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/event/format/mod.rs (1)
587-634:⚠️ Potential issue | 🟠 MajorDocstring-code mismatch: Function no longer implements the second short-circuit condition, causing test failure.
The function's docstring claims it short-circuits when "no inferred field shares both name and type with storage," but this condition is not implemented in the current code—only
values.len() <= 1is checked. The testrename_per_record_short_circuits_when_no_field_overlap_at_same_type(lines 1171-1191) expects this behavior: storage has"escaped"→Utf8, inferred has"escaped"→Boolean. With two values:
values.len() = 2→ loop runs (no short-circuit)target_type = Utf8(from storage)value_compatible_with_type(&true, &Utf8, V1)returnsfalse(line 477: requiresvalue.is_string())- Field gets renamed to
escaped_bool- Test assertion that
"escaped"key remains will failFix: Either update the docstring to reflect the new behavior, or restore the second short-circuit condition. The change appears intentional (based on commit 8838f6d to catch within-batch type mismatches), but the test and docstring were not updated accordingly.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/event/format/mod.rs` around lines 587 - 634, The docstring promised a second short-circuit that returns early when "no inferred field shares both name and type with storage"; restore that check before the mapping loop: after building inferred_types and before values.into_iter(), add a guard that scans existing_schema (or existing_schema.values()) and returns values unchanged if there is no field name for which inferred_types.get(name) is Some(same DataType as existing_schema entry). Refer to inferred_types, existing_schema and the mapping block that begins with values.into_iter().map(...) — implement the early return using those symbols so the function behavior matches the docstring and the failing test.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@src/event/format/mod.rs`:
- Around line 587-634: The docstring promised a second short-circuit that
returns early when "no inferred field shares both name and type with storage";
restore that check before the mapping loop: after building inferred_types and
before values.into_iter(), add a guard that scans existing_schema (or
existing_schema.values()) and returns values unchanged if there is no field name
for which inferred_types.get(name) is Some(same DataType as existing_schema
entry). Refer to inferred_types, existing_schema and the mapping block that
begins with values.into_iter().map(...) — implement the early return using those
symbols so the function behavior matches the docstring and the failing test.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 0d9f6213-244f-47d1-ba03-14776f79ce26
📒 Files selected for processing (3)
src/event/format/mod.rssrc/handlers/http/modal/utils/logstream_utils.rssrc/parseable/mod.rs
753acfd to
b58d891
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
src/event/format/mod.rs (1)
1106-1118: 💤 Low valueConsider updating the test comment for accuracy.
The comment "Fields not in storage are passed through (let arrow infer them fresh)" is now slightly misleading given the new behavior. Fields absent from storage are now validated against
inferred_types(the batch-level inference). This test still passes because both boolean values are compatible with the inferredBooleantype, but the reason differs from what the comment suggests.📝 Suggested comment update
#[test] fn rename_per_record_skips_unknown_fields() { - // Fields not in storage are passed through (let arrow infer them fresh). + // Fields not in storage are validated against the batch-level inferred type. + // Here all values are compatible with the inferred Boolean, so no rename. let storage: HashMap<String, Arc<Field>> = HashMap::new();🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/event/format/mod.rs` around lines 1106 - 1118, Update the test comment in the rename_per_record_skips_unknown_fields test to reflect the new behavior: instead of saying "Fields not in storage are passed through (let arrow infer them fresh)", state that fields absent from storage are validated against the batch-level inferred schema (the inferred variable / inferred_types) rather than being left to Arrow to infer; reference the storage HashMap, inferred (Schema), and the call to rename_per_record_type_mismatches with SchemaVersion::V1 to clarify where this validation occurs.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@src/event/format/mod.rs`:
- Around line 1106-1118: Update the test comment in the
rename_per_record_skips_unknown_fields test to reflect the new behavior: instead
of saying "Fields not in storage are passed through (let arrow infer them
fresh)", state that fields absent from storage are validated against the
batch-level inferred schema (the inferred variable / inferred_types) rather than
being left to Arrow to infer; reference the storage HashMap, inferred (Schema),
and the call to rename_per_record_type_mismatches with SchemaVersion::V1 to
clarify where this validation occurs.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 1dca6d4a-e26c-48d1-bd7b-9d61ed7df2a7
📒 Files selected for processing (1)
src/event/format/mod.rs
b58d891 to
8b290af
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
src/event/format/mod.rs (1)
574-580: ⚡ Quick winUpdate stale function docs to match new control flow
The doc block still says this is a no-op when inferred/storage name+type overlap is missing, but that guard was removed and Line 587 now only short-circuits for single-record batches. Please align the comment with current behavior.
Also applies to: 587-589
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/event/format/mod.rs` around lines 574 - 580, Update the stale doc comment above the function that short-circuits for single-record batches so it no longer claims the no-op also applies when "no inferred field shares both name and type with storage"; instead state that the no-op now only occurs for batches of at most one record (the earlier guard about inferred/storage name+type overlap was removed). Adjust the wording in the doc block (and the related comment at the other occurrence) to reference that detect_schema_conflicts still handles single-record type inference, and remove the outdated claim about arrow absorbing mixed-type records.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/event/format/mod.rs`:
- Around line 615-621: The early return that passes through arrays/objects (the
val.is_array() || val.is_object() branch) must be moved to after the code that
resolves target_type and applied only when the resolved target_type is a
structural Arrow type (List, LargeList, Struct, or Map); update the logic around
value_compatible_with_type to first resolve target_type and then, if val is
array/object, only skip validation when target_type.is_list() ||
target_type.is_large_list() || target_type.is_struct() || target_type.is_map(),
otherwise let value_compatible_with_type run and handle/rename incompatible
scalar targets as before (refer to val.is_array(), val.is_object(), target_type,
and value_compatible_with_type to locate and change the code).
---
Nitpick comments:
In `@src/event/format/mod.rs`:
- Around line 574-580: Update the stale doc comment above the function that
short-circuits for single-record batches so it no longer claims the no-op also
applies when "no inferred field shares both name and type with storage"; instead
state that the no-op now only occurs for batches of at most one record (the
earlier guard about inferred/storage name+type overlap was removed). Adjust the
wording in the doc block (and the related comment at the other occurrence) to
reference that detect_schema_conflicts still handles single-record type
inference, and remove the outdated claim about arrow absorbing mixed-type
records.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: f8b01b86-f4f2-437c-b380-8c7b1b2bd922
📒 Files selected for processing (1)
src/event/format/mod.rs
Summary by CodeRabbit
Features
Bug Fixes / Changes