Skip to content

escape rename check for static schema streams#1636

Merged
nikhilsinhaparseable merged 4 commits intoparseablehq:mainfrom
nikhilsinhaparseable:fix/static-schema-ingestion
May 2, 2026
Merged

escape rename check for static schema streams#1636
nikhilsinhaparseable merged 4 commits intoparseablehq:mainfrom
nikhilsinhaparseable:fix/static-schema-ingestion

Conversation

@nikhilsinhaparseable
Copy link
Copy Markdown
Contributor

@nikhilsinhaparseable nikhilsinhaparseable commented May 2, 2026

Summary by CodeRabbit

  • Features

    • JSON event processing: "static schema" mode skips conflict detection and JSON-field renaming; standard mode retains validation and renaming.
    • Per-record handling: type-mismatch renaming now applies more broadly for multi-record batches, including new-column and mixed-type cases.
  • Bug Fixes / Changes

    • Stream creation/update: static-schema validation uses an empty time-partition and changing a stream's time partition is no longer blocked.
    • Request handling: explicit time-partition header parsing removed.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 2, 2026

Important

Review skipped

This PR was authored by the user configured for CodeRabbit reviews. CodeRabbit does not review PRs authored by this user. It's recommended to use a dedicated user account to post CodeRabbit review feedback.

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 25975843-c34f-460a-bb31-f2b61f20ea27

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Walkthrough

Gates JSON conflict detection and per-record renaming on a static_schema_flag (skips when true). Per-record renaming now falls back to inferred types for fields absent from storage. Removed time_partition header/field from PutStreamHeaders. Stream create now uses empty time_partition; update no longer accepts a time_partition param.

Changes

Event JSON: static-schema gating & per-record rename

Layer / File(s) Summary
Behavior toggle
src/event/format/json.rs
EventFormat::to_data now branches on static_schema_flag: when true, it skips schema-conflict detection and JSON renaming; when false, it runs the existing conflict-detection and renaming flow.
Per-record rename logic
src/event/format/mod.rs
rename_per_record_type_mismatches only short-circuits for single-record batches; for multi-records it builds inferred_types, picks a target_type per key (storage type preferred, otherwise inferred), skips compatibility checks for arrays/objects, and renames incompatible scalar values to <key>_<datatype-suffix>.
Tests / Call-sites
src/event/format/...
Unit test updated to assert renaming when a field is absent from storage (first-batch/new-column case); no public signatures changed.

PutStreamHeaders: remove time_partition header

Layer / File(s) Summary
Imports & Constants
src/handlers/http/modal/utils/logstream_utils.rs
Removed TIME_PARTITION_KEY usage from imports and header parsing.
Struct shape
src/handlers/http/modal/utils/logstream_utils.rs
Removed time_partition: String field from PutStreamHeaders; struct now begins with time_partition_limit and retains other fields.
Constructors
src/handlers/http/modal/utils/logstream_utils.rs
Default and From<&HeaderMap> no longer initialize or parse time_partition.

Stream creation/update: time_partition handling

Layer / File(s) Summary
Creation path validation
src/parseable/mod.rs
create_update_stream validates static schema using an empty time_partition and calls create_stream with time_partition = ""; removed an earlier error for both time_partition and custom_partition being set in this path.
Update signature & checks
src/parseable/mod.rs
update_stream signature no longer accepts a time_partition parameter and no longer rejects non-empty time_partition in the update path; checks preventing static_schema_flag alteration remain and updates to time_partition_limit/custom_partition are unchanged.
Integration
src/parseable/...
Wiring adjusted so creation uses an empty time-partition and update path omits time_partition handling.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • parmesant

Poem

🐰 I nibble on keys with a careful hop,
A flag can still the renaming crop;
When static stands, the fields stay neat,
When dynamic runs, I rename each beat —
Small hops, safe schemas, and a carrot to top.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Description check ⚠️ Warning The pull request description is completely empty, missing all required sections including description of goals, solutions, and testing/documentation checklist. Add a description following the template: explain the goal of the changes, describe the problem being solved, outline key changes in each file, and check the verification boxes.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: skipping rename checks for static schema streams, which aligns with the primary modification across all files.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@nitisht nitisht requested a review from parmesant May 2, 2026 03:45
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/event/format/json.rs (1)

97-99: ⚡ Quick win

Avoid unconditional raw schema inference on the static-schema path.

raw_inferred_schema is still inferred before this branch, but the static-schema path does not use it. Moving inference inside the else branch avoids unnecessary work on a hot ingestion path.

♻️ Proposed change
-        // First, infer raw schema from incoming event to detect type conflicts
-        // IMPORTANT: Detect conflicts BEFORE update_field_type_in_schema, because
-        // update_field_type_in_schema may override types (e.g., force Utf8 to Timestamp
-        // if existing schema has Timestamp), which would hide the actual conflict.
-        let raw_inferred_schema = {
-            let _span = info_span!("infer_json_schema", record_count = value_arr.len()).entered();
-            infer_json_schema_from_iterator(value_arr.iter().map(Ok)).map_err(|err| {
-                anyhow!("Could not infer schema for this event due to err {:?}", err)
-            })?
-        };
-
         let value_arr = if static_schema_flag {
             value_arr
         } else {
+            // First, infer raw schema from incoming event to detect type conflicts
+            // IMPORTANT: Detect conflicts BEFORE update_field_type_in_schema, because
+            // update_field_type_in_schema may override types (e.g., force Utf8 to Timestamp
+            // if existing schema has Timestamp), which would hide the actual conflict.
+            let raw_inferred_schema = {
+                let _span = info_span!("infer_json_schema", record_count = value_arr.len()).entered();
+                infer_json_schema_from_iterator(value_arr.iter().map(Ok)).map_err(|err| {
+                    anyhow!("Could not infer schema for this event due to err {:?}", err)
+                })?
+            };
+
             // Detect schema conflicts using raw inferred schema vs existing stream schema
             // Pass the actual values and schema_version to check if values can be coerced to existing types
             let conflicts = detect_schema_conflicts(
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/event/format/json.rs` around lines 97 - 99, The code currently infers
raw_inferred_schema unconditionally even when static_schema_flag is true; move
the raw schema inference so it only runs inside the else branch where value_arr
is reconstructed (i.e., perform the raw_inferred_schema computation after
checking static_schema_flag and inside the else block that produces value_arr)
to avoid doing unnecessary work on the static-schema path; update any uses of
raw_inferred_schema to be scoped where it is needed (inside the else) and ensure
nothing outside the branch relies on it.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/event/format/json.rs`:
- Around line 97-99: The code currently infers raw_inferred_schema
unconditionally even when static_schema_flag is true; move the raw schema
inference so it only runs inside the else branch where value_arr is
reconstructed (i.e., perform the raw_inferred_schema computation after checking
static_schema_flag and inside the else block that produces value_arr) to avoid
doing unnecessary work on the static-schema path; update any uses of
raw_inferred_schema to be scoped where it is needed (inside the else) and ensure
nothing outside the branch relies on it.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 11cf4bd2-d59a-4343-8ec8-9dfdf5098573

📥 Commits

Reviewing files that changed from the base of the PR and between 001215b and 04f2fb8.

📒 Files selected for processing (1)
  • src/event/format/json.rs

coderabbitai[bot]
coderabbitai Bot previously approved these changes May 2, 2026
parmesant
parmesant previously approved these changes May 2, 2026
remove time partition headers from create/update logstream
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/event/format/mod.rs (1)

587-634: ⚠️ Potential issue | 🟠 Major

Docstring-code mismatch: Function no longer implements the second short-circuit condition, causing test failure.

The function's docstring claims it short-circuits when "no inferred field shares both name and type with storage," but this condition is not implemented in the current code—only values.len() <= 1 is checked. The test rename_per_record_short_circuits_when_no_field_overlap_at_same_type (lines 1171-1191) expects this behavior: storage has "escaped"→Utf8, inferred has "escaped"→Boolean. With two values:

  1. values.len() = 2 → loop runs (no short-circuit)
  2. target_type = Utf8 (from storage)
  3. value_compatible_with_type(&true, &Utf8, V1) returns false (line 477: requires value.is_string())
  4. Field gets renamed to escaped_bool
  5. Test assertion that "escaped" key remains will fail

Fix: Either update the docstring to reflect the new behavior, or restore the second short-circuit condition. The change appears intentional (based on commit 8838f6d to catch within-batch type mismatches), but the test and docstring were not updated accordingly.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/event/format/mod.rs` around lines 587 - 634, The docstring promised a
second short-circuit that returns early when "no inferred field shares both name
and type with storage"; restore that check before the mapping loop: after
building inferred_types and before values.into_iter(), add a guard that scans
existing_schema (or existing_schema.values()) and returns values unchanged if
there is no field name for which inferred_types.get(name) is Some(same DataType
as existing_schema entry). Refer to inferred_types, existing_schema and the
mapping block that begins with values.into_iter().map(...) — implement the early
return using those symbols so the function behavior matches the docstring and
the failing test.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@src/event/format/mod.rs`:
- Around line 587-634: The docstring promised a second short-circuit that
returns early when "no inferred field shares both name and type with storage";
restore that check before the mapping loop: after building inferred_types and
before values.into_iter(), add a guard that scans existing_schema (or
existing_schema.values()) and returns values unchanged if there is no field name
for which inferred_types.get(name) is Some(same DataType as existing_schema
entry). Refer to inferred_types, existing_schema and the mapping block that
begins with values.into_iter().map(...) — implement the early return using those
symbols so the function behavior matches the docstring and the failing test.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 0d9f6213-244f-47d1-ba03-14776f79ce26

📥 Commits

Reviewing files that changed from the base of the PR and between 04f2fb8 and 8838f6d.

📒 Files selected for processing (3)
  • src/event/format/mod.rs
  • src/handlers/http/modal/utils/logstream_utils.rs
  • src/parseable/mod.rs

coderabbitai[bot]
coderabbitai Bot previously approved these changes May 2, 2026
coderabbitai[bot]
coderabbitai Bot previously approved these changes May 2, 2026
@nikhilsinhaparseable nikhilsinhaparseable force-pushed the fix/static-schema-ingestion branch from 753acfd to b58d891 Compare May 2, 2026 08:43
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/event/format/mod.rs (1)

1106-1118: 💤 Low value

Consider updating the test comment for accuracy.

The comment "Fields not in storage are passed through (let arrow infer them fresh)" is now slightly misleading given the new behavior. Fields absent from storage are now validated against inferred_types (the batch-level inference). This test still passes because both boolean values are compatible with the inferred Boolean type, but the reason differs from what the comment suggests.

📝 Suggested comment update
     #[test]
     fn rename_per_record_skips_unknown_fields() {
-        // Fields not in storage are passed through (let arrow infer them fresh).
+        // Fields not in storage are validated against the batch-level inferred type.
+        // Here all values are compatible with the inferred Boolean, so no rename.
         let storage: HashMap<String, Arc<Field>> = HashMap::new();
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/event/format/mod.rs` around lines 1106 - 1118, Update the test comment in
the rename_per_record_skips_unknown_fields test to reflect the new behavior:
instead of saying "Fields not in storage are passed through (let arrow infer
them fresh)", state that fields absent from storage are validated against the
batch-level inferred schema (the inferred variable / inferred_types) rather than
being left to Arrow to infer; reference the storage HashMap, inferred (Schema),
and the call to rename_per_record_type_mismatches with SchemaVersion::V1 to
clarify where this validation occurs.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/event/format/mod.rs`:
- Around line 1106-1118: Update the test comment in the
rename_per_record_skips_unknown_fields test to reflect the new behavior: instead
of saying "Fields not in storage are passed through (let arrow infer them
fresh)", state that fields absent from storage are validated against the
batch-level inferred schema (the inferred variable / inferred_types) rather than
being left to Arrow to infer; reference the storage HashMap, inferred (Schema),
and the call to rename_per_record_type_mismatches with SchemaVersion::V1 to
clarify where this validation occurs.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 1dca6d4a-e26c-48d1-bd7b-9d61ed7df2a7

📥 Commits

Reviewing files that changed from the base of the PR and between 753acfd and b58d891.

📒 Files selected for processing (1)
  • src/event/format/mod.rs

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
src/event/format/mod.rs (1)

574-580: ⚡ Quick win

Update stale function docs to match new control flow

The doc block still says this is a no-op when inferred/storage name+type overlap is missing, but that guard was removed and Line 587 now only short-circuits for single-record batches. Please align the comment with current behavior.

Also applies to: 587-589

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/event/format/mod.rs` around lines 574 - 580, Update the stale doc comment
above the function that short-circuits for single-record batches so it no longer
claims the no-op also applies when "no inferred field shares both name and type
with storage"; instead state that the no-op now only occurs for batches of at
most one record (the earlier guard about inferred/storage name+type overlap was
removed). Adjust the wording in the doc block (and the related comment at the
other occurrence) to reference that detect_schema_conflicts still handles
single-record type inference, and remove the outdated claim about arrow
absorbing mixed-type records.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/event/format/mod.rs`:
- Around line 615-621: The early return that passes through arrays/objects (the
val.is_array() || val.is_object() branch) must be moved to after the code that
resolves target_type and applied only when the resolved target_type is a
structural Arrow type (List, LargeList, Struct, or Map); update the logic around
value_compatible_with_type to first resolve target_type and then, if val is
array/object, only skip validation when target_type.is_list() ||
target_type.is_large_list() || target_type.is_struct() || target_type.is_map(),
otherwise let value_compatible_with_type run and handle/rename incompatible
scalar targets as before (refer to val.is_array(), val.is_object(), target_type,
and value_compatible_with_type to locate and change the code).

---

Nitpick comments:
In `@src/event/format/mod.rs`:
- Around line 574-580: Update the stale doc comment above the function that
short-circuits for single-record batches so it no longer claims the no-op also
applies when "no inferred field shares both name and type with storage"; instead
state that the no-op now only occurs for batches of at most one record (the
earlier guard about inferred/storage name+type overlap was removed). Adjust the
wording in the doc block (and the related comment at the other occurrence) to
reference that detect_schema_conflicts still handles single-record type
inference, and remove the outdated claim about arrow absorbing mixed-type
records.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: f8b01b86-f4f2-437c-b380-8c7b1b2bd922

📥 Commits

Reviewing files that changed from the base of the PR and between b58d891 and 8b290af.

📒 Files selected for processing (1)
  • src/event/format/mod.rs

Comment thread src/event/format/mod.rs Outdated
@nikhilsinhaparseable nikhilsinhaparseable merged commit bb763b1 into parseablehq:main May 2, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants