escape rename check for static schema streams by nikhilsinhaparseable · Pull Request #1636 · parseablehq/parseable

nikhilsinhaparseable · 2026-05-02T03:43:55Z

Summary by CodeRabbit

Features
- JSON event processing: "static schema" mode skips conflict detection and JSON-field renaming; standard mode retains validation and renaming.
- Per-record handling: type-mismatch renaming now applies more broadly for multi-record batches, including new-column and mixed-type cases.
Bug Fixes / Changes
- Stream creation/update: static-schema validation uses an empty time-partition and changing a stream's time partition is no longer blocked.
- Request handling: explicit time-partition header parsing removed.

coderabbitai · 2026-05-02T03:44:06Z

Important

Review skipped

This PR was authored by the user configured for CodeRabbit reviews. CodeRabbit does not review PRs authored by this user. It's recommended to use a dedicated user account to post CodeRabbit review feedback.

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 25975843-c34f-460a-bb31-f2b61f20ea27

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

Walkthrough

Gates JSON conflict detection and per-record renaming on a static_schema_flag (skips when true). Per-record renaming now falls back to inferred types for fields absent from storage. Removed time_partition header/field from PutStreamHeaders. Stream create now uses empty time_partition; update no longer accepts a time_partition param.

Changes

Event JSON: static-schema gating & per-record rename

Layer / File(s)	Summary
Behavior toggle `src/event/format/json.rs`	`EventFormat::to_data` now branches on `static_schema_flag`: when true, it skips schema-conflict detection and JSON renaming; when false, it runs the existing conflict-detection and renaming flow.
Per-record rename logic `src/event/format/mod.rs`	`rename_per_record_type_mismatches` only short-circuits for single-record batches; for multi-records it builds `inferred_types`, picks a `target_type` per key (storage type preferred, otherwise inferred), skips compatibility checks for arrays/objects, and renames incompatible scalar values to `<key>_<datatype-suffix>`.
Tests / Call-sites `src/event/format/...`	Unit test updated to assert renaming when a field is absent from storage (first-batch/new-column case); no public signatures changed.

PutStreamHeaders: remove time_partition header

Layer / File(s)	Summary
Imports & Constants `src/handlers/http/modal/utils/logstream_utils.rs`	Removed `TIME_PARTITION_KEY` usage from imports and header parsing.
Struct shape `src/handlers/http/modal/utils/logstream_utils.rs`	Removed `time_partition: String` field from `PutStreamHeaders`; struct now begins with `time_partition_limit` and retains other fields.
Constructors `src/handlers/http/modal/utils/logstream_utils.rs`	`Default` and `From<&HeaderMap>` no longer initialize or parse `time_partition`.

Stream creation/update: time_partition handling

Layer / File(s)	Summary
Creation path validation `src/parseable/mod.rs`	`create_update_stream` validates static schema using an empty `time_partition` and calls `create_stream` with `time_partition = ""`; removed an earlier error for both `time_partition` and `custom_partition` being set in this path.
Update signature & checks `src/parseable/mod.rs`	`update_stream` signature no longer accepts a `time_partition` parameter and no longer rejects non-empty `time_partition` in the update path; checks preventing `static_schema_flag` alteration remain and updates to `time_partition_limit`/`custom_partition` are unchanged.
Integration `src/parseable/...`	Wiring adjusted so creation uses an empty time-partition and update path omits `time_partition` handling.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

infer timestamp flag at stream creation for otel-metrics #1632 — overlapping edits to per-record rename logic and JSON formatting flow.
fix: data type for static schema #1235 — related gating of EventFormat::to_data and JSON validation/renaming behavior.
update: remove time partition stream creation #1227 — related modifications to stream creation semantics for time-partition handling.

Suggested reviewers

parmesant

Poem

🐰 I nibble on keys with a careful hop,
A flag can still the renaming crop;
When static stands, the fields stay neat,
When dynamic runs, I rename each beat —
Small hops, safe schemas, and a carrot to top.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The pull request description is completely empty, missing all required sections including description of goals, solutions, and testing/documentation checklist.	Add a description following the template: explain the goal of the changes, describe the problem being solved, outline key changes in each file, and check the verification boxes.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main change: skipping rename checks for static schema streams, which aligns with the primary modification across all files.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

src/event/format/json.rs (1)

97-99: ⚡ Quick win

Avoid unconditional raw schema inference on the static-schema path.

raw_inferred_schema is still inferred before this branch, but the static-schema path does not use it. Moving inference inside the else branch avoids unnecessary work on a hot ingestion path.

♻️ Proposed change

-        // First, infer raw schema from incoming event to detect type conflicts
-        // IMPORTANT: Detect conflicts BEFORE update_field_type_in_schema, because
-        // update_field_type_in_schema may override types (e.g., force Utf8 to Timestamp
-        // if existing schema has Timestamp), which would hide the actual conflict.
-        let raw_inferred_schema = {
-            let _span = info_span!("infer_json_schema", record_count = value_arr.len()).entered();
-            infer_json_schema_from_iterator(value_arr.iter().map(Ok)).map_err(|err| {
-                anyhow!("Could not infer schema for this event due to err {:?}", err)
-            })?
-        };
-
         let value_arr = if static_schema_flag {
             value_arr
         } else {
+            // First, infer raw schema from incoming event to detect type conflicts
+            // IMPORTANT: Detect conflicts BEFORE update_field_type_in_schema, because
+            // update_field_type_in_schema may override types (e.g., force Utf8 to Timestamp
+            // if existing schema has Timestamp), which would hide the actual conflict.
+            let raw_inferred_schema = {
+                let _span = info_span!("infer_json_schema", record_count = value_arr.len()).entered();
+                infer_json_schema_from_iterator(value_arr.iter().map(Ok)).map_err(|err| {
+                    anyhow!("Could not infer schema for this event due to err {:?}", err)
+                })?
+            };
+
             // Detect schema conflicts using raw inferred schema vs existing stream schema
             // Pass the actual values and schema_version to check if values can be coerced to existing types
             let conflicts = detect_schema_conflicts(

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/event/format/json.rs` around lines 97 - 99, The code currently infers
raw_inferred_schema unconditionally even when static_schema_flag is true; move
the raw schema inference so it only runs inside the else branch where value_arr
is reconstructed (i.e., perform the raw_inferred_schema computation after
checking static_schema_flag and inside the else block that produces value_arr)
to avoid doing unnecessary work on the static-schema path; update any uses of
raw_inferred_schema to be scoped where it is needed (inside the else) and ensure
nothing outside the branch relies on it.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/event/format/json.rs`:
- Around line 97-99: The code currently infers raw_inferred_schema
unconditionally even when static_schema_flag is true; move the raw schema
inference so it only runs inside the else branch where value_arr is
reconstructed (i.e., perform the raw_inferred_schema computation after checking
static_schema_flag and inside the else block that produces value_arr) to avoid
doing unnecessary work on the static-schema path; update any uses of
raw_inferred_schema to be scoped where it is needed (inside the else) and ensure
nothing outside the branch relies on it.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 11cf4bd2-d59a-4343-8ec8-9dfdf5098573

📥 Commits

Reviewing files that changed from the base of the PR and between 001215b and 04f2fb8.

📒 Files selected for processing (1)

src/event/format/json.rs

remove time partition headers from create/update logstream

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/event/format/mod.rs (1)
587-634: ⚠️ Potential issue | 🟠 Major

Docstring-code mismatch: Function no longer implements the second short-circuit condition, causing test failure.

The function's docstring claims it short-circuits when "no inferred field shares both name and type with storage," but this condition is not implemented in the current code—only values.len() <= 1 is checked. The test rename_per_record_short_circuits_when_no_field_overlap_at_same_type (lines 1171-1191) expects this behavior: storage has "escaped"→Utf8, inferred has "escaped"→Boolean. With two values:

values.len() = 2 → loop runs (no short-circuit)

target_type = Utf8 (from storage)

value_compatible_with_type(&true, &Utf8, V1) returns false (line 477: requires value.is_string())

Field gets renamed to escaped_bool

Test assertion that "escaped" key remains will fail

Fix: Either update the docstring to reflect the new behavior, or restore the second short-circuit condition. The change appears intentional (based on commit 8838f6d to catch within-batch type mismatches), but the test and docstring were not updated accordingly.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/event/format/mod.rs` around lines 587 - 634, The docstring promised a
second short-circuit that returns early when "no inferred field shares both name
and type with storage"; restore that check before the mapping loop: after
building inferred_types and before values.into_iter(), add a guard that scans
existing_schema (or existing_schema.values()) and returns values unchanged if
there is no field name for which inferred_types.get(name) is Some(same DataType
as existing_schema entry). Refer to inferred_types, existing_schema and the
mapping block that begins with values.into_iter().map(...) — implement the early
return using those symbols so the function behavior matches the docstring and
the failing test.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@src/event/format/mod.rs`:
- Around line 587-634: The docstring promised a second short-circuit that
returns early when "no inferred field shares both name and type with storage";
restore that check before the mapping loop: after building inferred_types and
before values.into_iter(), add a guard that scans existing_schema (or
existing_schema.values()) and returns values unchanged if there is no field name
for which inferred_types.get(name) is Some(same DataType as existing_schema
entry). Refer to inferred_types, existing_schema and the mapping block that
begins with values.into_iter().map(...) — implement the early return using those
symbols so the function behavior matches the docstring and the failing test.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 0d9f6213-244f-47d1-ba03-14776f79ce26

📥 Commits

Reviewing files that changed from the base of the PR and between 04f2fb8 and 8838f6d.

📒 Files selected for processing (3)

src/event/format/mod.rs
src/handlers/http/modal/utils/logstream_utils.rs
src/parseable/mod.rs

coderabbitai

🧹 Nitpick comments (1)

src/event/format/mod.rs (1)
1106-1118: 💤 Low value

Consider updating the test comment for accuracy.

The comment "Fields not in storage are passed through (let arrow infer them fresh)" is now slightly misleading given the new behavior. Fields absent from storage are now validated against inferred_types (the batch-level inference). This test still passes because both boolean values are compatible with the inferred Boolean type, but the reason differs from what the comment suggests.
📝 Suggested comment update
     #[test]
     fn rename_per_record_skips_unknown_fields() {
-        // Fields not in storage are passed through (let arrow infer them fresh).
+        // Fields not in storage are validated against the batch-level inferred type.
+        // Here all values are compatible with the inferred Boolean, so no rename.
         let storage: HashMap<String, Arc<Field>> = HashMap::new();
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/event/format/mod.rs` around lines 1106 - 1118, Update the test comment in
the rename_per_record_skips_unknown_fields test to reflect the new behavior:
instead of saying "Fields not in storage are passed through (let arrow infer
them fresh)", state that fields absent from storage are validated against the
batch-level inferred schema (the inferred variable / inferred_types) rather than
being left to Arrow to infer; reference the storage HashMap, inferred (Schema),
and the call to rename_per_record_type_mismatches with SchemaVersion::V1 to
clarify where this validation occurs.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/event/format/mod.rs`:
- Around line 1106-1118: Update the test comment in the
rename_per_record_skips_unknown_fields test to reflect the new behavior: instead
of saying "Fields not in storage are passed through (let arrow infer them
fresh)", state that fields absent from storage are validated against the
batch-level inferred schema (the inferred variable / inferred_types) rather than
being left to Arrow to infer; reference the storage HashMap, inferred (Schema),
and the call to rename_per_record_type_mismatches with SchemaVersion::V1 to
clarify where this validation occurs.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 1dca6d4a-e26c-48d1-bd7b-9d61ed7df2a7

📥 Commits

Reviewing files that changed from the base of the PR and between 753acfd and b58d891.

📒 Files selected for processing (1)

src/event/format/mod.rs

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

src/event/format/mod.rs (1)
574-580: ⚡ Quick win

Update stale function docs to match new control flow

The doc block still says this is a no-op when inferred/storage name+type overlap is missing, but that guard was removed and Line 587 now only short-circuits for single-record batches. Please align the comment with current behavior.

Also applies to: 587-589
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/event/format/mod.rs` around lines 574 - 580, Update the stale doc comment
above the function that short-circuits for single-record batches so it no longer
claims the no-op also applies when "no inferred field shares both name and type
with storage"; instead state that the no-op now only occurs for batches of at
most one record (the earlier guard about inferred/storage name+type overlap was
removed). Adjust the wording in the doc block (and the related comment at the
other occurrence) to reference that detect_schema_conflicts still handles
single-record type inference, and remove the outdated claim about arrow
absorbing mixed-type records.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/event/format/mod.rs`:
- Around line 615-621: The early return that passes through arrays/objects (the
val.is_array() || val.is_object() branch) must be moved to after the code that
resolves target_type and applied only when the resolved target_type is a
structural Arrow type (List, LargeList, Struct, or Map); update the logic around
value_compatible_with_type to first resolve target_type and then, if val is
array/object, only skip validation when target_type.is_list() ||
target_type.is_large_list() || target_type.is_struct() || target_type.is_map(),
otherwise let value_compatible_with_type run and handle/rename incompatible
scalar targets as before (refer to val.is_array(), val.is_object(), target_type,
and value_compatible_with_type to locate and change the code).

---

Nitpick comments:
In `@src/event/format/mod.rs`:
- Around line 574-580: Update the stale doc comment above the function that
short-circuits for single-record batches so it no longer claims the no-op also
applies when "no inferred field shares both name and type with storage"; instead
state that the no-op now only occurs for batches of at most one record (the
earlier guard about inferred/storage name+type overlap was removed). Adjust the
wording in the doc block (and the related comment at the other occurrence) to
reference that detect_schema_conflicts still handles single-record type
inference, and remove the outdated claim about arrow absorbing mixed-type
records.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: f8b01b86-f4f2-437c-b380-8c7b1b2bd922

📥 Commits

Reviewing files that changed from the base of the PR and between b58d891 and 8b290af.

📒 Files selected for processing (1)

src/event/format/mod.rs

escape rename check for static schema streams

04f2fb8

nitisht requested a review from parmesant May 2, 2026 03:45

coderabbitai Bot reviewed May 2, 2026

View reviewed changes

coderabbitai Bot previously approved these changes May 2, 2026

View reviewed changes

parmesant previously approved these changes May 2, 2026

View reviewed changes

catch within-batch type mismatches on new fields

8838f6d

remove time partition headers from create/update logstream

nikhilsinhaparseable dismissed stale reviews from parmesant and coderabbitai[bot] via 8838f6d May 2, 2026 07:04

coderabbitai Bot reviewed May 2, 2026

View reviewed changes

coderabbitai Bot previously approved these changes May 2, 2026

View reviewed changes

nikhilsinhaparseable dismissed coderabbitai[bot]’s stale review via 753acfd May 2, 2026 07:28

coderabbitai Bot previously approved these changes May 2, 2026

View reviewed changes

nikhilsinhaparseable force-pushed the fix/static-schema-ingestion branch from 753acfd to b58d891 Compare May 2, 2026 08:43

coderabbitai Bot reviewed May 2, 2026

View reviewed changes

fix test

8b290af

nikhilsinhaparseable dismissed coderabbitai[bot]’s stale review via 8b290af May 2, 2026 08:49

nikhilsinhaparseable force-pushed the fix/static-schema-ingestion branch from b58d891 to 8b290af Compare May 2, 2026 08:49

coderabbitai Bot requested changes May 2, 2026

View reviewed changes

Comment thread src/event/format/mod.rs Outdated

handle array/object collision

44df0d5

coderabbitai Bot approved these changes May 2, 2026

View reviewed changes

nitisht approved these changes May 2, 2026

View reviewed changes

nikhilsinhaparseable merged commit bb763b1 into parseablehq:main May 2, 2026
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

escape rename check for static schema streams#1636

escape rename check for static schema streams#1636
nikhilsinhaparseable merged 4 commits intoparseablehq:mainfrom
nikhilsinhaparseable:fix/static-schema-ingestion

nikhilsinhaparseable commented May 2, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 2, 2026 •

edited

Loading

Review skipped

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

nikhilsinhaparseable commented May 2, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nikhilsinhaparseable commented May 2, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 2, 2026 •

edited

Loading