fix: commit schema bug by parmesant · Pull Request #1666 · parseablehq/parseable

parmesant · 2026-06-02T14:54:11Z

Parseable enterprise failed Quest test due to schema not being present in metastore as soon as possible

Description

This PR has:

been tested to ensure log ingestion and log query works.
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added documentation for new or modified features or behaviors.

Summary by CodeRabbit

Refactor
- Improved event processing and schema staging flow for atomic, serialized schema updates to avoid conflicts.
Bug Fixes
- Schema JSON/parse failures are now surfaced instead of being ignored.
- Disk write flushing is now configurable and time-aware to ensure timely and reliable batch persistence.

coderabbitai · 2026-06-02T14:54:26Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: d404ab71-93d9-43f0-84e4-c06b7d164da9

📥 Commits

Reviewing files that changed from the base of the PR and between 22fbbef and ba18bc9.

📒 Files selected for processing (2)

src/parseable/staging/writer.rs
src/parseable/streams.rs

🚧 Files skipped from review as they are similar to previous changes (2)

src/parseable/streams.rs
src/parseable/staging/writer.rs

Walkthrough

Event processing obtains the Stream up-front, conditionally stages/merges the schema via a new Stream::stage_schema_file (which writes atomically and respects ingestor id), and reuses the Stream for pushes. Disk staging gains configurable row and time thresholds; JSON parse errors are surfaced via a StagingError::Json variant.

Changes

Schema staging and writer updates

Layer / File(s)	Summary
Stream fields, disk batch config, and push path `src/parseable/streams.rs`	Adds `DISK_WRITE_BATCH_ROWS` (env-configured Lazy), imports Lazy, adds `schema_writer: Mutex<()>` to `Stream` and initializes it, updates disk write call to pass `DISK_WRITE_BATCH_ROWS`, and changes `get_schemas_if_present` to return `Result<Vec<Schema>, StagingError>`.
stage_schema_file and prepare_parquet `src/parseable/streams.rs`	Adds `pub fn stage_schema_file(&self, mut schema: Schema) -> Result<(), StagingError>` that locks `schema_writer`, ensures staging dir, merges existing staging `.schema` (using ingestor_id when present), writes to `.schema.tmp` and renames atomically. `prepare_parquet` delegates to this helper for non-static schemas.
Disk writer: time-based flush and batch aging `src/parseable/staging/writer.rs`	Introduces `DISK_WRITE_BATCH_MAX_AGE_SECS` (env-configured Lazy), tracks `first_seen` on `PendingDiskBatch`, updates `push_disk` to accept `batch_rows` and flush based on rows or max-age, and updates `take_flushable_disk` filtering accordingly.
Staging error variant and Event::process stream reuse `src/parseable/staging/mod.rs`, `src/event/mod.rs`	Adds `StagingError::Json(#[from] serde_json::Error)` to surface JSON errors. `Event::process` now binds a stream from `PARSEABLE.get_or_create_stream(...)` early, conditionally calls `stream.stage_schema_file(...)` on first event when `!stream.get_static_schema_flag()`, then uses `stream.push(...)` to ingest the event.

Sequence Diagram

sequenceDiagram
  participant Event as Event::process
  participant PARSEABLE
  participant Stream
  participant StagingWriter
  participant Filesystem
  Event->>PARSEABLE: get_or_create_stream(stream_name)
  PARSEABLE-->>Event: Stream
  Event->>Event: is_first_event && !stream.get_static_schema_flag()
  Event->>Stream: stage_schema_file(schema)
  Stream->>StagingWriter: open staging dir / read existing *.schema
  StagingWriter->>Filesystem: read/parse existing schema files
  Stream->>Stream: merge schemas (Schema::try_merge)
  Stream->>Filesystem: write merged schema to *.schema.tmp -> rename to *.schema
  Stream-->>Event: Ok(())
  Event->>Stream: push(event_data)
  Stream->>StagingWriter: push_disk(..., batch_rows=DISK_WRITE_BATCH_ROWS)
  StagingWriter->>StagingWriter: buffer, flush when rows>=batch_rows or age>=DISK_WRITE_BATCH_MAX_AGE_SECS

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly Related PRs

parseablehq/parseable#1381: Similar reorder to create streams before performing schema merge/commit/staging.
parseablehq/parseable#1258: Related changes to stream creation/get_or_create impacting upfront stream acquisition.
parseablehq/parseable#1201: Related error-typing changes around staging/commit paths interacting with StagingError.

Suggested labels

for next release

Suggested reviewers

nikhilsinhaparseable
nitisht

"I thumped a drum of schema bytes,
I stitched the files in careful rows,
I timed the flush, counted the rows,
I locked the writer, renamed the prose,
A hopping patch to keep data close. 🐇"

🚥 Pre-merge checks | ✅ 2 | ❌ 3

❌ Failed checks (2 warnings, 1 inconclusive)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description is largely incomplete. It mentions the Quest test failure and schema metastore timing issue but lacks detailed explanation of the solution, key changes made, and all required checklist items are unchecked.	Complete the description with detailed explanation of changes made, mark relevant checklist items, and explain how the schema staging fix addresses the metastore availability issue.
Docstring Coverage	⚠️ Warning	Docstring coverage is 63.64% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title 'fix: commit schema bug' is vague and does not clearly summarize the main change. It mentions 'schema bug' but does not specify what the bug is or how it is being fixed.	Provide a more specific title that explains the actual fix, such as 'fix: stage schema file earlier during event processing' or similar that reflects the key behavioral change.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

src/event/mod.rs (1)
91-95: ⚡ Quick win

Add a regression test for the new first-event staging path.

This branch is the actual fix for the Quest failure, but I don’t see coverage here asserting that a dynamic-schema stream has a staged *.schema before the first push()/parquet cycle. A focused test around Event::process would make this much harder to regress.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/event/mod.rs` around lines 91 - 95, Add a regression test that exercises
the first-event staging path in Event::process: create a dynamic-schema stream
(stream.get_static_schema_flag() == false), invoke Event::process (or the flow
that sets self.is_first_event) and assert that commit_schema(&self.stream_name,
...) was called and that stream.stage_schema_file received/created the expected
"*.schema" before any push()/parquet cycle; use the Event::process entry point,
the is_first_event flag, and verify side-effects against commit_schema and
stage_schema_file (or inspect the stream's staged schema via get_schema()) to
ensure the schema file is staged on the first event.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/parseable/streams.rs`:
- Around line 510-527: stage_schema_file currently does an unsynchronized
read/merge/write that can lose fields or read partial JSON; fix by serializing
updates with a per-stream lock and writing via a temp-file + atomic rename:
add/use a per-stream Mutex/RwLock (e.g., a field on the struct guarding schema
ops) and acquire it at the start of stage_schema_file, call
get_schemas_if_present (or re-read the staging file) while holding the lock,
merge via Schema::try_merge, then write the merged bytes to a temp path (e.g.,
same dir with a .tmp suffix or .ingestor.{id}.schema.tmp) and atomically rename
to the final schema filename (replace the current RelativePathBuf usage so final
name is the intended .ingestor.{id}.schema); also ensure get_schemas_if_present
is only used/read under the same lock or returns errors instead of silently
ignoring deserialization failures so partial writes are not treated as absent.

---

Nitpick comments:
In `@src/event/mod.rs`:
- Around line 91-95: Add a regression test that exercises the first-event
staging path in Event::process: create a dynamic-schema stream
(stream.get_static_schema_flag() == false), invoke Event::process (or the flow
that sets self.is_first_event) and assert that commit_schema(&self.stream_name,
...) was called and that stream.stage_schema_file received/created the expected
"*.schema" before any push()/parquet cycle; use the Event::process entry point,
the is_first_event flag, and verify side-effects against commit_schema and
stage_schema_file (or inspect the stream's staged schema via get_schema()) to
ensure the schema file is staged on the first event.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: d98e25e0-7082-48cd-85ab-0a5c8c5731e7

📥 Commits

Reviewing files that changed from the base of the PR and between cefe210 and bdf5e17.

📒 Files selected for processing (2)

src/event/mod.rs
src/parseable/streams.rs

pending batches by default get pushed to disk near instantly schema writer is behind a mutex to prevent incorrect schemas

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/parseable/streams.rs (1)

451-464: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Only treat NotFound as “no schema.”

stage_schema_file() now depends on this helper for merge correctness. Returning Ok(vec![]) for any read_dir failure can silently drop previously staged fields from the merged schema, and File::open(...).expect(...) can still panic on a transient race in the staging dir. Please propagate non-NotFound I/O errors here instead of treating them as an empty schema set.

Suggested fix

 pub fn get_schemas_if_present(&self) -> Result<Vec<Schema>, StagingError> {
-    let Ok(dir) = self.data_path.read_dir() else {
-        return Ok(vec![]);
-    };
+    let dir = match self.data_path.read_dir() {
+        Ok(dir) => dir,
+        Err(err) if err.kind() == std::io::ErrorKind::NotFound => return Ok(vec![]),
+        Err(err) => return Err(err.into()),
+    };

     let mut schemas: Vec<Schema> = Vec::new();

     for file in dir.flatten() {
         if let Some(ext) = file.path().extension()
             && ext.eq("schema")
         {
-            let file = File::open(file.path()).expect("Schema File should exist");
+            let file = match File::open(file.path()) {
+                Ok(file) => file,
+                Err(err) if err.kind() == std::io::ErrorKind::NotFound => continue,
+                Err(err) => return Err(err.into()),
+            };

             schemas.push(serde_json::from_reader(file)?);
         }
     }

     Ok(schemas)
 }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/parseable/streams.rs` around lines 451 - 464, In get_schemas_if_present,
don’t treat every read_dir error as “no schemas” — only return Ok(vec![]) when
the error kind is NotFound; for other I/O errors propagate them as a
StagingError (e.g. map_err into StagingError::Io or the crate’s equivalent).
Also remove the File::open(...).expect(...) panic: open the schema file using ?
(or map_err) so transient races return an error instead of panicking, and
propagate serde_json::from_reader errors as currently done; this ensures
stage_schema_file and merge logic won’t silently drop staged fields.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@src/parseable/streams.rs`:
- Around line 451-464: In get_schemas_if_present, don’t treat every read_dir
error as “no schemas” — only return Ok(vec![]) when the error kind is NotFound;
for other I/O errors propagate them as a StagingError (e.g. map_err into
StagingError::Io or the crate’s equivalent). Also remove the
File::open(...).expect(...) panic: open the schema file using ? (or map_err) so
transient races return an error instead of panicking, and propagate
serde_json::from_reader errors as currently done; this ensures stage_schema_file
and merge logic won’t silently drop staged fields.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 9d8d1151-be65-44dd-aded-da15ddb78480

📥 Commits

Reviewing files that changed from the base of the PR and between bdf5e17 and 22fbbef.

📒 Files selected for processing (3)

src/parseable/staging/mod.rs
src/parseable/staging/writer.rs
src/parseable/streams.rs

fix: commit schema bug

bdf5e17

coderabbitai Bot requested changes Jun 2, 2026

View reviewed changes

Comment thread src/parseable/streams.rs Outdated

fix: flush pending batches to disk, schema mutex

22fbbef

pending batches by default get pushed to disk near instantly schema writer is behind a mutex to prevent incorrect schemas

coderabbitai Bot reviewed Jun 3, 2026

View reviewed changes

coderabbitai Bot previously approved these changes Jun 3, 2026

View reviewed changes

fix: deepsource and coderabbit

ba18bc9

parmesant dismissed coderabbitai[bot]’s stale review via ba18bc9 June 3, 2026 06:18

coderabbitai Bot approved these changes Jun 3, 2026

View reviewed changes

nikhilsinhaparseable approved these changes Jun 3, 2026

View reviewed changes

nikhilsinhaparseable merged commit 515f0cf into parseablehq:main Jun 3, 2026
12 of 13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: commit schema bug#1666

fix: commit schema bug#1666
nikhilsinhaparseable merged 3 commits into
parseablehq:mainfrom
parmesant:commit_schema_fix

parmesant commented Jun 2, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 2, 2026 •

edited

Loading

❌ Failed checks (2 warnings, 1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

parmesant commented Jun 2, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Possibly Related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (2 warnings, 1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

parmesant commented Jun 2, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 2, 2026 •

edited

Loading