Skip to content

fix: commit schema bug#1666

Merged
nikhilsinhaparseable merged 3 commits into
parseablehq:mainfrom
parmesant:commit_schema_fix
Jun 3, 2026
Merged

fix: commit schema bug#1666
nikhilsinhaparseable merged 3 commits into
parseablehq:mainfrom
parmesant:commit_schema_fix

Conversation

@parmesant
Copy link
Copy Markdown
Contributor

@parmesant parmesant commented Jun 2, 2026

Parseable enterprise failed Quest test due to schema not being present in metastore as soon as possible

Description


This PR has:

  • been tested to ensure log ingestion and log query works.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added documentation for new or modified features or behaviors.

Summary by CodeRabbit

  • Refactor

    • Improved event processing and schema staging flow for atomic, serialized schema updates to avoid conflicts.
  • Bug Fixes

    • Schema JSON/parse failures are now surfaced instead of being ignored.
    • Disk write flushing is now configurable and time-aware to ensure timely and reliable batch persistence.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 2, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: d404ab71-93d9-43f0-84e4-c06b7d164da9

📥 Commits

Reviewing files that changed from the base of the PR and between 22fbbef and ba18bc9.

📒 Files selected for processing (2)
  • src/parseable/staging/writer.rs
  • src/parseable/streams.rs
🚧 Files skipped from review as they are similar to previous changes (2)
  • src/parseable/streams.rs
  • src/parseable/staging/writer.rs

Walkthrough

Event processing obtains the Stream up-front, conditionally stages/merges the schema via a new Stream::stage_schema_file (which writes atomically and respects ingestor id), and reuses the Stream for pushes. Disk staging gains configurable row and time thresholds; JSON parse errors are surfaced via a StagingError::Json variant.

Changes

Schema staging and writer updates

Layer / File(s) Summary
Stream fields, disk batch config, and push path
src/parseable/streams.rs
Adds DISK_WRITE_BATCH_ROWS (env-configured Lazy), imports Lazy, adds schema_writer: Mutex<()> to Stream and initializes it, updates disk write call to pass DISK_WRITE_BATCH_ROWS, and changes get_schemas_if_present to return Result<Vec<Schema>, StagingError>.
stage_schema_file and prepare_parquet
src/parseable/streams.rs
Adds pub fn stage_schema_file(&self, mut schema: Schema) -> Result<(), StagingError> that locks schema_writer, ensures staging dir, merges existing staging *.schema (using ingestor_id when present), writes to *.schema.tmp and renames atomically. prepare_parquet delegates to this helper for non-static schemas.
Disk writer: time-based flush and batch aging
src/parseable/staging/writer.rs
Introduces DISK_WRITE_BATCH_MAX_AGE_SECS (env-configured Lazy), tracks first_seen on PendingDiskBatch, updates push_disk to accept batch_rows and flush based on rows or max-age, and updates take_flushable_disk filtering accordingly.
Staging error variant and Event::process stream reuse
src/parseable/staging/mod.rs, src/event/mod.rs
Adds StagingError::Json(#[from] serde_json::Error) to surface JSON errors. Event::process now binds a stream from PARSEABLE.get_or_create_stream(...) early, conditionally calls stream.stage_schema_file(...) on first event when !stream.get_static_schema_flag(), then uses stream.push(...) to ingest the event.

Sequence Diagram

sequenceDiagram
  participant Event as Event::process
  participant PARSEABLE
  participant Stream
  participant StagingWriter
  participant Filesystem
  Event->>PARSEABLE: get_or_create_stream(stream_name)
  PARSEABLE-->>Event: Stream
  Event->>Event: is_first_event && !stream.get_static_schema_flag()
  Event->>Stream: stage_schema_file(schema)
  Stream->>StagingWriter: open staging dir / read existing *.schema
  StagingWriter->>Filesystem: read/parse existing schema files
  Stream->>Stream: merge schemas (Schema::try_merge)
  Stream->>Filesystem: write merged schema to *.schema.tmp -> rename to *.schema
  Stream-->>Event: Ok(())
  Event->>Stream: push(event_data)
  Stream->>StagingWriter: push_disk(..., batch_rows=DISK_WRITE_BATCH_ROWS)
  StagingWriter->>StagingWriter: buffer, flush when rows>=batch_rows or age>=DISK_WRITE_BATCH_MAX_AGE_SECS
Loading

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly Related PRs

Suggested labels

for next release

Suggested reviewers

  • nikhilsinhaparseable
  • nitisht

"I thumped a drum of schema bytes,
I stitched the files in careful rows,
I timed the flush, counted the rows,
I locked the writer, renamed the prose,
A hopping patch to keep data close. 🐇"

🚥 Pre-merge checks | ✅ 2 | ❌ 3

❌ Failed checks (2 warnings, 1 inconclusive)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is largely incomplete. It mentions the Quest test failure and schema metastore timing issue but lacks detailed explanation of the solution, key changes made, and all required checklist items are unchecked. Complete the description with detailed explanation of changes made, mark relevant checklist items, and explain how the schema staging fix addresses the metastore availability issue.
Docstring Coverage ⚠️ Warning Docstring coverage is 63.64% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'fix: commit schema bug' is vague and does not clearly summarize the main change. It mentions 'schema bug' but does not specify what the bug is or how it is being fixed. Provide a more specific title that explains the actual fix, such as 'fix: stage schema file earlier during event processing' or similar that reflects the key behavioral change.
✅ Passed checks (2 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
src/event/mod.rs (1)

91-95: ⚡ Quick win

Add a regression test for the new first-event staging path.

This branch is the actual fix for the Quest failure, but I don’t see coverage here asserting that a dynamic-schema stream has a staged *.schema before the first push()/parquet cycle. A focused test around Event::process would make this much harder to regress.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/event/mod.rs` around lines 91 - 95, Add a regression test that exercises
the first-event staging path in Event::process: create a dynamic-schema stream
(stream.get_static_schema_flag() == false), invoke Event::process (or the flow
that sets self.is_first_event) and assert that commit_schema(&self.stream_name,
...) was called and that stream.stage_schema_file received/created the expected
"*.schema" before any push()/parquet cycle; use the Event::process entry point,
the is_first_event flag, and verify side-effects against commit_schema and
stage_schema_file (or inspect the stream's staged schema via get_schema()) to
ensure the schema file is staged on the first event.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/parseable/streams.rs`:
- Around line 510-527: stage_schema_file currently does an unsynchronized
read/merge/write that can lose fields or read partial JSON; fix by serializing
updates with a per-stream lock and writing via a temp-file + atomic rename:
add/use a per-stream Mutex/RwLock (e.g., a field on the struct guarding schema
ops) and acquire it at the start of stage_schema_file, call
get_schemas_if_present (or re-read the staging file) while holding the lock,
merge via Schema::try_merge, then write the merged bytes to a temp path (e.g.,
same dir with a .tmp suffix or .ingestor.{id}.schema.tmp) and atomically rename
to the final schema filename (replace the current RelativePathBuf usage so final
name is the intended .ingestor.{id}.schema); also ensure get_schemas_if_present
is only used/read under the same lock or returns errors instead of silently
ignoring deserialization failures so partial writes are not treated as absent.

---

Nitpick comments:
In `@src/event/mod.rs`:
- Around line 91-95: Add a regression test that exercises the first-event
staging path in Event::process: create a dynamic-schema stream
(stream.get_static_schema_flag() == false), invoke Event::process (or the flow
that sets self.is_first_event) and assert that commit_schema(&self.stream_name,
...) was called and that stream.stage_schema_file received/created the expected
"*.schema" before any push()/parquet cycle; use the Event::process entry point,
the is_first_event flag, and verify side-effects against commit_schema and
stage_schema_file (or inspect the stream's staged schema via get_schema()) to
ensure the schema file is staged on the first event.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: d98e25e0-7082-48cd-85ab-0a5c8c5731e7

📥 Commits

Reviewing files that changed from the base of the PR and between cefe210 and bdf5e17.

📒 Files selected for processing (2)
  • src/event/mod.rs
  • src/parseable/streams.rs

Comment thread src/parseable/streams.rs Outdated
pending batches by default get pushed to disk near instantly

schema writer is behind a mutex to prevent incorrect schemas
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/parseable/streams.rs (1)

451-464: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Only treat NotFound as “no schema.”

stage_schema_file() now depends on this helper for merge correctness. Returning Ok(vec![]) for any read_dir failure can silently drop previously staged fields from the merged schema, and File::open(...).expect(...) can still panic on a transient race in the staging dir. Please propagate non-NotFound I/O errors here instead of treating them as an empty schema set.

Suggested fix
 pub fn get_schemas_if_present(&self) -> Result<Vec<Schema>, StagingError> {
-    let Ok(dir) = self.data_path.read_dir() else {
-        return Ok(vec![]);
-    };
+    let dir = match self.data_path.read_dir() {
+        Ok(dir) => dir,
+        Err(err) if err.kind() == std::io::ErrorKind::NotFound => return Ok(vec![]),
+        Err(err) => return Err(err.into()),
+    };

     let mut schemas: Vec<Schema> = Vec::new();

     for file in dir.flatten() {
         if let Some(ext) = file.path().extension()
             && ext.eq("schema")
         {
-            let file = File::open(file.path()).expect("Schema File should exist");
+            let file = match File::open(file.path()) {
+                Ok(file) => file,
+                Err(err) if err.kind() == std::io::ErrorKind::NotFound => continue,
+                Err(err) => return Err(err.into()),
+            };

             schemas.push(serde_json::from_reader(file)?);
         }
     }

     Ok(schemas)
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/parseable/streams.rs` around lines 451 - 464, In get_schemas_if_present,
don’t treat every read_dir error as “no schemas” — only return Ok(vec![]) when
the error kind is NotFound; for other I/O errors propagate them as a
StagingError (e.g. map_err into StagingError::Io or the crate’s equivalent).
Also remove the File::open(...).expect(...) panic: open the schema file using ?
(or map_err) so transient races return an error instead of panicking, and
propagate serde_json::from_reader errors as currently done; this ensures
stage_schema_file and merge logic won’t silently drop staged fields.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@src/parseable/streams.rs`:
- Around line 451-464: In get_schemas_if_present, don’t treat every read_dir
error as “no schemas” — only return Ok(vec![]) when the error kind is NotFound;
for other I/O errors propagate them as a StagingError (e.g. map_err into
StagingError::Io or the crate’s equivalent). Also remove the
File::open(...).expect(...) panic: open the schema file using ? (or map_err) so
transient races return an error instead of panicking, and propagate
serde_json::from_reader errors as currently done; this ensures stage_schema_file
and merge logic won’t silently drop staged fields.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 9d8d1151-be65-44dd-aded-da15ddb78480

📥 Commits

Reviewing files that changed from the base of the PR and between bdf5e17 and 22fbbef.

📒 Files selected for processing (3)
  • src/parseable/staging/mod.rs
  • src/parseable/staging/writer.rs
  • src/parseable/streams.rs

coderabbitai[bot]
coderabbitai Bot previously approved these changes Jun 3, 2026
@nikhilsinhaparseable nikhilsinhaparseable merged commit 515f0cf into parseablehq:main Jun 3, 2026
12 of 13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants