Skip to content

Harden rollout JSONL writes#24471

Closed
etraut-openai wants to merge 1 commit into
mainfrom
etraut/rollout-jsonl-repair
Closed

Harden rollout JSONL writes#24471
etraut-openai wants to merge 1 commit into
mainfrom
etraut/rollout-jsonl-repair

Conversation

@etraut-openai
Copy link
Copy Markdown
Collaborator

@etraut-openai etraut-openai commented May 25, 2026

Why

Mitigates #24425

A rollout write can fail after partially appending a JSONL record. The retry path then reopens the same file and writes the pending item again, which can splice a partial record and a retried full record onto the same physical line. That leaves the rollout unreadable even though the retried item itself is valid JSON.

What changed

  • Roll back a failed JSONL line write to the file length captured before the write started.
  • Repair the final rollout line before appending: preserve valid JSONL that is missing only its trailing newline, and truncate malformed trailing fragments.
  • Open rollout files with read/write access so repair and rollback can truncate on platforms where append-only handles are not sufficient.
  • Log ErrorKind and raw OS errno for rollout writer failures so future filesystem failures are easier to diagnose.

Copy link
Copy Markdown
Contributor

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e7581cbdd6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +1365 to +1369
let mut file = std::fs::OpenOptions::new()
.read(true)
.write(true)
.create(true)
.open(path)
.truncate(false)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve append mode for live rollout writers

When a live thread has been persisted and then metadata changes, update_thread_metadata appends a SessionMeta line through append_rollout_item_to_path while this live writer handle stays open. Because this handle is now plain read/write instead of append(true), its cursor remains at the old EOF; the next live rollout item writes from that stale offset and overwrites the metadata append, corrupting the JSONL history. Keep append semantics on the recorder handle while still allowing the read/truncate operations used for repair.

Useful? React with 👍 / 👎.

Comment on lines +701 to +704
let mut file = std::fs::OpenOptions::new()
.read(true)
.write(true)
.open(&path)?;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve append mode for resumed rollout writers

The same stale-offset corruption affects resumed live threads: after RolloutRecorder::new(...Resume...) opens this non-append handle and later update_thread_metadata appends a metadata line via a separate append-only handle, subsequent resumed-thread writes use this handle's old EOF position and overwrite that appended record. The resumed writer should retain append/O_APPEND behavior in addition to the read/write permissions needed for final-line repair.

Useful? React with 👍 / 👎.

Comment on lines +1693 to +1700
let start_len = self.file.metadata().await?.len();
if let Err(err) = async {
self.file.write_all(json.as_bytes()).await?;
self.file.flush().await
}
.await
{
if let Err(rollback_err) = self.file.set_len(start_len).await {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid truncating records from other appenders

If another writer appends to the same rollout after this start_len is captured and this writer then hits a write/flush error, set_len(start_len) rolls the entire file back to the stale length and deletes the other writer's successfully appended records. This can happen with the separate metadata append path (append_rollout_item_to_path) or another resumed process, so the rollback needs to be guarded to only remove bytes written by this attempt, or skipped when the file has grown beyond this write's range.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant