Harden rollout JSONL writes#24471
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e7581cbdd6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| let mut file = std::fs::OpenOptions::new() | ||
| .read(true) | ||
| .write(true) | ||
| .create(true) | ||
| .open(path) | ||
| .truncate(false) |
There was a problem hiding this comment.
Preserve append mode for live rollout writers
When a live thread has been persisted and then metadata changes, update_thread_metadata appends a SessionMeta line through append_rollout_item_to_path while this live writer handle stays open. Because this handle is now plain read/write instead of append(true), its cursor remains at the old EOF; the next live rollout item writes from that stale offset and overwrites the metadata append, corrupting the JSONL history. Keep append semantics on the recorder handle while still allowing the read/truncate operations used for repair.
Useful? React with 👍 / 👎.
| let mut file = std::fs::OpenOptions::new() | ||
| .read(true) | ||
| .write(true) | ||
| .open(&path)?; |
There was a problem hiding this comment.
Preserve append mode for resumed rollout writers
The same stale-offset corruption affects resumed live threads: after RolloutRecorder::new(...Resume...) opens this non-append handle and later update_thread_metadata appends a metadata line via a separate append-only handle, subsequent resumed-thread writes use this handle's old EOF position and overwrite that appended record. The resumed writer should retain append/O_APPEND behavior in addition to the read/write permissions needed for final-line repair.
Useful? React with 👍 / 👎.
| let start_len = self.file.metadata().await?.len(); | ||
| if let Err(err) = async { | ||
| self.file.write_all(json.as_bytes()).await?; | ||
| self.file.flush().await | ||
| } | ||
| .await | ||
| { | ||
| if let Err(rollback_err) = self.file.set_len(start_len).await { |
There was a problem hiding this comment.
Avoid truncating records from other appenders
If another writer appends to the same rollout after this start_len is captured and this writer then hits a write/flush error, set_len(start_len) rolls the entire file back to the stale length and deletes the other writer's successfully appended records. This can happen with the separate metadata append path (append_rollout_item_to_path) or another resumed process, so the rollback needs to be guarded to only remove bytes written by this attempt, or skipped when the file has grown beyond this write's range.
Useful? React with 👍 / 👎.
Why
Mitigates #24425
A rollout write can fail after partially appending a JSONL record. The retry path then reopens the same file and writes the pending item again, which can splice a partial record and a retried full record onto the same physical line. That leaves the rollout unreadable even though the retried item itself is valid JSON.
What changed
ErrorKindand raw OS errno for rollout writer failures so future filesystem failures are easier to diagnose.