fix(json): retry async JSON reads when atomic rename races#19
Conversation
|
Codex review: needs maintainer review before merge. Workflow note: Future ClawSweeper reviews update this same comment in place. How this review workflow works
Summary Reproducibility: yes. Source inspection shows current main performs one async JSON read and throws a single untyped PR rating What the crustacean ranks mean
Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics. Real behavior proof Next step before merge Security Review detailsBest possible solution: Land the bounded async JSON retry with the typed read-race sentinel and included regression coverage after ordinary maintainer review. Do we have a high-confidence way to reproduce the issue? Yes. Source inspection shows current main performs one async JSON read and throws a single untyped Is this the best way to solve the issue? Yes. Retrying the typed transient read-race sentinel at the async JSON boundary is narrower than changing all regular-file reads, and the branch preserves parse, sync-reader, and unrelated failure behavior. Label justifications:
What I checked:
Likely related people:
Codex review notes: model gpt-5.5, reasoning high; reviewed against 81bf0da0781f. |
|
@clawsweeper re-review Added real-fs proof. New
Parse errors and other errors stay at 0 in both runs, so the retry is not masking corruption. |
94164ef to
d3d1841
Compare
|
🦞🧹 I asked ClawSweeper to review this item again. Re-review progress:
|
|
ClawSweeper PR egg ✨ Hatched: 💎 rare Pearl Review Wisp Hatch commandComment Hatchability rules:
Rarity: 💎 rare. What is this egg doing here?
|
Tag the regular-file identity-change throw as FsSafeError("path-mismatch"),
matching the sentinel already used by file-store/directory-guard/pinned-write
for the same temp-file + rename race semantic. Wrap readJson, readJsonIfExists,
and tryReadJson in a bounded retry (max 3, 50ms exponential backoff) that only
retries on path-mismatch, so transient temp-file + rename rotations no longer
surface as JsonFileReadError to callers while parse errors and other read
failures still fail fast on the first attempt.
Add a real-disk regression that spawns a concurrent writeTextAtomic loop and a
readJson loop on a mkdtemp scratch file and asserts zero raceErrors over a
1-second window (per docs/contributing.md "use vi.mock sparingly"). Retain the
mocked exhaustion test for the retry-budget branch, which is hard to hit
deterministically on real disk.
Refs openclaw/openclaw#83657
d3d1841 to
66604d6
Compare
Summary
FsSafeError("path-mismatch")instead of a rawError, matching the sentinel already used insrc/file-store.ts:149,src/directory-guard.ts:35,src/pinned-write.ts:292, andsrc/pinned-python.ts:514for the same temp-file + rename race semantic.readJson,readJsonIfExists, andtryReadJsonin a bounded retry (max 3 attempts, 50ms exponential backoff) that only retries onpath-mismatch. Parse errors and other read failures still fail fast on the first attempt, so corruption is not masked.test/json.test.tsthat spawns a concurrentwriteTextAtomicloop against areadJsonloop on amkdtempscratch file and asserts zeroraceErrorsover a 1-second window. Keep one mocked test for the retry-budget exhaustion branch (hard to hit deterministically on real disk;docs/contributing.mdallowsvi.mocksparingly for cases like this).Why
openclaw/openclaw(and any other consumer of@openclaw/fs-safe/json) is hittingJsonFileReadError: Failed to read JSON file: .../paired.jsonunder normal multi-client operation. The writer uses temp-file +rename(atomic, by design). Concurrent readers race:statRegularFilereads the pre-rename inode,fs.openthen resolves to the post-rename inode,verifyStableReadTargetdetects the identity mismatch insrc/regular-file.ts:174-179and throws.That throw is correct - the read genuinely landed on a different file - but it is a transient state, not corruption. Today the only error type bubbled up is a generic
Error, so callers cannot retry without doing fragile message matching. The fix typifies it with the existingpath-mismatchcode and adds the bounded retry the issue asks for, scoped to the async public readers.Tracks openclaw/openclaw#83657.
Scope notes
readJsonSync,readRootJsonSync,tryReadJsonSync) are intentionally unchanged - they cannotawaitand the issue's pseudocode is async. A sync retry on top ofAtomics.waitor a tight spin loop is a separate decision and can land in a follow-up if anyone hits the race on the sync paths.readRegularFile/readRegularFileSyncthemselves are unchanged in behavior; only the error type the race emits is tightened. Callers that previously string-matched on the message still match the same message (File changed during read: <path>).Real behavior proof
Real disk, no
fs.openmock. The new vitest caserecovers readJson from a concurrent real atomic rewriteruns concurrentreadJsonandwriteTextAtomicloops against amkdtempscratch file for 1 second and countsraceErrorsby matching the cause message.Against
upstream/main(src/json.ts+src/regular-file.tsreverted to upstream, new tests in place):79 real
JsonFileReadErrors caused byFile changed during readduring a 1-second concurrent run - exact symptom from openclaw/openclaw#83657. The second failure also pins down today's behavior as "one attempt, no retry."With the fix:
All clean. The same real-disk concurrent loop that produced 79 race errors against upstream produces zero against the patched build, without
parseErrorsorotherErrors, confirming the retry is not masking corruption or unrelated failures.What was not tested: a real multi-process Discord + Control UI + CLI reproduction against
openclaw/openclaw's gateway. The race lives in the kernel rename path, so an in-process concurrent test exercises the same code; a multi-process reproduction behaves the same.Verification
pnpm test test/json.test.tspnpm testpnpm buildpnpm lint:fs-boundarypnpm lint:file-size