Skip to content

fix(json): retry async JSON reads when atomic rename races#19

Merged
steipete merged 5 commits into
openclaw:mainfrom
yetval:fix/json-read-retry-on-race
May 21, 2026
Merged

fix(json): retry async JSON reads when atomic rename races#19
steipete merged 5 commits into
openclaw:mainfrom
yetval:fix/json-read-retry-on-race

Conversation

@yetval
Copy link
Copy Markdown
Contributor

@yetval yetval commented May 18, 2026

Summary

  • Tag the regular-file identity-change throw as FsSafeError("path-mismatch") instead of a raw Error, matching the sentinel already used in src/file-store.ts:149, src/directory-guard.ts:35, src/pinned-write.ts:292, and src/pinned-python.ts:514 for the same temp-file + rename race semantic.
  • Wrap readJson, readJsonIfExists, and tryReadJson in a bounded retry (max 3 attempts, 50ms exponential backoff) that only retries on path-mismatch. Parse errors and other read failures still fail fast on the first attempt, so corruption is not masked.
  • Add a real-disk regression in test/json.test.ts that spawns a concurrent writeTextAtomic loop against a readJson loop on a mkdtemp scratch file and asserts zero raceErrors over a 1-second window. Keep one mocked test for the retry-budget exhaustion branch (hard to hit deterministically on real disk; docs/contributing.md allows vi.mock sparingly for cases like this).

Why

openclaw/openclaw (and any other consumer of @openclaw/fs-safe/json) is hitting JsonFileReadError: Failed to read JSON file: .../paired.json under normal multi-client operation. The writer uses temp-file + rename (atomic, by design). Concurrent readers race: statRegularFile reads the pre-rename inode, fs.open then resolves to the post-rename inode, verifyStableReadTarget detects the identity mismatch in src/regular-file.ts:174-179 and throws.

That throw is correct - the read genuinely landed on a different file - but it is a transient state, not corruption. Today the only error type bubbled up is a generic Error, so callers cannot retry without doing fragile message matching. The fix typifies it with the existing path-mismatch code and adds the bounded retry the issue asks for, scoped to the async public readers.

Tracks openclaw/openclaw#83657.

Scope notes

  • Sync readers (readJsonSync, readRootJsonSync, tryReadJsonSync) are intentionally unchanged - they cannot await and the issue's pseudocode is async. A sync retry on top of Atomics.wait or a tight spin loop is a separate decision and can land in a follow-up if anyone hits the race on the sync paths.
  • readRegularFile/readRegularFileSync themselves are unchanged in behavior; only the error type the race emits is tightened. Callers that previously string-matched on the message still match the same message (File changed during read: <path>).

Real behavior proof

Real disk, no fs.open mock. The new vitest case recovers readJson from a concurrent real atomic rewrite runs concurrent readJson and writeTextAtomic loops against a mkdtemp scratch file for 1 second and counts raceErrors by matching the cause message.

Against upstream/main (src/json.ts + src/regular-file.ts reverted to upstream, new tests in place):

$ pnpm test test/json.test.ts
 × json file helpers > recovers readJson from a concurrent real atomic rewrite 1014ms
 × json file helpers > surfaces JsonFileReadError when read races exceed retry budget 15ms

FAIL  json file helpers > recovers readJson from a concurrent real atomic rewrite
AssertionError: expected 79 to be +0
 ❯ test/json.test.ts:314
    312|     expect(writes).toBeGreaterThan(10);
    313|     expect(okReads).toBeGreaterThan(10);
    314|     expect(raceErrors).toBe(0);

FAIL  json file helpers > surfaces JsonFileReadError when read races exceed retry budget
AssertionError: expected 1 to be greater than or equal to 3

 Test Files  1 failed (1)
      Tests  2 failed | 12 passed (14)

79 real JsonFileReadErrors caused by File changed during read during a 1-second concurrent run - exact symptom from openclaw/openclaw#83657. The second failure also pins down today's behavior as "one attempt, no retry."

With the fix:

$ pnpm test test/json.test.ts
 Test Files  1 passed (1)
      Tests  14 passed (14)

$ pnpm test
 Test Files  35 passed (35)
      Tests  383 passed | 2 skipped (385)

$ pnpm build
$ pnpm lint:fs-boundary
$ pnpm lint:file-size

All clean. The same real-disk concurrent loop that produced 79 race errors against upstream produces zero against the patched build, without parseErrors or otherErrors, confirming the retry is not masking corruption or unrelated failures.

What was not tested: a real multi-process Discord + Control UI + CLI reproduction against openclaw/openclaw's gateway. The race lives in the kernel rename path, so an in-process concurrent test exercises the same code; a multi-process reproduction behaves the same.

Verification

  • pnpm test test/json.test.ts
  • pnpm test
  • pnpm build
  • pnpm lint:fs-boundary
  • pnpm lint:file-size

@clawsweeper
Copy link
Copy Markdown

clawsweeper Bot commented May 18, 2026

Codex review: needs maintainer review before merge.

Workflow note: Future ClawSweeper reviews update this same comment in place.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

Summary
The PR tags async regular-file read identity swaps as FsSafeError("path-mismatch"), retries async JSON reads on transient read races, and adds changelog plus JSON regression coverage.

Reproducibility: yes. Source inspection shows current main performs one async JSON read and throws a single untyped File changed during read error, and the PR body provides terminal before/after output for the real temp-file atomic rewrite race.

PR rating
Overall: 🦞 diamond lobster
Proof: 🦞 diamond lobster
Patch quality: 🦞 diamond lobster
Summary: Strong terminal proof, focused implementation, regression coverage, and no blocking findings make this above-normal merge-ready quality pending maintainer review.

What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

Real behavior proof
Sufficient (terminal): The PR body includes terminal output showing the real temp-file atomic rename race failing on upstream/current behavior and passing with the patched branch.

Next step before merge
No repair lane is needed; the remaining action is maintainer review and merge handling for the open PR.

Security
Cleared: The diff adds no dependencies, workflows, scripts, secret handling, package-resolution changes, or broader filesystem permissions; the retry is bounded to read-race handling.

Review details

Best possible solution:

Land the bounded async JSON retry with the typed read-race sentinel and included regression coverage after ordinary maintainer review.

Do we have a high-confidence way to reproduce the issue?

Yes. Source inspection shows current main performs one async JSON read and throws a single untyped File changed during read error, and the PR body provides terminal before/after output for the real temp-file atomic rewrite race.

Is this the best way to solve the issue?

Yes. Retrying the typed transient read-race sentinel at the async JSON boundary is narrower than changing all regular-file reads, and the branch preserves parse, sync-reader, and unrelated failure behavior.

Label justifications:

  • P1: The PR addresses a source-backed transient JSON read race tied to downstream normal multi-client OpenClaw failures.
  • rating: 🦞 diamond lobster: Current PR rating is 🦞 diamond lobster because proof is 🦞 diamond lobster, patch quality is 🦞 diamond lobster, and Strong terminal proof, focused implementation, regression coverage, and no blocking findings make this above-normal merge-ready quality pending maintainer review.
  • status: 👀 ready for maintainer look: ClawSweeper has no concrete contributor-facing blocker left for this PR. Sufficient (terminal): The PR body includes terminal output showing the real temp-file atomic rename race failing on upstream/current behavior and passing with the patched branch.
  • proof: sufficient: Contributor real behavior proof is sufficient. The PR body includes terminal output showing the real temp-file atomic rename race failing on upstream/current behavior and passing with the patched branch.

What I checked:

  • Current main has no async JSON read retry: On current main, tryReadJson, readJson, and readJsonIfExists each call readRegularFile once, so a transient stable-target race is returned or thrown without a bounded retry. (src/json.ts:239, 81bf0da0781f)
  • Current main reports the identity swap as an untyped Error: verifyStableReadTarget throws Error("File changed during read: ...") when the pre-open, post-open, and path identities differ, so callers cannot branch on the existing path-mismatch sentinel. (src/regular-file.ts:178, 81bf0da0781f)
  • PR adds bounded async JSON retry: The PR head adds a five-attempt exponential-backoff helper, keeps nullable missing-file reads from retrying when initially absent, and routes the async JSON readers through that helper. (src/json.ts:10, a7d46a5d76a0)
  • PR types the regular-file read race: The PR head converts mid-read disappearance and identity mismatch in async readRegularFile into FsSafeError("path-mismatch") while preserving the existing message text. (src/regular-file.ts:143, a7d46a5d76a0)
  • PR includes real-disk regression coverage: The PR test adds a concurrent writeTextAtomic loop against readJson on a temp file and asserts zero read-race errors, plus an exhaustion-budget test for repeated injected races. (test/json.test.ts:285, a7d46a5d76a0)
  • Contributor proof is sufficient: The PR body provides terminal before/after output from a real temp-file atomic rewrite race: current upstream fails the new race test with JsonFileReadErrors, while the patched branch passes the focused test, full test suite, build, and fs-boundary/file-size lint commands. (a7d46a5d76a0)

Likely related people:

  • steipete: Current main blame for the JSON and regular-file read paths points to recent filesystem-safety commits by this author, and the PR branch also contains maintainer hardening commits for the JSON retry behavior. (role: recent area contributor and branch hardening author; confidence: high; commits: d649786af91d, f457c69c2777, 43c0dcd55258; files: src/json.ts, src/regular-file.ts, test/json.test.ts)
  • sallyom: A recent adjacent commit touched the atomic write behavior used by the new real-disk regression, though the central JSON/read retry ownership points elsewhere. (role: adjacent atomic-write contributor; confidence: low; commits: e335490a5b3d; files: test/json.test.ts, src/text-atomic.ts)

Codex review notes: model gpt-5.5, reasoning high; reviewed against 81bf0da0781f.

@clawsweeper clawsweeper Bot added rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. P2 Normal priority bug or improvement with limited blast radius. labels May 18, 2026
@yetval
Copy link
Copy Markdown
Contributor Author

yetval commented May 18, 2026

@clawsweeper re-review

Added real-fs proof. New scripts/race-repro.mjs runs concurrent readJson against writeTextAtomic on a real temp file — no fs.open mocks. Measured numbers in updated PR body:

  • upstream/main: 521 raceErrors across 247 atomic writes
  • this branch: 0 raceErrors across 264 atomic writes

Parse errors and other errors stay at 0 in both runs, so the retry is not masking corruption.

@yetval yetval force-pushed the fix/json-read-retry-on-race branch from 94164ef to d3d1841 Compare May 18, 2026 16:14
@clawsweeper clawsweeper Bot added proof: sufficient Contributor real behavior proof is sufficient. rating: 🦞 diamond lobster Very strong PR readiness with only minor maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. P1 Urgent regression or broken agent/channel workflow affecting real users now. and removed rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. P2 Normal priority bug or improvement with limited blast radius. labels May 18, 2026
@clawsweeper
Copy link
Copy Markdown

clawsweeper Bot commented May 18, 2026

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

@clawsweeper clawsweeper Bot added rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. and removed rating: 🦞 diamond lobster Very strong PR readiness with only minor maintainer review expected. labels May 21, 2026
@clawsweeper
Copy link
Copy Markdown

clawsweeper Bot commented May 21, 2026

ClawSweeper PR egg

✨ Hatched: 💎 rare Pearl Review Wisp

Hatch command

Comment @clawsweeper hatch when this PR is hatchable.

Hatchability rules:

  • Merged PRs are hatchable.
  • Open PRs are hatchable when they are status: 👀 ready for maintainer look, status: 🚀 automerge armed, or labeled clawsweeper:automerge.
  • Closed unmerged PRs are hatchable only when one of those hatchable labels is still present in the durable record.

Rarity: 💎 rare.
Trait: finds missing screenshots.
Image traits: location review cove; accessory miniature diff map; palette seafoam, black, and opal; mood patient; pose stepping out of a freshly hatched shell; shell brushed metal shell; lighting soft underwater shimmer; background tiny artifact crates.
Share on X: post this hatch
Copy: My PR egg hatched a 💎 rare Pearl Review Wisp in ClawSweeper.

What is this egg doing here?
  • Eggs appear after the PR passes real-behavior proof. It is here for vibes, not verdicts: it does not change labels, ratings, merge decisions, or automation.
  • The shell reacts to review momentum: open follow-up work warms it up, re-review makes it wobble, and a clean final review lets it hatch.
  • Hatchability usually comes from sufficient real-behavior proof, no blocking P0/P1/P2 findings, no security attention needed, and clean correctness. A merged PR is already final, so merge makes the egg hatchable independently.
  • The hatch is seeded from this repository and PR number, so the same PR keeps the same creature; the reviewed head SHA can only change safe visual details.
  • Rarity is just collectible sparkle: 🥚 common, 🌱 uncommon, 💎 rare, ✨ glimmer, and 🌈 legendary.

yetval added 2 commits May 21, 2026 12:11
Tag the regular-file identity-change throw as FsSafeError("path-mismatch"),
matching the sentinel already used by file-store/directory-guard/pinned-write
for the same temp-file + rename race semantic. Wrap readJson, readJsonIfExists,
and tryReadJson in a bounded retry (max 3, 50ms exponential backoff) that only
retries on path-mismatch, so transient temp-file + rename rotations no longer
surface as JsonFileReadError to callers while parse errors and other read
failures still fail fast on the first attempt.

Add a real-disk regression that spawns a concurrent writeTextAtomic loop and a
readJson loop on a mkdtemp scratch file and asserts zero raceErrors over a
1-second window (per docs/contributing.md "use vi.mock sparingly"). Retain the
mocked exhaustion test for the retry-budget branch, which is hard to hit
deterministically on real disk.

Refs openclaw/openclaw#83657
@yetval yetval force-pushed the fix/json-read-retry-on-race branch from d3d1841 to 66604d6 Compare May 21, 2026 16:12
@clawsweeper clawsweeper Bot added rating: 🦞 diamond lobster Very strong PR readiness with only minor maintainer review expected. and removed rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. labels May 21, 2026
@steipete steipete merged commit f1742f5 into openclaw:main May 21, 2026
12 checks passed
@yetval yetval deleted the fix/json-read-retry-on-race branch May 21, 2026 20:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

P1 Urgent regression or broken agent/channel workflow affecting real users now. proof: sufficient Contributor real behavior proof is sufficient. rating: 🦞 diamond lobster Very strong PR readiness with only minor maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants