Skip to content

fix(journal): serialize concurrent appends via lockfile#1002

Open
carlos-alm wants to merge 4 commits intomainfrom
fix/journal-atomic-appends
Open

fix(journal): serialize concurrent appends via lockfile#1002
carlos-alm wants to merge 4 commits intomainfrom
fix/journal-atomic-appends

Conversation

@carlos-alm
Copy link
Copy Markdown
Contributor

Summary

Fixes #996.

appendJournalEntries() relied on fs.appendFileSync alone — that's atomic within a single call but not across concurrent writers. When a watch session and a manual codegraph build in another shell both write to .codegraph/changes.journal, their lines can interleave, producing truncated DELETED prefixes, partial paths, or a \n-less tail. A corrupted header then makes readJournal return valid: false and silently falls through to a hash scan — a silent perf regression, not a crash.

Fix

Wrap both appendJournalEntries and writeJournalHeader in withJournalLock:

  • Acquires .codegraph/changes.journal.lock with fs.openSync(path, 'wx') — atomic exclusive-create on ext4/NTFS/APFS.
  • Stamps the holder PID inside the lockfile.
  • Retries every 25ms up to a 5s timeout.
  • Steals the lock if the holder PID is dead (process.kill(pid, 0)) or the file's mtime is older than 30s — covers crash-mid-write.
  • Releases the fd + unlinks the lockfile in finally.

Zero new dependencies. Atomics.wait on a SharedArrayBuffer provides synchronous retry sleep so the existing sync API is preserved (watcher and build stages call these functions synchronously).

Test plan

  • npx vitest run tests/unit/journal.test.ts — 20/20 pass (17 existing + 3 new: lockfile cleanup, dead-PID stale-lock stealing, no-interleave across 200 mixed appends)
  • npx vitest run tests/builder/detect-changes.test.ts tests/integration/build.test.ts tests/integration/watcher-rebuild.test.ts — 27/27 pass
  • npm run typecheck — clean

appendJournalEntries used fs.appendFileSync with no cross-process
coordination, so a watcher session and a manual codegraph build in a
second shell could interleave lines and corrupt .codegraph/changes.journal
(truncated DELETED prefixes, partial entries, newline-less tails).

Wrap both appendJournalEntries and writeJournalHeader in a
withJournalLock helper that:
- acquires .codegraph/changes.journal.lock via fs.openSync(path, 'wx')
  (atomic exclusive-create, cross-platform)
- stamps the holder PID for stale-lock detection
- retries every 25ms up to a 5s timeout
- steals the lock if the holder PID is dead or the lock file is older
  than 30s (covers crash-mid-write)
- always releases in finally

Zero new dependencies — uses only node:fs and Atomics.wait on a
SharedArrayBuffer for synchronous retry sleep, preserving the existing
sync API.

Fixes #996

Impact: 7 functions changed, 12 affected
@claude
Copy link
Copy Markdown

claude Bot commented Apr 21, 2026

Claude finished @carlos-alm's task —— View job


PR Review: fix(journal): serialize concurrent appends via lockfile

Todo List

  • Read and analyze the modified journal implementation
  • Review test changes for integrity and coverage
  • Check for configuration changes or rule relaxations
  • Perform root cause analysis
  • Assess backlog compliance criteria
  • Provide final review assessment

Root Cause Analysis

Problem Being Solved: Race condition where concurrent journal writers (watch session + manual codegraph build) can interleave lines in .codegraph/changes.journal, causing corruption like truncated "DELETED " prefixes or paths without newlines. This leads to readJournal returning valid: false and falling back to expensive hash scans.

Why This Approach: Uses atomic file locking via fs.openSync(path, 'wx') which is an OS-level exclusive-create operation. The approach is sound - wraps both appendJournalEntries and writeJournalHeader in withJournalLock to serialize all journal writes.

Risk Assessment: LOW - Well-implemented defensive approach with proper timeout, dead PID detection, and stale lock cleanup.

Backlog Compliance

  • Zero-dep: ✓ — Uses only Node.js built-ins (fs, process.kill, SharedArrayBuffer, Atomics)
  • Foundation-aligned: ✓ — Solves core reliability issue without violating any principles
  • Problem-fit: 4 — Directly prevents AI agents from experiencing silent perf degradation due to corrupted journals
  • Breaking: No — Internal implementation change, no API changes
  • Tier: 1

Technical Review

Implementation Quality ⭐⭐⭐⭐☆

Strong Points:

  • Atomic locking: fs.openSync(path, 'wx') is the correct way to do exclusive file creation
  • Dead PID detection: process.kill(pid, 0) with proper EPERM handling (lines 17-25)
  • Stale lock cleanup: 30s timeout prevents indefinite blocking from crashed processes
  • Sync sleep: Clever use of SharedArrayBuffer + Atomics.wait for synchronous retry without async/await
  • Proper resource cleanup: finally block ensures fd close + lock unlink (lines 101-105)

Implementation Details:

  • Lock timeout: 5s (reasonable for file I/O)
  • Retry interval: 25ms (appropriate granularity)
  • Stale threshold: 30s (good balance)
  • Lock file contains PID for diagnostics (line 34)

Test Coverage ⭐⭐⭐⭐☆

New Tests Added:

  1. Lock cleanup verification (lines 200-207): Ensures .lock file is removed after successful operations
  2. Dead PID lock stealing (lines 209-223): Tests stale lock recovery using impossible PID (2147483646)
  3. Interleaving prevention (lines 225-243): 200 concurrent appends verify no corruption occurs

Test Quality:

  • Good edge case coverage (dead PID, lock cleanup, concurrent access)
  • Uses realistic scenario (200 mixed appends simulating high concurrency)
  • Proper assertion: regex validation that all lines match expected format
  • No test integrity issues - tests are additive and strengthen coverage

Potential Issues (Minor)

  1. SharedArrayBuffer compatibility: While widely supported, could theoretically fail in some environments. However, fallback would be a simple setTimeout with Promise, and this preserves the existing sync API.

  2. Lock file in same directory: Generally fine, though some filesystems might have issues with rapid create/delete cycles. The implementation handles races properly with try/catch blocks.

  3. No explicit file sync: fs.writeSync and fs.appendFileSync don't guarantee disk persistence before proceeding, but this shouldn't affect correctness for the lockfile use case.

Critical Concerns

None. This is a solid fix for a real concurrency issue.

Final Recommendation

  • Rating: ⭐⭐⭐⭐☆ (4/5)
  • Action: APPROVE
  • Reasoning: Well-implemented fix for a legitimate race condition. Uses appropriate OS primitives, has good test coverage, adds no dependencies, and solves a real reliability problem that affects AI agent performance. The implementation is defensive and handles edge cases properly.

The only reason this isn't 5 stars is that it's solving an existing issue rather than adding exceptional new value, but the execution is excellent.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 21, 2026

Greptile Summary

This PR wraps appendJournalEntries and writeJournalHeader in a lockfile-based mutex (withJournalLock) to prevent interleaved writes when a watcher process and a codegraph build run concurrently. The two previously-flagged P1 issues — the TOCTOU steal race and the Atomics.wait event-loop freeze — are confirmed resolved via the atomic rename-with-nonce steal and the process.hrtime.bigint() busy-spin respectively.

Confidence Score: 5/5

Safe to merge; all remaining findings are P2 style/quality suggestions that do not block correctness.

Both prior P1 issues (TOCTOU steal race, event-loop blocking) are addressed in this revision. New findings are limited to orphaned crash-time tmp files (minor filesystem hygiene) and a test that validates format integrity but not inter-process concurrency — neither affects the lock's correctness on the happy path.

No files require special attention beyond the P2 notes on trySteal cleanup and the concurrent-append test scope.

Important Files Changed

Filename Overview
src/domain/graph/journal.ts Adds a lockfile-based mutex (acquireJournalLock / releaseJournalLock / withJournalLock) wrapping appendJournalEntries and writeJournalHeader. Prior P1 TOCTOU and event-loop-blocking issues are resolved; orphaned .tmp files on crash mid-steal remain a minor concern.
tests/unit/journal.test.ts Adds three new lock tests: lockfile cleanup, dead-PID stale-lock stealing, and TOCTOU regression guard. The "no interleaved lines" test validates format integrity but is single-process/sequential and does not verify inter-process atomicity.

Sequence Diagram

sequenceDiagram
    participant W as Watcher Process
    participant B as Build Process
    participant FS as Filesystem (.codegraph/)

    W->>FS: openSync('changes.journal.lock', 'wx') → fd_W
    W->>FS: writeSync(fd_W, PID_W + nonce_W)
    note over W: holds lock

    B->>FS: openSync('changes.journal.lock', 'wx') → EEXIST
    B->>FS: readFileSync(lock) → PID_W (alive)
    B->>FS: statSync(lock) → mtime recent
    note over B: spins sleepSync(25ms) × N

    W->>FS: appendFileSync('changes.journal', lines)
    W->>FS: readFileSync(lock) → nonce matches → unlinkSync(lock)
    note over W: releases lock

    B->>FS: openSync('changes.journal.lock', 'wx') → fd_B
    B->>FS: writeSync(fd_B, PID_B + nonce_B)
    B->>FS: appendFileSync('changes.journal', lines)
    B->>FS: readFileSync(lock) → nonce matches → unlinkSync(lock)
    note over B: releases lock
Loading

Fix All in Claude Code

Reviews (4): Last reviewed commit: "Merge branch 'main' into fix/journal-ato..." | Re-trigger Greptile

Comment thread src/domain/graph/journal.ts Outdated
Comment on lines +43 to +57
let holderAlive = true;
try {
const pidContent = fs.readFileSync(lockPath, 'utf-8').trim();
holderAlive = isPidAlive(Number(pidContent));
} catch {
/* unreadable — fall through to age check */
}

if (!holderAlive) {
try {
fs.unlinkSync(lockPath);
} catch {
/* another writer stole it first */
}
continue;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 TOCTOU race allows two processes to concurrently enter the critical section

When multiple processes are waiting on the same stale (dead-PID) lockfile, they can each independently steal the lock from one another's fresh acquisition:

  1. Process A dies — lockfile contains dead PID A.
  2. Processes B and C both call openSync('wx'), both get EEXIST, and both read dead PID A.
  3. B unlinks and immediately re-acquires (new inode I1, fn() starts).
  4. C now executes unlinkSync(lockPath) — it unlinks B's live lockfile (I1), not the stale one.
  5. C loops, calls openSync('wx'), succeeds (new inode I2), and its fn() begins concurrently with B's.

Both B and C are now inside the critical section simultaneously — exactly the write-serialisation invariant this PR is meant to guarantee.

The standard mitigation is to verify ownership after the unlink/re-create cycle: write a nonce (or use the inode number) inside the lockfile, then re-read it after openSync succeeds and bail if it doesn't match yours. Alternatively, use fs.renameSync(tmpfile, lockPath) to perform the steal atomically (create a temp file, rename onto the lockfile — POSIX rename is atomic).

Fix in Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in f5c737c. Replaced the unlink + openSync('wx') steal pattern with an atomic write-tmp + fs.renameSync steal. Each writer generates a random nonce, writes it into a temp file, and atomically renames onto the lockfile. After rename we re-read the lockfile and only proceed if our nonce is still there — if another stealer's rename landed after ours, we bail and retry instead of unlinking their live lock. Release also nonce-verifies before unlinking. Added a regression test that stages a lockfile with a different-writer nonce after a real steal cycle and asserts we do not retroactively unlink it.

Comment on lines +12 to +15
function sleepSync(ms: number): void {
const buf = new Int32Array(new SharedArrayBuffer(4));
Atomics.wait(buf, 0, 0, ms);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Atomics.wait freezes the Node.js event loop during lock contention

Atomics.wait is a synchronous, blocking call — it stops the entire V8 event loop for the full ms duration. In a watcher process, every filesystem notification, timer, and pending I/O callback is silenced for each 25 ms retry. In the worst case (5 000 ms timeout, 200 retries), the watcher becomes completely unresponsive for up to 5 seconds before ever throwing.

A lighter alternative that avoids blocking the event loop is a simple busy-spin with process.hrtime.bigint():

function sleepSync(ms: number): void {
  const end = process.hrtime.bigint() + BigInt(ms) * 1_000_000n;
  while (process.hrtime.bigint() < end) { /* spin */ }
}

This keeps each retry short and doesn't starve unrelated callbacks (though it does keep the CPU busy, which is acceptable for the brief per-retry duration).

Fix in Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in f5c737c. Replaced Atomics.wait with a short process.hrtime.bigint busy-spin per your suggestion. The 25ms retry interval keeps CPU burn negligible while letting pending FS events, timers, and I/O callbacks in watcher processes keep firing during contention.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 21, 2026

Codegraph Impact Analysis

10 functions changed13 callers affected across 5 files

  • sleepSync in src/domain/graph/journal.ts:17 (4 transitive callers)
  • isPidAlive in src/domain/graph/journal.ts:24 (4 transitive callers)
  • AcquiredLock.fd in src/domain/graph/journal.ts:36 (4 transitive callers)
  • AcquiredLock.nonce in src/domain/graph/journal.ts:37 (4 transitive callers)
  • trySteal in src/domain/graph/journal.ts:51 (4 transitive callers)
  • acquireJournalLock in src/domain/graph/journal.ts:94 (7 transitive callers)
  • releaseJournalLock in src/domain/graph/journal.ts:143 (7 transitive callers)
  • withJournalLock in src/domain/graph/journal.ts:161 (8 transitive callers)
  • appendJournalEntries in src/domain/graph/journal.ts:229 (4 transitive callers)
  • writeJournalHeader in src/domain/graph/journal.ts:249 (4 transitive callers)

Address two Greptile review issues on the journal lockfile:

- P1 TOCTOU: when two stealers observed the same stale (dead-PID)
  holder, one's unlink could cross the other's fresh openSync('wx')
  acquisition, admitting both writers into the critical section. Replace
  the unlink + openSync('wx') pattern with an atomic write-tmp + rename
  steal, then verify via a random nonce. If another stealer's rename
  landed after ours, we bail and retry instead of unlinking their live
  lockfile. Release now also nonce-verifies before unlinking.

- P2 event-loop blockage: replace Atomics.wait with a short hrtime
  busy-spin so pending FS events and timer callbacks in the watcher
  keep firing during the 25ms retry window.

Add a regression test for the stale-lock steal race that asserts we
never unlink a lockfile whose nonce does not match our own.

Impact: 7 functions changed, 10 affected
@carlos-alm
Copy link
Copy Markdown
Contributor Author

@greptileai

Comment on lines +98 to +108
try {
const fd = fs.openSync(lockPath, 'wx');
try {
fs.writeSync(fd, `${process.pid}\n${nonce}\n`);
} catch {
/* PID stamp is advisory; fd is still exclusive */
}
return { fd, nonce };
} catch (e) {
if ((e as NodeJS.ErrnoException).code !== 'EEXIST') throw e;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Silent writeSync failure voids the nonce and breaks mutual exclusion

If fs.writeSync throws (e.g. ENOSPC, I/O error), the lockfile is created by openSync('wx') but remains empty — the nonce is never written. Two consequences compound:

  1. releaseJournalLock reads the empty file, content.includes(lock.nonce) is false, and the lockfile is orphaned rather than unlinked.
  2. Any concurrent waiter reads the empty file, computes Number('') → 0, then calls isPidAlive(0) which returns false (the pid <= 0 guard). With holderAlive = false, it immediately calls trySteal — stealing a live, active lock and breaking mutual exclusion.

The comment "PID stamp is advisory; fd is still exclusive" only holds when there are no concurrent waiters, but a journal with a watcher + build process is exactly the concurrent scenario this lock is meant to protect.

Fix: when the write fails, release the fd and unlink synchronously (we still hold the exclusive fd at that point) and retry, rather than proceeding into fn() with an un-stamped lockfile.

Fix in Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug(journal): concurrent appends are not atomic — corrupted under parallel builds

1 participant