fix(journal): serialize concurrent appends via lockfile by carlos-alm · Pull Request #1002 · optave/ops-codegraph-tool

carlos-alm · 2026-04-21T07:22:42Z

Summary

Fixes #996.

appendJournalEntries() relied on fs.appendFileSync alone — that's atomic within a single call but not across concurrent writers. When a watch session and a manual codegraph build in another shell both write to .codegraph/changes.journal, their lines can interleave, producing truncated DELETED prefixes, partial paths, or a \n-less tail. A corrupted header then makes readJournal return valid: false and silently falls through to a hash scan — a silent perf regression, not a crash.

Fix

Wrap both appendJournalEntries and writeJournalHeader in withJournalLock:

Acquires .codegraph/changes.journal.lock with fs.openSync(path, 'wx') — atomic exclusive-create on ext4/NTFS/APFS.
Stamps the holder PID inside the lockfile.
Retries every 25ms up to a 5s timeout.
Steals the lock if the holder PID is dead (process.kill(pid, 0)) or the file's mtime is older than 30s — covers crash-mid-write.
Releases the fd + unlinks the lockfile in finally.

Zero new dependencies. Atomics.wait on a SharedArrayBuffer provides synchronous retry sleep so the existing sync API is preserved (watcher and build stages call these functions synchronously).

Test plan

npx vitest run tests/unit/journal.test.ts — 20/20 pass (17 existing + 3 new: lockfile cleanup, dead-PID stale-lock stealing, no-interleave across 200 mixed appends)
npx vitest run tests/builder/detect-changes.test.ts tests/integration/build.test.ts tests/integration/watcher-rebuild.test.ts — 27/27 pass
npm run typecheck — clean

appendJournalEntries used fs.appendFileSync with no cross-process coordination, so a watcher session and a manual codegraph build in a second shell could interleave lines and corrupt .codegraph/changes.journal (truncated DELETED prefixes, partial entries, newline-less tails). Wrap both appendJournalEntries and writeJournalHeader in a withJournalLock helper that: - acquires .codegraph/changes.journal.lock via fs.openSync(path, 'wx') (atomic exclusive-create, cross-platform) - stamps the holder PID for stale-lock detection - retries every 25ms up to a 5s timeout - steals the lock if the holder PID is dead or the lock file is older than 30s (covers crash-mid-write) - always releases in finally Zero new dependencies — uses only node:fs and Atomics.wait on a SharedArrayBuffer for synchronous retry sleep, preserving the existing sync API. Fixes #996 Impact: 7 functions changed, 12 affected

claude · 2026-04-21T07:22:54Z

Claude finished @carlos-alm's task —— View job

PR Review: fix(journal): serialize concurrent appends via lockfile

Todo List

Read and analyze the modified journal implementation
Review test changes for integrity and coverage
Check for configuration changes or rule relaxations
Perform root cause analysis
Assess backlog compliance criteria
Provide final review assessment

Root Cause Analysis

Problem Being Solved: Race condition where concurrent journal writers (watch session + manual codegraph build) can interleave lines in .codegraph/changes.journal, causing corruption like truncated "DELETED " prefixes or paths without newlines. This leads to readJournal returning valid: false and falling back to expensive hash scans.

Why This Approach: Uses atomic file locking via fs.openSync(path, 'wx') which is an OS-level exclusive-create operation. The approach is sound - wraps both appendJournalEntries and writeJournalHeader in withJournalLock to serialize all journal writes.

Risk Assessment: LOW - Well-implemented defensive approach with proper timeout, dead PID detection, and stale lock cleanup.

Backlog Compliance

Zero-dep: ✓ — Uses only Node.js built-ins (fs, process.kill, SharedArrayBuffer, Atomics)
Foundation-aligned: ✓ — Solves core reliability issue without violating any principles
Problem-fit: 4 — Directly prevents AI agents from experiencing silent perf degradation due to corrupted journals
Breaking: No — Internal implementation change, no API changes
Tier: 1

Technical Review

Implementation Quality ⭐⭐⭐⭐☆

Strong Points:

Atomic locking: fs.openSync(path, 'wx') is the correct way to do exclusive file creation
Dead PID detection: process.kill(pid, 0) with proper EPERM handling (lines 17-25)
Stale lock cleanup: 30s timeout prevents indefinite blocking from crashed processes
Sync sleep: Clever use of SharedArrayBuffer + Atomics.wait for synchronous retry without async/await
Proper resource cleanup: finally block ensures fd close + lock unlink (lines 101-105)

Implementation Details:

Lock timeout: 5s (reasonable for file I/O)
Retry interval: 25ms (appropriate granularity)
Stale threshold: 30s (good balance)
Lock file contains PID for diagnostics (line 34)

Test Coverage ⭐⭐⭐⭐☆

New Tests Added:

Lock cleanup verification (lines 200-207): Ensures .lock file is removed after successful operations
Dead PID lock stealing (lines 209-223): Tests stale lock recovery using impossible PID (2147483646)
Interleaving prevention (lines 225-243): 200 concurrent appends verify no corruption occurs

Test Quality:

Good edge case coverage (dead PID, lock cleanup, concurrent access)
Uses realistic scenario (200 mixed appends simulating high concurrency)
Proper assertion: regex validation that all lines match expected format
No test integrity issues - tests are additive and strengthen coverage

Potential Issues (Minor)

SharedArrayBuffer compatibility: While widely supported, could theoretically fail in some environments. However, fallback would be a simple setTimeout with Promise, and this preserves the existing sync API.
Lock file in same directory: Generally fine, though some filesystems might have issues with rapid create/delete cycles. The implementation handles races properly with try/catch blocks.
No explicit file sync: fs.writeSync and fs.appendFileSync don't guarantee disk persistence before proceeding, but this shouldn't affect correctness for the lockfile use case.

Critical Concerns

None. This is a solid fix for a real concurrency issue.

Final Recommendation

Rating: ⭐⭐⭐⭐☆ (4/5)
Action: APPROVE
Reasoning: Well-implemented fix for a legitimate race condition. Uses appropriate OS primitives, has good test coverage, adds no dependencies, and solves a real reliability problem that affects AI agent performance. The implementation is defensive and handles edge cases properly.

The only reason this isn't 5 stars is that it's solving an existing issue rather than adding exceptional new value, but the execution is excellent.

greptile-apps · 2026-04-21T07:26:44Z

Greptile Summary

This PR wraps appendJournalEntries and writeJournalHeader in a lockfile-based mutex (withJournalLock) to prevent interleaved writes when a watcher process and a codegraph build run concurrently. The two previously-flagged P1 issues — the TOCTOU steal race and the Atomics.wait event-loop freeze — are confirmed resolved via the atomic rename-with-nonce steal and the process.hrtime.bigint() busy-spin respectively.

Confidence Score: 5/5

Safe to merge; all remaining findings are P2 style/quality suggestions that do not block correctness.

Both prior P1 issues (TOCTOU steal race, event-loop blocking) are addressed in this revision. New findings are limited to orphaned crash-time tmp files (minor filesystem hygiene) and a test that validates format integrity but not inter-process concurrency — neither affects the lock's correctness on the happy path.

No files require special attention beyond the P2 notes on trySteal cleanup and the concurrent-append test scope.

Important Files Changed

Filename	Overview
src/domain/graph/journal.ts	Adds a lockfile-based mutex (acquireJournalLock / releaseJournalLock / withJournalLock) wrapping appendJournalEntries and writeJournalHeader. Prior P1 TOCTOU and event-loop-blocking issues are resolved; orphaned .tmp files on crash mid-steal remain a minor concern.
tests/unit/journal.test.ts	Adds three new lock tests: lockfile cleanup, dead-PID stale-lock stealing, and TOCTOU regression guard. The "no interleaved lines" test validates format integrity but is single-process/sequential and does not verify inter-process atomicity.

Sequence Diagram

sequenceDiagram
    participant W as Watcher Process
    participant B as Build Process
    participant FS as Filesystem (.codegraph/)

    W->>FS: openSync('changes.journal.lock', 'wx') → fd_W
    W->>FS: writeSync(fd_W, PID_W + nonce_W)
    note over W: holds lock

    B->>FS: openSync('changes.journal.lock', 'wx') → EEXIST
    B->>FS: readFileSync(lock) → PID_W (alive)
    B->>FS: statSync(lock) → mtime recent
    note over B: spins sleepSync(25ms) × N

    W->>FS: appendFileSync('changes.journal', lines)
    W->>FS: readFileSync(lock) → nonce matches → unlinkSync(lock)
    note over W: releases lock

    B->>FS: openSync('changes.journal.lock', 'wx') → fd_B
    B->>FS: writeSync(fd_B, PID_B + nonce_B)
    B->>FS: appendFileSync('changes.journal', lines)
    B->>FS: readFileSync(lock) → nonce matches → unlinkSync(lock)
    note over B: releases lock

_{Reviews (4): Last reviewed commit: "Merge branch 'main' into fix/journal-ato..." | Re-trigger Greptile}

greptile-apps · 2026-04-21T07:26:48Z

+    let holderAlive = true;
+    try {
+      const pidContent = fs.readFileSync(lockPath, 'utf-8').trim();
+      holderAlive = isPidAlive(Number(pidContent));
+    } catch {
+      /* unreadable — fall through to age check */
+    }
+
+    if (!holderAlive) {
+      try {
+        fs.unlinkSync(lockPath);
+      } catch {
+        /* another writer stole it first */
+      }
+      continue;


TOCTOU race allows two processes to concurrently enter the critical section

When multiple processes are waiting on the same stale (dead-PID) lockfile, they can each independently steal the lock from one another's fresh acquisition:

Process A dies — lockfile contains dead PID A.

Processes B and C both call openSync('wx'), both get EEXIST, and both read dead PID A.

B unlinks and immediately re-acquires (new inode I1, fn() starts).

C now executes unlinkSync(lockPath) — it unlinks B's live lockfile (I1), not the stale one.

C loops, calls openSync('wx'), succeeds (new inode I2), and its fn() begins concurrently with B's.

Both B and C are now inside the critical section simultaneously — exactly the write-serialisation invariant this PR is meant to guarantee.

The standard mitigation is to verify ownership after the unlink/re-create cycle: write a nonce (or use the inode number) inside the lockfile, then re-read it after openSync succeeds and bail if it doesn't match yours. Alternatively, use fs.renameSync(tmpfile, lockPath) to perform the steal atomically (create a temp file, rename onto the lockfile — POSIX rename is atomic).

Fixed in f5c737c. Replaced the unlink + openSync('wx') steal pattern with an atomic write-tmp + fs.renameSync steal. Each writer generates a random nonce, writes it into a temp file, and atomically renames onto the lockfile. After rename we re-read the lockfile and only proceed if our nonce is still there — if another stealer's rename landed after ours, we bail and retry instead of unlinking their live lock. Release also nonce-verifies before unlinking. Added a regression test that stages a lockfile with a different-writer nonce after a real steal cycle and asserts we do not retroactively unlink it.

greptile-apps · 2026-04-21T07:26:49Z

+function sleepSync(ms: number): void {
+  const buf = new Int32Array(new SharedArrayBuffer(4));
+  Atomics.wait(buf, 0, 0, ms);
+}


Atomics.wait freezes the Node.js event loop during lock contention

Atomics.wait is a synchronous, blocking call — it stops the entire V8 event loop for the full ms duration. In a watcher process, every filesystem notification, timer, and pending I/O callback is silenced for each 25 ms retry. In the worst case (5 000 ms timeout, 200 retries), the watcher becomes completely unresponsive for up to 5 seconds before ever throwing.

A lighter alternative that avoids blocking the event loop is a simple busy-spin with process.hrtime.bigint():

function sleepSync(ms: number): void { const end = process.hrtime.bigint() + BigInt(ms) * 1_000_000n; while (process.hrtime.bigint() < end) { /* spin */ } }

This keeps each retry short and doesn't starve unrelated callbacks (though it does keep the CPU busy, which is acceptable for the brief per-retry duration).

Fixed in f5c737c. Replaced Atomics.wait with a short process.hrtime.bigint busy-spin per your suggestion. The 25ms retry interval keeps CPU burn negligible while letting pending FS events, timers, and I/O callbacks in watcher processes keep firing during contention.

github-actions · 2026-04-21T07:29:59Z

Codegraph Impact Analysis

10 functions changed → 13 callers affected across 5 files

sleepSync in src/domain/graph/journal.ts:17 (4 transitive callers)
isPidAlive in src/domain/graph/journal.ts:24 (4 transitive callers)
AcquiredLock.fd in src/domain/graph/journal.ts:36 (4 transitive callers)
AcquiredLock.nonce in src/domain/graph/journal.ts:37 (4 transitive callers)
trySteal in src/domain/graph/journal.ts:51 (4 transitive callers)
acquireJournalLock in src/domain/graph/journal.ts:94 (7 transitive callers)
releaseJournalLock in src/domain/graph/journal.ts:143 (7 transitive callers)
withJournalLock in src/domain/graph/journal.ts:161 (8 transitive callers)
appendJournalEntries in src/domain/graph/journal.ts:229 (4 transitive callers)
writeJournalHeader in src/domain/graph/journal.ts:249 (4 transitive callers)

Address two Greptile review issues on the journal lockfile: - P1 TOCTOU: when two stealers observed the same stale (dead-PID) holder, one's unlink could cross the other's fresh openSync('wx') acquisition, admitting both writers into the critical section. Replace the unlink + openSync('wx') pattern with an atomic write-tmp + rename steal, then verify via a random nonce. If another stealer's rename landed after ours, we bail and retry instead of unlinking their live lockfile. Release now also nonce-verifies before unlinking. - P2 event-loop blockage: replace Atomics.wait with a short hrtime busy-spin so pending FS events and timer callbacks in the watcher keep firing during the 25ms retry window. Add a regression test for the stale-lock steal race that asserts we never unlink a lockfile whose nonce does not match our own. Impact: 7 functions changed, 10 affected

carlos-alm · 2026-04-22T06:28:07Z

@greptileai

greptile-apps · 2026-04-22T07:34:37Z

+    try {
+      const fd = fs.openSync(lockPath, 'wx');
+      try {
+        fs.writeSync(fd, `${process.pid}\n${nonce}\n`);
+      } catch {
+        /* PID stamp is advisory; fd is still exclusive */
+      }
+      return { fd, nonce };
+    } catch (e) {
+      if ((e as NodeJS.ErrnoException).code !== 'EEXIST') throw e;
+    }


Silent writeSync failure voids the nonce and breaks mutual exclusion

If fs.writeSync throws (e.g. ENOSPC, I/O error), the lockfile is created by openSync('wx') but remains empty — the nonce is never written. Two consequences compound:

releaseJournalLock reads the empty file, content.includes(lock.nonce) is false, and the lockfile is orphaned rather than unlinked.

Any concurrent waiter reads the empty file, computes Number('') → 0, then calls isPidAlive(0) which returns false (the pid <= 0 guard). With holderAlive = false, it immediately calls trySteal — stealing a live, active lock and breaking mutual exclusion.

The comment "PID stamp is advisory; fd is still exclusive" only holds when there are no concurrent waiters, but a journal with a watcher + build process is exactly the concurrent scenario this lock is meant to protect.

Fix: when the write fails, release the fd and unlink synchronously (we still hold the exclusive fd at that point) and retry, rather than proceeding into fn() with an un-stamped lockfile.

greptile-apps Bot reviewed Apr 21, 2026

View reviewed changes

Merge branch 'main' into fix/journal-atomic-appends

82ffccd

greptile-apps Bot reviewed Apr 22, 2026

View reviewed changes

Merge branch 'main' into fix/journal-atomic-appends

5177133

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(journal): serialize concurrent appends via lockfile#1002

fix(journal): serialize concurrent appends via lockfile#1002
carlos-alm wants to merge 4 commits intomainfrom
fix/journal-atomic-appends

carlos-alm commented Apr 21, 2026

Uh oh!

claude Bot commented Apr 21, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented Apr 21, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot Apr 21, 2026

Uh oh!

carlos-alm Apr 22, 2026

Uh oh!

greptile-apps Bot Apr 21, 2026

Uh oh!

carlos-alm Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 21, 2026 •

edited

Loading

Uh oh!

carlos-alm commented Apr 22, 2026

Uh oh!

greptile-apps Bot Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

carlos-alm commented Apr 21, 2026

Summary

Fix

Test plan

Uh oh!

claude Bot commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: fix(journal): serialize concurrent appends via lockfile

Todo List

Root Cause Analysis

Backlog Compliance

Technical Review

Implementation Quality ⭐⭐⭐⭐☆

Test Coverage ⭐⭐⭐⭐☆

Potential Issues (Minor)

Critical Concerns

Final Recommendation

Uh oh!

greptile-apps Bot commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

carlos-alm Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

carlos-alm Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codegraph Impact Analysis

Uh oh!

carlos-alm commented Apr 22, 2026

Uh oh!

greptile-apps Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude Bot commented Apr 21, 2026 •

edited

Loading

greptile-apps Bot commented Apr 21, 2026 •

edited

Loading

github-actions Bot commented Apr 21, 2026 •

edited

Loading