Skip to content

fix(snapshot): close TOCTOU race on save/restore/delete#1003

Open
carlos-alm wants to merge 5 commits intomainfrom
fix/snapshot-toctou-race
Open

fix(snapshot): close TOCTOU race on save/restore/delete#1003
carlos-alm wants to merge 5 commits intomainfrom
fix/snapshot-toctou-race

Conversation

@carlos-alm
Copy link
Copy Markdown
Contributor

Summary

  • snapshotSave --force now writes VACUUM INTO to a per-pid temp file and atomically fs.renameSyncs over the destination. Two concurrent saves can no longer see a missing file or interleave their VACUUM writes.
  • snapshotRestore copies the snapshot to a temp path next to the live DB and atomically renames, so readers never observe a partially-written database file. WAL/SHM sidecar removal switched to unlink + ENOENT catch.
  • snapshotDelete collapses the existsSync / unlinkSync pair into a single unlinkSync, mapping ENOENT to the existing "not found" error.

Public signatures of snapshotSave / snapshotRestore / snapshotDelete are unchanged; CLI callers in src/cli/commands/snapshot.ts need no updates.

Closes #995

Test plan

  • Existing 26 snapshot unit tests pass unchanged.
  • New regression: leaves no temp files in snapshots dir after success — ensures the temp artifact is renamed (not leaked) on the success path.
  • New regression: does not corrupt output when two --force saves race on the same name — issues two concurrent snapshotSave(..., { force: true }) calls via Promise.allSettled and asserts the surviving file is still a valid SQLite DB with the expected contents. Pre-fix this could produce a truncated/interleaved file.
  • biome check clean on both changed files.
  • Manual repro: two shells running codegraph snapshot save checkpoint --force against the same .codegraph/ no longer produce a truncated destination.

Use temp-file + atomic rename for `snapshotSave --force` and `snapshotRestore`
so two concurrent saves or a restore racing with a reader can't observe a
missing or partially-written database file. `snapshotDelete` now unlinks
directly and maps ENOENT to the existing "not found" error instead of a
separate existsSync check.

Adds two regression tests: temp-file cleanup on success, and two concurrent
--force saves producing a valid SQLite destination.

Closes #995

Impact: 3 functions changed, 1 affected
@claude
Copy link
Copy Markdown

claude Bot commented Apr 21, 2026

Claude finished @carlos-alm's task —— View job


PR Review: fix(snapshot): close TOCTOU race on save/restore/delete

Todo List:

  • Phase 1: Root cause analysis - understand the TOCTOU race problem
  • Phase 2: Configuration change detection - check for high-risk changes
  • Phase 3: Test integrity - verify test changes are appropriate
  • Phase 4: Code quality review - examine the fix implementation
  • Phase 5: Backlog compliance assessment
  • Provide final assessment and rating

Root Cause Analysis

Problem Being Solved: TOCTOU (Time-Of-Check-Time-Of-Use) race conditions in snapshot operations where concurrent save operations could corrupt files, and restore operations could leave readers with partially-written databases.

Why This Approach: The fix implements atomic operations using temp files + atomic fs.renameSync(), which is the correct pattern for preventing races in file operations. This approach ensures that readers never see partially-written files and concurrent writers don't corrupt each other's output.

Risk Assessment: LOW RISK - The changes preserve existing public APIs, use well-established atomic file patterns, and include comprehensive regression tests.

Backlog Compliance

  • Zero-dep: ✓ — No new dependencies, uses only Node.js built-in fs operations
  • Foundation-aligned: ✓ — Pure defensive bug fix, aligns with P1 (reliability) and doesn't violate any principles
  • Problem-fit: 3 — Improves system reliability but doesn't directly address "AI agents waste tokens in large codebases"
  • Breaking: No — Public API signatures unchanged, behavior is only more reliable
  • Tier: 1 (Zero-dep ✓ + Foundation-aligned ✓)

Technical Review

Implementation Quality ⭐⭐⭐⭐☆

snapshotSave changes (lines 46-74):

  • Atomic pattern: VACUUM INTO temp file → atomic rename is the correct approach
  • Unique temp names: process.pid + Date.now() prevents collision
  • Same filesystem: Temp file in snapshots dir ensures atomic rename works
  • Cleanup on error: try/catch with best-effort temp file removal
  • SQL injection protection: Single quotes properly escaped in VACUUM command

snapshotRestore changes (lines 95-123):

  • Sidecar cleanup: Removes WAL/SHM files before restore to prevent journal replay
  • Atomic restore: Copy to temp → rename ensures readers see complete file
  • ENOENT handling: Proper error handling for missing sidecar files

snapshotDelete changes (lines 163-170):

  • Race elimination: Single unlinkSync + ENOENT check replaces existsSync + unlinkSync
  • Consistent error: ENOENT properly mapped to existing "not found" error

Test Coverage ⭐⭐⭐⭐⭐

New regression tests:

  1. Line 136-142: Verifies no temp file leakage - excellent defensive test
  2. Line 144-167: Concurrent save race test - this is exemplary testing. Uses Promise.allSettled to verify at least one succeeds and final DB is valid.

Test quality observations:

  • ✅ Existing 26 tests continue to pass (per PR description)
  • ✅ New tests target the specific race conditions being fixed
  • ✅ Race test validates the most critical scenario (concurrent --force overwrites)
  • ✅ Tests verify both success paths and cleanup behavior

Code Quality Details

Minor observations:

  • SQL escaping at line 60: tmp.replace(/'/g, "''") is correct SQLite escaping
  • Error handling is consistent with existing patterns in the codebase
  • Temp file naming scheme is collision-resistant for typical usage
  • The readonly: true SQLite connection for VACUUM is appropriate

Architecture alignment:

  • Changes stay within the snapshot feature boundary
  • No changes to CLI interface or other modules required
  • Maintains separation of concerns

Critical Concerns

None. This is a solid defensive fix that eliminates race conditions without breaking changes.

Final Recommendation

  • Rating: ⭐⭐⭐⭐☆ (4/5)
  • Action: APPROVE
  • Reasoning: Well-implemented race condition fix with excellent test coverage. The implementation follows established atomic file operation patterns, preserves all existing behavior, and includes comprehensive regression tests that directly target the race scenarios. The concurrent save test is particularly well-designed. Only reason it's not 5-star is that it's "just" a bug fix rather than innovative functionality.

Minor suggestions (non-blocking):

  1. The temp file cleanup in catch blocks uses empty catch catch { /* cleanup best-effort */ } - consider logging cleanup failures at debug level
  2. Consider documenting the TOCTOU pattern in code comments for future maintainers

This PR should be merged without concerns.


@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 21, 2026

Greptile Summary

This PR closes the TOCTOU races in snapshotSave, snapshotRestore, and snapshotDelete by routing all writes through a unique temp file (pid + timestamp + random bytes) and placing the result atomically — renameSync for --force, linkSync for the non-force path. The error-path cleanup is handled correctly in all branches, and the new worker_threads-based regression tests genuinely exercise concurrent writes.

Confidence Score: 5/5

Safe to merge — all three TOCTOU surfaces are correctly closed and prior review concerns have been fully addressed.

No P0 or P1 bugs found. The temp-file lifecycle (pre-unlink before VACUUM, outer-catch cleanup on failure, best-effort unlink after linkSync) is correct in every code path. The non-force linkSync EEXIST check is POSIX-correct and NTFS-compatible. Both race tests use real worker threads with proper message-passing, and the assertions validate the contract rather than implementation details. Previous reviewer concerns (pseudo-concurrency, non-force TOCTOU window) are both resolved.

No files require special attention.

Important Files Changed

Filename Overview
src/features/snapshot.ts Core TOCTOU fix: introduces per-call unique temp file + atomic renameSync (force) / linkSync (non-force) placement; snapshotDelete simplified to direct unlinkSync+ENOENT mapping; snapshotRestore gains copyFileSync-to-tmp + renameSync atomicity and ENOENT-safe sidecar removal. Logic is correct in all error paths.
tests/unit/snapshot-race-worker.mjs New ESM worker entry for genuine cross-thread concurrency; imports snapshot.ts, calls snapshotSave, and posts structured ok/error message back to parent. Clean and minimal.
tests/unit/snapshot.test.ts Adds three regression tests: temp-file leak check, force-race integrity check, and non-force atomic-reject race; uses real worker_threads for genuine concurrency. Assertions correctly validate the expected contract.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A([snapshotSave called]) --> B{force?}
    B -- "false" --> C{existsSync dest?}
    C -- yes --> D([throw ConfigError fast-path])
    C -- no --> E[mkdirSync snapshots dir]
    B -- "true" --> E
    E --> F["generate unique tmp path\n(pid + Date.now() + randomBytes)"]
    F --> G[VACUUM INTO tmp]
    G --> H{force?}
    H -- "true" --> I["renameSync(tmp → dest)\natomic overwrite"]
    H -- "false" --> J["linkSync(tmp, dest)\natomic guard"]
    J -- "EEXIST" --> K[outer catch: unlink tmp] --> L([throw ConfigError])
    J -- "other error" --> K
    J -- "success" --> M["unlinkSync(tmp)\nbest-effort cleanup"]
    I -- "success" --> N[statSync dest → return]
    M --> N
    I -- "error" --> O[outer catch: unlink tmp] --> P([rethrow])
    N --> Q([return name/path/size])
Loading

Reviews (5): Last reviewed commit: "Merge branch 'main' into fix/snapshot-to..." | Re-trigger Greptile

Comment on lines +144 to +167
it('does not corrupt output when two --force saves race on the same name', async () => {
// Prime the target so both calls take the --force overwrite path.
snapshotSave('race', { dbPath });

// Issue both concurrently. Pre-fix, the existsSync/unlinkSync/VACUUM INTO
// sequence could interleave and produce a corrupt DB; post-fix, rename
// is atomic so the winner's file is intact and the loser either succeeds
// (overwriting) or fails cleanly.
const results = await Promise.allSettled([
Promise.resolve().then(() => snapshotSave('race', { dbPath, force: true })),
Promise.resolve().then(() => snapshotSave('race', { dbPath, force: true })),
]);

// At least one save must have succeeded.
const succeeded = results.filter((r) => r.status === 'fulfilled');
expect(succeeded.length).toBeGreaterThanOrEqual(1);

// The final file must be a valid SQLite DB with the expected contents.
const finalPath = path.join(snapshotsDir(dbPath), 'race.db');
const db = new Database(finalPath, { readonly: true });
const rows = db.prepare('SELECT name FROM nodes').all();
db.close();
expect(rows).toEqual([{ name: 'hello' }]);
});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Race test doesn't exercise actual concurrency

snapshotSave relies entirely on better-sqlite3, a synchronous native module. Wrapping two synchronous calls in Promise.resolve().then(...) produces microtasks that execute sequentially — the first .then() callback runs to completion before the second one is scheduled. Promise.allSettled here is equivalent to two back-to-back snapshotSave calls: the second never starts until the first's renameSync has already finished.

This test would pass identically against the pre-fix code, because there is no opportunity for interleaving on the Node.js microtask queue. The test validates the success path (which is already covered by the "overwrites with force" test) rather than the TOCTOU concern it describes.

Genuine concurrency in this module requires worker_threads. A worker-based harness that spawns two threads and synchronises them to fire their snapshotSave calls simultaneously would be the right approach, though admittedly heavyweight for a unit test. If that's out of scope, the test comment should be softened to avoid implying concurrent execution is happening.

Fix in Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed — replaced the Promise-based pseudo-concurrency with a worker_threads harness (tests/unit/snapshot-race-worker.mjs) that spawns two real threads racing on snapshotSave. The new test actually exercises the TOCTOU path and would fail against the pre-fix code. Also added a second test for the non-force concurrent-save case.

Comment thread src/features/snapshot.ts
Comment on lines +40 to 42
if (!options.force && fs.existsSync(dest)) {
throw new ConfigError(`Snapshot "${name}" already exists. Use --force to overwrite.`);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Non-force path still has a TOCTOU window

With force: false, two concurrent callers can both pass existsSync(dest)false, both proceed through VACUUM INTO tmp, and both successfully renameSync(tmp, dest) — silently overwriting each other with no error. The user expecting a "snapshot already exists" guard gets no protection in this race.

This is a narrower window than the --force case (the PR's stated target), so treating it as a known limitation is reasonable, but worth noting here since the fix otherwise closes most of the race surface.

Fix in Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed — closed the non-force TOCTOU by replacing renameSync with linkSync(tmp, dest) on the non-force path. linkSync fails atomically with EEXIST if dest already exists, making the existence check authoritative even under concurrent writers. The cheap existsSync above is kept as a fail-fast for the common case. Added a regression test (atomically rejects a concurrent non-force save when one already won) that spawns two worker threads and asserts exactly one wins with the other receiving the "already exists" error.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 21, 2026

Codegraph Impact Analysis

6 functions changed6 callers affected across 3 files

  • snapshotSave in src/features/snapshot.ts:28 (1 transitive callers)
  • snapshotRestore in src/features/snapshot.ts:121 (1 transitive callers)
  • snapshotDelete in src/features/snapshot.ts:195 (1 transitive callers)
  • dbPath in tests/unit/snapshot-race-worker.mjs:13 (0 transitive callers)
  • name in tests/unit/snapshot-race-worker.mjs:13 (5 transitive callers)
  • force in tests/unit/snapshot-race-worker.mjs:13 (0 transitive callers)

Address Greptile review feedback:

1. Non-force path still had a TOCTOU window: two concurrent callers could
   both pass existsSync(dest) and overwrite each other silently. Replace
   the final rename with linkSync(tmp, dest), which fails atomically with
   EEXIST if another writer won the race, then unlinkSync(tmp). The
   fail-fast existsSync above is retained as a cheap guard; linkSync is
   the authoritative check.

2. The original race regression test used Promise.resolve().then(...) on
   synchronous better-sqlite3 calls, so both microtasks executed
   sequentially and the test could not reproduce the TOCTOU it claimed
   to guard. Replace it with a worker_threads-based harness that spawns
   two workers racing on snapshotSave, plus a new non-force race test
   that verifies exactly one caller wins and the other sees a clear
   "already exists" error.

3. process.pid is shared across worker_threads in the same process, so
   add randomBytes(6) to the temp filename to keep concurrent callers
   in any thread from colliding on the same tmp path.

4. Log best-effort cleanup failures at debug so repeated tmp leaks
   surface in troubleshooting without spamming normal runs (per Claude
   review's minor suggestion).

Impact: 2 functions changed, 1 affected
@carlos-alm
Copy link
Copy Markdown
Contributor Author

Addressed review feedback:

Greptile P2 — non-force TOCTOU window: Replaced the final renameSync on the non-force path with linkSync(tmp, dest), which fails atomically with EEXIST if the destination already exists. The existsSync check stays as a cheap fail-fast, but linkSync is now the authoritative guard under concurrent writers.

Greptile P2 — race test didn't exercise concurrency: Replaced the Promise.resolve().then(...) wrapping (which produces sequential microtasks over synchronous better-sqlite3 calls) with a worker_threads-based harness (tests/unit/snapshot-race-worker.mjs). Two workers race on snapshotSave in genuine cross-thread concurrency. Also added a second regression test for the non-force race that verifies exactly one caller wins and the other sees "already exists".

Claude minor suggestion — debug-log cleanup failures: Best-effort unlinkSync cleanups in snapshotSave and snapshotRestore now log at debug level when they fail (excluding ENOENT which is the expected case).

Incidental fix — worker_thread pid collisions: process.pid is shared across worker_threads in the same process, so the temp filename now includes randomBytes(6) to keep concurrent callers in any thread from colliding on the tmp path.

All 29 tests in tests/unit/snapshot.test.ts pass locally.

@carlos-alm
Copy link
Copy Markdown
Contributor Author

@greptileai

@carlos-alm
Copy link
Copy Markdown
Contributor Author

@claude

@claude
Copy link
Copy Markdown

claude Bot commented Apr 22, 2026

Claude finished @carlos-alm's task —— View job


Review of TOCTOU Race Fixes

  • Review updated snapshot.ts implementation
  • Examine new worker_threads test harness
  • Validate fixes for both TOCTOU issues
  • Check code quality and test coverage

Assessment: Excellent Fixes ⭐⭐⭐⭐⭐

The changes effectively address both P2 issues from Greptile's review with well-engineered solutions:

🔧 Fixed Issue #1: Race Test Concurrency

src/features/snapshot.ts: Added randomBytes(6) to temp filenames to prevent worker thread PID collisions (lines 59, 148)

tests/unit/snapshot-race-worker.mjs: Clean worker_threads harness that:

  • Genuine concurrency: Uses separate threads instead of sequential microtasks
  • Imports correctly: Falls back from dist build to TS source seamlessly
  • Simple & focused: 21 lines, does exactly what's needed

Updated test: Now uses Promise.allSettled([spawnSaveWorker(...), spawnSaveWorker(...)]) with real threads. This would actually fail against pre-fix code, making it a proper regression test.

🔧 Fixed Issue #2: Non-Force TOCTOU

Key change (lines 84-92): Replaced renameSync(tmp, dest) with linkSync(tmp, dest) on non-force path:

  • Atomic guard: linkSync fails with EEXIST if destination exists, preventing silent overwrites
  • Proper cleanup: Subsequent unlinkSync(tmp) removes temp file after successful link
  • Preserves semantics: Still maintains the "already exists" error message

New regression test: atomically rejects a concurrent non-force save when one already won verifies exactly one worker succeeds and the other gets "already exists" error.

🎯 Code Quality Improvements

  1. Debug logging: Best-effort cleanup failures now log at debug level (lines 98, 106, 157) - addresses Claude's minor suggestion
  2. Better temp naming: randomBytes(6) prevents worker thread collisions where process.pid is shared
  3. Comprehensive comments: Clear explanations of the atomic patterns and why they work

🧪 Test Coverage

  • 29 total tests (all passing per comment)
  • 2 new regression tests that directly target the race scenarios
  • Worker threads harness provides reusable concurrency testing
  • Edge case coverage: Both force/non-force concurrent save scenarios

🚀 Implementation Details

The linkSync approach for non-force is particularly elegant:

  • POSIX semantics: Creates hard link that fails atomically if target exists
  • Windows compatible: NTFS supports hard links
  • Clean separation: Force path uses renameSync (overwrite), non-force uses linkSync (fail-if-exists)

Final Recommendation

LGTM - The fixes are well-engineered, thoroughly tested, and address the exact race conditions identified. The worker_threads test harness is a nice addition that could be reused for other concurrency testing. Ready to merge.


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug(snapshot): TOCTOU race on --force save path

1 participant