Skip to content

fix(native): prevent SQLITE_CORRUPT in incremental pipeline#728

Merged
carlos-alm merged 3 commits intomainfrom
fix/wal-checkpoint-corruption
Apr 1, 2026
Merged

fix(native): prevent SQLITE_CORRUPT in incremental pipeline#728
carlos-alm merged 3 commits intomainfrom
fix/wal-checkpoint-corruption

Conversation

@carlos-alm
Copy link
Copy Markdown
Contributor

Summary

  • Root cause: Two different SQLite libraries (better-sqlite3 v3.51, rusqlite ~v3.46) share the same WAL file. When one library checkpoints WAL frames written by the other, cross-library interpretation can corrupt B-tree pages — manifesting as SQLITE_CORRUPT during incremental rebuilds.
  • Checkpoint before every cross-library handoff: Each library now checkpoints its own WAL frames via PRAGMA wal_checkpoint(TRUNCATE) before closing or yielding to the other library.
  • Disable mmap on read-write connections: Eliminates Windows mmap/regular-I/O cache coherence issues between the two libraries sharing a DB file.
  • resumeJsDb now checkpoints: Previously a no-op — native-written WAL frames accumulated until better-sqlite3 checkpointed them at close time. Now rusqlite checkpoints its own frames immediately after each analysis write batch.

Changes

File What
pipeline.ts Add wal_checkpoint(TRUNCATE) after initSchema, before both nativeDb.close() calls, and in resumeJsDb callback
native_db.rs Remove PRAGMA mmap_size from read-write connection pragmas (kept on read-only)

Test plan

  • tests/builder/pipeline.test.ts — 4/4 passed
  • tests/integration/incremental-parity.test.ts — 12/12 passed
  • tests/integration/watcher-rebuild.test.ts — 4/4 passed
  • Biome lint clean
  • CI: incremental benchmark should no longer report native: null
  • CI: PRAGMA integrity_check passes after native full + incremental builds

Closes #715, closes #717

…-library handoff

Root cause: the native engine opens two SQLite connections from different
libraries (better-sqlite3 bundles SQLite 3.51, rusqlite bundles ~3.46).
Both share the same WAL file. When one library checkpoints WAL frames
written by the other, the cross-library interpretation can corrupt B-tree
pages — especially on Windows where mmap and regular I/O are not
cache-coherent.

The WAL guard from #696 closed nativeDb during pipeline stages but left
three gaps where cross-library WAL frames accumulated:

1. Setup: nativeDb.initSchema() wrote WAL frames before better-sqlite3
   ran its initSchema
2. Analysis phase: rusqlite wrote complexity/CFG/dataflow/AST data,
   then closed without checkpointing — leaving frames for better-sqlite3
   to checkpoint at close time
3. resumeJsDb was a no-op, so native-written frames persisted in the WAL

Fix:
- Checkpoint WAL through rusqlite after initSchema (setup)
- Checkpoint WAL through rusqlite before every nativeDb close
- Add resumeJsDb checkpoint so native-written frames are applied by the
  same library that wrote them
- Disable mmap on read-write rusqlite connections to eliminate Windows
  mmap/regular-I/O cache coherence issues

Closes #715, closes #717
@claude
Copy link
Copy Markdown

claude bot commented Apr 1, 2026

Claude finished @carlos-alm's task —— View job


Reviewing PR #728: SQLITE_CORRUPT Fix

Comprehensive code review following the 5-phase methodology:

Todo List

  • Read and analyze changed files
  • Phase 1: Root Cause Analysis
  • Phase 2: Configuration Change Detection
  • Phase 3: Test Integrity
  • Phase 4: Code Quality
  • Phase 5: Backlog Compliance Assessment
  • Provide final recommendation

Phase 1: Root Cause Analysis ✅

Problem Being Solved: SQLITE_CORRUPT errors during incremental builds when two SQLite libraries (better-sqlite3 v3.51, rusqlite ~v3.46) share the same WAL file. Cross-library WAL frame interpretation causes B-tree page corruption.

Why This Approach:

  • WAL Checkpointing: Forces each library to checkpoint its own WAL frames via PRAGMA wal_checkpoint(TRUNCATE) before handoff, ensuring clean state
  • mmap Removal: Eliminates Windows cache coherence issues between libraries
  • Strategic Placement: Checkpoints at all critical handoff points (init, suspend, resume, close)

Risk Assessment: ✅ Low Risk - This addresses a real corruption issue with a proven SQLite isolation pattern. The approach is defensive and follows SQLite best practices.

Phase 2: Configuration Change Detection ✅

No HIGH-RISK changes detected:

  • ❌ No Biome/linting rules disabled or relaxed
  • ❌ No test timeouts increased or coverage thresholds lowered
  • ❌ No build configuration changes
  • ❌ No TypeScript strict mode flags disabled
  • Only change: Removed PRAGMA mmap_size from read-write connections (justified for cache coherence)

Phase 3: Test Integrity ✅

No RED FLAGS detected:

  • ✅ PR mentions all existing tests passing (4/4, 12/12, 4/4)
  • ✅ No removed or weakened assertions
  • ✅ No test files modified alongside code changes
  • ✅ No eslint-disable comments added
  • ✅ Reasonable PR size (2 files, +27/-6 lines)
  • ✅ Single focused concern (SQLite corruption fix)

Phase 4: Code Quality ✅

Implementation Analysis:

pipeline.ts:133 - Schema init checkpoint: ✅ Correct placement after nativeDb.initSchema()

pipeline.ts:196, 247 - Close-time checkpoints: ✅ Proper error handling with try/catch, prevents dual-connection corruption

pipeline.ts:52-56 - resumeJsDb callback: ✅ CRITICAL FIX - Previously a no-op, now checkpoints native WAL frames before better-sqlite3 resumes

native_db.rs:385 - mmap removal: ✅ Well-commented rationale, kept on read-only connections (no WAL conflict)

Error Handling: ✅ Appropriate silent failure for checkpoint errors (DB may already be closed)

Performance Impact: ✅ Minimal - checkpoints are fast operations, only at handoff boundaries

Security: ✅ No SQL injection risks, using prepared PRAGMA statements

Phase 5: Backlog Compliance Assessment ✅

Zero-dep: ✅ — No new runtime dependencies added

Foundation-aligned: ✅ — Fully compliant with all principles:

  • ✅ P1 (always-current): Fixes corruption affecting build accuracy
  • ✅ P4 (zero-cost core): No external dependencies, improves reliability
  • ✅ P8 (not an agent): Pure infrastructure fix, no decision-making
  • ✅ P5 (embeddable-first): Maintains programmatic API contract

Problem-fit: 5/5 — Directly addresses core problem: prevents AI agents from getting corrupted dependency data that would cause hallucinated dependencies and missed blast radius detection.

Breaking: No — Purely internal implementation change, no API/CLI changes

Tier classification: Tier 1 (zero-dep ✅ + foundation-aligned ✅)

Final Assessment

## Root Cause Analysis
**Problem Being Solved**: Real SQLITE_CORRUPT issue from cross-library WAL sharing
**Why This Approach**: Proven SQLite isolation pattern with strategic checkpointing
**Risk Assessment**: Low risk, defensive fix following SQLite best practices

## Backlog Compliance
- Zero-dep: ✓ — no new dependencies
- Foundation-aligned: ✓ — fully compliant with all principles  
- Problem-fit: 5 — directly prevents corrupted dependency data
- Breaking: No — internal implementation only
- Tier: 1

## Critical Concerns
None. This is a well-implemented fix for a real corruption issue.

## Final Recommendation
- Rating: ⭐⭐⭐⭐☆ (4/5)
- Action: APPROVE
- Reasoning: Addresses real corruption bug with proven SQLite pattern. Excellent root cause analysis, conservative implementation, comprehensive checkpointing at all handoff points. Only minor concern is the complexity of dual-engine coordination, but this is inherent to the architecture and well-handled here.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 1, 2026

Greptile Summary

This PR addresses SQLITE_CORRUPT errors during incremental rebuilds caused by two SQLite libraries (rusqlite and better-sqlite3) sharing the same WAL-mode database file. The fix applies PRAGMA wal_checkpoint(TRUNCATE) at every library handoff point and disables mmap on read-write rusqlite connections.

Key changes:

  • pipeline.ts: Adds wal_checkpoint(TRUNCATE) after initSchema, before both nativeDb.close() calls, and converts resumeJsDb from a no-op into a real rusqlite checkpoint.
  • pipeline.ts: Properly closes nativeDb in the setup fallback error path (previously the connection could leak).
  • native_db.rs: Removes PRAGMA mmap_size from read-write connections to eliminate Windows mmap/regular-I/O cache-coherence issues between the two libraries; read-only connections retain mmap_size.

Minor issues found:

  • The post-analyses nativeDb.close() block does not clear ctx.engineOpts.nativeDb, inconsistent with the earlier close block which explicitly nulls it to prevent stale-reference access.
  • resumeJsDb uses a bare catch {} that silently suppresses all checkpoint errors, not just "connection already closed" — a real I/O failure during checkpoint would be invisible and leave better-sqlite3 exposed to cross-library WAL frames.

Confidence Score: 4/5

Safe to merge — the core WAL-isolation fix is correct; two P2-level inconsistencies are worth cleaning up but don't block the fix.

The root-cause analysis and checkpoint placement are sound. The two remaining findings are P2: an inconsistent ctx.engineOpts.nativeDb clear in the post-analyses close block, and a bare catch {} in resumeJsDb that silently swallows non-close-related checkpoint errors. Neither is a definite current bug (finalize uses JS paths; checkpoint errors rarely happen outside of I/O failures), but they could mask real problems.

src/domain/graph/builder/pipeline.ts — post-analyses close block and resumeJsDb error handling

Important Files Changed

Filename Overview
src/domain/graph/builder/pipeline.ts Adds wal_checkpoint(TRUNCATE) at every rusqlite→better-sqlite3 handoff point (post-initSchema, pre-close ×2, and in resumeJsDb); also adds proper nativeDb.close() in the setup error path. Two minor inconsistencies: ctx.engineOpts.nativeDb is not cleared in the post-analyses close block, and resumeJsDb swallows all checkpoint errors silently.
crates/codegraph-core/src/native_db.rs Removes PRAGMA mmap_size from read-write connections to avoid Windows mmap/regular-I/O cache incoherence when two SQLite libraries share a WAL file; read-only connections intentionally retain mmap_size. Change is targeted and correct.

Sequence Diagram

sequenceDiagram
    participant BS3 as better-sqlite3 (ctx.db)
    participant RQ as rusqlite (ctx.nativeDb)
    participant WAL as WAL file

    Note over BS3,WAL: setupPipeline
    RQ->>WAL: initSchema writes (rusqlite frames)
    RQ->>WAL: wal_checkpoint(TRUNCATE) — flush rusqlite frames to main DB
    BS3->>WAL: initSchema writes (bs3 frames)

    Note over BS3,WAL: runPipelineStages — start
    RQ->>WAL: wal_checkpoint(TRUNCATE) — flush before close
    RQ-->>RQ: close()
    BS3->>WAL: pipeline stages (collect, parse, insert, resolve, edges, structure)

    Note over BS3,WAL: runAnalyses — per feature module
    BS3->>WAL: suspendJsDb: wal_checkpoint(TRUNCATE) — flush bs3 frames
    RQ->>WAL: native analysis writes (rusqlite frames)
    RQ->>WAL: resumeJsDb: wal_checkpoint(TRUNCATE) — flush rusqlite frames
    BS3->>WAL: resumes reading (only sees main DB pages)

    Note over BS3,WAL: runPipelineStages — end
    RQ->>WAL: wal_checkpoint(TRUNCATE) — flush before close
    RQ-->>RQ: close()
    BS3->>WAL: finalize (JS paths only)
Loading

Reviews (2): Last reviewed commit: "fix: separate checkpoint and close into ..." | Re-trigger Greptile

…728)

Split checkpoint + close operations into independent try/catch blocks at
all three sites so close() always runs even if checkpoint throws. Also
explicitly close the NativeDatabase in setupPipeline's catch path to
prevent a live rusqlite connection from lingering until GC.
@carlos-alm
Copy link
Copy Markdown
Contributor Author

Both Greptile issues addressed in 0415113:

P1 — close() skipped when checkpoint throws: Separated checkpoint and close into independent try/catch blocks at both close sites (lines ~192 and ~243). Now close() always runs regardless of checkpoint outcome, preventing a live rusqlite connection from lingering until GC.

P2 — setupPipeline catch path doesn't close: Added explicit ctx.nativeDb?.close() (wrapped in its own try/catch) before clearing the reference. Also fixed the misleading message from "init failed" to "setup failed" since the error could come from the checkpoint call rather than init.

1 similar comment
@carlos-alm
Copy link
Copy Markdown
Contributor Author

Both Greptile issues addressed in 0415113:

P1 — close() skipped when checkpoint throws: Separated checkpoint and close into independent try/catch blocks at both close sites (lines ~192 and ~243). Now close() always runs regardless of checkpoint outcome, preventing a live rusqlite connection from lingering until GC.

P2 — setupPipeline catch path doesn't close: Added explicit ctx.nativeDb?.close() (wrapped in its own try/catch) before clearing the reference. Also fixed the misleading message from "init failed" to "setup failed" since the error could come from the checkpoint call rather than init.

@carlos-alm
Copy link
Copy Markdown
Contributor Author

@greptileai

@carlos-alm carlos-alm merged commit ef15740 into main Apr 1, 2026
18 checks passed
@carlos-alm carlos-alm deleted the fix/wal-checkpoint-corruption branch April 1, 2026 03:47
@github-actions github-actions bot locked and limited conversation to collaborators Apr 1, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

1 participant