Skip to content

feat(ledger): add SQLite run ledger#37

Merged
rianjs merged 4 commits into
mainfrom
feat/cr-06-ledger-store
May 30, 2026
Merged

feat(ledger): add SQLite run ledger#37
rianjs merged 4 commits into
mainfrom
feat/cr-06-ledger-store

Conversation

@rianjs
Copy link
Copy Markdown
Contributor

@rianjs rianjs commented May 30, 2026

Summary

  • add internal/ledger SQLite store with schema version 1 and CR-06 startup pragmas
  • add single-writer write path, typed storage constants, run/session/finding/action/named-session persistence
  • add transactional AllocateRun with recovery-mode insert-only semantics and concurrent attempt allocation coverage

Verification

  • go test ./internal/ledger
  • go test -race ./internal/ledger
  • go test ./...
  • make check

Closes #13

@rianjs
Copy link
Copy Markdown
Contributor Author

rianjs commented May 30, 2026

Architect review: clean after recovery allocation and validation coverage fixes.

STATUS: blockers=0 majors=0 minors=0 nits=0

@rianjs
Copy link
Copy Markdown
Contributor Author

rianjs commented May 30, 2026

Coverage is broad, but there are still gaps to tighten before daemon review.

  • Blocker: AllocateRun is not yet proven transactional on failure after the PR upsert and before run insert commit. A non-atomic implementation could still pass.
  • Major: negative-path tests do not consistently prove no mutation on failure; they mostly assert error classification.
  • Minor: runs_resume is asserted by index name only, not definition.
  • Nit: recovery constraint classification has a white-box helper test; public-path coverage should carry as much of that as possible.

Copy link
Copy Markdown

@monit-reviewer monit-reviewer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated PR Review

Reviewed commit: e692e58

Summary

Reviewer Findings
harness-engineering:harness-architecture-reviewer 4
harness-engineering:harness-enforcement-reviewer 1
harness-engineering:harness-knowledge-reviewer 1
harness-engineering:harness-self-documenting-code-reviewer 1
security:security-code-auditor 1
harness-engineering:harness-architecture-reviewer (4 findings)

⚠️ Should Fix - internal/ledger/ledger.go:1237

isSQLiteConstraintError uses strings.Contains(err.Error(), "constraint") to classify constraint violations. This is fragile: any error whose message incidentally contains "constraint" (wrapped I/O errors, future driver message changes, trigger errors naming a constraint) will be misclassified, causing allocateRunTx to enter an incorrect retry path or return a false ErrRunExists. The modernc.org/sqlite driver exposes a structured *sqlite3.Error type with a numeric Code field; the correct check is errors.As to extract that type and compare against SQLITE_CONSTRAINT (19). This function is the load-bearing guard for the entire AllocateRun retry and recovery path.

💡 Suggestion - internal/ledger/ledger.go:356

write() holds s.mu while blocking on the unbuffered s.writes channel send. Under load with tight per-request timeouts, all concurrent callers queue behind the mutex rather than at the channel, so they cannot even begin context-cancellation detection until the previous caller finishes dispatching. Consider a buffered channel (capacity matching expected burst) or releasing the mutex before the channel send and re-checking closed state afterward to allow callers to race independently.

💡 Suggestion - internal/ledger/ledger.go:1282

The findings table declares a foreign key on session_row_id referencing sessions(session_row_id) without ON DELETE CASCADE. If a DeleteSession method is added later, SQLite will block the delete with a FK violation rather than cascading to findings, creating an asymmetric deletion contract relative to the runs → sessions cascade. Add ON DELETE CASCADE here, or add an explanatory comment if orphaned findings are acceptable.

💡 Suggestion - internal/ledger/ledger.go:1293

planned_actions.finding_id is a nullable TEXT column with no REFERENCES constraint pointing to findings(finding_id). With foreign_keys = ON, SQLite only enforces declared FK constraints; a caller can insert a planned action referencing a non-existent finding and it will be silently accepted. Given that the outbox consumer will look up findings by this ID, a dangling reference creates a silent data integrity gap. Add REFERENCES findings(finding_id), or document that the column is intentionally unvalidated.

harness-engineering:harness-enforcement-reviewer (1 findings)

⚠️ Should Fix - internal/ledger/ledger.go:344

In write(), if ctx is cancelled after the request is dispatched to the writer goroutine (first select) but before the response is received (second select), the caller receives ctx.Err() while the writer may still execute and commit the DB operation. For AllocateRun specifically, this means a run can be persisted while the caller believes the call failed. A subsequent retry with the same RunID will receive ErrRunExists, producing confusing recovery semantics. Consider returning the allocated run ID even on context cancellation, or documenting this hazard explicitly so callers can handle the ambiguous state.

harness-engineering:harness-knowledge-reviewer (1 findings)

💡 Suggestion - internal/ledger/ledger.go:412

classifyAllocateConstraint returns (retry=true, nil) when params.RunID is non-empty but the run_id does not yet exist in the database. The semantics are surprising: a caller who provided an explicit RunID and received a constraint violation unrelated to run_id uniqueness would be silently retried rather than having the real error surfaced. The path is currently unreachable under the single-writer design, but a comment or tighter condition would make the intent explicit and prevent confusion during future multi-writer work.

harness-engineering:harness-self-documenting-code-reviewer (1 findings)

💡 Suggestion - internal/ledger/ledger.go:619

GetPR, GetRun, GetSession, GetNamedSession, ListFindings, and ListPlannedActions call s.db directly without checking s.closed. After Close() is called, these methods return a generic 'sql: database is closed' error rather than the typed ErrClosed sentinel. Callers using errors.Is(err, ErrClosed) to detect post-close access will miss these cases, while the write path correctly checks s.closed before dispatching.

security:security-code-auditor (1 findings)

⚠️ Should Fix - internal/ledger/ledger.go:465

In recovery mode, classifyAllocateConstraint returns retry=true when params.RunID is set, a constraint error occurs, and the RunID does not yet exist. However, the only other uniqueness constraint that could fire is on (pr_key, sha, base_sha, profile, posting_identity, attempt). Since recovery mode retries with the same parameters, it will hit the same constraint violation on every iteration, producing a truly infinite loop. Recovery mode should fail immediately on non-run_id constraint violations rather than retrying.

3 info-level observations excluded. Run with --verbose to include.

2 PR discussion threads considered.


Completed in 3m 59s | $1.35 | sonnet | daemon 0.2.121 | Glorfindel
Field Value
Model sonnet
Reviewers hybrid-synthesis, harness-engineering:harness-architecture-reviewer, harness-engineering:harness-enforcement-reviewer, harness-engineering:harness-knowledge-reviewer, harness-engineering:harness-self-documenting-code-reviewer, security:security-code-auditor
Engine claude · sonnet
Reviewed by pr-review-daemon · monit-pr-reviewer
Duration 3m 59s wall · 10m 57s compute (Reviewers: 2m 24s · Synthesis: 1m 32s)
Cost $1.35
Tokens 217.8k in / 41.8k out
Turns 6

Per-workstream usage

Workstream Model In Out Cache read Cache create Cost
hybrid-synthesis sonnet 33.2k 5.5k 0 33.2k (1h) $0.21
harness-engineering:harness-architecture-reviewer sonnet 37.0k 4.7k 2.1k 34.9k (1h) $0.23
harness-engineering:harness-enforcement-reviewer sonnet 37.0k 7.4k 2.1k 34.9k (1h) $0.27
harness-engineering:harness-knowledge-reviewer sonnet 37.0k 6.0k 2.1k 34.9k (1h) $0.25
harness-engineering:harness-self-documenting-code-reviewer sonnet 36.7k 5.4k 2.1k 34.6k (1h) $0.24
security:security-code-auditor haiku 37.0k 12.8k 0 36.9k (1h) $0.14

Re-reviews only run when @monit-reviewer is re-requested as a reviewer — push as many commits as you need, then re-request when ready. PRs targeting branches other than main, master are skipped, even when @monit-reviewer is re-requested.

Comment thread internal/ledger/ledger.go
completed_at TEXT,
outcome TEXT,
artifact_path TEXT NOT NULL,
blocking_count INTEGER NOT NULL DEFAULT 0,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Medium (harness-engineering:harness-architecture-reviewer): isSQLiteConstraintError uses strings.Contains(err.Error(), "constraint") to classify constraint violations. This is fragile: any error whose message incidentally contains "constraint" (wrapped I/O errors, future driver message changes, trigger errors naming a constraint) will be misclassified, causing allocateRunTx to enter an incorrect retry path or return a false ErrRunExists. The modernc.org/sqlite driver exposes a structured *sqlite3.Error type with a numeric Code field; the correct check is errors.As to extract that type and compare against SQLITE_CONSTRAINT (19). This function is the load-bearing guard for the entire AllocateRun retry and recovery path.

Reply to this thread when addressed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed - switched constraint detection to modernc error codes.

Comment thread internal/ledger/ledger.go
_ = db.Close()
return nil, err
}
if _, err := dbmig.Apply(ctx, db, migrations()); err != nil {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Medium (harness-engineering:harness-enforcement-reviewer): In write(), if ctx is cancelled after the request is dispatched to the writer goroutine (first select) but before the response is received (second select), the caller receives ctx.Err() while the writer may still execute and commit the DB operation. For AllocateRun specifically, this means a run can be persisted while the caller believes the call failed. A subsequent retry with the same RunID will receive ErrRunExists, producing confusing recovery semantics. Consider returning the allocated run ID even on context cancellation, or documenting this hazard explicitly so callers can handle the ambiguous state.

Reply to this thread when addressed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed - dispatched writes now return the writer result.

Comment thread internal/ledger/ledger.go
if err == nil {
return run, nil
}
retry, classifyErr := classifyAllocateConstraint(ctx, db, params, err)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Medium (security:security-code-auditor): In recovery mode, classifyAllocateConstraint returns retry=true when params.RunID is set, a constraint error occurs, and the RunID does not yet exist. However, the only other uniqueness constraint that could fire is on (pr_key, sha, base_sha, profile, posting_identity, attempt). Since recovery mode retries with the same parameters, it will hit the same constraint violation on every iteration, producing a truly infinite loop. Recovery mode should fail immediately on non-run_id constraint violations rather than retrying.

Reply to this thread when addressed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed - recovery retries only resume-key collisions.

Comment thread internal/ledger/ledger.go
}
go store.writer()
return store, nil
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 Low (harness-engineering:harness-architecture-reviewer): write() holds s.mu while blocking on the unbuffered s.writes channel send. Under load with tight per-request timeouts, all concurrent callers queue behind the mutex rather than at the channel, so they cannot even begin context-cancellation detection until the previous caller finishes dispatching. Consider a buffered channel (capacity matching expected burst) or releasing the mutex before the channel send and re-checking closed state afterward to allow callers to race independently.

Reply to this thread when addressed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed - added a bounded writer queue.

Comment thread internal/ledger/ledger.go
`CREATE TABLE planned_actions (
action_id TEXT PRIMARY KEY,
run_id TEXT NOT NULL REFERENCES runs(run_id) ON DELETE CASCADE,
kind TEXT NOT NULL,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 Low (harness-engineering:harness-architecture-reviewer): The findings table declares a foreign key on session_row_id referencing sessions(session_row_id) without ON DELETE CASCADE. If a DeleteSession method is added later, SQLite will block the delete with a FK violation rather than cascading to findings, creating an asymmetric deletion contract relative to the runs → sessions cascade. Add ON DELETE CASCADE here, or add an explanatory comment if orphaned findings are acceptable.

Reply to this thread when addressed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed - findings cascade on session deletion.

Comment thread internal/ledger/ledger.go
attempted_at TEXT,
posted_at TEXT,
upstream_id TEXT,
error TEXT
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 Low (harness-engineering:harness-architecture-reviewer): planned_actions.finding_id is a nullable TEXT column with no REFERENCES constraint pointing to findings(finding_id). With foreign_keys = ON, SQLite only enforces declared FK constraints; a caller can insert a planned action referencing a non-existent finding and it will be silently accepted. Given that the outbox consumer will look up findings by this ID, a dangling reference creates a silent data integrity gap. Add REFERENCES findings(finding_id), or document that the column is intentionally unvalidated.

Reply to this thread when addressed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed - planned actions now reference findings.

Comment thread internal/ledger/ledger.go

// DeleteRun deletes a run and lets SQLite cascade child rows.
func (s *Store) DeleteRun(ctx context.Context, runID string) error {
if strings.TrimSpace(runID) == "" {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 Low (harness-engineering:harness-self-documenting-code-reviewer): GetPR, GetRun, GetSession, GetNamedSession, ListFindings, and ListPlannedActions call s.db directly without checking s.closed. After Close() is called, these methods return a generic 'sql: database is closed' error rather than the typed ErrClosed sentinel. Callers using errors.Is(err, ErrClosed) to detect post-close access will miss these cases, while the write path correctly checks s.closed before dispatching.

Reply to this thread when addressed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed - reads now return ErrClosed after Close.

Comment thread internal/ledger/ledger.go

func configureSQLite(ctx context.Context, db *sql.DB) error {
if _, err := db.ExecContext(ctx, "PRAGMA foreign_keys = ON"); err != nil {
return fmt.Errorf("ledger: enable foreign keys: %w", err)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 Low (harness-engineering:harness-knowledge-reviewer): classifyAllocateConstraint returns (retry=true, nil) when params.RunID is non-empty but the run_id does not yet exist in the database. The semantics are surprising: a caller who provided an explicit RunID and received a constraint violation unrelated to run_id uniqueness would be silently retried rather than having the real error surfaced. The path is currently unreachable under the single-writer design, but a comment or tighter condition would make the intent explicit and prevent confusion during future multi-writer work.

Reply to this thread when addressed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed - unrelated constraints are no longer retried.

Copy link
Copy Markdown

@monit-reviewer monit-reviewer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated PR Review

Reviewed commit: 83790ad | Previous: e692e58 (incremental)

Approved with 1 non-blocking suggestion below. Address at your discretion.

Summary

Reviewer Findings
harness-engineering:harness-enforcement-reviewer 1
harness-engineering:harness-enforcement-reviewer (1 findings)

💡 Suggestion - internal/ledger/ledger.go:1299

isResumeAttemptConstraintError identifies resume-key collisions by checking that all six column names appear in the driver error message text. This couples the retry decision to the modernc.org/sqlite message format: a driver update that changes the "UNIQUE constraint failed: table.col1, table.col2" format (e.g., drops the table prefix, changes the separator) would cause the function to silently return false, surfacing a raw constraint error instead of retrying. The failure mode is no-retry rather than incorrect-retry, which is safer than the previous isSQLiteConstraintError pattern, but it remains a version-sensitive contract. A named constraint (CONSTRAINT resume_key UNIQUE(...)) would make the error message carry the constraint name instead, making the match stable; alternatively, a test fixture that pins the expected message format from the driver would catch a regression at test time rather than in production.

10 PR discussion threads considered.


Completed in 2m 01s | $1.85 | sonnet | daemon 0.2.121 | Glorfindel
Field Value
Model sonnet
Mode Re-review · Cycle 2 · Session resumed
Reviewers hybrid-synthesis, harness-engineering:harness-architecture-reviewer, harness-engineering:harness-enforcement-reviewer, harness-engineering:harness-knowledge-reviewer, harness-engineering:harness-self-documenting-code-reviewer, security:security-code-auditor
Engine claude · sonnet
Reviewed by pr-review-daemon · monit-pr-reviewer
Duration 2m 01s wall · 4m 50s compute (Reviewers: 1m 24s · Synthesis: 20s)
Cost $1.85
Tokens 540.8k in / 15.8k out
Turns 7

Per-workstream usage

Workstream Model In Out Cache read Cache create Cost
hybrid-synthesis sonnet 43.3k 1.1k 18.6k 24.7k (1h) $0.12
harness-engineering:harness-architecture-reviewer sonnet 75.4k 112 2.8k 72.5k (1h) $0.27
harness-engineering:harness-enforcement-reviewer sonnet 78.0k 4.2k 2.8k 75.2k (1h) $0.35
harness-engineering:harness-knowledge-reviewer sonnet 157.7k 3.5k 6.9k 150.8k (1h) $0.62
harness-engineering:harness-self-documenting-code-reviewer sonnet 75.4k 3.2k 2.8k 72.5k (1h) $0.32
security:security-code-auditor haiku 83.5k 2.9k 0 83.5k (1h) $0.12
discussion-summarizer 27.5k 815 18.6k 8.9k (1h) $0.05

Re-reviews only run when @monit-reviewer is re-requested as a reviewer — push as many commits as you need, then re-request when ready. PRs targeting branches other than main, master are skipped, even when @monit-reviewer is re-requested.

Comment thread internal/ledger/ledger.go
session_row_id TEXT PRIMARY KEY,
run_id TEXT NOT NULL REFERENCES runs(run_id) ON DELETE CASCADE,
provider_session_id TEXT NOT NULL,
role TEXT NOT NULL,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 Low (harness-engineering:harness-enforcement-reviewer): isResumeAttemptConstraintError identifies resume-key collisions by checking that all six column names appear in the driver error message text. This couples the retry decision to the modernc.org/sqlite message format: a driver update that changes the "UNIQUE constraint failed: table.col1, table.col2" format (e.g., drops the table prefix, changes the separator) would cause the function to silently return false, surfacing a raw constraint error instead of retrying. The failure mode is no-retry rather than incorrect-retry, which is safer than the previous isSQLiteConstraintError pattern, but it remains a version-sensitive contract. A named constraint (CONSTRAINT resume_key UNIQUE(...)) would make the error message carry the constraint name instead, making the match stable; alternatively, a test fixture that pins the expected message format from the driver would catch a regression at test time rather than in production.

Reply to this thread when addressed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Covered by TestClassifyAllocateConstraintForRecoveryMode, which generates the driver duplicate-resume error and verifies retry classification.

@rianjs rianjs merged commit 3a526cc into main May 30, 2026
8 checks passed
@rianjs rianjs deleted the feat/cr-06-ledger-store branch May 30, 2026 21:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CR-06: ledger schema + store + transactional attempt allocator

2 participants