feat(ledger): add SQLite run ledger#37
Conversation
|
Architect review: clean after recovery allocation and validation coverage fixes. STATUS: blockers=0 majors=0 minors=0 nits=0 |
|
Coverage is broad, but there are still gaps to tighten before daemon review.
|
monit-reviewer
left a comment
There was a problem hiding this comment.
Automated PR Review
Reviewed commit: e692e58
Summary
| Reviewer | Findings |
|---|---|
| harness-engineering:harness-architecture-reviewer | 4 |
| harness-engineering:harness-enforcement-reviewer | 1 |
| harness-engineering:harness-knowledge-reviewer | 1 |
| harness-engineering:harness-self-documenting-code-reviewer | 1 |
| security:security-code-auditor | 1 |
harness-engineering:harness-architecture-reviewer (4 findings)
internal/ledger/ledger.go:1237
isSQLiteConstraintError uses strings.Contains(err.Error(), "constraint") to classify constraint violations. This is fragile: any error whose message incidentally contains "constraint" (wrapped I/O errors, future driver message changes, trigger errors naming a constraint) will be misclassified, causing allocateRunTx to enter an incorrect retry path or return a false ErrRunExists. The modernc.org/sqlite driver exposes a structured *sqlite3.Error type with a numeric Code field; the correct check is errors.As to extract that type and compare against SQLITE_CONSTRAINT (19). This function is the load-bearing guard for the entire AllocateRun retry and recovery path.
💡 Suggestion - internal/ledger/ledger.go:356
write() holds s.mu while blocking on the unbuffered s.writes channel send. Under load with tight per-request timeouts, all concurrent callers queue behind the mutex rather than at the channel, so they cannot even begin context-cancellation detection until the previous caller finishes dispatching. Consider a buffered channel (capacity matching expected burst) or releasing the mutex before the channel send and re-checking closed state afterward to allow callers to race independently.
💡 Suggestion - internal/ledger/ledger.go:1282
The findings table declares a foreign key on session_row_id referencing sessions(session_row_id) without ON DELETE CASCADE. If a DeleteSession method is added later, SQLite will block the delete with a FK violation rather than cascading to findings, creating an asymmetric deletion contract relative to the runs → sessions cascade. Add ON DELETE CASCADE here, or add an explanatory comment if orphaned findings are acceptable.
💡 Suggestion - internal/ledger/ledger.go:1293
planned_actions.finding_id is a nullable TEXT column with no REFERENCES constraint pointing to findings(finding_id). With foreign_keys = ON, SQLite only enforces declared FK constraints; a caller can insert a planned action referencing a non-existent finding and it will be silently accepted. Given that the outbox consumer will look up findings by this ID, a dangling reference creates a silent data integrity gap. Add REFERENCES findings(finding_id), or document that the column is intentionally unvalidated.
harness-engineering:harness-enforcement-reviewer (1 findings)
internal/ledger/ledger.go:344
In write(), if ctx is cancelled after the request is dispatched to the writer goroutine (first select) but before the response is received (second select), the caller receives ctx.Err() while the writer may still execute and commit the DB operation. For AllocateRun specifically, this means a run can be persisted while the caller believes the call failed. A subsequent retry with the same RunID will receive ErrRunExists, producing confusing recovery semantics. Consider returning the allocated run ID even on context cancellation, or documenting this hazard explicitly so callers can handle the ambiguous state.
harness-engineering:harness-knowledge-reviewer (1 findings)
💡 Suggestion - internal/ledger/ledger.go:412
classifyAllocateConstraint returns (retry=true, nil) when params.RunID is non-empty but the run_id does not yet exist in the database. The semantics are surprising: a caller who provided an explicit RunID and received a constraint violation unrelated to run_id uniqueness would be silently retried rather than having the real error surfaced. The path is currently unreachable under the single-writer design, but a comment or tighter condition would make the intent explicit and prevent confusion during future multi-writer work.
harness-engineering:harness-self-documenting-code-reviewer (1 findings)
💡 Suggestion - internal/ledger/ledger.go:619
GetPR, GetRun, GetSession, GetNamedSession, ListFindings, and ListPlannedActions call s.db directly without checking s.closed. After Close() is called, these methods return a generic 'sql: database is closed' error rather than the typed ErrClosed sentinel. Callers using errors.Is(err, ErrClosed) to detect post-close access will miss these cases, while the write path correctly checks s.closed before dispatching.
security:security-code-auditor (1 findings)
internal/ledger/ledger.go:465
In recovery mode, classifyAllocateConstraint returns retry=true when params.RunID is set, a constraint error occurs, and the RunID does not yet exist. However, the only other uniqueness constraint that could fire is on (pr_key, sha, base_sha, profile, posting_identity, attempt). Since recovery mode retries with the same parameters, it will hit the same constraint violation on every iteration, producing a truly infinite loop. Recovery mode should fail immediately on non-run_id constraint violations rather than retrying.
3 info-level observations excluded. Run with --verbose to include.
2 PR discussion threads considered.
Completed in 3m 59s | $1.35 | sonnet | daemon 0.2.121 | Glorfindel
| Field | Value |
|---|---|
| Model | sonnet |
| Reviewers | hybrid-synthesis, harness-engineering:harness-architecture-reviewer, harness-engineering:harness-enforcement-reviewer, harness-engineering:harness-knowledge-reviewer, harness-engineering:harness-self-documenting-code-reviewer, security:security-code-auditor |
| Engine | claude · sonnet |
| Reviewed by | pr-review-daemon · monit-pr-reviewer |
| Duration | 3m 59s wall · 10m 57s compute (Reviewers: 2m 24s · Synthesis: 1m 32s) |
| Cost | $1.35 |
| Tokens | 217.8k in / 41.8k out |
| Turns | 6 |
Per-workstream usage
| Workstream | Model | In | Out | Cache read | Cache create | Cost |
|---|---|---|---|---|---|---|
| hybrid-synthesis | sonnet | 33.2k | 5.5k | 0 | 33.2k (1h) | $0.21 |
| harness-engineering:harness-architecture-reviewer | sonnet | 37.0k | 4.7k | 2.1k | 34.9k (1h) | $0.23 |
| harness-engineering:harness-enforcement-reviewer | sonnet | 37.0k | 7.4k | 2.1k | 34.9k (1h) | $0.27 |
| harness-engineering:harness-knowledge-reviewer | sonnet | 37.0k | 6.0k | 2.1k | 34.9k (1h) | $0.25 |
| harness-engineering:harness-self-documenting-code-reviewer | sonnet | 36.7k | 5.4k | 2.1k | 34.6k (1h) | $0.24 |
| security:security-code-auditor | haiku | 37.0k | 12.8k | 0 | 36.9k (1h) | $0.14 |
Re-reviews only run when @monit-reviewer is re-requested as a reviewer — push as many commits as you need, then re-request when ready. PRs targeting branches other than main, master are skipped, even when @monit-reviewer is re-requested.
| completed_at TEXT, | ||
| outcome TEXT, | ||
| artifact_path TEXT NOT NULL, | ||
| blocking_count INTEGER NOT NULL DEFAULT 0, |
There was a problem hiding this comment.
🟡 Medium (harness-engineering:harness-architecture-reviewer): isSQLiteConstraintError uses strings.Contains(err.Error(), "constraint") to classify constraint violations. This is fragile: any error whose message incidentally contains "constraint" (wrapped I/O errors, future driver message changes, trigger errors naming a constraint) will be misclassified, causing allocateRunTx to enter an incorrect retry path or return a false ErrRunExists. The modernc.org/sqlite driver exposes a structured *sqlite3.Error type with a numeric Code field; the correct check is errors.As to extract that type and compare against SQLITE_CONSTRAINT (19). This function is the load-bearing guard for the entire AllocateRun retry and recovery path.
Reply to this thread when addressed.
There was a problem hiding this comment.
Fixed - switched constraint detection to modernc error codes.
| _ = db.Close() | ||
| return nil, err | ||
| } | ||
| if _, err := dbmig.Apply(ctx, db, migrations()); err != nil { |
There was a problem hiding this comment.
🟡 Medium (harness-engineering:harness-enforcement-reviewer): In write(), if ctx is cancelled after the request is dispatched to the writer goroutine (first select) but before the response is received (second select), the caller receives ctx.Err() while the writer may still execute and commit the DB operation. For AllocateRun specifically, this means a run can be persisted while the caller believes the call failed. A subsequent retry with the same RunID will receive ErrRunExists, producing confusing recovery semantics. Consider returning the allocated run ID even on context cancellation, or documenting this hazard explicitly so callers can handle the ambiguous state.
Reply to this thread when addressed.
There was a problem hiding this comment.
Fixed - dispatched writes now return the writer result.
| if err == nil { | ||
| return run, nil | ||
| } | ||
| retry, classifyErr := classifyAllocateConstraint(ctx, db, params, err) |
There was a problem hiding this comment.
🟡 Medium (security:security-code-auditor): In recovery mode, classifyAllocateConstraint returns retry=true when params.RunID is set, a constraint error occurs, and the RunID does not yet exist. However, the only other uniqueness constraint that could fire is on (pr_key, sha, base_sha, profile, posting_identity, attempt). Since recovery mode retries with the same parameters, it will hit the same constraint violation on every iteration, producing a truly infinite loop. Recovery mode should fail immediately on non-run_id constraint violations rather than retrying.
Reply to this thread when addressed.
There was a problem hiding this comment.
Fixed - recovery retries only resume-key collisions.
| } | ||
| go store.writer() | ||
| return store, nil | ||
| } |
There was a problem hiding this comment.
🔵 Low (harness-engineering:harness-architecture-reviewer): write() holds s.mu while blocking on the unbuffered s.writes channel send. Under load with tight per-request timeouts, all concurrent callers queue behind the mutex rather than at the channel, so they cannot even begin context-cancellation detection until the previous caller finishes dispatching. Consider a buffered channel (capacity matching expected burst) or releasing the mutex before the channel send and re-checking closed state afterward to allow callers to race independently.
Reply to this thread when addressed.
There was a problem hiding this comment.
Fixed - added a bounded writer queue.
| `CREATE TABLE planned_actions ( | ||
| action_id TEXT PRIMARY KEY, | ||
| run_id TEXT NOT NULL REFERENCES runs(run_id) ON DELETE CASCADE, | ||
| kind TEXT NOT NULL, |
There was a problem hiding this comment.
🔵 Low (harness-engineering:harness-architecture-reviewer): The findings table declares a foreign key on session_row_id referencing sessions(session_row_id) without ON DELETE CASCADE. If a DeleteSession method is added later, SQLite will block the delete with a FK violation rather than cascading to findings, creating an asymmetric deletion contract relative to the runs → sessions cascade. Add ON DELETE CASCADE here, or add an explanatory comment if orphaned findings are acceptable.
Reply to this thread when addressed.
There was a problem hiding this comment.
Fixed - findings cascade on session deletion.
| attempted_at TEXT, | ||
| posted_at TEXT, | ||
| upstream_id TEXT, | ||
| error TEXT |
There was a problem hiding this comment.
🔵 Low (harness-engineering:harness-architecture-reviewer): planned_actions.finding_id is a nullable TEXT column with no REFERENCES constraint pointing to findings(finding_id). With foreign_keys = ON, SQLite only enforces declared FK constraints; a caller can insert a planned action referencing a non-existent finding and it will be silently accepted. Given that the outbox consumer will look up findings by this ID, a dangling reference creates a silent data integrity gap. Add REFERENCES findings(finding_id), or document that the column is intentionally unvalidated.
Reply to this thread when addressed.
There was a problem hiding this comment.
Fixed - planned actions now reference findings.
|
|
||
| // DeleteRun deletes a run and lets SQLite cascade child rows. | ||
| func (s *Store) DeleteRun(ctx context.Context, runID string) error { | ||
| if strings.TrimSpace(runID) == "" { |
There was a problem hiding this comment.
🔵 Low (harness-engineering:harness-self-documenting-code-reviewer): GetPR, GetRun, GetSession, GetNamedSession, ListFindings, and ListPlannedActions call s.db directly without checking s.closed. After Close() is called, these methods return a generic 'sql: database is closed' error rather than the typed ErrClosed sentinel. Callers using errors.Is(err, ErrClosed) to detect post-close access will miss these cases, while the write path correctly checks s.closed before dispatching.
Reply to this thread when addressed.
There was a problem hiding this comment.
Fixed - reads now return ErrClosed after Close.
|
|
||
| func configureSQLite(ctx context.Context, db *sql.DB) error { | ||
| if _, err := db.ExecContext(ctx, "PRAGMA foreign_keys = ON"); err != nil { | ||
| return fmt.Errorf("ledger: enable foreign keys: %w", err) |
There was a problem hiding this comment.
🔵 Low (harness-engineering:harness-knowledge-reviewer): classifyAllocateConstraint returns (retry=true, nil) when params.RunID is non-empty but the run_id does not yet exist in the database. The semantics are surprising: a caller who provided an explicit RunID and received a constraint violation unrelated to run_id uniqueness would be silently retried rather than having the real error surfaced. The path is currently unreachable under the single-writer design, but a comment or tighter condition would make the intent explicit and prevent confusion during future multi-writer work.
Reply to this thread when addressed.
There was a problem hiding this comment.
Fixed - unrelated constraints are no longer retried.
monit-reviewer
left a comment
There was a problem hiding this comment.
Automated PR Review
Reviewed commit: 83790ad | Previous: e692e58 (incremental)
Approved with 1 non-blocking suggestion below. Address at your discretion.
Summary
| Reviewer | Findings |
|---|---|
| harness-engineering:harness-enforcement-reviewer | 1 |
harness-engineering:harness-enforcement-reviewer (1 findings)
💡 Suggestion - internal/ledger/ledger.go:1299
isResumeAttemptConstraintError identifies resume-key collisions by checking that all six column names appear in the driver error message text. This couples the retry decision to the modernc.org/sqlite message format: a driver update that changes the "UNIQUE constraint failed: table.col1, table.col2" format (e.g., drops the table prefix, changes the separator) would cause the function to silently return false, surfacing a raw constraint error instead of retrying. The failure mode is no-retry rather than incorrect-retry, which is safer than the previous isSQLiteConstraintError pattern, but it remains a version-sensitive contract. A named constraint (CONSTRAINT resume_key UNIQUE(...)) would make the error message carry the constraint name instead, making the match stable; alternatively, a test fixture that pins the expected message format from the driver would catch a regression at test time rather than in production.
10 PR discussion threads considered.
Completed in 2m 01s | $1.85 | sonnet | daemon 0.2.121 | Glorfindel
| Field | Value |
|---|---|
| Model | sonnet |
| Mode | Re-review · Cycle 2 · Session resumed |
| Reviewers | hybrid-synthesis, harness-engineering:harness-architecture-reviewer, harness-engineering:harness-enforcement-reviewer, harness-engineering:harness-knowledge-reviewer, harness-engineering:harness-self-documenting-code-reviewer, security:security-code-auditor |
| Engine | claude · sonnet |
| Reviewed by | pr-review-daemon · monit-pr-reviewer |
| Duration | 2m 01s wall · 4m 50s compute (Reviewers: 1m 24s · Synthesis: 20s) |
| Cost | $1.85 |
| Tokens | 540.8k in / 15.8k out |
| Turns | 7 |
Per-workstream usage
| Workstream | Model | In | Out | Cache read | Cache create | Cost |
|---|---|---|---|---|---|---|
| hybrid-synthesis | sonnet | 43.3k | 1.1k | 18.6k | 24.7k (1h) | $0.12 |
| harness-engineering:harness-architecture-reviewer | sonnet | 75.4k | 112 | 2.8k | 72.5k (1h) | $0.27 |
| harness-engineering:harness-enforcement-reviewer | sonnet | 78.0k | 4.2k | 2.8k | 75.2k (1h) | $0.35 |
| harness-engineering:harness-knowledge-reviewer | sonnet | 157.7k | 3.5k | 6.9k | 150.8k (1h) | $0.62 |
| harness-engineering:harness-self-documenting-code-reviewer | sonnet | 75.4k | 3.2k | 2.8k | 72.5k (1h) | $0.32 |
| security:security-code-auditor | haiku | 83.5k | 2.9k | 0 | 83.5k (1h) | $0.12 |
| discussion-summarizer | — | 27.5k | 815 | 18.6k | 8.9k (1h) | $0.05 |
Re-reviews only run when @monit-reviewer is re-requested as a reviewer — push as many commits as you need, then re-request when ready. PRs targeting branches other than main, master are skipped, even when @monit-reviewer is re-requested.
| session_row_id TEXT PRIMARY KEY, | ||
| run_id TEXT NOT NULL REFERENCES runs(run_id) ON DELETE CASCADE, | ||
| provider_session_id TEXT NOT NULL, | ||
| role TEXT NOT NULL, |
There was a problem hiding this comment.
🔵 Low (harness-engineering:harness-enforcement-reviewer): isResumeAttemptConstraintError identifies resume-key collisions by checking that all six column names appear in the driver error message text. This couples the retry decision to the modernc.org/sqlite message format: a driver update that changes the "UNIQUE constraint failed: table.col1, table.col2" format (e.g., drops the table prefix, changes the separator) would cause the function to silently return false, surfacing a raw constraint error instead of retrying. The failure mode is no-retry rather than incorrect-retry, which is safer than the previous isSQLiteConstraintError pattern, but it remains a version-sensitive contract. A named constraint (CONSTRAINT resume_key UNIQUE(...)) would make the error message carry the constraint name instead, making the match stable; alternatively, a test fixture that pins the expected message format from the driver would catch a regression at test time rather than in production.
Reply to this thread when addressed.
There was a problem hiding this comment.
Covered by TestClassifyAllocateConstraintForRecoveryMode, which generates the driver duplicate-resume error and verifies retry classification.
Summary
Verification
Closes #13