fix: make attestation verify atomic and concurrency-safe (closes #98)#119
fix: make attestation verify atomic and concurrency-safe (closes #98)#119
Conversation
…ries `AttestationService.VerifyAttestation` runs three writes serially without a wrapping DB transaction: 1. credentialSvc.IssueCredential — inserts issued_credentials row. 2. identitySvc.UpdateIdentity — flips identities.trust_level. 3. repo.Update — sets attestation_records.is_verified. Today's `if record.IsVerified` guard at the top of the function only catches retries after a fully-successful run. It does NOT catch the partial-failure window where Step 1 succeeded but Step 2 or Step 3 failed: in that state, IsVerified is still false, but AttestationRecord.CredentialID is already populated. A retry from the caller would therefore re-run Step 1 and mint a SECOND credential from the same proof. The proper fix is the full DB-transaction refactor proposed in #98 — threading bun.IDB through CredentialService.IssueCredential, IdentityService.UpdateIdentity, and AttestationRepository.Update. That is a real cross-service surface change (~7 IssueCredential call sites plus repo signature changes) and is left as the follow-up. This PR does the smaller mitigation the issue itself proposes as a fallback: extend the guard to ALSO trip when CredentialID != "". That closes the double-credential window without touching service-layer signatures. The remaining gap — Step 2 succeeding but Step 3 failing leaves trust_level promoted on a record that's not marked verified — is unchanged, but it's a strictly weaker problem (no double issuance, just an audit-trail inconsistency) and disappears entirely once the transaction work lands. ## Test plan New regression test `TestAttestationVerifyGuardCatchesPartialFailureRetry`: - runs a clean verify (ends with IsVerified=true, CredentialID set) - surgically rewinds is_verified to false in the DB to simulate the Step 2/3 failure window - asserts the retry returns 409 Conflict instead of minting a second credential Without the new condition the test fails at the second verify (returns 200 + new access token). With it, second verify is rejected as expected. Existing TestAttestationDoubleVerifyIsRejected continues to pass — the new condition is OR'd with the old one; post-success guard behavior is unchanged. Issue #98 stays open to track the full transaction refactor.
There was a problem hiding this comment.
Code Review
This pull request introduces a guard in the attestation service to prevent duplicate credential issuance during retries of partially failed requests, along with a corresponding integration test. Feedback indicates that the guard is currently ineffective because the CredentialID is only persisted in the final step of the process; a failure in earlier steps would leave the database in a state that bypasses this check. To resolve this, the reviewer suggests wrapping the database operations in a transaction to ensure atomicity. Furthermore, the integration test is noted to simulate a state that does not occur during actual application failure modes.
…oses #98) VerifyAttestation does three independent writes: 1. credentialSvc.IssueCredential — inserts issued_credentials row. 2. identitySvc.UpdateIdentity — flips identities.trust_level. 3. repo.Update — sets attestation_records.is_verified and links credential_id. Before this change each ran in its own auto-commit transaction. A failure between steps left partial state — most painfully, Step 1 succeeding then Step 2 or Step 3 failing meant a retry would mint a second credential from the same proof, since the IsVerified guard hadn't been set yet. ## Approach Context-based tx propagation rather than threading bun.IDB through every service signature. New helper `internal/store/postgres/tx.go`: - `WithTx(ctx, tx)` attaches an in-flight tx to ctx. - `dbOrTx(ctx, fallback)` (unexported) returns the tx if ctx has one, falls back to the repo's *bun.DB handle. Three repo methods now route through dbOrTx: - postgres.CredentialRepository.Create - postgres.IdentityRepository.Update - postgres.AttestationRepository.Update Repos that don't see WithTx behave exactly as before (auto-commit per statement), so this is a strict superset of the prior behavior — every existing caller is untouched, no signatures change. VerifyAttestation now wraps the three writes: s.db.RunInTx(ctx, nil, func(ctx context.Context, tx bun.Tx) error { ctx = postgres.WithTx(ctx, tx) // IssueCredential / UpdateIdentity / repo.Update }) AttestationService gains a `db *bun.DB` field so it can open the tx; NewAttestationService picks up one new parameter, wired in server.go. ## Why context-based and not signature-based Threading `bun.IDB` through CredentialService.IssueCredential and IdentityService.UpdateIdentity would have changed ~15 call sites across oauth.go, agent.go, handler/identity.go, handler/credential.go. The context-based pattern keeps the change local: only the repo methods that actually need to participate in a tx pick up the helper, and every other caller of those services keeps the existing signature. ## Defense in depth The CredentialID-based check in the ErrAttestationAlreadyVerified guard stays in place as belt-and-suspenders. With the transaction it's unreachable through the normal flow — partial state cannot exist. But it still catches direct DB manipulation, a future code path that bypasses the verify flow, or a transaction-commits-but-error-returned bug. Test pinning the guard remains. ## Test plan - [x] `go vet ./...` clean (root + pkg/authjwt) - [x] Full integration suite green ~10s, including: - TestAttestationOIDCVerifierHappyPath (proves the tx commits and all three writes land for a normal success path) - TestAttestationDoubleVerifyIsRejected (post-success guard) - TestAttestationVerifyGuardCatchesPartialFailureRetry (manually plants partial state to exercise the defense-in-depth guard) Closes #98.
Three follow-ups on the transaction-wrap PR: R1. Update the stale ErrAttestationAlreadyVerified comment. The original guard-tighten commit said "Closes one of the two failure windows tracked by #98; the full DB-transaction refactor closes the rest." After the tx wrap landed in the previous commit on this branch, the full refactor IS this PR — the comment misled a future reader about what was load-bearing. Rewritten to make clear the guard is now strictly defense- in-depth: the transaction wrap closes both failure windows; the guard only fires for direct DB manipulation, future code paths that bypass VerifyAttestation, or commit-then-error-return. R2. Route every read in the three touched repos through dbOrTx. Previously only Create / Update participated in transactions. Reads (GetByID, GetByJTI, ListByIdentity, GetHighestVerifiedLevel, RevokeAllActiveForIdentity, Revoke, GetByExternalID, GetByWIMSEURI, List, Delete) used r.db directly. In today's flow that's harmless — IssueCredential's reads don't touch tables we write to in the same tx — but it's a latent stale-read footgun: a future refactor that adds a read-after-write inside the closure would silently see pre-tx state. Now every method in postgres/{credential,identity,attestation}.go opens through dbOrTx, so the contract is uniform: if a tx is attached to ctx, the statement participates; otherwise the repo's *bun.DB falls through. No call site sees a behavior change. Other repos (apikey.go, refresh_token.go, etc.) intentionally not touched — they're not in the attestation flow and the consistent pattern can be extended on demand. R3. Add direct regression tests for the WithTx + dbOrTx mechanism. tests/integration/postgres_tx_test.go: - TestPostgresWithTxRollbackPersistsNothing: opens a tx via testDB, inserts an Identity + an IssuedCredential through the repos with WithTx attached, returns a non-nil error from the closure, and asserts neither row survives. Pins the foundational invariant behind the attestation atomicity claim — without it, every existing test in the suite could pass while the rollback semantics are silently broken. - TestPostgresWithoutTxFallsBackToAutoCommit: confirms the no-tx-in- ctx path still auto-commits per statement, so every existing call site of these repos (which doesn't open a tx) is behaviorally unchanged. ## Test plan - [x] go vet ./... clean - [x] Full integration suite green ~10s, including the two new tests - [x] Existing TestAttestationVerifyGuardCatchesPartialFailureRetry still pins the defense-in-depth guard behavior The defense-in-depth claim and the rollback claim are now each backed by their own test.
Addresses the remaining items from the self-review on the #98 tx wrap: W1. Concurrent-verify race. Two simultaneous /verify calls on the same record could each pass the pre-tx IsVerified guard, each enter RunInTx, each call IssueCredential, and leave the DB with two credentials minted from one proof. The tx wrap by itself doesn't fix this — both txs would commit successfully, just at different times. Fix: SELECT ... FOR UPDATE on the attestation record as the FIRST statement inside the tx (new repo method GetByIDForUpdate). The second verify queues on the row lock; when it acquires, the first has already written CredentialID and the in-tx guard re-check rejects it with ErrAttestationAlreadyVerified. Pinned by TestAttestationConcurrentVerifyMintsExactlyOneCredential: fires N=8 concurrent /verify calls on the same record and asserts exactly one returns 200, the other seven return 409, and the DB ends with exactly one credential row for the identity. W2. Identity read moved inside the tx. GetIdentity used to run before RunInTx, so a concurrent UpdateIdentity that deactivated the identity between the read and the tx could mint a credential against a stale view. Now reads happen inside the closure, under the same READ COMMITTED snapshot as the writes. W4. Isolation level documented. Comment on the RunInTx call notes that nil TxOptions = READ COMMITTED and that the row lock provides serialization; we don't need REPEATABLE READ across the whole tx. N1. Outer return values assigned only on success. The previous version captured `record`/`accessToken`/`cred` as outer vars and wrote into `record` mid-closure. On rollback the outer `record` was left holding "verified-looking" fields (no observable bug because of the early txErr return, but a footgun for future code that reads after the error check). Now: closure-scoped locals (`locked`, `issued`, `issuedCred`) hold in-progress state, and the three outer vars are assigned exactly once at the end of the closure. A rollback path never assigns them, so the caller can't observe partial state by accident. N3. WithTx nesting documented. Added a paragraph to WithTx's doc comment noting that re-attaching on a ctx that already carries a tx replaces it (no savepoints, no merge), and that the right pattern is "open one tx at the outermost service call, attach once." ## Test plan - [x] go vet ./... clean - [x] go build ./... clean - [x] Full integration suite green ~10s, including: - TestAttestationConcurrentVerifyMintsExactlyOneCredential (NEW — pins the row lock against an N=8 concurrent attack) - TestAttestationOIDCVerifierHappyPath (regression — happy path still commits all three writes) - TestAttestationDoubleVerifyIsRejected (post-success guard) - TestAttestationVerifyGuardCatchesPartialFailureRetry (defense-in-depth guard against direct DB manipulation) - TestPostgresWithTxRollbackPersistsNothing / TestPostgresWithoutTxFallsBackToAutoCommit (foundational WithTx + dbOrTx mechanism) W3 (cryptographic signing inside the tx) is intentionally left as-is. It's a per-attestation cost, not a hot-path concern, and refactoring would mean either pre-computing the signature outside the tx (loses ability to use ctx-derived state) or splitting IssueCredential into sign + persist phases (broader surface change with no measurable gain).
R1. Fix misleading "no-op" claim on GetByIDForUpdate. The doc previously said "outside one [tx], FOR UPDATE on Postgres is a no-op." Technically wrong — Postgres wraps every statement in an implicit transaction, so the SELECT does acquire the lock; it just releases it on the implicit commit when the statement returns. The practical effect (no useful serialization) is the same, but the wording would mislead a reader trying to reason about lock lifetime. Rewrote to describe the real behavior. R2. Enforce that GetByIDForUpdate is called inside a tx. The doc said "MUST be called inside a transaction" but the code didn't check. A future caller that forgets postgres.WithTx would silently downgrade to a per-statement lock with no useful serialization — exactly the failure mode the lock is meant to prevent, made invisible. Added a `hasTx(ctx)` helper alongside dbOrTx and a one-line guard: "GetByIDForUpdate must be called inside a postgres.WithTx context." Misuse now fails fast instead of producing a race that only shows up under concurrent load. R3. Stop calling require.* from goroutines in the race test. The Go testing contract says t.FailNow / require.* MUST be called from the goroutine running the test, not from workers. The previous race test fired N goroutines that each called the verifyAttestation helper, which routes through doRequest → require.NoError. In practice this often works, but it's officially undefined behaviour: a worker that fails won't actually halt the test, and on some Go versions the failure is silently lost. Refactored the goroutines to do raw http.NewRequest + http.DefaultClient.Do, write (status, err) to a per-slot result struct, and have the main test goroutine assert post-Wait. Same coverage, no testing-API misuse. `go test -race` confirms no data races; conflict goroutines record durations of ~23-27ms (vs ~13ms for the lock holder), independent evidence the row lock is actually serializing the requests. ## Test plan - [x] go vet ./... clean - [x] go test -race targeted suite — clean, no race warnings - [x] Full integration suite green ~10s - [x] Race test still observes 1 OK / 7 conflict / 1 credential
Three small follow-ups from the latest self-review pass: W1. Document lock ordering. The tx now takes the attestation_records row lock (FOR UPDATE) first, then implicitly takes the identities row lock when UpdateIdentity runs. A future code path that does the inverse — identity lock first, then attestation lock — could deadlock with this one. Postgres detects deadlocks (40P01) and aborts one tx, so it's not a correctness bug, just a transient retry concern. Added a comment on the RunInTx site documenting the ordering so a future contributor doesn't introduce the inverse path silently. N1. Rename `updatedRecord` to `verifiedRecord`. The variable holds the post-success attestation record. "Updated" is generic; "verified" is the specific thing that makes this record worth holding onto. One-name swap, no behavior change. N2. Tighten the pre-tx fast-fail comment. The previous wording said the pre-tx guard "shaves work off the obvious-already-done case" — accurate but understated. The real saving on retry storms is avoiding the verifier run, the second DB connection for the tx, and the BEGIN/SELECT FOR UPDATE/ROLLBACK round-trip that the in-tx guard would otherwise require. Reworded. Skipped: W2 (signing inside the lock) — a real refactor of CredentialService into "build+sign" + "persist" phases for ~10ms of lock contention savings per attestation verify; not worth the surface change. W3 (sanity-check that conflict goroutines actually waited) — timing-based assertion would be flaky on heterogeneous CI hardware. ## Test plan - [x] go vet ./... clean - [x] Full integration suite green ~10s, no regressions
R1. Add regression test for the WithTx contract on GetByIDForUpdate. The hasTx guard added in cc67d52 protects against a future caller forgetting postgres.WithTx around GetByIDForUpdate — silent downgrade to a per-statement implicit tx would acquire the row lock and release it immediately, providing no serialization. Without a test, the guard itself could be removed by a future contributor thinking it's redundant, and the regression would only surface under concurrent load. New TestGetByIDForUpdateRequiresTx in postgres_tx_test.go calls GetByIDForUpdate with context.Background() and asserts the contract error fires. W1. Drop the misleading "failed to lock attestation record" wrap. When GetByIDForUpdate's hasTx guard fires, the previous wrap chain read "failed to lock attestation record: GetByIDForUpdate must be called inside a postgres.WithTx context — ..." Misleading: an operator chasing the message would assume a runtime DB problem when the cause is actually a programming mistake. Removed the outer wrap; the repo method's error already names what failed (whether runtime or contract violation), and the inline closure comment explains why. W2. Trim the docstring redundancy. The function-level docstring previously described the atomicity + serialization invariant in detail, then the inline comments inside the closure repeated the same content at finer grain. Trimmed the docstring to a one-paragraph summary that points at the inline commentary; the inline comments remain authoritative. N1. Share the txKey type assertion via txFromContext helper. dbOrTx and hasTx both did the same `ctx.Value(txKey{}).(bun.Tx)` assertion. Extracted into a single txFromContext returning (tx, ok) so the key + type live in exactly one place. N2. Move the lock-order comment inside the closure. The "Lock-order note" comment previously sat above RunInTx but described what happens INSIDE the closure. Moved next to the GetByIDForUpdate call where the read-the-comment-then-look-at-the- code distance is zero. ## Test plan - [x] go vet ./... clean - [x] go test ./... full suite green ~10s - [x] TestGetByIDForUpdateRequiresTx pins the contract
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request implements transaction atomicity and concurrency control in the VerifyAttestation flow. It introduces a context-based transaction mechanism across the Attestation, Identity, and Credential repositories, utilizing SELECT FOR UPDATE to prevent duplicate credential issuance during concurrent requests. A review comment identified a risk of transaction poisoning in the IdentityRepository.Delete method due to an ignored error during a pre-deletion update.
The pre-DELETE UPDATE that stamps modified_by previously discarded its error via `_, _ = db.NewUpdate()...Exec(ctx)`. Outside a transaction this was harmless: the subsequent DELETE either succeeded (best case) or surfaced its own error (e.g., row not found). Inside a transaction this was a real bug. Postgres aborts a transaction the moment any statement fails, so a failed pre-stamp would mean the DELETE that follows fails with the generic "current transaction is aborted, commands ignored until end of transaction block" — losing the original cause and making the failure mode hard to diagnose. The Delete repo method was changed to participate in WithTx alongside Create / Update in commit aa07997. Now that callers can route Delete through a tx via dbOrTx, the swallow pattern is no longer safe. Fix is the obvious one: propagate the UPDATE error. Behavior change outside a tx is "loud failure on a benign UPDATE problem" instead of "silent audit gap" — strictly safer, and aligns the audit trail with the same atomicity guarantees the rest of the repo provides under tx. Flagged by Gemini on PR #119.
Summary
Closes #98.
AttestationService.VerifyAttestationnow performs its three side-effecting writes (IssueCredential,UpdateIdentity,repo.Update) inside a single transaction with a row-level lock, fixing both partial-failure consistency and the concurrent-verify race that could mint two credentials from one proof.Problem
VerifyAttestationran three writes serially, each in its own auto-commit transaction:credentialSvc.IssueCredential— insertsissued_credentialsrow.identitySvc.UpdateIdentity— flipsidentities.trust_level.repo.Update— setsattestation_records.is_verifiedand linkscredential_id.Two failure modes:
if record.IsVerifiedguard only catches retries after a fully-successful run. If Step 1 succeeded but Step 2 or 3 failed,credential_idwas set butis_verifiedwas stillfalse, and a retry would mint a second credential.IssueCredential— leaving two credentials in the DB and one record pointing at one of them.Fix
Both fixes ship in this PR.
Atomicity —
bun.RunInTxaround the three writesAttestationServicegains adb *bun.DBfield.VerifyAttestation's side-effecting writes now run inside a single closure passed tos.db.RunInTx. Any failure rolls all three back; retries always see a clean slate.Serialization —
SELECT ... FOR UPDATEon the attestation rowThe transaction's first statement is
s.repo.GetByIDForUpdate(...), which acquires a Postgres row-level write lock onattestation_records. Concurrent verifies on the same record queue on the lock; the second one re-reads after the first commits, seesCredentialIDset, and bails withErrAttestationAlreadyVerified.Reusable transaction-propagation pattern
New
internal/store/postgres/tx.go:WithTx(ctx, tx)(exported) attaches abun.Txto ctx.dbOrTx(ctx, fallback)(unexported) returns the in-flight tx if one is set, falling back to the repo's*bun.DB.hasTx(ctx)(unexported) for repo methods that fail-fast on missing tx (e.g.,GetByIDForUpdate).Every method in
internal/store/postgres/{credential,identity,attestation}.gonow routes throughdbOrTx. Callers that don't open a transaction get the repo's auto-commit*bun.DBexactly as before — every existing call site is behaviorally unchanged. Other repos (apikey, refresh_token, oauth_client, …) intentionally not touched; the pattern can be extended on demand.Defense in depth
The
record.IsVerified || record.CredentialID != ""clause on theErrAttestationAlreadyVerifiedguard stays. With the transaction in place it's unreachable through the normal flow (partial state can't exist by construction), but it still catches direct DB manipulation, future code paths that bypassVerifyAttestation, or commit-then-error-return.Lock-order note
This flow takes the
attestation_recordsrow lock first (SELECT FOR UPDATE), then implicitly theidentitiesrow lock (viaUpdateIdentity's UPDATE). Future code paths that hold both should acquire them in the same order. Postgres detects deadlocks (40P01) and aborts one tx, so the worst case from inversion is a transient retry, not silent corruption.Files
internal/store/postgres/tx.goWithTx/dbOrTx/hasTxhelpers, sharingtxFromContext.internal/store/postgres/credential.godbOrTx.internal/store/postgres/identity.godbOrTx.internal/store/postgres/attestation.godbOrTx; newGetByIDForUpdateenforceshasTx.internal/service/attestation.godb *bun.DBfield; three writes wrapped inRunInTx; identity re-loaded inside the tx; outer return values assigned exactly once on commit.server.godbintoNewAttestationService.tests/integration/attestation_oidc_test.goTestAttestationConcurrentVerifyMintsExactlyOneCredential+TestAttestationVerifyGuardCatchesPartialFailureRetry; updated docstring on existing tests.tests/integration/postgres_tx_test.goWithTx-required contract.Test plan
Six regression tests pin the guarantees:
TestPostgresWithTxRollbackPersistsNothing— tx rollback is honored end-to-end.TestPostgresWithoutTxFallsBackToAutoCommit— no-tx callers still auto-commit.TestGetByIDForUpdateRequiresTx—WithTxcontract is enforced.TestAttestationVerifyGuardCatchesPartialFailureRetry— defense-in-depth guard rejects manually-planted partial state.TestAttestationDoubleVerifyIsRejected— post-success idempotency unchanged.TestAttestationConcurrentVerifyMintsExactlyOneCredential— fires N=8 concurrent /verify calls, asserts exactly one credential lands in the DB and the other seven return 409.Local:
go vet ./...clean,go build ./...clean,go test -raceclean, full integration suite green in ~10s. All 10 CI checks green at every commit on the branch. Conflict-goroutine durations of 23-27ms vs winner ~13ms are independent evidence the row lock actually serializes.What's NOT in this PR (deliberate)
IssueCredentialto sign-then-persist to shrink lock duration. Surface change disproportionate to ~10ms savings on a per-attestation flow that isn't hot path.dbOrTxto apikey / refresh_token / oauth_client / etc. None of those participate in the attestation flow today; YAGNI until something needs them. Pattern is in place when it's wanted.Closes
#98