Fix data race between key rotation and registry signer#308
Merged
Conversation
RotateKey swapped d.identity and then zeroed the OLD ed25519 PrivateKey buffer in place, but did so outside identityMu. The registry signer closures captured the identity under identityMu.RLock(), released the lock, and only then called cur.Sign(). An in-flight signer that grabbed the old identity before the swap therefore read the very PrivateKey bytes RotateKey was concurrently zeroing -- a use-after-zero on signing material, flagged by the race detector in the lock-graph stress harness (TestConcurrentDialEncryptDecrypt) and intermittent in production. Fix: hold identityMu.RLock() across the whole Sign() in every signer closure (Start's and RotateKey's re-bind), and zero the old key under identityMu.Lock() in the same critical section as the swap. RLock and Lock are mutually exclusive, so signing and zeroing can never overlap. The key is still zeroed promptly, preserving the no-heap-lingering intent. Add TestConcurrentRotateKeyAndSign: hammers the signer closure from N goroutines while another goroutine rotates the identity, asserting every signature verifies against the pubkey snapshotted under the same lock. Reverting either half of the fix makes it fail under -race.
This was referenced Jun 22, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The race
Caught by the race detector in the lock-graph stress harness (
TestConcurrentDialEncryptDecrypt, the "Architecture gates / Lock-graph stress harness" CI job) — intermittent, passes only sometimes.WRITE (
pkg/daemon/daemon.go, rotate-key path): after swapping the identity,RotateKeyzeroed the OLD ed25519PrivateKeybuffer in place — but outsideidentityMu:READ (registry signer closures, both the load-bearing one in
Start()and the re-bind inRotateKey): captured the identity underidentityMu.RLock(), released the lock, then calledcur.Sign():Signreads the private-key bytes (down intoed25519/subtle.ConstantTimeCompare). An in-flight signer that captured the old identity just before the swap races with the in-place zeroing of that old key — a use-after-zero on live signing material.rotateKeyMuonly serializes rotations against each other; it never gated the signers.The fix
Make private-key
Signmutually exclusive with the zeroing, via the existingidentityMu:identityMu.RLock()across the wholeSign()(no longer releases the lock before signing).identityMu.Lock(), in the same critical section as the identity swap.RLockandLockare mutually exclusive, so a signer reading the key and the rotation zeroing it can never overlap. Invariant: no goroutine ever reads private-key bytes another goroutine is concurrently zeroing. The key is still zeroed promptly (kept under the same Lock), so the no-heap-lingering security intent is preserved — nothing was dropped.Test
TestConcurrentRotateKeyAndSign(inpkg/daemon): N goroutines hammer the production-shaped signer closure against a live in-process registry while another goroutine repeatedly callsRotateKey(). Each signature is verified against the public key snapshotted under the sameRLock, so a torn/zeroed key fails verification. Reverting either half of the fix makes it fail under-race(verified: race detector reports WRITE in the zeroing loop vs READ inSign, plus a corrupt signature).Validation
GOWORK=off go build ./...— cleanGOWORK=off go vet ./pkg/daemon/...— clean;gofmtcleango test -race -count=3 -run TestConcurrentRotateKeyAndSign ./pkg/daemon/— cleango test -race -count=2 -parallel 4 -run TestConcurrentDialEncryptDecrypt ./tests/— clean (214s, no data race)