Skip to content

Fix data race between key rotation and registry signer#308

Merged
TeoSlayer merged 1 commit into
mainfrom
fix/rotate-key-sign-data-race
Jun 22, 2026
Merged

Fix data race between key rotation and registry signer#308
TeoSlayer merged 1 commit into
mainfrom
fix/rotate-key-sign-data-race

Conversation

@TeoSlayer

Copy link
Copy Markdown
Collaborator

The race

Caught by the race detector in the lock-graph stress harness (TestConcurrentDialEncryptDecrypt, the "Architecture gates / Lock-graph stress harness" CI job) — intermittent, passes only sometimes.

  • WRITE (pkg/daemon/daemon.go, rotate-key path): after swapping the identity, RotateKey zeroed the OLD ed25519 PrivateKey buffer in place — but outside identityMu:

    d.identityMu.Lock(); d.identity = newID; d.identityMu.Unlock()
    for i := range current.PrivateKey { current.PrivateKey[i] = 0 } // write, no lock
  • READ (registry signer closures, both the load-bearing one in Start() and the re-bind in RotateKey): captured the identity under identityMu.RLock(), released the lock, then called cur.Sign():

    d.identityMu.RLock(); cur := d.identity; d.identityMu.RUnlock()
    ... cur.Sign([]byte(c)) // reads private key, no lock

    Sign reads the private-key bytes (down into ed25519 / subtle.ConstantTimeCompare). An in-flight signer that captured the old identity just before the swap races with the in-place zeroing of that old key — a use-after-zero on live signing material. rotateKeyMu only serializes rotations against each other; it never gated the signers.

The fix

Make private-key Sign mutually exclusive with the zeroing, via the existing identityMu:

  1. Every signer closure now holds identityMu.RLock() across the whole Sign() (no longer releases the lock before signing).
  2. The old-key zeroing now happens under identityMu.Lock(), in the same critical section as the identity swap.

RLock and Lock are mutually exclusive, so a signer reading the key and the rotation zeroing it can never overlap. Invariant: no goroutine ever reads private-key bytes another goroutine is concurrently zeroing. The key is still zeroed promptly (kept under the same Lock), so the no-heap-lingering security intent is preserved — nothing was dropped.

Test

TestConcurrentRotateKeyAndSign (in pkg/daemon): N goroutines hammer the production-shaped signer closure against a live in-process registry while another goroutine repeatedly calls RotateKey(). Each signature is verified against the public key snapshotted under the same RLock, so a torn/zeroed key fails verification. Reverting either half of the fix makes it fail under -race (verified: race detector reports WRITE in the zeroing loop vs READ in Sign, plus a corrupt signature).

Validation

  • GOWORK=off go build ./... — clean
  • GOWORK=off go vet ./pkg/daemon/... — clean; gofmt clean
  • go test -race -count=3 -run TestConcurrentRotateKeyAndSign ./pkg/daemon/ — clean
  • go test -race -count=2 -parallel 4 -run TestConcurrentDialEncryptDecrypt ./tests/clean (214s, no data race)

RotateKey swapped d.identity and then zeroed the OLD ed25519 PrivateKey
buffer in place, but did so outside identityMu. The registry signer
closures captured the identity under identityMu.RLock(), released the
lock, and only then called cur.Sign(). An in-flight signer that grabbed
the old identity before the swap therefore read the very PrivateKey
bytes RotateKey was concurrently zeroing -- a use-after-zero on signing
material, flagged by the race detector in the lock-graph stress harness
(TestConcurrentDialEncryptDecrypt) and intermittent in production.

Fix: hold identityMu.RLock() across the whole Sign() in every signer
closure (Start's and RotateKey's re-bind), and zero the old key under
identityMu.Lock() in the same critical section as the swap. RLock and
Lock are mutually exclusive, so signing and zeroing can never overlap.
The key is still zeroed promptly, preserving the no-heap-lingering
intent.

Add TestConcurrentRotateKeyAndSign: hammers the signer closure from N
goroutines while another goroutine rotates the identity, asserting every
signature verifies against the pubkey snapshotted under the same lock.
Reverting either half of the fix makes it fail under -race.
@TeoSlayer TeoSlayer merged commit e513b55 into main Jun 22, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants