-
Notifications
You must be signed in to change notification settings - Fork 497
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix race conditions in worker/secretrotate #13348
Conversation
There were two issues here: 1. In TestManySecretsTrigger and similar, we send two separate changes to the watcher channel. After the first one, the worker starts the timer for an hour from now. After we send the second one, the worker receives it and calls handleSecretRotateChanges, which calls computeNextRotateTime, which calls the clock's Now(). But meanwhile the test calls WaitAdvance to advance the clock by 90 minutes, and it's a race: if the worker calls Now() after the clock is advanced, the test succeeds, because computeNextRotateTime retains the existing rotate time and returns, then the worker will receive the timeout and call rotate(). If the worker calls Now() before the clock is advanced, the test fails, because computeNextRotateTime gets a Now() value that's exactly the same as the previous time, and the return early doesn't happen. The fix for this is to either: a) in the test, insert an advanceClock call of any amount between sending the first change and sending the second, or b) changing the logic in computeNextRotateTime to return early and retain the existing timeout if the next secret rotation time is after OR EQUAL TO the existing timeout (instead of just after). Better to be safe than sorry, so I've made both fixes. 2. The second issue was in TestNewSecretTriggersBefore: if the second advanceClock happened after the worker received the second change and calculated the new shorter timeout but BEFORE it actually reset the timer, the timer wouldn't be fired because it's still set to 1h. There are better ways to fix this test, but I've spent too long on this, so I'm just inserting a testing.ShortWait delay in the test to ensure the advanceClock happens after the timer is updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for digging into this.
worker.lease.manager.go also uses similar logic
// Ensure we never walk the next check back without have performed a
// scheduled check *unless* we're just starting up.
if !manager.nextTimeout.IsZero() && t.After(manager.nextTimeout) {
return
}
Perhaps we can fix that also.
@@ -589,7 +589,7 @@ func (manager *Manager) setNextTimeout(t time.Time) { | |||
|
|||
// Ensure we never walk the next check back without have performed a | |||
// scheduled check *unless* we're just starting up. | |||
if !manager.nextTimeout.IsZero() && t.After(manager.nextTimeout) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see it now. Nice catch.
!!build!! |
|
1 similar comment
|
#13361 Merge from 2.9 to bring forward: - #13360 from wallyworld/simplestreams-compression - #13359 from manadart/2.9-lxd-container-images - #13352 from tlm/aws-instance-profile - #13358 from jujubot/increment-to-2.9.16 - #13354 from wallyworld/refresh-consume-proxy - #13353 from wallyworld/cmr-consume-fixes - #13346 from SimonRichardson/raft-api-client - #13349 from wallyworld/remove-orphaned-cmrapps - #13348 from benhoyt/fix-secretrotate-tests - #13119 from SimonRichardson/pass-context - #13342 from SimonRichardson/raft-facade - #13341 from ycliuhw/feature/quay.io Conflicts (easy resolution): - apiserver/common/crossmodel/interface.go - apiserver/errors/errors.go - apiserver/params/apierror.go - apiserver/testserver/server.go - scripts/win-installer/setup.iss - snap/snapcraft.yaml - version/version.go
The worker/secretrotate tests were failing intermittently in CI, particularly during race testing (where there were additional delays involved, not because of the race detection per se).
There were two issues here:
to the watcher channel. After the first one, the worker starts the
timer for an hour from now. After we send the second one, the worker
receives it and calls handleSecretRotateChanges, which calls
computeNextRotateTime, which calls the clock's Now().
But meanwhile the test calls WaitAdvance to advance the clock by 90
minutes, and it's a race: if the worker calls Now() after the clock is
advanced, the test succeeds, because computeNextRotateTime retains the
existing rotate time and returns, then the worker will receive the
timeout and call rotate().
If the worker calls Now() before the clock is advanced, the test fails,
because computeNextRotateTime gets a Now() value that's exactly the
same as the previous time, and the return early doesn't happen. The
fix for this is to either:
a) in the test, insert an advanceClock call of any amount between
sending the first change and sending the second, or
b) changing the logic in computeNextRotateTime to return early and
retain the existing timeout if the next secret rotation time is
after OR EQUAL TO the existing timeout (instead of just after).
Better to be safe than sorry, so I've made both fixes. Also made the
similar >= vs > fix to the timeout code in the lease manager.
advanceClock happened after the worker received the second change
and calculated the new shorter timeout but BEFORE it actually reset
the timer, the timer wouldn't be fired because it's still set to 1h.
There are better ways to fix this test, but I've spent too long on
this, so I'm just inserting a testing.ShortWait delay in the test to
ensure the advanceClock happens after the timer is updated.
go test -count=10 -v -race ./worker/secretrotate/... -check.v -check.vv