Fix race conditions in worker/secretrotate #13348

benhoyt · 2021-09-22T04:23:20Z

The worker/secretrotate tests were failing intermittently in CI, particularly during race testing (where there were additional delays involved, not because of the race detection per se).

There were two issues here:

In TestManySecretsTrigger and similar, we send two separate changes
to the watcher channel. After the first one, the worker starts the
timer for an hour from now. After we send the second one, the worker
receives it and calls handleSecretRotateChanges, which calls
computeNextRotateTime, which calls the clock's Now().

But meanwhile the test calls WaitAdvance to advance the clock by 90
minutes, and it's a race: if the worker calls Now() after the clock is
advanced, the test succeeds, because computeNextRotateTime retains the
existing rotate time and returns, then the worker will receive the
timeout and call rotate().

If the worker calls Now() before the clock is advanced, the test fails,
because computeNextRotateTime gets a Now() value that's exactly the
same as the previous time, and the return early doesn't happen. The
fix for this is to either:

a) in the test, insert an advanceClock call of any amount between
sending the first change and sending the second, or
b) changing the logic in computeNextRotateTime to return early and
retain the existing timeout if the next secret rotation time is
after OR EQUAL TO the existing timeout (instead of just after).

Better to be safe than sorry, so I've made both fixes. Also made the
similar >= vs > fix to the timeout code in the lease manager.

The second issue was in TestNewSecretTriggersBefore: if the second
advanceClock happened after the worker received the second change
and calculated the new shorter timeout but BEFORE it actually reset
the timer, the timer wouldn't be fired because it's still set to 1h.
There are better ways to fix this test, but I've spent too long on
this, so I'm just inserting a testing.ShortWait delay in the test to
ensure the advanceClock happens after the timer is updated.

go test -count=10 -v -race ./worker/secretrotate/... -check.v -check.vv

There were two issues here: 1. In TestManySecretsTrigger and similar, we send two separate changes to the watcher channel. After the first one, the worker starts the timer for an hour from now. After we send the second one, the worker receives it and calls handleSecretRotateChanges, which calls computeNextRotateTime, which calls the clock's Now(). But meanwhile the test calls WaitAdvance to advance the clock by 90 minutes, and it's a race: if the worker calls Now() after the clock is advanced, the test succeeds, because computeNextRotateTime retains the existing rotate time and returns, then the worker will receive the timeout and call rotate(). If the worker calls Now() before the clock is advanced, the test fails, because computeNextRotateTime gets a Now() value that's exactly the same as the previous time, and the return early doesn't happen. The fix for this is to either: a) in the test, insert an advanceClock call of any amount between sending the first change and sending the second, or b) changing the logic in computeNextRotateTime to return early and retain the existing timeout if the next secret rotation time is after OR EQUAL TO the existing timeout (instead of just after). Better to be safe than sorry, so I've made both fixes. 2. The second issue was in TestNewSecretTriggersBefore: if the second advanceClock happened after the worker received the second change and calculated the new shorter timeout but BEFORE it actually reset the timer, the timer wouldn't be fired because it's still set to 1h. There are better ways to fix this test, but I've spent too long on this, so I'm just inserting a testing.ShortWait delay in the test to ensure the advanceClock happens after the timer is updated.

wallyworld

Thanks for digging into this.
worker.lease.manager.go also uses similar logic

	// Ensure we never walk the next check back without have performed a
	// scheduled check *unless* we're just starting up.
	if !manager.nextTimeout.IsZero() && t.After(manager.nextTimeout) {
		return
	}

Perhaps we can fix that also.

manadart · 2021-09-22T07:00:58Z

worker/lease/manager.go

@@ -589,7 +589,7 @@ func (manager *Manager) setNextTimeout(t time.Time) {

 	// Ensure we never walk the next check back without have performed a
 	// scheduled check *unless* we're just starting up.
-	if !manager.nextTimeout.IsZero() && t.After(manager.nextTimeout) {


I see it now. Nice catch.

benhoyt · 2021-09-22T21:30:51Z

!!build!!

benhoyt · 2021-09-22T22:18:39Z

$$merge$$

benhoyt · 2021-09-22T22:55:01Z

$$merge$$

#13361 Merge from 2.9 to bring forward: - #13360 from wallyworld/simplestreams-compression - #13359 from manadart/2.9-lxd-container-images - #13352 from tlm/aws-instance-profile - #13358 from jujubot/increment-to-2.9.16 - #13354 from wallyworld/refresh-consume-proxy - #13353 from wallyworld/cmr-consume-fixes - #13346 from SimonRichardson/raft-api-client - #13349 from wallyworld/remove-orphaned-cmrapps - #13348 from benhoyt/fix-secretrotate-tests - #13119 from SimonRichardson/pass-context - #13342 from SimonRichardson/raft-facade - #13341 from ycliuhw/feature/quay.io Conflicts (easy resolution): - apiserver/common/crossmodel/interface.go - apiserver/errors/errors.go - apiserver/params/apierror.go - apiserver/testserver/server.go - scripts/win-installer/setup.iss - snap/snapcraft.yaml - version/version.go

wallyworld approved these changes Sep 22, 2021

View reviewed changes

Same >= vs > fix in similar timeout code in worker/lease/manager.go

12eb7b5

manadart reviewed Sep 22, 2021

View reviewed changes

jujubot merged commit 4fad6d8 into juju:2.9 Sep 22, 2021

benhoyt deleted the fix-secretrotate-tests branch September 23, 2021 02:09

benhoyt restored the fix-secretrotate-tests branch September 27, 2021 04:58

benhoyt deleted the fix-secretrotate-tests branch September 27, 2021 04:58

benhoyt restored the fix-secretrotate-tests branch September 27, 2021 05:01

manadart mentioned this pull request Sep 27, 2021

Merge 2.9 into develop #13361

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix race conditions in worker/secretrotate #13348

Fix race conditions in worker/secretrotate #13348

benhoyt commented Sep 22, 2021 •

edited

Loading

wallyworld left a comment

manadart Sep 22, 2021

benhoyt commented Sep 22, 2021

benhoyt commented Sep 22, 2021

benhoyt commented Sep 22, 2021

Fix race conditions in worker/secretrotate #13348

Fix race conditions in worker/secretrotate #13348

Conversation

benhoyt commented Sep 22, 2021 • edited Loading

wallyworld left a comment

Choose a reason for hiding this comment

manadart Sep 22, 2021

Choose a reason for hiding this comment

benhoyt commented Sep 22, 2021

benhoyt commented Sep 22, 2021

benhoyt commented Sep 22, 2021

benhoyt commented Sep 22, 2021 •

edited

Loading