Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race conditions in worker/secretrotate #13348

Merged
merged 2 commits into from
Sep 22, 2021

Conversation

benhoyt
Copy link
Member

@benhoyt benhoyt commented Sep 22, 2021

The worker/secretrotate tests were failing intermittently in CI, particularly during race testing (where there were additional delays involved, not because of the race detection per se).

There were two issues here:

  1. In TestManySecretsTrigger and similar, we send two separate changes
    to the watcher channel. After the first one, the worker starts the
    timer for an hour from now. After we send the second one, the worker
    receives it and calls handleSecretRotateChanges, which calls
    computeNextRotateTime, which calls the clock's Now().

But meanwhile the test calls WaitAdvance to advance the clock by 90
minutes, and it's a race: if the worker calls Now() after the clock is
advanced, the test succeeds, because computeNextRotateTime retains the
existing rotate time and returns, then the worker will receive the
timeout and call rotate().

If the worker calls Now() before the clock is advanced, the test fails,
because computeNextRotateTime gets a Now() value that's exactly the
same as the previous time, and the return early doesn't happen. The
fix for this is to either:

a) in the test, insert an advanceClock call of any amount between
sending the first change and sending the second, or
b) changing the logic in computeNextRotateTime to return early and
retain the existing timeout if the next secret rotation time is
after OR EQUAL TO the existing timeout (instead of just after).

Better to be safe than sorry, so I've made both fixes. Also made the
similar >= vs > fix to the timeout code in the lease manager.

  1. The second issue was in TestNewSecretTriggersBefore: if the second
    advanceClock happened after the worker received the second change
    and calculated the new shorter timeout but BEFORE it actually reset
    the timer, the timer wouldn't be fired because it's still set to 1h.
    There are better ways to fix this test, but I've spent too long on
    this, so I'm just inserting a testing.ShortWait delay in the test to
    ensure the advanceClock happens after the timer is updated.
go test -count=10 -v -race ./worker/secretrotate/... -check.v -check.vv

There were two issues here:

1. In TestManySecretsTrigger and similar, we send two separate changes
to the watcher channel. After the first one, the worker starts the
timer for an hour from now. After we send the second one, the worker
receives it and calls handleSecretRotateChanges, which calls
computeNextRotateTime, which calls the clock's Now().

But meanwhile the test calls WaitAdvance to advance the clock by 90
minutes, and it's a race: if the worker calls Now() after the clock is
advanced, the test succeeds, because computeNextRotateTime retains the
existing rotate time and returns, then the worker will receive the
timeout and call rotate().

If the worker calls Now() before the clock is advanced, the test fails,
because computeNextRotateTime gets a Now() value that's exactly the
same as the previous time, and the return early doesn't happen. The
fix for this is to either:

a) in the test, insert an advanceClock call of any amount between
   sending the first change and sending the second, or
b) changing the logic in computeNextRotateTime to return early and
   retain the existing timeout if the next secret rotation time is
   after OR EQUAL TO the existing timeout (instead of just after).

Better to be safe than sorry, so I've made both fixes.

2. The second issue was in TestNewSecretTriggersBefore: if the second
advanceClock happened after the worker received the second change
and calculated the new shorter timeout but BEFORE it actually reset
the timer, the timer wouldn't be fired because it's still set to 1h.
There are better ways to fix this test, but I've spent too long on
this, so I'm just inserting a testing.ShortWait delay in the test to
ensure the advanceClock happens after the timer is updated.
Copy link
Member

@wallyworld wallyworld left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for digging into this.
worker.lease.manager.go also uses similar logic

	// Ensure we never walk the next check back without have performed a
	// scheduled check *unless* we're just starting up.
	if !manager.nextTimeout.IsZero() && t.After(manager.nextTimeout) {
		return
	}

Perhaps we can fix that also.

@@ -589,7 +589,7 @@ func (manager *Manager) setNextTimeout(t time.Time) {

// Ensure we never walk the next check back without have performed a
// scheduled check *unless* we're just starting up.
if !manager.nextTimeout.IsZero() && t.After(manager.nextTimeout) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see it now. Nice catch.

@benhoyt
Copy link
Member Author

benhoyt commented Sep 22, 2021

!!build!!

@benhoyt
Copy link
Member Author

benhoyt commented Sep 22, 2021

$$merge$$

1 similar comment
@benhoyt
Copy link
Member Author

benhoyt commented Sep 22, 2021

$$merge$$

@jujubot jujubot merged commit 4fad6d8 into juju:2.9 Sep 22, 2021
@benhoyt benhoyt deleted the fix-secretrotate-tests branch September 23, 2021 02:09
@benhoyt benhoyt restored the fix-secretrotate-tests branch September 27, 2021 04:58
@benhoyt benhoyt deleted the fix-secretrotate-tests branch September 27, 2021 04:58
@benhoyt benhoyt restored the fix-secretrotate-tests branch September 27, 2021 05:01
jujubot added a commit that referenced this pull request Sep 28, 2021
#13361

Merge from 2.9 to bring forward:
- #13360 from wallyworld/simplestreams-compression
- #13359 from manadart/2.9-lxd-container-images
- #13352 from tlm/aws-instance-profile
- #13358 from jujubot/increment-to-2.9.16
- #13354 from wallyworld/refresh-consume-proxy
- #13353 from wallyworld/cmr-consume-fixes
- #13346 from SimonRichardson/raft-api-client
- #13349 from wallyworld/remove-orphaned-cmrapps
- #13348 from benhoyt/fix-secretrotate-tests
- #13119 from SimonRichardson/pass-context
- #13342 from SimonRichardson/raft-facade
- #13341 from ycliuhw/feature/quay.io

Conflicts (easy resolution):
- apiserver/common/crossmodel/interface.go
- apiserver/errors/errors.go
- apiserver/params/apierror.go
- apiserver/testserver/server.go
- scripts/win-installer/setup.iss
- snap/snapcraft.yaml
- version/version.go
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants