-
Notifications
You must be signed in to change notification settings - Fork 497
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
2.5 worker lease 1815471 #9742
2.5 worker lease 1815471 #9742
Conversation
Now that ClaimLease sets the timer expiry according to the requested lease (since it doesn't tell you how long you succeeded without rereading Leases). We *should* wake up, but we should note that there is nothing to expire.
They sometimes passed because the loop would exit before we could observe that we wanted to Refresh, but they should always be calling Refresh at time 1 minute, but not expiring the lease.
There was a couple issues: a) setupInitialTimer() would create a NewTimer with an arbitrary timeout, and then would update itself with the timeout from Leases. However, tests could observe "hey, I have a timer waiting" after the first one, and miss the synchronization with the second one. Instead, we just always initialize Timer in setNextTimeout. b) Reset() is not safe to call bare (which is documented in the time package). The issue is that it may be trying to send on the channel, and block, so we have to call Stop() and optionally drain the channel first. This was triggered with Startup_NoExpiry_NotLongEnough, where it was improperly synchronizing because of (a), and then the test would think it was happy, and start tearing down the manager and store. But the code was just about to get to timer.Reset() which then died because it was triggering synchronously but the Manager loop had already been told to TearDown. This can be exacerbated (to make tests fail more reliably) with this patch: --- a/worker/lease/manager.go +++ b/worker/lease/manager.go @@ -437,6 +437,7 @@ func (manager *Manager) setupInitialTimer() { manager.muNextTimeout.Lock() manager.timer = manager.config.Clock.NewTimer(manager.config.MaxSleep) manager.muNextTimeout.Unlock() + time.Sleep(0 * time.Millisecond) // lastTick has never happened, so pass in the epoch time manager.computeNextTimeout(time.Time{}, manager.config.Store.Leases()) }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested with the old code with time.Sleep
and can induce race, then tested new code and can't induce race.
@@ -427,16 +427,12 @@ func (manager *Manager) computeNextTimeout(lastTick time.Time, leases map[lease. | |||
} | |||
nextTick = info.Expiry | |||
} | |||
manager.config.Logger.Tracef("[%s] next expire decided on %v %v", | |||
manager.config.Logger.Tracef("[%s] next expire in %v %v", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice 👏
default: | ||
} | ||
} | ||
manager.timer.Reset(d) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good change, the docs read like a minefield.
|
|
#9748 ## Description of change This just brings develop up-to-date with all the 2.5 patches: prdesc Merge pull request #9745 from jameinel/2.5-lease-invalid-retries-1815719 prdesc Merge pull request #9740 from wallyworld/cmr-multi-offer-fix prdesc Merge pull request #9742 from jameinel/2.5-worker-lease-1815468 prdesc Merge pull request #9736 from jameinel/2.5-leadership-client prdesc Merge pull request #9737 from jameinel/2.5-worker-lease-1815468 prdesc Merge pull request #9722 from achilleasa/fix-1812227 prdesc Merge pull request #9734 from babbageclunk/state-worker-dep-message prdesc Merge pull request #9733 from babbageclunk/raftlease-stop-global-clock prdesc Merge pull request #9735 from howbazaar/2.5-mongo-systemd-ulimit prdesc Merge pull request #9724 from babbageclunk/raftlease-upgrade-blank prdesc Merge pull request #9731 from howbazaar/2.5-status-close-error prdesc Merge pull request #9730 from howbazaar/2.5-lease-race prdesc Merge pull request #9727 from achilleasa/fix-1814638 prdesc Merge pull request #9728 from wallyworld/rename-delete-storage-pool prdesc Merge pull request #9712 from jameinel/2.5-leases-nextTick prdesc Merge pull request #9709 from jameinel/2.5-update-testing-clock ## QA steps See individual patches. ## Documentation changes See individual patches. ## Bug reference prdesc https://bugs.launchpad.net/juju/+bug/1815719 prdesc https://bugs.launchpad.net/juju/+bug/1813151 prdesc https://bugs.launchpad.net/juju/+bug/1815179 prdesc https://bugs.launchpad.net/juju/+bug/1815471 prdesc https://bugs.launchpad.net/juju/+bug/1815468 prdesc https://bugs.launchpad.net/juju/+bug/1812227 prdesc https://bugs.launchpad.net/juju/+bug/1815405 prdesc https://bugs.launchpad.net/juju/+bug/1813996 prdesc https://bugs.launchpad.net/juju/+bug/1813995 prdesc https://bugs.launchpad.net/juju/+bug/1815397 prdesc https://bugs.launchpad.net/juju/+bug/1814638 prdesc https://bugs.launchpad.net/juju/+bug/1814556
Description of change
Fix at least 5 flaky tests around Lease behavior.
QA steps
To trigger the race in setupInitialTimer you could use this patch:
That extends the chance that the test loop will sync on the NewTimer and not on the timer.Reset that is immediately called after that.
For all the tests, the easiest way to see the flakiness is:
With this patch, I didn't get any failures after 2000 loops. (though, as always, there could still be issues.)
Documentation changes
None.
Bug reference
https://bugs.launchpad.net/juju/+bug/1815471