New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
only load leases on next tick #9712
Conversation
This needs: juju/clock#4 In order to pass without hanging. This is because the ExpiryRepeatedTimeout test sets up a situation where there is a lease that has already expired by the time we enter loop(). The testclock.Timer was properly being set to <=0 seconds, but Reset() wasn't triggering the timer without calling WaitAdvance(0,0,1).
ExpiryTimeout is now failing. Need to figure out if that should be a source code change or a test suite change. We wanted to give a bit of time for the expiry to run so that our Leases would be fresh before we evalute them for blocks and the next expire evaluation time. However, that means that a blocking Expire call will prevent us from running the main loop.
If we use NewTimer() we have the ability to Stop() the timer, which will remove it from the wait list, which means it doesn't make it unpredictable how many timers we have. Either we are testing waiting on that timer, or the other cause triggers and stops the timer, so it doesn't show up as a waiting timer. Breaks other tests, though. We may just want to not synchronize with retryingExpire.
Started looking at the Block tests, whose behavior *does* change quite a bit (since we don't always evaluate blocks on every pass). Found that the Block test wasn't actually properly synchronizing with its helper routine, so several tests were accidentally passing.
If we are processing a Block and we notice its information is stale, trigger a Refresh and then queue up a timeout to indicate that we should be expiring/unblocking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aside from the nits, my only question is regarding the logic in ensureNextTimeout
.
worker/lease/manager.go
Outdated
// now is the time of the current tick - expiries are done on the | ||
// basis of this time. | ||
now time.Time | ||
// nextTimeout is the next time that has a possibly expiry that we would care |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"possible".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
} | ||
} | ||
|
||
func (manager *Manager) lookupLease(leaseKey lease.Key) (lease.Info, bool) { | ||
lease, exists := manager.config.Store.Leases(leaseKey)[leaseKey] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just return it directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can't because Go is doing a "magic" 2 value return from a map, but that doesn't seem to work as a magic return directly from a function (it treats it as a one-value return, AFAICT).
./manager.go:173:2: not enough arguments to return
have (lease.Info)
want (lease.Info, bool)
worker/lease/manager.go
Outdated
if next.After(proposed) { | ||
manager.config.Logger.Tracef("[%s] ensuring we wake up before %v at %v\n", | ||
manager.logContext, d, next) | ||
manager.nextTimeout = next |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should proposed
be used here and in the log line above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wow, you're absolutely right, and clearly we have a missing test as it didn't catch this.
I'll work on a test for this.
It was a little bit tricky to test, because we *were* calling Timer.Reset() with the right duration. But it lead to a situation where we thought the timeout was later than it actually was. TestClaimNoticesEarlyExpiry handles both cases properly. If we were waiting a long time to timeout, when we get a claim, it will cause us to wake up earlier, but if we then get a slightly *later* claim, it won't push our Expiry back to the later time.
#9709 ## Description of change The testing clock now iterates faster when given a long timeout, and gives stack traces when the waiters don't match expectations. This is just cleanup of the testing clock that helped when I was debugging something else, but we may as well get it into our core code. Only changes the testing clock, so it may change the test suite, but doesn't change production code. juju/clock#3 juju/clock#4 We need (4) #9712 ## QA steps None. ## Documentation changes None. ## Bug reference None.
|
#9748 ## Description of change This just brings develop up-to-date with all the 2.5 patches: prdesc Merge pull request #9745 from jameinel/2.5-lease-invalid-retries-1815719 prdesc Merge pull request #9740 from wallyworld/cmr-multi-offer-fix prdesc Merge pull request #9742 from jameinel/2.5-worker-lease-1815468 prdesc Merge pull request #9736 from jameinel/2.5-leadership-client prdesc Merge pull request #9737 from jameinel/2.5-worker-lease-1815468 prdesc Merge pull request #9722 from achilleasa/fix-1812227 prdesc Merge pull request #9734 from babbageclunk/state-worker-dep-message prdesc Merge pull request #9733 from babbageclunk/raftlease-stop-global-clock prdesc Merge pull request #9735 from howbazaar/2.5-mongo-systemd-ulimit prdesc Merge pull request #9724 from babbageclunk/raftlease-upgrade-blank prdesc Merge pull request #9731 from howbazaar/2.5-status-close-error prdesc Merge pull request #9730 from howbazaar/2.5-lease-race prdesc Merge pull request #9727 from achilleasa/fix-1814638 prdesc Merge pull request #9728 from wallyworld/rename-delete-storage-pool prdesc Merge pull request #9712 from jameinel/2.5-leases-nextTick prdesc Merge pull request #9709 from jameinel/2.5-update-testing-clock ## QA steps See individual patches. ## Documentation changes See individual patches. ## Bug reference prdesc https://bugs.launchpad.net/juju/+bug/1815719 prdesc https://bugs.launchpad.net/juju/+bug/1813151 prdesc https://bugs.launchpad.net/juju/+bug/1815179 prdesc https://bugs.launchpad.net/juju/+bug/1815471 prdesc https://bugs.launchpad.net/juju/+bug/1815468 prdesc https://bugs.launchpad.net/juju/+bug/1812227 prdesc https://bugs.launchpad.net/juju/+bug/1815405 prdesc https://bugs.launchpad.net/juju/+bug/1813996 prdesc https://bugs.launchpad.net/juju/+bug/1813995 prdesc https://bugs.launchpad.net/juju/+bug/1815397 prdesc https://bugs.launchpad.net/juju/+bug/1814638 prdesc https://bugs.launchpad.net/juju/+bug/1814556
Description of change
This changes how we handle unblocking and how we determine nextTick.
Instead of checking blocks on every pass through the loop(), we only check when we get a tick and trigger potential expiration. Christian and I discussed why it was safe, which boils down to. If someone adds a claim that would expire to a different controller, we would only care about expiring it if we had someone blocked on it expiring. Which we can tell from our blocks (arguably we can immediately unblock a block if it isn't in our Leases, which we could easily check without looking at all leases.).
I renamed tick() to expire() and retryingTick to retryingExpire, because that code is only the expiration path, while tick() is now tracking the timer and handling unblocking people.
There is probably also a lot of extra logging in there, and we should audit what is actually useful in practice vs just what was immediately developer focused.
This currently needs:
#9709
Though we could update the tests if we decide not to do it.
expire() now informs the main loop that it is a good time to process its blocks() once it has gone through the processing. Further, handleCheck() now notices if it saw stale data, and then causes the expire() to tick faster than it would otherwise, which will speed up things getting unblocked.
QA steps
The test suite should cover the important cases. The other thing you could do is run up the leadershipclaimer I wrote and try to abuse leases with it, and see if we are able to get into any bad conditions.
Documentation changes
None.
Bug reference
https://bugs.launchpad.net/juju/+bug/1814556