Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

only load leases on next tick #9712

Merged
merged 9 commits into from Feb 5, 2019
Merged

Conversation

jameinel
Copy link
Member

@jameinel jameinel commented Feb 4, 2019

Description of change

This changes how we handle unblocking and how we determine nextTick.

Instead of checking blocks on every pass through the loop(), we only check when we get a tick and trigger potential expiration. Christian and I discussed why it was safe, which boils down to. If someone adds a claim that would expire to a different controller, we would only care about expiring it if we had someone blocked on it expiring. Which we can tell from our blocks (arguably we can immediately unblock a block if it isn't in our Leases, which we could easily check without looking at all leases.).

I renamed tick() to expire() and retryingTick to retryingExpire, because that code is only the expiration path, while tick() is now tracking the timer and handling unblocking people.

There is probably also a lot of extra logging in there, and we should audit what is actually useful in practice vs just what was immediately developer focused.

This currently needs:
#9709
Though we could update the tests if we decide not to do it.

expire() now informs the main loop that it is a good time to process its blocks() once it has gone through the processing. Further, handleCheck() now notices if it saw stale data, and then causes the expire() to tick faster than it would otherwise, which will speed up things getting unblocked.

QA steps

The test suite should cover the important cases. The other thing you could do is run up the leadershipclaimer I wrote and try to abuse leases with it, and see if we are able to get into any bad conditions.

Documentation changes

None.

Bug reference

https://bugs.launchpad.net/juju/+bug/1814556

This needs:
  juju/clock#4

In order to pass without hanging. This is because the
ExpiryRepeatedTimeout test sets up a situation where there is a lease
that has already expired by the time we enter loop(). The
testclock.Timer was properly being set to <=0 seconds, but Reset()
wasn't triggering the timer without calling WaitAdvance(0,0,1).
ExpiryTimeout is now failing. Need to figure out if that should be a
source code change or a test suite change.

We wanted to give a bit of time for the expiry to run so that our Leases
would be fresh before we evalute them for blocks and the next expire
evaluation time. However, that means that a blocking Expire call will
prevent us from running the main loop.
If we use NewTimer() we have the ability to Stop() the timer, which will
remove it from the wait list, which means it doesn't make it
unpredictable how many timers we have. Either we are testing waiting on
that timer, or the other cause triggers and stops the timer, so it
doesn't show up as a waiting timer.

Breaks other tests, though. We may just want to not synchronize with
retryingExpire.
@jameinel jameinel changed the base branch from develop to 2.5 February 5, 2019 03:56
Started looking at the Block tests, whose behavior *does* change quite a
bit (since we don't always evaluate blocks on every pass). Found that
the Block test wasn't actually properly synchronizing with its helper
routine, so several tests were accidentally passing.
If we are processing a Block and we notice its information is stale,
trigger a Refresh and then queue up a timeout to indicate that we should
be expiring/unblocking.
@jameinel jameinel changed the title WIP: only load leases on next tick only load leases on next tick Feb 5, 2019
Copy link
Member

@manadart manadart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside from the nits, my only question is regarding the logic in ensureNextTimeout.

// now is the time of the current tick - expiries are done on the
// basis of this time.
now time.Time
// nextTimeout is the next time that has a possibly expiry that we would care
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"possible".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

}
}

func (manager *Manager) lookupLease(leaseKey lease.Key) (lease.Info, bool) {
lease, exists := manager.config.Store.Leases(leaseKey)[leaseKey]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just return it directly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't because Go is doing a "magic" 2 value return from a map, but that doesn't seem to work as a magic return directly from a function (it treats it as a one-value return, AFAICT).

./manager.go:173:2: not enough arguments to return
have (lease.Info)
want (lease.Info, bool)

if next.After(proposed) {
manager.config.Logger.Tracef("[%s] ensuring we wake up before %v at %v\n",
manager.logContext, d, next)
manager.nextTimeout = next
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should proposed be used here and in the log line above?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow, you're absolutely right, and clearly we have a missing test as it didn't catch this.
I'll work on a test for this.

It was a little bit tricky to test, because we *were* calling
Timer.Reset() with the right duration. But it lead to a situation where
we thought the timeout was later than it actually was.
TestClaimNoticesEarlyExpiry handles both cases properly. If we were
waiting a long time to timeout, when we get a claim, it will cause us to
wake up earlier, but if we then get a slightly *later* claim, it won't
push our Expiry back to the later time.
jujubot added a commit that referenced this pull request Feb 5, 2019
#9709

## Description of change

The testing clock now iterates faster when given a long timeout, and
gives stack traces when the waiters don't match expectations.

This is just cleanup of the testing clock that helped when I was debugging something else, but we may as well get it into our core code. Only changes the testing clock, so it may change the test suite, but doesn't change production code.

juju/clock#3
juju/clock#4

We need (4) #9712

## QA steps

None.

## Documentation changes

None.

## Bug reference

None.
@jameinel
Copy link
Member Author

jameinel commented Feb 5, 2019

$$merge$$

@jujubot jujubot merged commit 17b943c into juju:2.5 Feb 5, 2019
@jameinel jameinel mentioned this pull request Feb 13, 2019
jujubot added a commit that referenced this pull request Feb 13, 2019
#9748

## Description of change

This just brings develop up-to-date with all the 2.5 patches:
 prdesc Merge pull request #9745 from jameinel/2.5-lease-invalid-retries-1815719
 prdesc Merge pull request #9740 from wallyworld/cmr-multi-offer-fix
 prdesc Merge pull request #9742 from jameinel/2.5-worker-lease-1815468
 prdesc Merge pull request #9736 from jameinel/2.5-leadership-client
 prdesc Merge pull request #9737 from jameinel/2.5-worker-lease-1815468
 prdesc Merge pull request #9722 from achilleasa/fix-1812227
 prdesc Merge pull request #9734 from babbageclunk/state-worker-dep-message
 prdesc Merge pull request #9733 from babbageclunk/raftlease-stop-global-clock
 prdesc Merge pull request #9735 from howbazaar/2.5-mongo-systemd-ulimit
 prdesc Merge pull request #9724 from babbageclunk/raftlease-upgrade-blank
 prdesc Merge pull request #9731 from howbazaar/2.5-status-close-error
 prdesc Merge pull request #9730 from howbazaar/2.5-lease-race
 prdesc Merge pull request #9727 from achilleasa/fix-1814638
 prdesc Merge pull request #9728 from wallyworld/rename-delete-storage-pool
 prdesc Merge pull request #9712 from jameinel/2.5-leases-nextTick
 prdesc Merge pull request #9709 from jameinel/2.5-update-testing-clock

## QA steps

See individual patches.

## Documentation changes

See individual patches.

## Bug reference

 prdesc https://bugs.launchpad.net/juju/+bug/1815719
 prdesc https://bugs.launchpad.net/juju/+bug/1813151
 prdesc https://bugs.launchpad.net/juju/+bug/1815179
 prdesc https://bugs.launchpad.net/juju/+bug/1815471
 prdesc https://bugs.launchpad.net/juju/+bug/1815468
 prdesc https://bugs.launchpad.net/juju/+bug/1812227
 prdesc https://bugs.launchpad.net/juju/+bug/1815405
 prdesc https://bugs.launchpad.net/juju/+bug/1813996
 prdesc https://bugs.launchpad.net/juju/+bug/1813995
 prdesc https://bugs.launchpad.net/juju/+bug/1815397
 prdesc https://bugs.launchpad.net/juju/+bug/1814638
 prdesc https://bugs.launchpad.net/juju/+bug/1814556
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants