Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix lease manager next check scheduling #13429

Merged
merged 4 commits into from Oct 18, 2021
Merged

Conversation

manadart
Copy link
Member

@manadart manadart commented Oct 18, 2021

A missing assignment in the lease manager means that we can decide that the next block check is now, rather than at least a second from now.

This was not causing us issues until we added a tiny efficiency gain regarding scheduling the check. Before we would schedule the next check if it was now, otherwise we would not.

In addition to the linked bug, an intermittent test failure is possible, observed here.

Here we fix the assignment so that a second is correctly added to the next check time.

Even with this fix applied, we would often not schedule a next check due to the fact that we try to avoid walking back the next check later than one we have scheduled. This logic is flawed if we just made a check and are calculating the next. This too is fixed.

QA steps

For the intermittent test failure, run TestLeadershipNoLeaseBlockEvaluatedNextTick in loop with -race. The failure is very hard to trigger, but the test is fast an will run a good number of times in a reasonable duration.

For practical QA:

  • Bootstrap to LXD and add 3 units of an application (ubuntu-lite works well).
  • juju model-config -m controller logging-config="<root>=INFO;juju.worker.lease.raft=TRACE".
  • juju debug log -m controller.
  • Check that we get regular entries like this:
machine-0: 15:32:39 TRACE juju.worker.lease.raft [094931] tick at 2021-10-18 13:32:39.595055464 +0000 UTC m=+1245.389124953, running expiry checks
machine-0: 15:32:39 TRACE juju.worker.lease.raft [094931] evaluating 1 blocks
machine-0: 15:32:39 TRACE juju.worker.lease.raft [094931] next expire in 49.992s 2021-10-18 13:33:29.587359413 +0000 UTC m=+1295.381428912
  • Check that stopping the machine agent for the leader unit causes the a new leader to be elected.

Documentation changes

None.

Bug reference

https://bugs.launchpad.net/juju/+bug/1942354
https://bugs.launchpad.net/juju/+bug/1947409
https://bugs.launchpad.net/juju/+bug/1947179

scheduled lease manager block check.

This causes a missed check if scheduled at a zero duration from now.
@manadart manadart added the 2.9 label Oct 18, 2021
@manadart
Copy link
Member Author

Copy link
Member

@SimonRichardson SimonRichardson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I'm load testing + Q&A'ing with this patch, I can verify this does now correctly run.

@manadart
Copy link
Member Author

$$merge$$

@jujubot jujubot merged commit d87fd82 into juju:2.9 Oct 18, 2021
@manadart manadart deleted the 2.9-lease-next-tick branch October 18, 2021 15:20
@wallyworld wallyworld mentioned this pull request Oct 19, 2021
jujubot added a commit that referenced this pull request Oct 19, 2021
#13430

Merge 2.9

#13419 Recognise VXLAN network devices
#13423 Fix add-machine command help for zones constraint
#13422 Provision OpenStack instances with NICs for bindings
#13427 Update to latest Pebble
#13429 Fix lease manager next check scheduling
#13428 Various txn fixes mainly for mongo 4.4 compatibility

Conflicts due to mongo changes already in develop
```
# Conflicts:
# cloudconfig/podcfg/image.go
# cloudconfig/podcfg/podcfg.go
# featuretests/upgrade_test.go
# mongo/mongo.go
# testing/mgo.go
#
```

## QA steps

See PRs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants