Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2.5 worker lease 1815471 #9742

Merged
merged 4 commits into from
Feb 12, 2019
Merged

Conversation

jameinel
Copy link
Member

@jameinel jameinel commented Feb 12, 2019

Description of change

Fix at least 5 flaky tests around Lease behavior.

  • The Manager.setupInitialTimer was allowing the test loops to think that setup had finished, but really it was only that setup had started and had done the initial NewTimer() call.
  • Timer.Reset() is unsafe to call without first Stopping the timer according to the time docs. (race condition with the trigger firing.)
  • Several tests were only passing because they exited the main loop 'fast enough'. They wanted to test that we weren't expiring a claim that got a longer-than-requested duration. However, because Claim doesn't actually tell you how long the Lease was given for, unless we did an extra lookup we can't actually set the timeout trivially. And we want to avoid lookups on every Extend. (And in practice we don't give out arbitrarily longer claims.) Update the tests to sync with the expected Refresh and note that nothing gets expired.
  • Also test the fact that it will expire the Lease at the appropriate times.

QA steps

To trigger the race in setupInitialTimer you could use this patch:

--- a/worker/lease/manager.go
+++ b/worker/lease/manager.go
@@ -437,6 +437,7 @@ func (manager *Manager) setupInitialTimer() {
        manager.muNextTimeout.Lock()
        manager.timer = manager.config.Clock.NewTimer(manager.config.MaxSleep)
        manager.muNextTimeout.Unlock()
+       time.Sleep(0 * time.Millisecond)
        // lastTick has never happened, so pass in the epoch time
        manager.computeNextTimeout(time.Time{}, manager.config.Store.Leases())
 }

That extends the chance that the test loop will sync on the NewTimer and not on the timer.Reset that is immediately called after that.

For all the tests, the easiest way to see the flakiness is:

$ go test -race -count=2000 -failfast -test.timeout=20m -check.v
$ go test -race -count=2000 -failfast -test.timeout=20m -check.v -check.f Expiry

With this patch, I didn't get any failures after 2000 loops. (though, as always, there could still be issues.)

Documentation changes

None.

Bug reference

https://bugs.launchpad.net/juju/+bug/1815471

Now that ClaimLease sets the timer expiry according to the requested
lease (since it doesn't tell you how long you succeeded without
rereading Leases). We *should* wake up, but we should note that there is
nothing to expire.
They sometimes passed because the loop would exit before we could
observe that we wanted to Refresh, but they should always be calling
Refresh at time 1 minute, but not expiring the lease.
There was a couple issues:
a) setupInitialTimer() would create a NewTimer with an arbitrary
   timeout, and then would update itself with the timeout from Leases.
   However, tests could observe "hey, I have a timer waiting" after the
   first one, and miss the synchronization with the second one. Instead,
   we just always initialize Timer in setNextTimeout.
b) Reset() is not safe to call bare (which is documented in the time
   package). The issue is that it may be trying to send on the channel,
   and block, so we have to call Stop() and optionally drain the channel
   first. This was triggered with Startup_NoExpiry_NotLongEnough, where
   it was improperly synchronizing because of (a), and then the test
   would think it was happy, and start tearing down the manager and
   store. But the code was just about to get to timer.Reset() which then
   died because it was triggering synchronously but the Manager loop had
   already been told to TearDown.

This can be exacerbated (to make tests fail more reliably) with this
patch:
--- a/worker/lease/manager.go
+++ b/worker/lease/manager.go
@@ -437,6 +437,7 @@ func (manager *Manager) setupInitialTimer() {
        manager.muNextTimeout.Lock()
        manager.timer = manager.config.Clock.NewTimer(manager.config.MaxSleep)
        manager.muNextTimeout.Unlock()
+       time.Sleep(0 * time.Millisecond)
        // lastTick has never happened, so pass in the epoch time
        manager.computeNextTimeout(time.Time{}, manager.config.Store.Leases())
 }
@jameinel jameinel changed the title 2.5 worker lease 1815468 2.5 worker lease 1815471 Feb 12, 2019
Copy link
Member

@SimonRichardson SimonRichardson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested with the old code with time.Sleep and can induce race, then tested new code and can't induce race.

@@ -427,16 +427,12 @@ func (manager *Manager) computeNextTimeout(lastTick time.Time, leases map[lease.
}
nextTick = info.Expiry
}
manager.config.Logger.Tracef("[%s] next expire decided on %v %v",
manager.config.Logger.Tracef("[%s] next expire in %v %v",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice 👏

default:
}
}
manager.timer.Reset(d)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good change, the docs read like a minefield.

@jameinel
Copy link
Member Author

$$merge$$

@jameinel
Copy link
Member Author

$$merge$$
fix another flaky test around SlowPeriodSuite.TestWatchRemoveBeforeKnown.
It seemed to think that you would never see the command that was just executed, but I don't think that is guaranteed in any way.

@jujubot jujubot merged commit 3bbdbe6 into juju:2.5 Feb 12, 2019
@jameinel jameinel mentioned this pull request Feb 13, 2019
jujubot added a commit that referenced this pull request Feb 13, 2019
#9748

## Description of change

This just brings develop up-to-date with all the 2.5 patches:
 prdesc Merge pull request #9745 from jameinel/2.5-lease-invalid-retries-1815719
 prdesc Merge pull request #9740 from wallyworld/cmr-multi-offer-fix
 prdesc Merge pull request #9742 from jameinel/2.5-worker-lease-1815468
 prdesc Merge pull request #9736 from jameinel/2.5-leadership-client
 prdesc Merge pull request #9737 from jameinel/2.5-worker-lease-1815468
 prdesc Merge pull request #9722 from achilleasa/fix-1812227
 prdesc Merge pull request #9734 from babbageclunk/state-worker-dep-message
 prdesc Merge pull request #9733 from babbageclunk/raftlease-stop-global-clock
 prdesc Merge pull request #9735 from howbazaar/2.5-mongo-systemd-ulimit
 prdesc Merge pull request #9724 from babbageclunk/raftlease-upgrade-blank
 prdesc Merge pull request #9731 from howbazaar/2.5-status-close-error
 prdesc Merge pull request #9730 from howbazaar/2.5-lease-race
 prdesc Merge pull request #9727 from achilleasa/fix-1814638
 prdesc Merge pull request #9728 from wallyworld/rename-delete-storage-pool
 prdesc Merge pull request #9712 from jameinel/2.5-leases-nextTick
 prdesc Merge pull request #9709 from jameinel/2.5-update-testing-clock

## QA steps

See individual patches.

## Documentation changes

See individual patches.

## Bug reference

 prdesc https://bugs.launchpad.net/juju/+bug/1815719
 prdesc https://bugs.launchpad.net/juju/+bug/1813151
 prdesc https://bugs.launchpad.net/juju/+bug/1815179
 prdesc https://bugs.launchpad.net/juju/+bug/1815471
 prdesc https://bugs.launchpad.net/juju/+bug/1815468
 prdesc https://bugs.launchpad.net/juju/+bug/1812227
 prdesc https://bugs.launchpad.net/juju/+bug/1815405
 prdesc https://bugs.launchpad.net/juju/+bug/1813996
 prdesc https://bugs.launchpad.net/juju/+bug/1813995
 prdesc https://bugs.launchpad.net/juju/+bug/1815397
 prdesc https://bugs.launchpad.net/juju/+bug/1814638
 prdesc https://bugs.launchpad.net/juju/+bug/1814556
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants