2.5 into develop #9748

jameinel · 2019-02-13T14:02:11Z

Description of change

This just brings develop up-to-date with all the 2.5 patches:

Merge pull request Tweak the worker/lease retry code. #9745 from jameinel/2.5-lease-invalid-retries-1815719
Merge pull request Fix cmr relations where there are 2 offers on the one app #9740 from wallyworld/cmr-multi-offer-fix
Merge pull request 2.5 worker lease 1815471 #9742 from jameinel/2.5-worker-lease-1815468
Merge pull request Include the leadershipclaimer client. #9736 from jameinel/2.5-leadership-client
Merge pull request Fix a flaky test. #9737 from jameinel/2.5-worker-lease-1815468
Merge pull request Populate charm version for charms fetched from the charm store #9722 from achilleasa/fix-1812227
Merge pull request Clarify error when StateServingInfo is unavailable #9734 from babbageclunk/state-worker-dep-message
Merge pull request leases: Only run global-clock-updater if legacy-leases is set #9733 from babbageclunk/raftlease-stop-global-clock
Merge pull request Update mongo ulimits. #9735 from howbazaar/2.5-mongo-systemd-ulimit
Merge pull request raft-leases: Skip legacy leases with blank name/holder when upgrading #9724 from babbageclunk/raftlease-upgrade-blank
Merge pull request juju status: don't pass on close errors to the user. #9731 from howbazaar/2.5-status-close-error
Merge pull request worker/lease fix the handling of multiple concurrent errors #9730 from howbazaar/2.5-lease-race
Merge pull request Improve error message when bootstrapping without credentials #9727 from achilleasa/fix-1814638
Merge pull request Rename delete-storage-pool to remove-storage-pool #9728 from wallyworld/rename-delete-storage-pool
Merge pull request only load leases on next tick #9712 from jameinel/2.5-leases-nextTick
Merge pull request Use the newer testing clock. #9709 from jameinel/2.5-update-testing-clock

QA steps

See individual patches.

Documentation changes

See individual patches.

Bug reference

The testing clock now iterates faster when given a long timeout, and gives stack traces when the waiters don't match expectations.

This needs: juju/clock#4 In order to pass without hanging. This is because the ExpiryRepeatedTimeout test sets up a situation where there is a lease that has already expired by the time we enter loop(). The testclock.Timer was properly being set to <=0 seconds, but Reset() wasn't triggering the timer without calling WaitAdvance(0,0,1).

ExpiryTimeout is now failing. Need to figure out if that should be a source code change or a test suite change. We wanted to give a bit of time for the expiry to run so that our Leases would be fresh before we evalute them for blocks and the next expire evaluation time. However, that means that a blocking Expire call will prevent us from running the main loop.

If we use NewTimer() we have the ability to Stop() the timer, which will remove it from the wait list, which means it doesn't make it unpredictable how many timers we have. Either we are testing waiting on that timer, or the other cause triggers and stops the timer, so it doesn't show up as a waiting timer. Breaks other tests, though. We may just want to not synchronize with retryingExpire.

Started looking at the Block tests, whose behavior *does* change quite a bit (since we don't always evaluate blocks on every pass). Found that the Block test wasn't actually properly synchronizing with its helper routine, so several tests were accidentally passing.

If we are processing a Block and we notice its information is stale, trigger a Refresh and then queue up a timeout to indicate that we should be expiring/unblocking.

Still needs the test suite updates, but we need this change for other reasons.

It was a little bit tricky to test, because we *were* calling Timer.Reset() with the right duration. But it lead to a situation where we thought the timeout was later than it actually was. TestClaimNoticesEarlyExpiry handles both cases properly. If we were waiting a long time to timeout, when we get a claim, it will cause us to wake up earlier, but if we then get a slightly *later* claim, it won't push our Expiry back to the later time.

Several tests used a timeout of 0, but the new clock means that the waiter has to *already* be waiting.

juju#9709 ## Description of change The testing clock now iterates faster when given a long timeout, and gives stack traces when the waiters don't match expectations. This is just cleanup of the testing clock that helped when I was debugging something else, but we may as well get it into our core code. Only changes the testing clock, so it may change the test suite, but doesn't change production code. juju/clock#3 juju/clock#4 We need (4) juju#9712 ## QA steps None. ## Documentation changes None. ## Bug reference None.

juju#9712 ## Description of change This changes how we handle unblocking and how we determine nextTick. Instead of checking blocks on *every* pass through the loop(), we only check when we get a tick and trigger potential expiration. Christian and I discussed why it was safe, which boils down to. If someone adds a claim that would expire to a different controller, we would only care about expiring it if we had someone blocked on it expiring. Which we can tell from our blocks (arguably we can immediately unblock a block if it isn't in our Leases, which we could easily check without looking at all leases.). I renamed tick() to expire() and retryingTick to retryingExpire, because that code is only the expiration path, while tick() is now tracking the timer and handling unblocking people. There is probably also a lot of extra logging in there, and we should audit what is actually useful in practice vs just what was immediately developer focused. This currently needs: juju#9709 Though we could update the tests if we decide not to do it. expire() now informs the main loop that it is a good time to process its blocks() once it has gone through the processing. Further, handleCheck() now notices if it saw stale data, and then causes the expire() to tick faster than it would otherwise, which will speed up things getting unblocked. ## QA steps The test suite should cover the important cases. The other thing you could do is run up the leadershipclaimer I wrote and try to abuse leases with it, and see if we are able to get into any bad conditions. ## Documentation changes None. ## Bug reference https://bugs.launchpad.net/juju/+bug/1814556

I'm not sure why some production systems have these (possibly they used to be created in some much earlier version and have lasted through upgrades), but they're not valid for raft leases so don't try to migrate them.

juju#9728 ## Description of change Essentially s/delete-storage-pool/remove-storage-pool The command was incorrectly named "delete" instead of "remove" ## QA steps create a storage pool and then remove it

juju#9727 ## Description of change Make the error message when attempting to bootstrap without credentials a bit more helpful for the end-user ## QA steps ```console $ juju bootstrap joyent ERROR detecting credentials for "joyent" cloud provider: credentials not found See `juju add-credential joyent --help` for instructions ``` ## Bug reference https://bugs.launchpad.net/juju/+bug/1814638

juju#9730 This fix is for the observed failures we have seen in openstack deploys with solutions QA. The uniter worker would get stuck setting up the initial remote state watcher. This watcher would try to claim the leadership, and the ClaimLeadership facade call would never return. On the apiserver, the worker/lease was in a state where it was trying to restart, but waiting on the in flight wait group. If there were two concurrent claims, both of which hit errors, the first would be processed causing the worker/lease loop to exit. This triggered a defer on the wait group, which wasn't being decremented until the second in flight claim finished. However the second in flight one was blocked trying to send the error down the errors channel. This was no longer being selected on. The fix is to not use an errors channel, but instead just kill the catacomb. ## Bug reference https://bugs.launchpad.net/juju/+bug/1815397

juju#9731 Noticed on a solutions QA run that juju wait filed with an error trying to close the API. There is nothing a user could do, and it is possible that the command otherwise worked fine.

When investigating a production problem we could see that the state worker reported missing dependencies even though both of its deps were running. The state manifold does an additional check that one of its dependencies returns true before starting, but the error is a generic "dependency not available" which is confusing. Annotate the error with some more context about what specifically is wrong - this is the same technique that engine.Housing.Decorate uses to indicate when a worker isn't running because a flag isn't set.

juju#9724 ## Description of change These were causing errors reported from migrating the leases to raft during the upgrade. The leases wouldn't cause a problem other than a delay, because nothing would be extending them. It's not clear why some production systems have these (possibly they used to be created in some much earlier version and have lasted through upgrades), but they're not valid for raft leases so don't try to migrate them. ## QA steps * Bootstrap a 2.4.x controller. * Monkey with the database to create some leases with blank names and/or holders. * Upgrade the controller to this version of Juju. * There should be no errors about invalid lease names or holders in the controller log, or blank leases in the leaseholders collection. * There shouldn't be any blank leases in the raft snapshot created when migrating the leases. ## Documentation changes None ## Bug reference https://bugs.launchpad.net/juju/+bug/1813995

… recommendations.

juju#9735 Make limits support unlimited, and update mongo config based on mongo recommendations. In order to support "unlimited", had to change the limit map in the service code to use strings.

…clock juju#9733 ## Description of change If we're using raft leases there's no need to have the global-clock-updater writing to the db once a second. ## QA steps * Spin up a controller. * Check that global-clock-updater isn't running in the engine report. * Switch to legacy-leases: `juju controller-config features="[legacy-leases]"` * The engine report shows global-clock-updater is started. * Switch back to raft leases: `juju controller-config features="[]"` * global-clock-updater is stopped again. ## Documentation changes None ## Bug reference https://bugs.launchpad.net/juju/+bug/1813996

juju#9734 ## Description of change When investigating a production problem we could see that the state worker reported missing dependencies even though both of its deps were running. The state manifold does an additional check that one of its dependencies returns true before starting, but the error is a generic "dependency not available" which is confusing. Annotate the error with some more context about what specifically is wrong - this is the same technique that engine.Housing.Decorate uses to indicate when a worker isn't running because a flag isn't set. ## QA steps * Bootstrap a controller. * SSH to the controller machine and edit /var/lib/juju/agents/machine-0/agent.conf (save a backup first!) - remove the controllerkey field. * Restart the controller agent. * Run juju_engine_report and look at the section for the state worker - it should report "no StateServingInfo in config: dependency not available" ## Documentation changes None ## Bug reference https://bugs.launchpad.net/juju/+bug/1815405

This is the tool used to test leeadership Claim/Extend performance. It can be used as an example of writing an external client for testing Juju APIs. Also included is a simple python script that can aggregate the logging output to make simpler understanding of the results.

The test that we notice when a Claim has a timeout newer than our current expiry timeout had a race. The timeout was updated after the Claim had been processed, which meant the ensureTimeout could happen arbitrarily far in the future. Just changing the code to call ensureTimeout before the claim is returned handles the race. This shouldn't affect production deployments, because if all the claims have the same timeout, then any 'new' claim will always be later than any other claims that we already have. And even if we didn't, at worst it would backstop the expiry to the poll timeout. So we might be a second behind expiring, but nothing that would be a problem in production.

juju#9722 ## Description of change When deploying local charms that contain a `version` file, the charm version details are updated with the contents of that file. This was not the case when attempting to deploy charms from the charm store which results in the version details not being included in the `juju status` output. This PR bumps the sha for the `charm.v6` dependency (see: juju/charm#269) which provides a helper charm archive method for fetching the version of a charm and updates the relevant code to populate the charm version field prior to persisting its details to the store. ## QA steps ```console $ juju deploy vault $ juju status vault --format yaml | grep charm-version charm-version: 220f0ee ``` Note: you should get the same output when deploying a local charm that includes a `version` file. ## Bug reference https://bugs.launchpad.net/juju/+bug/1812227

juju#9737 ## Description of change The test that we notice when a Claim has a timeout newer than our current expiry timeout had a race. The timeout was updated after the Claim had been processed, which meant the ensureTimeout could happen arbitrarily far in the future. Just changing the code to call ensureTimeout before the claim is returned handles the race. This shouldn't affect production deployments, because if all the claims have the same timeout, then any 'new' claim will always be later than any other claims that we already have. And even if we didn't, at worst it would backstop the expiry to the poll timeout. So we might be a second behind expiring, but nothing that would be a problem in production. ## QA steps ``` $ go test -race -count=10000 -failfast -test.timeout=10m -check.v -check.f Early ``` If you include this patch, it fails every time: ``` --- a/worker/lease/manager.go +++ b/worker/lease/manager.go @@ -271,6 +271,7 @@ func (manager *Manager) retryingClaim(claim claim) { return } claim.respond(nil) + time.Sleep(10 prdesc time.Millisecond) // now, this isn't strictly true, as the lease can be given for longer // than the requested duration. However, it cannot be shorter. // Doing it this way, we'll wake up, and then see we can sleep ``` With the patch, you can't put a sleep there, as the order is changed. But the 'run this test 10,000 times with the race detector' shouldn't fail, while even without the patch it failed occasionally. ## Documentation changes None. ## Bug reference https://bugs.launchpad.net/juju/+bug/1815468

Now that ClaimLease sets the timer expiry according to the requested lease (since it doesn't tell you how long you succeeded without rereading Leases). We *should* wake up, but we should note that there is nothing to expire.

They sometimes passed because the loop would exit before we could observe that we wanted to Refresh, but they should always be calling Refresh at time 1 minute, but not expiring the lease.

There was a couple issues: a) setupInitialTimer() would create a NewTimer with an arbitrary timeout, and then would update itself with the timeout from Leases. However, tests could observe "hey, I have a timer waiting" after the first one, and miss the synchronization with the second one. Instead, we just always initialize Timer in setNextTimeout. b) Reset() is not safe to call bare (which is documented in the time package). The issue is that it may be trying to send on the channel, and block, so we have to call Stop() and optionally drain the channel first. This was triggered with Startup_NoExpiry_NotLongEnough, where it was improperly synchronizing because of (a), and then the test would think it was happy, and start tearing down the manager and store. But the code was just about to get to timer.Reset() which then died because it was triggering synchronously but the Manager loop had already been told to TearDown. This can be exacerbated (to make tests fail more reliably) with this patch: --- a/worker/lease/manager.go +++ b/worker/lease/manager.go @@ -437,6 +437,7 @@ func (manager *Manager) setupInitialTimer() { manager.muNextTimeout.Lock() manager.timer = manager.config.Clock.NewTimer(manager.config.MaxSleep) manager.muNextTimeout.Unlock() + time.Sleep(0 * time.Millisecond) // lastTick has never happened, so pass in the epoch time manager.computeNextTimeout(time.Time{}, manager.config.Store.Leases()) }

juju#9736 ## Description of change This introduces another 'script' which provides an external client for testing the Leadership Facade. While the testing is probably done for now, it still can be useful as an example of how we might want to load test specific services in the future. ## QA steps See https://discourse.jujucharms.com/t/lease-scale-testing-2-5-0-vs-2-5-1/574/2 for a discussion of how this client was used. ## Documentation changes None. ## Bug reference None.

juju#9742 ## Description of change Fix at least 5 flaky tests around Lease behavior. * The Manager.setupInitialTimer was allowing the test loops to think that setup had finished, but really it was only that setup had started and had done the initial NewTimer() call. * Timer.Reset() is unsafe to call without first Stopping the timer according to the time docs. (race condition with the trigger firing.) * Several tests were only passing because they exited the main loop 'fast enough'. They wanted to test that we weren't expiring a claim that got a longer-than-requested duration. However, because Claim doesn't actually tell you how long the Lease was given for, unless we did an extra lookup we can't actually set the timeout trivially. And we want to avoid lookups on every Extend. (And in practice we don't give out arbitrarily longer claims.) Update the tests to sync with the expected Refresh and note that nothing gets expired. * Also test the fact that it *will* expire the Lease at the appropriate times. ## QA steps To trigger the race in setupInitialTimer you could use this patch: ``` --- a/worker/lease/manager.go +++ b/worker/lease/manager.go @@ -437,6 +437,7 @@ func (manager *Manager) setupInitialTimer() { manager.muNextTimeout.Lock() manager.timer = manager.config.Clock.NewTimer(manager.config.MaxSleep) manager.muNextTimeout.Unlock() + time.Sleep(0 prdesc time.Millisecond) // lastTick has never happened, so pass in the epoch time manager.computeNextTimeout(time.Time{}, manager.config.Store.Leases()) } ``` That extends the chance that the test loop will sync on the NewTimer and not on the timer.Reset that is immediately called after that. For all the tests, the easiest way to see the flakiness is: ``` $ go test -race -count=2000 -failfast -test.timeout=20m -check.v $ go test -race -count=2000 -failfast -test.timeout=20m -check.v -check.f Expiry ``` With this patch, I didn't get any failures after 2000 loops. (though, as always, there could still be issues.) ## Documentation changes None. ## Bug reference https://bugs.launchpad.net/juju/+bug/1815471

It wasn't properly handling if we got 5 ErrInvalid responses in a row. Now we require 10 invalid responses, and if we end up getting all failures, we just deny the claim, because clearly the Primary node knows something we don't. Ultimately, it is a failure that we are failing to Sync with the master node. The failure here means that the client should go into BlockUntilLeadership call. And if we have gotten in sync by then, we will block them. If we still haven't then they will be immediately unblocked, and they'll try to do another claim. But it should still be a) Correct b) With delay backoff shouldn't introduce too much load. This does impact other retry loops, but it shouldn't be a problem. For Expire, we don't have ErrInvalid, as we treat that the same as if we successfully expired it (at the very least, we know that our existing lease is incorrect, even if it is now just owned by someone else.)

juju#9740 ## Description of change Creating 2 offers off the one app was failing when relations were processed. To fix, change the token generated on the offer side to be based on the offer name, not the app name. Also, fix the offer status watcher so that status changes to an offering app's units also trigger a status update in the consuming model. Note: existing single app offers will continue to work with this fix (it's backwards compatible). But multi-offers on the same app will need to be redeployed. ## QA steps juju bootstrap lxd test juju switch controller juju model-config logging-juju switch default juju deploy memcached juju deploy mariadb juju add-model foo juju deploy mediawiki juju offer mediawiki:cache mw-cache juju offer mediawiki:db mw-db juju switch default juju consume foo.mw-db juju consume foo.mw-cache juju relate mw-db mariadb:db juju relate mw-cache memcached Check controller logs for cmr relation errors. Check workloads juju remove-relation mariadb:db mw-db Check SAAS status Upgrade a 2.5.0 cmr deployment and check operations. ## Bug reference https://bugs.launchpad.net/juju/+bug/1813151 https://bugs.launchpad.net/juju/+bug/1815179

…1815719 juju#9745 ## Description of change It wasn't properly handling if we got 5 ErrInvalid responses in a row. Now we require 10 invalid responses, and if we end up getting all failures, we just deny the claim, because clearly the Primary node knows something we don't. Ultimately, it is a failure that we are failing to Sync with the master node. However, having a Claim fail 5 (10) times shouldn't cause the Manager to fail. So instead, we just issue a ClaimDenied. That response means that the client should go into BlockUntilLeadership call. And if we have gotten in sync by then, we will block them. If we still haven't then they will be immediately unblocked, and they'll try to do another claim. But it should still be a) Correct b) With delay backoff shouldn't introduce too much load. This does impact other retry loops, but it shouldn't be a problem. For Expire, we don't have ErrInvalid, as we treat that the same as if we successfully expired it (at the very least, we know that our existing lease is incorrect, even if it is now just owned by someone else.) ## QA steps A Unit test was added which exhibits the failing behavior. ## Documentation changes None. ## Bug reference https://bugs.launchpad.net/juju/+bug/1815719

Lots of pieces were out of date, should be better now.

jameinel · 2019-02-13T14:02:29Z

$$merge$$ self-approving a merge of 2.5 into develop

jameinel · 2019-02-13T16:09:44Z

$$merge$$ both failures were intermittent (one failing mongo with socket-in-use, and one failing watch test)

jameinel · 2019-02-13T16:12:25Z

$$merge$$

jameinel · 2019-02-13T16:14:56Z

$$merge$$

jameinel · 2019-02-13T17:07:01Z

$$merge$$
Last failure was:

FAIL: upgrade_test.go:163: upgradeSuite.TestDowngradeOnMasterWhenOtherControllerDoesntStartUpgrade

[LOG] 0:07.341 DEBUG juju.apiserver [1] machine-0 API connection terminated after 1.235265564s
[LOG] 0:07.350 DEBUG juju.worker.dependency "api-server" manifold worker completed successfully
[LOG] 0:07.350 INFO juju.worker.httpserver shutting down HTTP server
[LOG] 0:07.350 DEBUG juju.worker.dependency "http-server" manifold worker completed successfully
upgrade_test.go:226:
c.Fatal("machine agent did not exit as expected")
... Error: machine agent did not exit as expected

https://bugs.launchpad.net/bugs/1793276

jameinel · 2019-02-13T18:38:23Z

$$merge$$

jameinel · 2019-02-13T19:21:16Z

$$merge$$

jameinel and others added 30 commits February 4, 2019 07:59

Use the newer testing clock.

c8b5f25

The testing clock now iterates faster when given a long timeout, and gives stack traces when the waiters don't match expectations.

(wip) trying to get the expiry to only be checked on timeout

25b6576

Calling FatalF in a goroutine causes a panic.

766842f

Describe what we need to update for block handling.

528392f

Handle Blocks.

633a939

If we are processing a Block and we notice its information is stale, trigger a Refresh and then queue up a timeout to indicate that we should be expiring/unblocking.

Update to one more newer testing.Clock

580d51f

Still needs the test suite updates, but we need this change for other reasons.

Actually give time for the other goroutine to start.

9eabafe

Several tests used a timeout of 0, but the new clock means that the waiter has to *already* be waiting.

Ignore legacy leases with blank name or holder

88ec9dc

I'm not sure why some production systems have these (possibly they used to be created in some much earlier version and have lasted through upgrades), but they're not valid for raft leases so don't try to migrate them.

Improve error message when bootstrapping without credentials

fb35b6c

Rename delete-storage-pool to remove-storage-pool

1af567b

Merge pull request juju#9728 from wallyworld/rename-delete-storage-pool

ed563bd

juju#9728 ## Description of change Essentially s/delete-storage-pool/remove-storage-pool The command was incorrectly named "delete" instead of "remove" ## QA steps create a storage pool and then remove it

Use latest charm.v6 dep

a4185d9

Populate CharmVersion for charms fetched from the charm store

420e17f

Fix the handling of multiple concurrent errors.

37b3bba

Don't pass on close errors to the user.

bf3cdab

Merge pull request juju#9731 from howbazaar/2.5-status-close-error

8d4c9f0

juju#9731 Noticed on a solutions QA run that juju wait filed with an error trying to close the API. There is nothing a user could do, and it is possible that the command otherwise worked fine.

Only run global-clock-updater if legacy leases is set

57fad2b

Make limits support unlimited, and update mongo config based on mongo…

e109bbc

… recommendations.

Fix tets.

d7e5670

jujubot and others added 19 commits February 11, 2019 19:43

Merge pull request juju#9735 from howbazaar/2.5-mongo-systemd-ulimit

5fcd9ad

juju#9735 Make limits support unlimited, and update mongo config based on mongo recommendations. In order to support "unlimited", had to change the limit map in the service code to use strings.

Fix cmr relations where there are 2 offers on the one app

ae62d7f

Update ExpiryInFuture.

f93a4a7

Now that ClaimLease sets the timer expiry according to the requested lease (since it doesn't tell you how long you succeeded without rereading Leases). We *should* wake up, but we should note that there is nothing to expire.

Two tests interacted with the new behavior.

0240a19

They sometimes passed because the loop would exit before we could observe that we wanted to Refresh, but they should always be calling Refresh at time 1 minute, but not expiring the lease.

Don't remove the object and then assume that you won't see the remove().

a362c2f

Use a switch statement to cleanup error handling.

7ab23a3

Merge 2.5 into develop

f6a418b

Lots of pieces were out of date, should be better now.

jameinel force-pushed the 2.5-into-develop branch from ba15d1a to f6a418b Compare February 13, 2019 16:12

jujubot merged commit a0ce0c8 into juju:develop Feb 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2.5 into develop #9748

2.5 into develop #9748

jameinel commented Feb 13, 2019

jameinel commented Feb 13, 2019

jameinel commented Feb 13, 2019

jameinel commented Feb 13, 2019

jameinel commented Feb 13, 2019

jameinel commented Feb 13, 2019 •

edited

Loading

jameinel commented Feb 13, 2019

jameinel commented Feb 13, 2019

2.5 into develop #9748

2.5 into develop #9748

Conversation

jameinel commented Feb 13, 2019

Description of change

QA steps

Documentation changes

Bug reference

jameinel commented Feb 13, 2019

jameinel commented Feb 13, 2019

jameinel commented Feb 13, 2019

jameinel commented Feb 13, 2019

jameinel commented Feb 13, 2019 • edited Loading

$$merge$$ Last failure was:

jameinel commented Feb 13, 2019

jameinel commented Feb 13, 2019

jameinel commented Feb 13, 2019 •

edited

Loading

$$merge$$
Last failure was: