Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2.5 into develop #9748

Merged
merged 49 commits into from
Feb 13, 2019
Merged

2.5 into develop #9748

merged 49 commits into from
Feb 13, 2019

Conversation

jameinel
Copy link
Member

Description of change

This just brings develop up-to-date with all the 2.5 patches:

QA steps

See individual patches.

Documentation changes

See individual patches.

Bug reference

jameinel and others added 30 commits February 4, 2019 07:59
The testing clock now iterates faster when given a long timeout, and
gives stack traces when the waiters don't match expectations.
This needs:
  juju/clock#4

In order to pass without hanging. This is because the
ExpiryRepeatedTimeout test sets up a situation where there is a lease
that has already expired by the time we enter loop(). The
testclock.Timer was properly being set to <=0 seconds, but Reset()
wasn't triggering the timer without calling WaitAdvance(0,0,1).
ExpiryTimeout is now failing. Need to figure out if that should be a
source code change or a test suite change.

We wanted to give a bit of time for the expiry to run so that our Leases
would be fresh before we evalute them for blocks and the next expire
evaluation time. However, that means that a blocking Expire call will
prevent us from running the main loop.
If we use NewTimer() we have the ability to Stop() the timer, which will
remove it from the wait list, which means it doesn't make it
unpredictable how many timers we have. Either we are testing waiting on
that timer, or the other cause triggers and stops the timer, so it
doesn't show up as a waiting timer.

Breaks other tests, though. We may just want to not synchronize with
retryingExpire.
Started looking at the Block tests, whose behavior *does* change quite a
bit (since we don't always evaluate blocks on every pass). Found that
the Block test wasn't actually properly synchronizing with its helper
routine, so several tests were accidentally passing.
If we are processing a Block and we notice its information is stale,
trigger a Refresh and then queue up a timeout to indicate that we should
be expiring/unblocking.
Still needs the test suite updates, but we need this change for other
reasons.
It was a little bit tricky to test, because we *were* calling
Timer.Reset() with the right duration. But it lead to a situation where
we thought the timeout was later than it actually was.
TestClaimNoticesEarlyExpiry handles both cases properly. If we were
waiting a long time to timeout, when we get a claim, it will cause us to
wake up earlier, but if we then get a slightly *later* claim, it won't
push our Expiry back to the later time.
Several tests used a timeout of 0, but the new clock means that the
waiter has to *already* be waiting.
juju#9709

## Description of change

The testing clock now iterates faster when given a long timeout, and
gives stack traces when the waiters don't match expectations.

This is just cleanup of the testing clock that helped when I was debugging something else, but we may as well get it into our core code. Only changes the testing clock, so it may change the test suite, but doesn't change production code.

juju/clock#3
juju/clock#4

We need (4) juju#9712

## QA steps

None.

## Documentation changes

None.

## Bug reference

None.
juju#9712

## Description of change

This changes how we handle unblocking and how we determine nextTick.

Instead of checking blocks on *every* pass through the loop(), we only check when we get a tick and trigger potential expiration. Christian and I discussed why it was safe, which boils down to. If someone adds a claim that would expire to a different controller, we would only care about expiring it if we had someone blocked on it expiring. Which we can tell from our blocks (arguably we can immediately unblock a block if it isn't in our Leases, which we could easily check without looking at all leases.).

I renamed tick() to expire() and retryingTick to retryingExpire, because that code is only the expiration path, while tick() is now tracking the timer and handling unblocking people.

There is probably also a lot of extra logging in there, and we should audit what is actually useful in practice vs just what was immediately developer focused.

This currently needs: 
 juju#9709
Though we could update the tests if we decide not to do it.

expire() now informs the main loop that it is a good time to process its blocks() once it has gone through the processing. Further, handleCheck() now notices if it saw stale data, and then causes the expire() to tick faster than it would otherwise, which will speed up things getting unblocked.
 
## QA steps

The test suite should cover the important cases. The other thing you could do is run up the leadershipclaimer I wrote and try to abuse leases with it, and see if we are able to get into any bad conditions.

## Documentation changes

None.

## Bug reference

https://bugs.launchpad.net/juju/+bug/1814556
I'm not sure why some production systems have these (possibly they used
to be created in some much earlier version and have lasted through
upgrades), but they're not valid for raft leases so don't try to migrate
them.
juju#9728

## Description of change

Essentially s/delete-storage-pool/remove-storage-pool
The command was incorrectly named "delete" instead of "remove"

## QA steps

create a storage pool and then remove it
juju#9727

## Description of change

Make the error message when attempting to bootstrap without credentials a bit more helpful for the end-user

## QA steps

```console
$ juju bootstrap joyent
ERROR detecting credentials for "joyent" cloud provider: credentials not found
See `juju add-credential joyent --help` for instructions
```

## Bug reference
https://bugs.launchpad.net/juju/+bug/1814638
juju#9730

This fix is for the observed failures we have seen in openstack deploys with solutions QA.

The uniter worker would get stuck setting up the initial remote state watcher. This watcher would try to claim the leadership, and the ClaimLeadership facade call would never return.

On the apiserver, the worker/lease was in a state where it was trying to restart, but waiting on the in flight wait group. If there were two concurrent claims, both of which hit errors, the first would be processed causing the worker/lease loop to exit. This triggered a defer on the wait group, which wasn't being decremented until the second in flight claim finished. However the second in flight one was blocked trying to send the error down the errors channel. This was no longer being selected on.

The fix is to not use an errors channel, but instead just kill the catacomb.

## Bug reference

https://bugs.launchpad.net/juju/+bug/1815397
juju#9731

Noticed on a solutions QA run that juju wait filed with an error trying to close the API.

There is nothing a user could do, and it is possible that the command otherwise worked fine.
When investigating a production problem we could see that the state
worker reported missing dependencies even though both of its deps were
running. The state manifold does an additional check that one of its
dependencies returns true before starting, but the error is a generic
"dependency not available" which is confusing. Annotate the error with
some more context about what specifically is wrong - this is the same
technique that engine.Housing.Decorate uses to indicate when a worker
isn't running because a flag isn't set.
juju#9724

## Description of change

These were causing errors reported from migrating the leases to raft during the upgrade. The leases wouldn't cause a problem other than a delay, because nothing would be extending them.

It's not clear why some production systems have these (possibly they used to be created in some much earlier version and have lasted through upgrades), but they're not valid for raft leases so don't try to migrate them.

## QA steps

* Bootstrap a 2.4.x controller.
* Monkey with the database to create some leases with blank names and/or holders.
* Upgrade the controller to this version of Juju.
* There should be no errors about invalid lease names or holders in the controller log, or blank leases in the leaseholders collection.
* There shouldn't be any blank leases in the raft snapshot created when migrating the leases.

## Documentation changes

None

## Bug reference

https://bugs.launchpad.net/juju/+bug/1813995
jujubot and others added 19 commits February 11, 2019 19:43
juju#9735

Make limits support unlimited, and update mongo config based on mongo recommendations.

In order to support "unlimited", had to change the limit map in the service code to use strings.
…clock

juju#9733

## Description of change

If we're using raft leases there's no need to have the global-clock-updater writing to the db once a second.

## QA steps

* Spin up a controller.
* Check that global-clock-updater isn't running in the engine report.
* Switch to legacy-leases: `juju controller-config features="[legacy-leases]"`
* The engine report shows global-clock-updater is started.
* Switch back to raft leases: `juju controller-config features="[]"`
* global-clock-updater is stopped again.

## Documentation changes

None

## Bug reference

https://bugs.launchpad.net/juju/+bug/1813996
juju#9734

## Description of change

When investigating a production problem we could see that the state worker reported missing dependencies even though both of its deps were running. The state manifold does an additional check that one of its dependencies returns true before starting, but the error is a generic "dependency not available" which is confusing. Annotate the error with some more context about what specifically is wrong - this is the same technique that engine.Housing.Decorate uses to indicate when a worker isn't running because a flag isn't set.

## QA steps

* Bootstrap a controller.
* SSH to the controller machine and edit /var/lib/juju/agents/machine-0/agent.conf (save a backup first!) - remove the controllerkey field.
* Restart the controller agent.
* Run juju_engine_report and look at the section for the state worker - it should report "no StateServingInfo in config: dependency not available"

## Documentation changes

None

## Bug reference

https://bugs.launchpad.net/juju/+bug/1815405
This is the tool used to test leeadership Claim/Extend performance. It
can be used as an example of writing an external client for testing Juju
APIs.

Also included is a simple python script that can aggregate the logging
output to make simpler understanding of the results.
The test that we notice when a Claim has a timeout newer than our
current expiry timeout had a race. The timeout was updated after the
Claim had been processed, which meant the ensureTimeout could happen
arbitrarily far in the future.

Just changing the code to call ensureTimeout before the claim is
returned handles the race. This shouldn't affect production deployments,
because if all the claims have the same timeout, then any 'new' claim
will always be later than any other claims that we already have. And
even if we didn't, at worst it would backstop the expiry to the poll
timeout. So we might be a second behind expiring, but nothing that would
be a problem in production.
juju#9722

## Description of change

When deploying local charms that contain a `version` file, the charm version details are updated with the contents of that file. This was not the case when attempting to deploy charms from the charm store which results in the version details not being included in the `juju status` output.

This PR bumps the sha for the `charm.v6` dependency (see: juju/charm#269) which provides a helper charm archive method for fetching the version of a charm and updates the relevant code to populate the charm version field prior to persisting its details to the store.

## QA steps

```console
$ juju deploy vault
$ juju status vault --format yaml | grep charm-version
 charm-version: 220f0ee
```

Note: you should get the same output when deploying a local charm that includes a `version` file.

## Bug reference
https://bugs.launchpad.net/juju/+bug/1812227
juju#9737

## Description of change

The test that we notice when a Claim has a timeout newer than our
current expiry timeout had a race. The timeout was updated after the
Claim had been processed, which meant the ensureTimeout could happen
arbitrarily far in the future.

Just changing the code to call ensureTimeout before the claim is
returned handles the race. This shouldn't affect production deployments,
because if all the claims have the same timeout, then any 'new' claim
will always be later than any other claims that we already have. And
even if we didn't, at worst it would backstop the expiry to the poll
timeout. So we might be a second behind expiring, but nothing that would
be a problem in production.

## QA steps

```
$ go test -race -count=10000 -failfast -test.timeout=10m -check.v -check.f Early
```

If you include this patch, it fails every time:
```
--- a/worker/lease/manager.go
+++ b/worker/lease/manager.go
@@ -271,6 +271,7 @@ func (manager *Manager) retryingClaim(claim claim) {
 return
 }
 claim.respond(nil)
+ time.Sleep(10 prdesc time.Millisecond)
 // now, this isn't strictly true, as the lease can be given for longer
 // than the requested duration. However, it cannot be shorter.
 // Doing it this way, we'll wake up, and then see we can sleep
```

With the patch, you can't put a sleep there, as the order is changed. But the 'run this test 10,000 times with the race detector' shouldn't fail, while even without the patch it failed occasionally.

## Documentation changes

None.

## Bug reference

https://bugs.launchpad.net/juju/+bug/1815468
Now that ClaimLease sets the timer expiry according to the requested
lease (since it doesn't tell you how long you succeeded without
rereading Leases). We *should* wake up, but we should note that there is
nothing to expire.
They sometimes passed because the loop would exit before we could
observe that we wanted to Refresh, but they should always be calling
Refresh at time 1 minute, but not expiring the lease.
There was a couple issues:
a) setupInitialTimer() would create a NewTimer with an arbitrary
   timeout, and then would update itself with the timeout from Leases.
   However, tests could observe "hey, I have a timer waiting" after the
   first one, and miss the synchronization with the second one. Instead,
   we just always initialize Timer in setNextTimeout.
b) Reset() is not safe to call bare (which is documented in the time
   package). The issue is that it may be trying to send on the channel,
   and block, so we have to call Stop() and optionally drain the channel
   first. This was triggered with Startup_NoExpiry_NotLongEnough, where
   it was improperly synchronizing because of (a), and then the test
   would think it was happy, and start tearing down the manager and
   store. But the code was just about to get to timer.Reset() which then
   died because it was triggering synchronously but the Manager loop had
   already been told to TearDown.

This can be exacerbated (to make tests fail more reliably) with this
patch:
--- a/worker/lease/manager.go
+++ b/worker/lease/manager.go
@@ -437,6 +437,7 @@ func (manager *Manager) setupInitialTimer() {
        manager.muNextTimeout.Lock()
        manager.timer = manager.config.Clock.NewTimer(manager.config.MaxSleep)
        manager.muNextTimeout.Unlock()
+       time.Sleep(0 * time.Millisecond)
        // lastTick has never happened, so pass in the epoch time
        manager.computeNextTimeout(time.Time{}, manager.config.Store.Leases())
 }
juju#9736

## Description of change

This introduces another 'script' which provides an external client for testing the Leadership Facade. While the testing is probably done for now, it still can be useful as an example of how we might want to load test specific services in the future.

## QA steps

See https://discourse.jujucharms.com/t/lease-scale-testing-2-5-0-vs-2-5-1/574/2 for a discussion of how this client was used.

## Documentation changes

None.

## Bug reference

None.
juju#9742

## Description of change

Fix at least 5 flaky tests around Lease behavior.
* The Manager.setupInitialTimer was allowing the test loops to think that setup had finished, but really it was only that setup had started and had done the initial NewTimer() call.
* Timer.Reset() is unsafe to call without first Stopping the timer according to the time docs. (race condition with the trigger firing.)
* Several tests were only passing because they exited the main loop 'fast enough'. They wanted to test that we weren't expiring a claim that got a longer-than-requested duration. However, because Claim doesn't actually tell you how long the Lease was given for, unless we did an extra lookup we can't actually set the timeout trivially. And we want to avoid lookups on every Extend. (And in practice we don't give out arbitrarily longer claims.) Update the tests to sync with the expected Refresh and note that nothing gets expired.
* Also test the fact that it *will* expire the Lease at the appropriate times.

## QA steps

To trigger the race in setupInitialTimer you could use this patch:
```
--- a/worker/lease/manager.go
+++ b/worker/lease/manager.go
@@ -437,6 +437,7 @@ func (manager *Manager) setupInitialTimer() {
 manager.muNextTimeout.Lock()
 manager.timer = manager.config.Clock.NewTimer(manager.config.MaxSleep)
 manager.muNextTimeout.Unlock()
+ time.Sleep(0 prdesc time.Millisecond)
 // lastTick has never happened, so pass in the epoch time
 manager.computeNextTimeout(time.Time{}, manager.config.Store.Leases())
 }
```
That extends the chance that the test loop will sync on the NewTimer and not on the timer.Reset that is immediately called after that.

For all the tests, the easiest way to see the flakiness is:
```
$ go test -race -count=2000 -failfast -test.timeout=20m -check.v
$ go test -race -count=2000 -failfast -test.timeout=20m -check.v -check.f Expiry
```

With this patch, I didn't get any failures after 2000 loops. (though, as always, there could still be issues.)

## Documentation changes

None.

## Bug reference

https://bugs.launchpad.net/juju/+bug/1815471
It wasn't properly handling if we got 5 ErrInvalid responses in a row.
Now we require 10 invalid responses, and if we end up getting all
failures, we just deny the claim, because clearly the Primary node knows
something we don't. Ultimately, it is a failure that we are failing to
Sync with the master node.

The failure here means that the client should go into
BlockUntilLeadership call. And if we have gotten in sync by then, we
will block them. If we still haven't then they will be immediately
unblocked, and they'll try to do another claim. But it should still be
a) Correct
b) With delay backoff shouldn't introduce too much load.

This does impact other retry loops, but it shouldn't be a problem. For
Expire, we don't have ErrInvalid, as we treat that the same as if we
successfully expired it (at the very least, we know that our existing
lease is incorrect, even if it is now just owned by someone else.)
juju#9740

## Description of change

Creating 2 offers off the one app was failing when relations were processed. To fix, change the token generated on the offer side to be based on the offer name, not the app name.

Also, fix the offer status watcher so that status changes to an offering app's units also trigger a status update in the consuming model.

Note: existing single app offers will continue to work with this fix (it's backwards compatible).
But multi-offers on the same app will need to be redeployed.

## QA steps

juju bootstrap lxd test
juju switch controller
juju model-config logging-juju switch default
juju deploy memcached
juju deploy mariadb
juju add-model foo
juju deploy mediawiki
juju offer mediawiki:cache mw-cache
juju offer mediawiki:db mw-db
juju switch default
juju consume foo.mw-db
juju consume foo.mw-cache
juju relate mw-db mariadb:db
juju relate mw-cache memcached

Check controller logs for cmr relation errors.
Check workloads

juju remove-relation mariadb:db mw-db

Check SAAS status

Upgrade a 2.5.0 cmr deployment and check operations.

## Bug reference

https://bugs.launchpad.net/juju/+bug/1813151
https://bugs.launchpad.net/juju/+bug/1815179
…1815719

juju#9745

## Description of change

It wasn't properly handling if we got 5 ErrInvalid responses in a row.
Now we require 10 invalid responses, and if we end up getting all
failures, we just deny the claim, because clearly the Primary node knows
something we don't. Ultimately, it is a failure that we are failing to
Sync with the master node. However, having a Claim fail 5 (10) times
shouldn't cause the Manager to fail. So instead, we just issue a ClaimDenied.

That response means that the client should go into
BlockUntilLeadership call. And if we have gotten in sync by then, we
will block them. If we still haven't then they will be immediately
unblocked, and they'll try to do another claim. But it should still be
a) Correct
b) With delay backoff shouldn't introduce too much load.

This does impact other retry loops, but it shouldn't be a problem. For
Expire, we don't have ErrInvalid, as we treat that the same as if we
successfully expired it (at the very least, we know that our existing
lease is incorrect, even if it is now just owned by someone else.)

## QA steps

A Unit test was added which exhibits the failing behavior. 

## Documentation changes

None.

## Bug reference

https://bugs.launchpad.net/juju/+bug/1815719
Lots of pieces were out of date, should be better now.
@jameinel
Copy link
Member Author

$$merge$$ self-approving a merge of 2.5 into develop

@jameinel
Copy link
Member Author

$$merge$$ both failures were intermittent (one failing mongo with socket-in-use, and one failing watch test)

@jameinel
Copy link
Member Author

$$merge$$

1 similar comment
@jameinel
Copy link
Member Author

$$merge$$

@jameinel
Copy link
Member Author

jameinel commented Feb 13, 2019

$$merge$$
Last failure was:

FAIL: upgrade_test.go:163: upgradeSuite.TestDowngradeOnMasterWhenOtherControllerDoesntStartUpgrade

[LOG] 0:07.341 DEBUG juju.apiserver [1] machine-0 API connection terminated after 1.235265564s
[LOG] 0:07.350 DEBUG juju.worker.dependency "api-server" manifold worker completed successfully
[LOG] 0:07.350 INFO juju.worker.httpserver shutting down HTTP server
[LOG] 0:07.350 DEBUG juju.worker.dependency "http-server" manifold worker completed successfully
upgrade_test.go:226:
c.Fatal("machine agent did not exit as expected")
... Error: machine agent did not exit as expected

https://bugs.launchpad.net/bugs/1793276

@jameinel
Copy link
Member Author

$$merge$$

1 similar comment
@jameinel
Copy link
Member Author

$$merge$$

@jujubot jujubot merged commit a0ce0c8 into juju:develop Feb 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants