Merge 2.2 1007 #7608

Merged
merged 38 commits into from Jul 10, 2017

Conversation

Projects
None yet
6 participants
Owner

wallyworld commented Jul 10, 2017

Description of change

Merge 2.2 into develop, including:

  • batch status history pruning
  • separate controller and model upgrades
  • api server auth issues
  • fix debug log tailing

QA steps

smoke test bootstrap

babbageclunk and others added some commits Jun 12, 2017

Batch pruning from the status history collection
Ensure that we only try to delete 1000 rows at a time. In a large
deployment with lots of status history records the pruner was dying
because it was issuing a RemoveAll call trying to delete so many rows
that it would fail.
Extract batch-deletion to reuse for age pruning
Created a statusHistoryPruner struct so that pruneByAge and pruneBySize could
both use deleteInBatches.
state: Generalise error handling when splitting logs collection
The error checks in SplitLogCollections didn't work on MongoDB 2.4 as
older MongoDB versions only return an error string without the error
code.

mgo's IsDup helper is now used for insert error handling and "ns not
found" is now checked by code or string.

Fixes https://bugs.launchpad.net/juju/+bug/1701999
Merge pull request #7588 from mjs/log-split-upgrade-err-handling
state: Generalise error handling when splitting logs collection

## Description of change

The error checks in SplitLogCollections didn't work on MongoDB 2.4 as
older MongoDB versions only return an error string without the error
code.

mgo's IsDup helper is now used for insert error handling and "ns not
found" is now checked by code or string.

Fixes https://bugs.launchpad.net/juju/+bug/1701999

## QA steps

Related unit tests were failing on MongoDB 2.4 before and pass now.

## Documentation changes

N.A. 

## Bug reference

https://bugs.launchpad.net/juju/+bug/1701999
Merge pull request #7592 from howbazaar/2.2-fix-debug-log-tailing
Skip non-insert records when tailing the oplog.

## Description of change

The debug-log implementation of tailing the oplog was getting messed up when the log pruning ran, as it caused deletion entries in the oplog.

## QA steps

```
cd state
go test -check.f TestLogDeletionDuringTailing -check.vv
```
And observe that there is no log deserialisation warning.

## Bug reference

https://bugs.launchpad.net/juju/+bug/1700453
Handle subordinate-subordinate relations correctly
Don't filter sub-sub relations in uniter API; the subordinate relations
watcher should allow relations where the other end is a subordinate to
get to the uniter, since they could result in units that exist on the
same machine.

RelationUnit.Valid returns true for subordinate-subordinate relations.

Handle them correctly in the 2.2.1 upgrade step.
Merge pull request #7590 from babbageclunk/subsub-rel-fix
Handle subordinate-subordinate relations in subordinate filtering

## Description of change
In fixing [bug 1686696](https://bugs.launchpad.net/juju/+bug/1686696) I changed the uniter API WatchUnitRelations to restrict container-scoped relations to those where the other end was the app for the principal unit. Then in the fixes for the related [bug 1699050](https://bugs.launchpad.net/juju/+bug/1699050) an upgrade step was added to make the same change in the database, and the uniter API was changed to prevent invalid relation units entering scope.

This excluded relations between subordinate applications (like ntp and nrpe), which meant that upgrading a model with sub-sub relations would result in the uniters repeatedly crashing as they tried to remove a relation state directory with files in it (because that sub-sub relation had been incorrectly filtered out).

Change the watcher, upgrade step and uniter API EnterScope method to allow sub-sub relations through.

## QA steps

* Bootstrap a 2.1.3 controller and add a model.
* Deploy the following bundle:
```
series: xenial
machines:
  '0':
    series: xenial
  '1':
    series: xenial
services:
  min:
    charm: cs:ubuntu-10
    num_units: 1
    to:
      - "0"
  min2:
    charm: cs:ubuntu-10
    num_units: 1
    to:
      - "1"
  nrpe:
    charm: cs:nrpe-23
  ntp:
    charm: cs:ntp-18
relations:
- - min:juju-info
  - nrpe:general-info
- - min2:juju-info
  - nrpe:general-info
- - ntp:nrpe-external-master
  - nrpe:nrpe-external-master
- - ntp:juju-info
  - min:juju-info
- - ntp:juju-info
  - min2:juju-info
```
* Observe in the DB that the min-ntp, min2-ntp, min-nrpe and min2-nrpe relations all have an incorrect unitcount of 3, but the ntp-nrpe relation unitcount is 4 (which is right).
* Upgrade the controller and model from 2.1.3 to the version containing this PR.
* Check in the DB that the incorrect unitcounts have been changed to 2, but the ntp-nrpe relation is still 4.
* Check that there aren't repeated errors in the log from the uniter crashing and restarting in a loop.

## Bug reference

Fixes https://bugs.launchpad.net/juju/+bug/1700450
Merge pull request #7594 from babbageclunk/skip-bad-test
Skip flaky serverSuite.TestClosesStateFromPool

## Description of change

This test fails frequently in builds - marking it as flaky so we can track it.

## QA steps

Ran the tests, saw that it was skipped.

## Bug reference

https://bugs.launchpad.net/juju/+bug/1702215
Merge pull request #7595 from howbazaar/2.2-fix-dummy-data-race
Always use pointer receiver to avoid data race.

## Description of change

This was a bit of a doozy because it wasn't at all obvious. This fixes a data race found in the BootstrapSuite. The CredentialSchemas() method was being called on a non-pointer receiver of the environProvider. Even though the method is returning a new structure just filled with constants, the method is just syntactic sugar for a function that takes the receiver as the first arg, and the receiver was being copied. The structure has a mutex, and other values were being written to on other goroutines, and that triggered the race detector. The solution here is to just make sure that all the receivers are pointer receivers.

## QA steps

I was not able to trigger this locally, but @davecheney recognised the error pattern from an earlier fix he had done.
Use testclock.WaitAdvance in TestPause
The test passes fine locally but fails often in CI, and when run under
stress it fails after 1 or 2 runs. It looks like the clock advance is
happening before the throttlingListener manages to wait for it. Change
the Advance calls to WaitAdvance so we know the listener is waiting when
we advance.

Removed start channel - it complicated the test for no real benefit.
Merge pull request #7596 from babbageclunk/listener-test-reliability
apiserver: Use WaitAdvance in listenerSuite.TestPause

## Description of change

The test passes fine locally but fails often in CI, and when run under
stress it fails after 1 or 2 runs. It looks like the clock advance is
happening before the `throttlingListener` manages to wait for it. Change
the `Advance` calls to `WaitAdvance` so we know the listener is waiting when
we advance. The `time.After` threshold needs to be created before the
`WaitAdvance` call so that the done channel doesn't appear to be ready
early in the case that we needed to wait for the accept goroutine to be
waiting. (Picking the max waits on the `WaitAdvance` calls was fiddly
because `WaitAdvance` divides the max wait time into 10 chunks.)

## QA steps

The test passes normally and for more than 20 runs under stress.
apiserver/uniter: Add OtherApplication to RelationResult
This required changing the return type of Relation and RelationById and
bumping the uniter API to version 6 (leaving a V5 that returns the old
types). The new field will be used in the uniter worker to determine
whether the relation can keep a subordinate unit alive.
api/uniter: Add Relation.OtherApplication
This will be used by the uniter to determine whether the relation to the
principal unit is still alive.
worker/uniter: Destroy unit when principal relation dies
Previously the uniter would only destroy the unit when all of its
container-scoped relations died - this meant that a relation to another
subordinate unit under the same principal could keep it alive.

Make ordering of relations deterministic for testing.
Merge pull request #7603 from babbageclunk/uniter-subsub
uniter: Relations between subordinates shouldn't keep the sub units alive

## Description of change

Relating two subordinates and then trying to remove the principal application would leave the units of the application in _terminating_ state. The principal units wouldn't be cleaned up until their subordinates were, and they wouldn't die because the sub-sub relation would keep them alive. For example, if you had this configuration:
```
mysql/0
    ntp/0
    nrpe/0
```
...and nrpe and ntp were also related, then `mysql/0` wouldn't get removed if you removed `mysql`.

Fix this by making the subordinate units only consider the relations to their principal when checking whether they should destroy themselves.

## QA steps

* Deploy the following bundle:
```
series: xenial
machines:
  '0':
    series: xenial
services:
  min:
    charm: cs:ubuntu-10
    num_units: 1
    to:
      - "0"
  nrpe:
    charm: cs:nrpe-23
  ntp:
    charm: cs:ntp-18
relations:
- - min:juju-info
  - nrpe:general-info
- - ntp:nrpe-external-master
  - nrpe:nrpe-external-master
- - ntp:juju-info
  - min:juju-info
```
* Wait for the three units to become idle.
* Remove the min application.
* It should remove cleanly.

## Bug reference
Fixes part of https://bugs.launchpad.net/juju/+bug/1702636
Introduce environ/provider versioning
Environs (i.e. the cloud representation of
a model) are now versioned, so that we can
upgrade them as necessary. Previously this
has been done as part of controller upgrades,
but this poses a problem when the environs
cannot be opened; thuse we seek to decouple
the upgrading of a model's environ from the
upgrading of the controller.

EnvironProvider now has a Version method,
which returns the current/most up-to-date
environ version. Model docs now record the
version of the model's associated environ.
Environ upgrade operations specify an
environ/provider version, rather than
controller/agent version.

Once this lands, we will introduce a new
worker that will run under the model
dependency engine to upgrade the model's
environ, and which will gate access to the
model until that has happened.
Merge pull request #7599 from axw/environ-version
Introduce environ/provider versioning

## Description of change

Environs (i.e. the cloud representation of
a model) are now versioned, so that we can
upgrade them as necessary. Previously this
has been done as part of controller upgrades,
but this poses a problem when the environs
cannot be opened; thus we seek to decouple
the upgrading of a model's environ from the
upgrading of the controller.

EnvironProvider now has a Version method,
which returns the current/most up-to-date
environ version. Model docs now record the
version of the model's associated environ.
Environ upgrade operations specify an
environ/provider version, rather than
controller/agent version.

Once this lands, we will introduce a new
worker that will run under the model
dependency engine to upgrade the model's
environ, and which will gate access to the
model until that has happened.

Requires juju/description#16

(There is a drive-by fix in state/upgrade.go, to stop checking for env-uuid upgrades. That all happened before 2.0, so we don't need to carry that code anymore.)

## QA steps

No functional changes. Bootstrap azure with 2.1.x, juju upgrade-juju to this branch.

## Documentation changes

None.

## Bug reference

Part of the fix for https://bugs.launchpad.net/juju/+bug/1700451
Merge pull request #7604 from babbageclunk/remove-formodel
apiserver: Replace ForModel with pool in ModelStatus

## Description of change
Having status use an existing state instead of starting a fresh one (with the attendant workers) is more efficient.

## QA steps
Bootstrap and create a few models. Verify that status works.

## Bug reference
Part of https://bugs.launchpad.net/juju/+bug/1698702
{api,apiserver}/modelupgrader: add API
Add new ModelUpgrader API facade.
worker/modelupgrader: introduce new worker
In 2.2 we introduced upgrade steps that would attempt to open an
Environ for every model, and run upgrade operations on the Environ
as necessary (e.g. to move VMs to folders in vSphere). If the
Environ cannot be opened, then the entire controller upgrade fails.

We now introduce a "modelupgrader" worker, which is responsible
for running controller-side upgrade operations for a model's
Environ. Other model workers must wait for this worker to complete
before they can start. If an Environ cannot be opened, or the
Environ's upgrade operations fail, only that model will be
affected.
cmd/jujud: wire up modelupgrader
Add the modelupgrader manifold into the
model dependency engine, and gate starting
other model workers on the upgrades having
been run. Remove environ upgrades from the
controller upgrade steps.
Merge pull request #7602 from axw/worker-modelupgrader
Separate model environ upgrades from controller upgrades

## Description of change

In 2.2 we introduced upgrade steps that would attempt to open an
Environ for every model, and run upgrade operations on the Environ
as necessary (e.g. to move VMs to folders in vSphere). If the
Environ cannot be opened, then the entire controller upgrade fails.

We now introduce a "modelupgrader" worker, which is responsible
for running controller-side upgrade operations for a model's
Environ. Other model workers must wait for this worker to complete
before they can start. If an Environ cannot be opened, or the
Environ's upgrade operations fail, only that model will be
affected.

(Requires #7599)

## QA steps

1. juju bootstrap azure azure1 (with 2.1.x)
2. juju add-model foo
(confirm there's no "common" deployment via the portal)
3. juju bootstrap azure azure2 (with this branch)
4. juju switch azure1
5. juju migrate foo azure2
(confirm there's now a "common" deployment in the portal)

The above shows that model upgrade steps are run when a model is migrated to a newer controller.

Repeat, using "juju upgrade-juju" on the original controller to show that the same works with an upgrade.

Repeat with "juju upgrade-juju", but this time inject a failure (before running the upgrade-juju command):
```
mkdir /var/lib/juju/wrench
cat fail-all > /var/lib/juju/wrench/modelupgrader
```

And confirm that the controller is upgraded, and only the model workers are broken until the wrench file is removed.

## Documentation changes

None.

## Bug reference

Fixes https://bugs.launchpad.net/juju/+bug/1700451
apiserver/...: fix several auth issues
Backport of auth issues found/fixed in
#7606.

- logfwd was using AuthMachineAgent, and should
  have used AuthController. It has been updated
  to use the latter.
- payloads had no auth at all; it has been updated
  to use AuthClient.
- metricsmanager and undertaker: redundant calls
  to AuthMachineAgent, when AuthController would
  do. Simplified.
- charmrevisionupdater: was allowing non-controller
  agents, now does not.
- keymanager and highavailability: both were allowing
  controller unnecessarily; dropped.
Merge pull request #7501 from babbageclunk/status-hist-pruner-delete
state: Prune status history in batches

## Description of change

In a heavily loaded system we saw the status history pruner failing because it was trying to delete too many rows at once - the RemoveAll call would time out.

Change pruning to delete rows in batches of 1000 by id using the mgo bulk API. Use ids rather than time ranges to ensure we can always make progress - in a pathological situation there might be enough status updates in a given time range that the delete still can't complete.

## QA steps

* Bootstrap a controller, setting max-status-history-age to 5m and max-status-history-size to 1M.
* Deploy some units
* Generate a lot of status updates by running `status-set` in a loop under `juju run` - I used the following script (scp'd to the unit machines) and ran it with `juju run --unit x nohup /home/ubuntu/spam-status.bash` :
```
while true; do
      status-set maintenance xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
      status-set active "all good"
done
```
* Wait for 5 minutes for the pruner to run, and check that it reports deleting the records by age from each model.
* Turn off the age constraint by setting `max-status-history-age` to 0 in the default model.
* Check that the pruner logs that it's deleting rows by size (but only once since this is across all models).
* Check the collection size in the mongo shell using `db.statuseshistory.dataSize()` - this should be within 1Mb of the specified limit (it might be 1.9 Mb since we ask for the size in Mb and Mongo gives us an int - 1.9 truncates to 1).

## Bug reference

Fixes https://bugs.launchpad.net/juju/+bug/1696509
Merge pull request #7607 from axw/apiserver-facade-auth-fixes
apiserver/...: fix several auth issues

## Description of change

Backport of auth issues found/fixed in
#7606.

- logfwd was using AuthMachineAgent, and should
  have used AuthController. It has been updated
  to use the latter.
- payloads had no auth at all; it has been updated
  to use AuthClient.
- metricsmanager and undertaker: redundant calls
  to AuthMachineAgent, when AuthController would
  do. Simplified.
- charmrevisionupdater: was allowing non-controller
  agents, now does not.
- keymanager and highavailability: both were allowing
  controller unnecessarily; dropped.

## QA steps

Run tests. Bootstrap.

## Documentation changes

None.

## Bug reference

None.

axw approved these changes Jul 10, 2017

api/uniter/uniter_test.go
@@ -120,6 +120,17 @@ func (s *uniterSuite) assertInScope(c *gc.C, relUnit *state.RelationUnit, inScop
c.Assert(ok, gc.Equals, inScope)
}
+func (s *uniterSuite) patchNewState(
@axw

axw Jul 10, 2017

Member

I don't think we need this any more.

@wallyworld

wallyworld Jul 10, 2017

Owner

ah yes, i meant to remove that

axw and others added some commits Jul 10, 2017

worker/modelupgrade: set model status
Set model status before and after upgrading the
model's environ. Allow the "error" status for
models, so that we can indicate that the model
requires attention (e.g. credentials are out of
date and need to be refreshed.)
Merge pull request #7609 from axw/modelupgrader-setstatus
worker/modelupgrade: set model status

## Description of change

Set model status before and after upgrading the
model's environ. Allow the "error" status for
models, so that we can indicate that the model
requires attention (e.g. credentials are out of
date and need to be refreshed.)

## QA steps

1. juju bootstrap localhost
2. modify provider code, bumping provider version to 1, and introducing an environ upgrade step that fails
3. juju upgrade-juju -m controller
4. observe model status indicates a failure occurred
5. fix the provider code, re-run juju upgrade-juju -m controller
6. observe model status is back to normal

## Documentation changes

None.

## Bug reference

https://bugs.launchpad.net/juju/+bug/1700451
Owner

wallyworld commented Jul 10, 2017

$$merge$$

Contributor

jujubot commented Jul 10, 2017

Status: merge request accepted. Url: http://juju-ci.vapour.ws:8080/job/github-merge-juju

Contributor

jujubot commented Jul 10, 2017

Build failed: Tests failed
build url: http://juju-ci.vapour.ws:8080/job/github-merge-juju/11282

Owner

wallyworld commented Jul 10, 2017

$$merge$$

Contributor

jujubot commented Jul 10, 2017

Status: merge request accepted. Url: http://juju-ci.vapour.ws:8080/job/github-merge-juju

@jujubot jujubot merged commit 6a8745d into juju:develop Jul 10, 2017

1 check passed

github-check-merge-juju Ran tests against PR. Use !!.*!! to request another build. IE, !!build!!, !!retry!!
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment