Prevent unit failing when starting model migration #6712

Merged
merged 6 commits into from Dec 15, 2016

Conversation

Projects
None yet
4 participants
Contributor

mjs commented Dec 15, 2016

When the model migration fortress was locked down some of the manifolds used by the uniter could return a "fortress shutting down" error which was returned to the uniter resolver loop. This would cause the unit to be put into error before the uniter shut down, causing prechecks failures. The problem is somewhat timing dependent which is why it was being seen intermittently.

There are a number of fixes here:

  1. The unit agent now logs unclassified resolver loop errors instead of swallowing them.
  2. The migrationminion now logs when it is reporting back to the master.
  3. The uniter now takes an optional function to support translation of resolver loop errors.
  4. The fortress shutdown error is now exported.
  5. Using the above fortress errors coming out of the resolver loop now cause the resolver to restart instead of causing a unit error.
  6. The migrationmaster now runs prechecks twice in the QUIECSE phase: before minions have reported back to avoid long waits for an agent that will never be able to report and again after minions have reported, to ensure that the model is truly ok to migrate.

QA

Using a charm with a long sleep in its config-changed hook, tested multiple migrations which were triggered while the hook was executing. Previously this would reliably trigger a precheck failure.

Fixes https://bugs.launchpad.net/juju/+bug/1620438

mjs added some commits Dec 14, 2016

worker/uniter: Log agent errors
Previously some errors returned by the resolver loop were being
swallowed making it difficult to diagnose failures. Now agent errors are
logged.
worker/migrationminion: Log when reporting back
When troubleshooting it's useful to see when the migrationminion
reported back to the master.
worker/uniter: Support resolver error translation
Allow injection of an optional function which converts errors returned
by the resolver loop into another error. This will be used to suppress
fortress related errors which are causing the uniter to go into a failed
state when manifolds are shut down before a migration.
worker/fortress: Expose ErrShutdown
Allow external components to identify the "fortress worker shutting
down" error. Also added IsFortressError helper to simplify
identification of ErrAborted and ErrShutdown.
worker/migrationmaster: Precheck twice in QUIESCE
Two precheck runs are needed:

Running the prechecks before waiting for minions to reports avoids a
long wait if an agent is never going to be able to report (because it is
down).

Running the prechecks once minions have reported ensures that the model
is definitely ok to migrate.
Convert fortress errors in the uniter
In order to prevent the uniter from putting the unit into error when
it's dependents shut down in preparation for a model migration, an error
translation function is now injected.
Member

anastasiamac commented Dec 15, 2016

Please forward port to develop (2.2) to when ready.

Contributor

mjs commented Dec 15, 2016

$$merge$$

Contributor

jujubot commented Dec 15, 2016

Status: merge request accepted. Url: http://juju-ci.vapour.ws:8080/job/github-merge-juju

Contributor

jujubot commented Dec 15, 2016

Build failed: Tests failed
build url: http://juju-ci.vapour.ws:8080/job/github-merge-juju/9882

Contributor

mjs commented Dec 15, 2016

$$mongodb-sucks$$

Contributor

jujubot commented Dec 15, 2016

Status: merge request accepted. Url: http://juju-ci.vapour.ws:8080/job/github-merge-juju

@jujubot jujubot merged commit cc744ba into juju:2.1 Dec 15, 2016

@mjs mjs deleted the mjs:1620438-MM-not-idle branch Dec 15, 2016

jujubot added a commit that referenced this pull request Jan 7, 2017

Merge pull request #6779 from mjs/1620438-MM-not-idle-develop
Prevent unit failing when starting model migration

This is a forward port of #6712 .

When the model migration fortress was locked down some of the manifolds used by the uniter could return a "fortress shutting down" error which was returned to the uniter resolver loop. This would cause the unit to be put into error before the uniter shut down, causing prechecks failures. The problem is somewhat timing dependent which is why it was being seen intermittently.

There are a number of fixes here:

1. The unit agent now logs unclassified resolver loop errors instead of swallowing them.
2. The migrationminion now logs when it is reporting back to the master.
3. The uniter now takes an optional function to support translation of resolver loop errors.
4. The fortress shutdown error is now exported.
5. Using the above fortress errors coming out of the resolver loop now cause the resolver to restart instead of causing a unit error.
6. The migrationmaster now runs prechecks twice in the QUIECSE phase: before minions have reported back to avoid long waits for an agent that will never be able to report and again after minions have reported, to ensure that the model is truly ok to migrate.

### QA

Using a charm with a long sleep in its config-changed hook, tested multiple migrations which were triggered while the hook was executing. Previously this would reliably trigger a precheck failure.

Fixes https://bugs.launchpad.net/juju/+bug/1620438
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment