Cleanup error reporting during container provisioning #6828

Merged
merged 10 commits into from Jan 19, 2017

Conversation

Projects
None yet
4 participants
Owner

jameinel commented Jan 18, 2017

This handles bug #1650252.

Key points:

  1. We were calling machine.SetStatus() instead of machine.SetInstanceStatus() for provisioning failures. But the former is about the status of the agent (which doesn't exist yet), vs the instance that we are trying to start.
  2. Polish around error messages and log messages, to hopefully more clearly explain to users what is going on.
  3. This makes containers use similar status states as provisioning machines on MAAS. Things like "allocating" instead of "pending", etc.

It's possible to test this today by doing:
juju deploy app --bind badspace --to lxd:N
where 'badspace' is the name of a space that the host machine doesn't have access to. (Eventually this will be a deployment failure, but it triggers the provisioning failure right now.)

jameinel added some commits Jan 17, 2017

Change Provisioner to use Machine.SetInstanceStatus
During provisioning, we are actually reporting on the status of the machine
itself, not the agent (because the agent doesn't exist yet). And the
final status of the machine was getting suppressed because of the reporting
on the 'agent' not being connected to the machine.
Owner

jameinel commented Jan 18, 2017

While failing to provision, the 'juju show-machine N' looks like:

      1/lxd/12:
        juju-status:
          current: pending
          since: 18 Jan 2017 08:16:10+04:00
        instance-id: pending
        machine-status:
          current: pending
          message: 'failed to start instance (unable to setup network: host machine
            "1" has no available device in space(s) "ceph"), will retry in 10s'
          since: 18 Jan 2017 08:16:37+04:00
        series: xenial

Once it has finally decided to stop trying (3 attempts after 30s), it says:

      1/lxd/12:
        juju-status:
          current: pending
          since: 18 Jan 2017 08:16:10+04:00
        instance-id: pending
        machine-status:
          current: error
          message: 'unable to setup network: host machine "1" has no available device
            in space(s) "ceph"'
          since: 18 Jan 2017 08:16:47+04:00
        series: xenial

The controller has a log message with:
2017-01-18 04:16:47 WARNING juju.state machine_linklayerdevices.go:1224 container "1/lxd/12" wants spaces "ceph", but host machine "1" has "guest-100", "guest-150", "space-0", "vixen" missing "ceph"

The machine that is provisioning the container has log messages of:

2017-01-18 04:16:17 WARNING juju.provisioner provisioner_task.go:715 failed to start instance (unable to setup network: host machine "1" has no available device in space(s) "ceph"), will retry in 10s
2017-01-18 04:16:27 WARNING juju.provisioner provisioner_task.go:715 failed to start instance (unable to setup network: host machine "1" has no available device in space(s) "ceph"), will retry in 10s
2017-01-18 04:16:37 WARNING juju.provisioner provisioner_task.go:715 failed to start instance (unable to setup network: host machine "1" has no available device in space(s) "ceph"), will retry in 10s
2017-01-18 04:16:47 ERROR juju.provisioner provisioner_task.go:687 cannot start instance for machine "1/lxd/12": unable to setup network: host machine "1" has no available device in space(s) "ceph"

jameinel added some commits Jan 18, 2017

change all functions to look at InstanceStatus instead of Status
Provisioner is setting the error there, thus we shouldn't look at Status to tell if
the machine failed to provision due to a transient error.
Owner

jameinel commented Jan 18, 2017

On feedback from frobware, I added a 'attempts left' section which now looks like:

2017-01-18 09:55:39 WARNING juju.provisioner provisioner_task.go:715 failed to start instance (unable to setup network: host machine "1" has no available device in space(s) "ceph"), retrying in 10s (3 more attempts)
2017-01-18 09:55:50 WARNING juju.provisioner provisioner_task.go:715 failed to start instance (unable to setup network: host machine "1" has no available device in space(s) "ceph"), retrying in 10s (2 more attempts)
2017-01-18 09:56:00 WARNING juju.provisioner provisioner_task.go:715 failed to start instance (unable to setup network: host machine "1" has no available device in space(s) "ceph"), retrying in 10s (1 more attempts)
2017-01-18 09:56:11 ERROR juju.provisioner provisioner_task.go:687 cannot start instance for machine "1/lxd/16": unable to setup network: host machine "1" has no available device in space(s) "ceph"
Owner

jameinel commented Jan 18, 2017

$$merge$$

Contributor

jujubot commented Jan 18, 2017

Status: merge request accepted. Url: http://juju-ci.vapour.ws:8080/job/github-merge-juju

Contributor

jujubot commented Jan 18, 2017

Build failed: Tests failed
build url: http://juju-ci.vapour.ws:8080/job/github-merge-juju/10054

Contributor

frobware commented Jan 18, 2017

$$merge$$

Contributor

jujubot commented Jan 18, 2017

Status: merge request accepted. Url: http://juju-ci.vapour.ws:8080/job/github-merge-juju

Contributor

jujubot commented Jan 18, 2017

Build failed: Tests failed
build url: http://juju-ci.vapour.ws:8080/job/github-merge-juju/10057

Contributor

jujubot commented Jan 18, 2017

Build failed: Tests failed
build url: http://juju-ci.vapour.ws:8080/job/github-merge-juju/10060

Fix the apiserver provisioner tests.
They needed to be setting InstanceStatus now. Also discovered that
InstanceStatus wasn't being directly tested, so added some tests for
those.
Might need to tweak the status values we are setting on the machine.
Owner

jameinel commented Jan 18, 2017

!!build!!

This looks very sane to me.

state/linklayerdevices_test.go
@@ -1669,7 +1669,7 @@ func (s *linkLayerDevicesStateSuite) TestSetContainerLinkLayerDevicesMissingBrid
})
c.Assert(err, jc.ErrorIsNil)
err = s.machine.SetContainerLinkLayerDevices(s.containerMachine)
- c.Assert(err.Error(), gc.Equals, `unable to find host bridge for spaces ["dmz"] for container "0/lxd/0"`)
+ c.Assert(err.Error(), gc.Equals, `unable to find host bridge for spaces "dmz" for container "0/lxd/0"`)
@perrito666

perrito666 Jan 18, 2017

Contributor

this error strikes me odd, is there a possibility that there is a list of spaces? otherwise perhaps the error should be "for space" sorry for commenting this here and not in the error itself.

@jameinel

jameinel Jan 19, 2017

Owner

It should be for space(s). It is possible to be more than one.

Owner

jameinel commented Jan 19, 2017

!!build!!

jameinel added some commits Jan 19, 2017

Use the right status values for SetInstanceStatus.
Machine.SetInstanceStatus wasn't checking what strings were being passed.
This pokes it to make it strict, and then updates the provisioner to
pass the right values (and look for the right values).
Owner

jameinel commented Jan 19, 2017

$$merge$$

Contributor

jujubot commented Jan 19, 2017

Status: merge request accepted. Url: http://juju-ci.vapour.ws:8080/job/github-merge-juju

Contributor

jujubot commented Jan 19, 2017

Build failed: Tests failed
build url: http://juju-ci.vapour.ws:8080/job/github-merge-juju/10068

Owner

jameinel commented Jan 19, 2017

!!build!!

Owner

jameinel commented Jan 19, 2017

$$merge$$

Contributor

jujubot commented Jan 19, 2017

Status: merge request accepted. Url: http://juju-ci.vapour.ws:8080/job/github-merge-juju

@jujubot jujubot merged commit 03625c0 into juju:2.1-dynamic-bridges Jan 19, 2017

1 check failed

github-check-merge-juju Built PR, ran unit tests, and tested LXD deploy. Use !!.*!! to request another build. IE, !!build!!, !!retry!!
Details

@jameinel jameinel deleted the jameinel:2.1-dynamic-bridges-surface-failure-1650252 branch Apr 22, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment