Openstack Provider: check for instance build to complete correctly, if #7300

Merged
merged 1 commit into from May 9, 2017

Conversation

Projects
None yet
3 participants
Member

hmlanigan commented May 1, 2017

instance build fails on 'No valid host' fault, try another AZ

Please provide the following details to expedite Pull Request review:


Description of change

Check the status of the openstack instance after it's moved from the BUILD status. If it's active,
continue. If it's set to StatusError, check the fault message which is part of the ServerDetail. If
the fault message contains "No valid host", print a related error message and try another
availability zone if possible.

If the server is in a error state during tryStartNovaInstance() delete it because the instance details won't be available to for the caller to clean up.

QA steps

Run the unit tests.

Find an openstack where 1 AZ will return No valid host.

This was tested against serverstack before the resolution of bug 1680787 where neutron networks
with port_security_enabled false and a virt-type of lxd returned 'No valid host', for the failure. However only 1 AZ was in the config.

Documentation changes

N/A

Bug reference

This may help with the following bugs:
https://bugs.launchpad.net/juju/+bug/1425808
https://bugs.launchpad.net/juju/+bug/1649990

@@ -853,31 +853,6 @@ func (s *localServerSuite) TestInstancesGatheringWithFloatingIP(c *gc.C) {
s.assertInstancesGathering(c, true)
}
-func (s *localServerSuite) TestInstancesBuildSpawning(c *gc.C) {
@axw

axw May 2, 2017

Member

I'm not sure this test should be dropped, see my comment in StartInstance

provider/openstack/provider.go
+ client *nova.Client,
+ id string,
+ ) (server *nova.ServerDetail, err error) {
+ logger.Debugf("waiting for instance %s to start...", id)
@axw

axw May 2, 2017

Member

StartInstanceParams has a StatusCallback function, which you can call to update the instance status while it's being provisioned. That would be preferable to logging, which the user can't see - can be helpful to show progress to the user.

@hmlanigan

hmlanigan May 2, 2017

Member

cool, learned something new. updated.

provider/openstack/provider.go
+ // We don't want to flood the connection while polling the server waiting
+ // for it to start.
+ logger.Debugf("server has status %s, waiting 10 seconds before polling again...", server.Status)
+ time.Sleep(10 * time.Second)
@axw

axw May 2, 2017

Member

How long does it usually take to go from build to non-build? I ask because StartInstance calls made by the provisioner are sequential: blocking for 10 seconds in here means no other instances get created for another 10+ seconds. Ideally we would queue them all up at the same time.

We might have look at overhauling the provisioner sooner rather than later if fixing this issue means we'll kill throughput.

@hmlanigan

hmlanigan May 2, 2017

Member

It's been taking between 15 to 21 seconds based on my sampling with a local openstack. 19 to 36 on a remote one.

The oracle provider waits for the instance to complete building as well.

With more reading, I should add a timeout here, since an instance might get stuck in the build state.

provider/openstack/provider.go
for a := attempts.Start(); a.Next(); {
server, err = client.RunServer(instanceOpts)
- if err == nil || gooseerrors.IsNotFound(err) == false {
+ if err != nil {
@axw

axw May 2, 2017

Member

should we not be checking for IsNotFound still?

@hmlanigan

hmlanigan May 2, 2017

Member

No, if the error IsNotFound we still break out of the loop, as we did previously. Now, any error from RunServer causes the provider to drop out of tryStartNovaInstance().

@axw

axw May 3, 2017

Member

Yep, my eyes glanced over the break and filled in with "return err".

+ } else if serverDetail.Status == nova.StatusActive {
+ break
+ } else if serverDetail.Status == nova.StatusError {
+ // Perhaps there is an error case where a retry in the same AZ
@axw

axw May 2, 2017

Member

can't the status also be StatusBuildSpawning? I think we should treat that the same as StatusActive?

@hmlanigan

hmlanigan May 2, 2017

Member

BuildSpawning was an HP cloud thing as far as I can see, I was told HP cloud was no longer an
issue. Only BUILD is listed in the api, same with Rackspace.

https://developer.rackspace.com/docs/cloud-servers/v2/extensions/ext-extended-status/#server-statuses
https://developer.openstack.org/api-ref/compute/?expanded=list-servers-detail

@axw

axw May 3, 2017

Member

OK. Looks like BUILD(spawning) was a weird way of having status=BUILD & OS-EXT-STS:vm_state=spawning. Perhaps before the extension came along.

And yeah, HP Cloud is no longer a thing.

provider/openstack/provider.go
+ if err = e.terminateInstances([]instance.Id{instance.Id(server.Id)}); err != nil {
+ logger.Debugf("Failed to delete instance in ERROR state, %q", err)
+ }
+ server = nil
break
@axw

axw May 2, 2017

Member

why not just do err = errors.New(serverDetail.Fault.Message) here, rather than below with the condition?

@hmlanigan

hmlanigan May 2, 2017

Member

where was a scope and variable shadowing issue farther up to find and resolve first. updated.

provider/openstack/provider.go
@@ -1129,8 +1171,8 @@ func (e *Environ) StartInstance(args environs.StartInstanceParams) (*environs.St
}
func isNoValidHostsError(err error) bool {
- if gooseErr, ok := err.(gooseerrors.Error); ok {
- if cause := gooseErr.Cause(); cause != nil {
+ if err != nil {
@axw

axw May 2, 2017

Member

errors.Cause(nil) == nil, so you can drop this outer if statement

provider/openstack/provider.go
+ return errors.New("server has status BUILD, waiting 10 seconds before polling again...")
+ }
+ }
+ return nil
@axw

axw May 4, 2017

Member

We should be returning err here, otherwise we'll ignore any errors from GetServer. You should use the IsFatalError field of CallArgs to decide whether or not to retry. Something like:

var errStillBuilding = errors.New("server is still being built")
err = retry.Call(retry.CallArgs{
    Clock: e.clock,
    Delay: 10 * time.Second,
    MaxDuration: timeout,
    IsFatalError: func(err error) bool {
        return err != errStillBuilding
    },
    NotifyFunc: func(lastError error, attempt int) {
        args.StatusCallback(status.Provisioning, fmt.Sprintf("%q retry attempt %d", lastError, attempt), nil)
    }
    Func: func() error {
        server, err := client.GetServer(id)
        if err != nil {
            return err
        }
        if server.Status == nova.StatusBuild {
            return errStillBuilding
        }
        return nil
    },
})
@axw

axw May 4, 2017

Member

We should probably have tests for this. So that you don't have to wait for real time to pass, there's a SetClock method on the *openstack.Environ type: see TestStopInstanceSecurityGroupNotDeleted.

@hmlanigan

hmlanigan May 4, 2017

Member

Updated and added 2 tests: TestStartInstanceWaitForActiveDetails() and TestStartInstanceGetServerFail(). A goose change was necessary to implement TestStartInstanceWaitForActiveDetails(), please see goose PR 48.

axw approved these changes May 5, 2017

provider/openstack/provider.go
+ timeout time.Duration,
+ ) (server *nova.ServerDetail, err error) {
+
+ var errStillBuilding = errors.New(fmt.Sprintf("instance %q has status BUILD", id))
@axw

axw May 5, 2017

Member

errors.Errorf("instance ...", id)

Openstack Provider: check for instance build to complete and the inst…
…ance is

in an ACTIVE state.  Handle errors seen in nova.ServerDetail.Fault if possible.
Member

hmlanigan commented May 8, 2017

$$merge$$

Contributor

jujubot commented May 8, 2017

Status: merge request accepted. Url: http://juju-ci.vapour.ws:8080/job/github-merge-juju

Contributor

jujubot commented May 8, 2017

Build failed: Tests failed
build url: http://juju-ci.vapour.ws:8080/job/github-merge-juju/10837

Member

hmlanigan commented May 8, 2017

$$merge$$

Contributor

jujubot commented May 8, 2017

Status: merge request accepted. Url: http://juju-ci.vapour.ws:8080/job/github-merge-juju

Contributor

jujubot commented May 8, 2017

Build failed: Tests failed
build url: http://juju-ci.vapour.ws:8080/job/github-merge-juju/10838

Member

hmlanigan commented May 9, 2017

$$merge$$

Contributor

jujubot commented May 9, 2017

Status: merge request accepted. Url: http://juju-ci.vapour.ws:8080/job/github-merge-juju

Contributor

jujubot commented May 9, 2017

Build failed: Tests failed
build url: http://juju-ci.vapour.ws:8080/job/github-merge-juju/10839

Member

hmlanigan commented May 9, 2017

$$merge$$

Contributor

jujubot commented May 9, 2017

Status: merge request accepted. Url: http://juju-ci.vapour.ws:8080/job/github-merge-juju

@jujubot jujubot merged commit 856445c into juju:develop May 9, 2017

1 check failed

github-check-merge-juju Built PR, ran unit tests, and tested LXD deploy. Use !!.*!! to request another build. IE, !!build!!, !!retry!!
Details

@hmlanigan hmlanigan deleted the hmlanigan:no-valid-host branch May 9, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment