Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
Openstack Provider: check for instance build to complete correctly, if #7300
Conversation
| @@ -853,31 +853,6 @@ func (s *localServerSuite) TestInstancesGatheringWithFloatingIP(c *gc.C) { | ||
| s.assertInstancesGathering(c, true) | ||
| } | ||
| -func (s *localServerSuite) TestInstancesBuildSpawning(c *gc.C) { |
| + client *nova.Client, | ||
| + id string, | ||
| + ) (server *nova.ServerDetail, err error) { | ||
| + logger.Debugf("waiting for instance %s to start...", id) |
axw
May 2, 2017
Member
StartInstanceParams has a StatusCallback function, which you can call to update the instance status while it's being provisioned. That would be preferable to logging, which the user can't see - can be helpful to show progress to the user.
| + // We don't want to flood the connection while polling the server waiting | ||
| + // for it to start. | ||
| + logger.Debugf("server has status %s, waiting 10 seconds before polling again...", server.Status) | ||
| + time.Sleep(10 * time.Second) |
axw
May 2, 2017
Member
How long does it usually take to go from build to non-build? I ask because StartInstance calls made by the provisioner are sequential: blocking for 10 seconds in here means no other instances get created for another 10+ seconds. Ideally we would queue them all up at the same time.
We might have look at overhauling the provisioner sooner rather than later if fixing this issue means we'll kill throughput.
hmlanigan
May 2, 2017
Member
It's been taking between 15 to 21 seconds based on my sampling with a local openstack. 19 to 36 on a remote one.
The oracle provider waits for the instance to complete building as well.
With more reading, I should add a timeout here, since an instance might get stuck in the build state.
| for a := attempts.Start(); a.Next(); { | ||
| server, err = client.RunServer(instanceOpts) | ||
| - if err == nil || gooseerrors.IsNotFound(err) == false { | ||
| + if err != nil { |
hmlanigan
May 2, 2017
Member
No, if the error IsNotFound we still break out of the loop, as we did previously. Now, any error from RunServer causes the provider to drop out of tryStartNovaInstance().
| + } else if serverDetail.Status == nova.StatusActive { | ||
| + break | ||
| + } else if serverDetail.Status == nova.StatusError { | ||
| + // Perhaps there is an error case where a retry in the same AZ |
axw
May 2, 2017
Member
can't the status also be StatusBuildSpawning? I think we should treat that the same as StatusActive?
hmlanigan
May 2, 2017
Member
BuildSpawning was an HP cloud thing as far as I can see, I was told HP cloud was no longer an
issue. Only BUILD is listed in the api, same with Rackspace.
https://developer.rackspace.com/docs/cloud-servers/v2/extensions/ext-extended-status/#server-statuses
https://developer.openstack.org/api-ref/compute/?expanded=list-servers-detail
axw
May 3, 2017
•
Member
OK. Looks like BUILD(spawning) was a weird way of having status=BUILD & OS-EXT-STS:vm_state=spawning. Perhaps before the extension came along.
And yeah, HP Cloud is no longer a thing.
| + if err = e.terminateInstances([]instance.Id{instance.Id(server.Id)}); err != nil { | ||
| + logger.Debugf("Failed to delete instance in ERROR state, %q", err) | ||
| + } | ||
| + server = nil | ||
| break |
axw
May 2, 2017
Member
why not just do err = errors.New(serverDetail.Fault.Message) here, rather than below with the condition?
hmlanigan
May 2, 2017
Member
where was a scope and variable shadowing issue farther up to find and resolve first. updated.
| @@ -1129,8 +1171,8 @@ func (e *Environ) StartInstance(args environs.StartInstanceParams) (*environs.St | ||
| } | ||
| func isNoValidHostsError(err error) bool { | ||
| - if gooseErr, ok := err.(gooseerrors.Error); ok { | ||
| - if cause := gooseErr.Cause(); cause != nil { | ||
| + if err != nil { |
| + return errors.New("server has status BUILD, waiting 10 seconds before polling again...") | ||
| + } | ||
| + } | ||
| + return nil |
axw
May 4, 2017
Member
We should be returning err here, otherwise we'll ignore any errors from GetServer. You should use the IsFatalError field of CallArgs to decide whether or not to retry. Something like:
var errStillBuilding = errors.New("server is still being built")
err = retry.Call(retry.CallArgs{
Clock: e.clock,
Delay: 10 * time.Second,
MaxDuration: timeout,
IsFatalError: func(err error) bool {
return err != errStillBuilding
},
NotifyFunc: func(lastError error, attempt int) {
args.StatusCallback(status.Provisioning, fmt.Sprintf("%q retry attempt %d", lastError, attempt), nil)
}
Func: func() error {
server, err := client.GetServer(id)
if err != nil {
return err
}
if server.Status == nova.StatusBuild {
return errStillBuilding
}
return nil
},
})
axw
May 4, 2017
Member
We should probably have tests for this. So that you don't have to wait for real time to pass, there's a SetClock method on the *openstack.Environ type: see TestStopInstanceSecurityGroupNotDeleted.
hmlanigan
May 4, 2017
Member
Updated and added 2 tests: TestStartInstanceWaitForActiveDetails() and TestStartInstanceGetServerFail(). A goose change was necessary to implement TestStartInstanceWaitForActiveDetails(), please see goose PR 48.
| + timeout time.Duration, | ||
| + ) (server *nova.ServerDetail, err error) { | ||
| + | ||
| + var errStillBuilding = errors.New(fmt.Sprintf("instance %q has status BUILD", id)) |
|
$$merge$$ |
|
Status: merge request accepted. Url: http://juju-ci.vapour.ws:8080/job/github-merge-juju |
|
Build failed: Tests failed |
|
$$merge$$ |
|
Status: merge request accepted. Url: http://juju-ci.vapour.ws:8080/job/github-merge-juju |
|
Build failed: Tests failed |
|
$$merge$$ |
|
Status: merge request accepted. Url: http://juju-ci.vapour.ws:8080/job/github-merge-juju |
|
Build failed: Tests failed |
|
$$merge$$ |
|
Status: merge request accepted. Url: http://juju-ci.vapour.ws:8080/job/github-merge-juju |
hmlanigan commentedMay 1, 2017
instance build fails on 'No valid host' fault, try another AZ
Please provide the following details to expedite Pull Request review:
Description of change
Check the status of the openstack instance after it's moved from the BUILD status. If it's active,
continue. If it's set to StatusError, check the fault message which is part of the ServerDetail. If
the fault message contains "No valid host", print a related error message and try another
availability zone if possible.
If the server is in a error state during tryStartNovaInstance() delete it because the instance details won't be available to for the caller to clean up.
QA steps
Run the unit tests.
Find an openstack where 1 AZ will return No valid host.
This was tested against serverstack before the resolution of bug 1680787 where neutron networks
with port_security_enabled false and a virt-type of lxd returned 'No valid host', for the failure. However only 1 AZ was in the config.
Documentation changes
N/A
Bug reference
This may help with the following bugs:
https://bugs.launchpad.net/juju/+bug/1425808
https://bugs.launchpad.net/juju/+bug/1649990