Fix 1604965: machine stays in pending state even though node has been marked failed deployment in MAAS #6321

Merged
merged 5 commits into from Oct 4, 2016

Conversation

Projects
None yet
6 participants
Contributor

perrito666 commented Sep 26, 2016

Maas provider was missing an error state by the maas controller, it is now handled.
Additionally juju machine status is updated upon failure.

QA Steps

  • Deploy a maas controller
  • tamper with the machines power profiles
  • deploy a charm
  • within moments you will see a failed deployment and the machine will be down.

perrito666 added some commits Sep 23, 2016

Made maas2 status non cached.
Maas 2 instance status used to return a cached value
which was of little to no value when trying to find
out the deployment status.
Add proper maas2 error code for deployment.
Maas failed deployment is now properly picked and also when
detected an error on the instance (for any provider) the error
is also marked in the machine.
provider/maas/environ.go
+ if args.StatusCallback == nil {
+ logger.Errorf("no status callback to report %s status: %v", inst.Id(), instStatus)
+ <-clock.WallClock.After(statusPollInterval)
+ continue
@voidspace

voidspace Sep 26, 2016

If there's no StatusCallback is there any point in continuing here?

@perrito666

perrito666 Sep 26, 2016

Contributor

Well I thought that the log message would be of value for debugging purposes, I think I might make it a Debug level instead.

provider/maas/environ.go
+ if err != nil {
+ logger.Errorf("setting instance %s status: %v", inst.Id(), err)
+ }
+ <-clock.WallClock.After(statusPollInterval)
@voidspace

voidspace Sep 26, 2016

How does this goroutine terminate?

@perrito666

perrito666 Sep 26, 2016

Contributor

My fault, I am adding it now.

provider/maas/environ.go
@@ -948,6 +951,33 @@ func (environ *maasEnviron) StartInstance(args environs.StartInstanceParams) (
return nil, errors.Annotatef(err, "cannot run instances")
}
}
+ done := make(chan struct{})
+ defer close(done)
+ go func(done chan struct{}) {
@kat-co

kat-co Sep 28, 2016

Contributor

Please do a few things here:

  1. This method is huge. Please pull this goroutine out into its own function
  2. This is a retryable operation; please use github.com/juju/rety.
  3. Consolidate the function's conditionals; as we discussed, args.StatusCallback doesn't change once passed in. Create a closure somewhere that is a noop if this is nil. This allows you to collapse the conditional in your loop.
  4. Please document why you're doing this and specifically call out that it's OK if this goroutine is never run.
@howbazaar

howbazaar Sep 28, 2016

Owner

Some points here. I agree entirely that this method is too long and should be broken up. Please don't add big bunches of code to the middle.

Don't even start the polling goroutine if there is no StatusCallback. Otherwise we are going to see repeated log entries for something.

Don't hide time.After behind clock.WallClock. Either use the clock interface in the environ or don't, but don't pretend.

Since this is a continuous poll rather than a retry operation, I'm happy for it to be in a for loop that is exited when done, but please define it in a function outside this one.

Don't take the done channel as chan struct{} but the more restrictive <-chan struct{}. You are never sending to it.

provider/maas/instance.go
@@ -64,7 +64,7 @@ func convertInstanceStatus(statusMsg, substatus string, id instance.Id) instance
if substatus != "" {
statusMsg = fmt.Sprintf("%s: %s", statusMsg, substatus)
}
- case "Failed Deployment":
+ case "Failed Deployment", "Failed deployment":
@kat-co

kat-co Sep 28, 2016

Contributor

Instead, please normalize the input:

  • Trim
  • Lowercase

Please do this in a function.

provider/maas/maas2instance.go
- statusMsg := mi.machine.StatusMessage()
+ args := gomaasapi.MachinesArgs{SystemIDs: []string{mi.machine.SystemID()}}
+ machines, err := mi.maasController.Machines(args)
+ if err != nil {
@kat-co

kat-co Sep 28, 2016

Contributor

We should at least log the error. Why aren't we passing in the error message into the status?

@axw

axw Sep 28, 2016

Member

Why are we not just returning the error? It should be up to the instance poller to make a call about what it does with the error. This is not a MAAS-specific issue.

@perrito666

perrito666 Sep 29, 2016

Contributor

Are you sugesting we change the interface?

@axw

axw Sep 29, 2016

Member

Are you sugesting we change the interface?

Ah, I forgot that the interface didn't return an error. The way all the existing code works is: to get fresh status, you get a new Instance.

So I think you should revert this function to how it was, drop the TODO, and update pollInstanceStatus to get a fresh Instance each time around the loop. Alternatively, don't call Status(), but call some internal function. Getting a fresh Instance is probably simplest.

provider/maas/maas2instance.go
+ if err != nil {
+ return convertInstanceStatus("", "", mi.Id())
+ }
+ if len(machines) != 1 {
@kat-co

kat-co Sep 28, 2016

Contributor

Please log a warning here stating we expected one back and got n.

provider/maas/environ.go
@@ -859,6 +861,24 @@ func (*maasEnviron) MaintainInstance(args environs.StartInstanceParams) error {
return nil
}
+type statusCallbackFunc func(settableStatus status.Status, info string, data map[string]interface{}) error
+
+func pollinstanceStatus(done <-chan struct{}, statusCallback statusCallbackFunc, inst maasInstance) {
@kat-co

kat-co Sep 28, 2016

Contributor

s/pollinstanceStatus/pollInstanceStatus/g

provider/maas/environ.go
+ select {
+ case <-done:
+ return
+ case <-time.After(statusPollInterval):
@kat-co

kat-co Sep 28, 2016

Contributor

Maybe I don't understand @howbazaar 's comment, but we shouldn't use time.After directly. I think what he meant was to pass a clock into this type and then do clock.After.

@perrito666

perrito666 Sep 28, 2016

Contributor

Well, as I understand, @howbazaar states that I should either pass a clock all the way from whoever is instantiating environ or just use time, I believe that for a use of After, time is reasonable.

@kat-co

kat-co Sep 28, 2016

Contributor

If that's the case, I disagree. It doesn't have to be all or nothing; we can request it from the caller, and for now -- since we don't have time -- just pass in WallClock. It edges us forward.

@perrito666

perrito666 Sep 28, 2016

Contributor

That, I can agree with

provider/maas/environ.go
+ // consequence to this goroutine never running since it means the function
+ // ended fast and instancepoller is taking care now.
+ defer close(done)
+ go pollinstanceStatus(done, args.StatusCallback, inst)
@kat-co

kat-co Sep 28, 2016

Contributor

Beautiful code/comment, thank you!

provider/maas/environ.go
@@ -948,6 +970,18 @@ func (environ *maasEnviron) StartInstance(args environs.StartInstanceParams) (
return nil, errors.Annotatef(err, "cannot run instances")
}
}
+ done := make(chan struct{})
@axw

axw Sep 28, 2016

Member

move this inside the if block below

provider/maas/environ.go
+ if args.StatusCallback != nil {
+ // the purpose of this goroutine is to provide information while the instance
+ // is coming up, once StartInstance exits instancepoller will take care but
+ // long start times can lead to confusing status stalling, there is no
@axw

axw Sep 28, 2016

Member

how slow is the code below this if block? isn't starting a node mostly asynchronous? I'm curious to know how much of a delay we're saving here

provider/maas/environ.go
+ // long start times can lead to confusing status stalling, there is no
+ // consequence to this goroutine never running since it means the function
+ // ended fast and instancepoller is taking care now.
+ defer close(done)
@axw

axw Sep 28, 2016

Member

Please also wait for the goroutine to stop before returning from StartInstance. We shouldn't assume that args.StatusCallback is valid to use once this method has returned.

provider/maas/maas2instance.go
- statusMsg := mi.machine.StatusMessage()
+ args := gomaasapi.MachinesArgs{SystemIDs: []string{mi.machine.SystemID()}}
+ machines, err := mi.maasController.Machines(args)
+ if err != nil {
@kat-co

kat-co Sep 28, 2016

Contributor

We should at least log the error. Why aren't we passing in the error message into the status?

@axw

axw Sep 28, 2016

Member

Why are we not just returning the error? It should be up to the instance poller to make a call about what it does with the error. This is not a MAAS-specific issue.

@perrito666

perrito666 Sep 29, 2016

Contributor

Are you sugesting we change the interface?

@axw

axw Sep 29, 2016

Member

Are you sugesting we change the interface?

Ah, I forgot that the interface didn't return an error. The way all the existing code works is: to get fresh status, you get a new Instance.

So I think you should revert this function to how it was, drop the TODO, and update pollInstanceStatus to get a fresh Instance each time around the loop. Alternatively, don't call Status(), but call some internal function. Getting a fresh Instance is probably simplest.

Member

axw commented Sep 28, 2016

Also: I'm not very happy about this magical instance-status influencing machine-status business. We need to come up with a better solution to composite status some time soon.

Contributor

perrito666 commented Sep 29, 2016

@axw I believe so too, perhaps a watcher on instances?

axw approved these changes Oct 3, 2016

LGTM, thanks.

@@ -54,6 +54,8 @@ var shortAttempt = utils.AttemptStrategy{
Delay: 200 * time.Millisecond,
}
+const statusPollInterval = 5 * time.Second
@axw

axw Oct 3, 2016

Member

delete?

provider/maas/environ.go
@@ -741,7 +743,7 @@ func (environ *maasEnviron) acquireNode2(
if err != nil {
return nil, errors.Trace(err)
}
- return &maas2Instance{machine, constraintMatches}, nil
+ return &maas2Instance{machine, constraintMatches, environ.maasController}, nil
@axw

axw Oct 3, 2016

Member

delete? and below

@@ -184,6 +184,14 @@ func (c *fakeController) ReleaseMachines(args gomaasapi.ReleaseMachinesArgs) err
return c.NextErr()
}
+type fakeMachineOnlyController struct {
@axw

axw Oct 3, 2016

Member

delete? unused now

@@ -65,7 +65,10 @@ func (mi *maas2Instance) Addresses() ([]network.Address, error) {
// Status returns a juju status based on the maas instance returned
// status message.
func (mi *maas2Instance) Status() instance.InstanceStatus {
- // TODO (babbageclunk): this should rerequest to get live status.
+ // A fresh status is not obtained here because the interface it is intended
@axw

axw Oct 3, 2016

Member

Thank you.

+ // A fresh status is not obtained here because the interface it is intended
+ // to satisfy gets a new maas2Instance before each call, using a fresh status
+ // would cause us to mask errors since this interface does not contemplate
+ // returing them.
@axw

axw Oct 3, 2016

Member

returning

@@ -271,6 +271,7 @@ func pollInstanceInfo(context machineContext, m machine) (instInfo instanceInfo,
return instanceInfo{}, err
}
}
+
Contributor

perrito666 commented Oct 3, 2016

$$merge$$

Contributor

jujubot commented Oct 3, 2016

Status: merge request accepted. Url: http://juju-ci.vapour.ws:8080/job/github-merge-juju

@jujubot jujubot merged commit 1fd4c66 into juju:master Oct 4, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment