Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
Fix 1604965: machine stays in pending state even though node has been marked failed deployment in MAAS #6321
Conversation
perrito666
added some commits
Sep 23, 2016
| + if args.StatusCallback == nil { | ||
| + logger.Errorf("no status callback to report %s status: %v", inst.Id(), instStatus) | ||
| + <-clock.WallClock.After(statusPollInterval) | ||
| + continue |
perrito666
Sep 26, 2016
Contributor
Well I thought that the log message would be of value for debugging purposes, I think I might make it a Debug level instead.
| + if err != nil { | ||
| + logger.Errorf("setting instance %s status: %v", inst.Id(), err) | ||
| + } | ||
| + <-clock.WallClock.After(statusPollInterval) |
| @@ -948,6 +951,33 @@ func (environ *maasEnviron) StartInstance(args environs.StartInstanceParams) ( | ||
| return nil, errors.Annotatef(err, "cannot run instances") | ||
| } | ||
| } | ||
| + done := make(chan struct{}) | ||
| + defer close(done) | ||
| + go func(done chan struct{}) { |
kat-co
Sep 28, 2016
Contributor
Please do a few things here:
- This method is huge. Please pull this goroutine out into its own function
- This is a retryable operation; please use github.com/juju/rety.
- Consolidate the function's conditionals; as we discussed,
args.StatusCallbackdoesn't change once passed in. Create a closure somewhere that is a noop if this is nil. This allows you to collapse the conditional in your loop. - Please document why you're doing this and specifically call out that it's OK if this goroutine is never run.
howbazaar
Sep 28, 2016
Owner
Some points here. I agree entirely that this method is too long and should be broken up. Please don't add big bunches of code to the middle.
Don't even start the polling goroutine if there is no StatusCallback. Otherwise we are going to see repeated log entries for something.
Don't hide time.After behind clock.WallClock. Either use the clock interface in the environ or don't, but don't pretend.
Since this is a continuous poll rather than a retry operation, I'm happy for it to be in a for loop that is exited when done, but please define it in a function outside this one.
Don't take the done channel as chan struct{} but the more restrictive <-chan struct{}. You are never sending to it.
| @@ -64,7 +64,7 @@ func convertInstanceStatus(statusMsg, substatus string, id instance.Id) instance | ||
| if substatus != "" { | ||
| statusMsg = fmt.Sprintf("%s: %s", statusMsg, substatus) | ||
| } | ||
| - case "Failed Deployment": | ||
| + case "Failed Deployment", "Failed deployment": |
kat-co
Sep 28, 2016
Contributor
Instead, please normalize the input:
- Trim
- Lowercase
Please do this in a function.
| - statusMsg := mi.machine.StatusMessage() | ||
| + args := gomaasapi.MachinesArgs{SystemIDs: []string{mi.machine.SystemID()}} | ||
| + machines, err := mi.maasController.Machines(args) | ||
| + if err != nil { |
kat-co
Sep 28, 2016
Contributor
We should at least log the error. Why aren't we passing in the error message into the status?
axw
Sep 28, 2016
•
Member
Why are we not just returning the error? It should be up to the instance poller to make a call about what it does with the error. This is not a MAAS-specific issue.
axw
Sep 29, 2016
Member
Are you sugesting we change the interface?
Ah, I forgot that the interface didn't return an error. The way all the existing code works is: to get fresh status, you get a new Instance.
So I think you should revert this function to how it was, drop the TODO, and update pollInstanceStatus to get a fresh Instance each time around the loop. Alternatively, don't call Status(), but call some internal function. Getting a fresh Instance is probably simplest.
| + if err != nil { | ||
| + return convertInstanceStatus("", "", mi.Id()) | ||
| + } | ||
| + if len(machines) != 1 { |
| @@ -859,6 +861,24 @@ func (*maasEnviron) MaintainInstance(args environs.StartInstanceParams) error { | ||
| return nil | ||
| } | ||
| +type statusCallbackFunc func(settableStatus status.Status, info string, data map[string]interface{}) error | ||
| + | ||
| +func pollinstanceStatus(done <-chan struct{}, statusCallback statusCallbackFunc, inst maasInstance) { |
| + select { | ||
| + case <-done: | ||
| + return | ||
| + case <-time.After(statusPollInterval): |
kat-co
Sep 28, 2016
Contributor
Maybe I don't understand @howbazaar 's comment, but we shouldn't use time.After directly. I think what he meant was to pass a clock into this type and then do clock.After.
perrito666
Sep 28, 2016
Contributor
Well, as I understand, @howbazaar states that I should either pass a clock all the way from whoever is instantiating environ or just use time, I believe that for a use of After, time is reasonable.
kat-co
Sep 28, 2016
Contributor
If that's the case, I disagree. It doesn't have to be all or nothing; we can request it from the caller, and for now -- since we don't have time -- just pass in WallClock. It edges us forward.
| + // consequence to this goroutine never running since it means the function | ||
| + // ended fast and instancepoller is taking care now. | ||
| + defer close(done) | ||
| + go pollinstanceStatus(done, args.StatusCallback, inst) |
| @@ -948,6 +970,18 @@ func (environ *maasEnviron) StartInstance(args environs.StartInstanceParams) ( | ||
| return nil, errors.Annotatef(err, "cannot run instances") | ||
| } | ||
| } | ||
| + done := make(chan struct{}) |
| + if args.StatusCallback != nil { | ||
| + // the purpose of this goroutine is to provide information while the instance | ||
| + // is coming up, once StartInstance exits instancepoller will take care but | ||
| + // long start times can lead to confusing status stalling, there is no |
axw
Sep 28, 2016
Member
how slow is the code below this if block? isn't starting a node mostly asynchronous? I'm curious to know how much of a delay we're saving here
| + // long start times can lead to confusing status stalling, there is no | ||
| + // consequence to this goroutine never running since it means the function | ||
| + // ended fast and instancepoller is taking care now. | ||
| + defer close(done) |
axw
Sep 28, 2016
Member
Please also wait for the goroutine to stop before returning from StartInstance. We shouldn't assume that args.StatusCallback is valid to use once this method has returned.
| - statusMsg := mi.machine.StatusMessage() | ||
| + args := gomaasapi.MachinesArgs{SystemIDs: []string{mi.machine.SystemID()}} | ||
| + machines, err := mi.maasController.Machines(args) | ||
| + if err != nil { |
kat-co
Sep 28, 2016
Contributor
We should at least log the error. Why aren't we passing in the error message into the status?
axw
Sep 28, 2016
•
Member
Why are we not just returning the error? It should be up to the instance poller to make a call about what it does with the error. This is not a MAAS-specific issue.
axw
Sep 29, 2016
Member
Are you sugesting we change the interface?
Ah, I forgot that the interface didn't return an error. The way all the existing code works is: to get fresh status, you get a new Instance.
So I think you should revert this function to how it was, drop the TODO, and update pollInstanceStatus to get a fresh Instance each time around the loop. Alternatively, don't call Status(), but call some internal function. Getting a fresh Instance is probably simplest.
|
Also: I'm not very happy about this magical instance-status influencing machine-status business. We need to come up with a better solution to composite status some time soon. |
|
@axw I believe so too, perhaps a watcher on instances? |
| @@ -54,6 +54,8 @@ var shortAttempt = utils.AttemptStrategy{ | ||
| Delay: 200 * time.Millisecond, | ||
| } | ||
| +const statusPollInterval = 5 * time.Second |
| @@ -741,7 +743,7 @@ func (environ *maasEnviron) acquireNode2( | ||
| if err != nil { | ||
| return nil, errors.Trace(err) | ||
| } | ||
| - return &maas2Instance{machine, constraintMatches}, nil | ||
| + return &maas2Instance{machine, constraintMatches, environ.maasController}, nil |
| @@ -184,6 +184,14 @@ func (c *fakeController) ReleaseMachines(args gomaasapi.ReleaseMachinesArgs) err | ||
| return c.NextErr() | ||
| } | ||
| +type fakeMachineOnlyController struct { |
| @@ -65,7 +65,10 @@ func (mi *maas2Instance) Addresses() ([]network.Address, error) { | ||
| // Status returns a juju status based on the maas instance returned | ||
| // status message. | ||
| func (mi *maas2Instance) Status() instance.InstanceStatus { | ||
| - // TODO (babbageclunk): this should rerequest to get live status. | ||
| + // A fresh status is not obtained here because the interface it is intended |
| + // A fresh status is not obtained here because the interface it is intended | ||
| + // to satisfy gets a new maas2Instance before each call, using a fresh status | ||
| + // would cause us to mask errors since this interface does not contemplate | ||
| + // returing them. |
| @@ -271,6 +271,7 @@ func pollInstanceInfo(context machineContext, m machine) (instInfo instanceInfo, | ||
| return instanceInfo{}, err | ||
| } | ||
| } | ||
| + |
|
$$merge$$ |
|
Status: merge request accepted. Url: http://juju-ci.vapour.ws:8080/job/github-merge-juju |
perrito666 commentedSep 26, 2016
Maas provider was missing an error state by the maas controller, it is now handled.
Additionally juju machine status is updated upon failure.
QA Steps