Fix bug lp#1694734 #7552

ExternalReality · 2017-06-25T08:21:07Z

Description of change

Why is this change needed?

Fixes the referenced bug.

In a nutshell, if the uniter was to crash while running an action, upon restart it would attempt to fail that action, regardless of what the controller thought. This was bad since the controller does not allow the failure of arbitrary actions (only pending ones) and thus the controller would deny the uniter's failure request for actions that it considered to be finished. This would cause the uniter to spin in a never ending loop of failure request retries. A juju run "reboot" was a good way to potentially jam up the uniter in this fashion.

How do we verify that the change works?

It is difficult to verify this bug from the CLI since invoking its cause does not always lead to the error. It is quite non-deterministic. Concretely, running juju run "reboot" may cause the issue to arise or may not. However, changing the code in such a way that the uniter "crashes" at specific points in its operation is a good way to flex the error and thus test the solution. Read on...

Add the following code immediately before the uniter writes to its local state after executing an operation (action). At this PR's proposed commit point the line should be https://github.com/juju/juju/blob/develop/worker/uniter/operation/executor.go#L103

  if step.verb == "executing" && x.state.Kind == RunAction {
     panic("stopping uniter before writing")
  }

This will crash the uniter when running an action. The uniter should restart, realize that it shutdown while running an action, and move back into a normal state of operation, no longer wanting to run the action that it was running when it crashed.

You may want to crash the uniter after it begins running an action but before it updates the controller's state. To do this you can put a panic in the appropriate spot in the code. Upon restart, the uniter should recover from this too without attempting to again run the action in question and move into a normal waiting state.

In summary, deploy a simple service:

juju deploy ubuntu

make one of the code changes above (which help to reproduce the bug's error condition) and then run an action (try the other code change afterwards):

juju run "ls -la" --unit=ubuntu/0

This will simulate a hard stop of the uniter in a places that will cause the referenced bug's error. It will show that the uniter is now able to recover from such conditions.

Does it affect current user workflow? CLI? API?
No

Bug reference

lp#1694734

ExternalReality · 2017-06-25T16:10:27Z

!!build!!

ExternalReality · 2017-06-26T02:54:33Z

worker/uniter/actions/resolver.go

@@ -42,26 +42,47 @@ func (r *actionsResolver) NextOp(
 	opFactory operation.Factory,
 ) (operation.Operation, error) {
 	nextAction, err := nextAction(remoteState.Actions, localState.CompletedActions)
-	if err != nil {
+	if err != nil && err != resolver.ErrNoOperation {


If there are no operation left to be run, then we cannot return the error signaling such here, we must first check to see if an action is already running (that has been interrupted) before we declare that there is nothing to do.

I think we need to add a code comment with something like the above text

wallyworld

Just to check - the newly added tests fail without the code modifications being in place? And performing the QA steps produces the bad behaviour without the code modifications? And all other aspects of action execution work as expected with the code mods?

wallyworld · 2017-06-26T10:53:53Z

worker/uniter/actions/resolver.go

 		return nil, err
 	}
 	switch localState.Kind {
 	case operation.RunHook:
 		// We can still run actions if the unit is in a hook error state.
-		if localState.Step == operation.Pending {
+		if localState.Step == operation.Pending && err != resolver.ErrNoOperation {


If we are here, then either err == nil or err == ErrNoOperation
I think it would read much nicer to say
if localState.Step == operation.Pending && err == nil

wallyworld · 2017-06-26T10:55:44Z

worker/uniter/actions/resolver.go

@@ -42,26 +42,47 @@ func (r *actionsResolver) NextOp(
 	opFactory operation.Factory,
 ) (operation.Operation, error) {
 	nextAction, err := nextAction(remoteState.Actions, localState.CompletedActions)
-	if err != nil {
+	if err != nil && err != resolver.ErrNoOperation {


I think we need to add a code comment with something like the above text

ExternalReality · 2017-06-26T19:58:59Z

!!build!!

ExternalReality · 2017-06-26T20:24:24Z

$$MERGE$$

jujubot · 2017-06-26T20:26:06Z

Status: merge request accepted. Url: http://juju-ci.vapour.ws:8080/job/github-merge-juju

Fix bug lp#1694734 ## Description of change **Why is this change needed?** Fixes the referenced bug. In a nutshell, if the `uniter` was to crash while running an action, upon restart it would attempt to fail that action, regardless of what the `controller` thought. This was bad since the `controller` does not allow the failure of arbitrary actions (only pending ones) and thus the `controller` would deny the `uniter's` failure request for actions that it considered to be finished. This would cause the uniter to spin in a never ending loop of failure request retries. A `juju run "reboot"` was a good way to potentially jam up the `uniter` in this fashion. **How do we verify that the change works?** It is difficult to verify this bug from the CLI since invoking its cause does not always lead to the error. It is quite non-deterministic. Concretely, running `juju run "reboot"` may cause the issue to arise or may not. However, changing the code in such a way that the `uniter` "crashes" at specific points in its operation is a good way to flex the error and thus test the solution. Read on... Add the following code immediately before the `uniter` writes to its local state after executing an operation (action). At this PR's proposed commit point the line should be https://github.com/juju/juju/blob/develop/worker/uniter/operation/executor.go#L103 ``` if step.verb == "executing" && x.state.Kind == RunAction { panic("stopping uniter before writing") } ``` This will crash the `uniter` when running an action. The `uniter` should restart, realize that it shutdown while running an action, and move back into a normal state of operation, no longer wanting to run the action that it was running when it crashed. You may want to crash the `uniter` after it begins running an action but before it updates the `controller's` state. To do this you can put a panic in the appropriate spot in the code. Upon restart, the `uniter` should recover from this too without attempting to again run the action in question and move into a normal waiting state. In summary, deploy a simple service: ``` juju deploy ubuntu ``` make one of the code changes above (which help to reproduce the bug's error condition) and then run an action (try the other code change afterwards): ``` juju run "ls -la" --unit=ubuntu/0 ``` This will simulate a hard stop of the uniter in a places that will cause the referenced bug's error. It will show that the `uniter` is now able to recover from such conditions. **Does it affect current user workflow? CLI? API?** No ## Bug reference [lp#1694734](https://bugs.launchpad.net/juju/+bug/1694734)

ExternalReality force-pushed the bug_1694734 branch 2 times, most recently from 436647b to 7c1294a Compare June 26, 2017 02:33

ExternalReality commented Jun 26, 2017

View reviewed changes

ExternalReality force-pushed the bug_1694734 branch from 7c1294a to b6da140 Compare June 26, 2017 03:21

wallyworld approved these changes Jun 26, 2017

View reviewed changes

Fix bug lp#1694734

3f7b14e

ExternalReality force-pushed the bug_1694734 branch from b6da140 to 3f7b14e Compare June 26, 2017 19:04

jujubot merged commit dec2bde into juju:develop Jun 26, 2017

ExternalReality deleted the bug_1694734 branch June 26, 2017 20:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix bug lp#1694734 #7552

Fix bug lp#1694734 #7552

ExternalReality commented Jun 25, 2017 •

edited

ExternalReality commented Jun 25, 2017

ExternalReality Jun 26, 2017 •

edited

wallyworld Jun 26, 2017

wallyworld left a comment

wallyworld Jun 26, 2017

wallyworld Jun 26, 2017

ExternalReality commented Jun 26, 2017

ExternalReality commented Jun 26, 2017

jujubot commented Jun 26, 2017

Fix bug lp#1694734 #7552

Fix bug lp#1694734 #7552

Conversation

ExternalReality commented Jun 25, 2017 • edited

Description of change

Bug reference

ExternalReality commented Jun 25, 2017

ExternalReality Jun 26, 2017 • edited

Choose a reason for hiding this comment

wallyworld Jun 26, 2017

Choose a reason for hiding this comment

wallyworld left a comment

Choose a reason for hiding this comment

wallyworld Jun 26, 2017

Choose a reason for hiding this comment

wallyworld Jun 26, 2017

Choose a reason for hiding this comment

ExternalReality commented Jun 26, 2017

ExternalReality commented Jun 26, 2017

jujubot commented Jun 26, 2017

ExternalReality commented Jun 25, 2017 •

edited

ExternalReality Jun 26, 2017 •

edited