Fix bug lp#1694734 #7552

Merged
merged 1 commit into from Jun 26, 2017

Conversation

Projects
None yet
3 participants
Member

ExternalReality commented Jun 25, 2017

Description of change

Why is this change needed?

Fixes the referenced bug.

In a nutshell, if the uniter was to crash while running an action, upon restart it would attempt to fail that action, regardless of what the controller thought. This was bad since the controller does not allow the failure of arbitrary actions (only pending ones) and thus the controller would deny the uniter's failure request for actions that it considered to be finished. This would cause the uniter to spin in a never ending loop of failure request retries. A juju run "reboot" was a good way to potentially jam up the uniter in this fashion.

How do we verify that the change works?

It is difficult to verify this bug from the CLI since invoking its cause does not always lead to the error. It is quite non-deterministic. Concretely, running juju run "reboot" may cause the issue to arise or may not. However, changing the code in such a way that the uniter "crashes" at specific points in its operation is a good way to flex the error and thus test the solution. Read on...

Add the following code immediately before the uniter writes to its local state after executing an operation (action). At this PR's proposed commit point the line should be https://github.com/juju/juju/blob/develop/worker/uniter/operation/executor.go#L103

  if step.verb == "executing" && x.state.Kind == RunAction {
     panic("stopping uniter before writing")
  }

This will crash the uniter when running an action. The uniter should restart, realize that it shutdown while running an action, and move back into a normal state of operation, no longer wanting to run the action that it was running when it crashed.

You may want to crash the uniter after it begins running an action but before it updates the controller's state. To do this you can put a panic in the appropriate spot in the code. Upon restart, the uniter should recover from this too without attempting to again run the action in question and move into a normal waiting state.

In summary, deploy a simple service:

juju deploy ubuntu

make one of the code changes above (which help to reproduce the bug's error condition) and then run an action (try the other code change afterwards):

juju run "ls -la" --unit=ubuntu/0

This will simulate a hard stop of the uniter in a places that will cause the referenced bug's error. It will show that the uniter is now able to recover from such conditions.

Does it affect current user workflow? CLI? API?
No

Bug reference

lp#1694734

Member

ExternalReality commented Jun 25, 2017

!!build!!

worker/uniter/actions/resolver.go
@@ -42,26 +42,47 @@ func (r *actionsResolver) NextOp(
opFactory operation.Factory,
) (operation.Operation, error) {
nextAction, err := nextAction(remoteState.Actions, localState.CompletedActions)
- if err != nil {
+ if err != nil && err != resolver.ErrNoOperation {
@ExternalReality

ExternalReality Jun 26, 2017

Member

If there are no operation left to be run, then we cannot return the error signaling such here, we must first check to see if an action is already running (that has been interrupted) before we declare that there is nothing to do.

@wallyworld

wallyworld Jun 26, 2017

Owner

I think we need to add a code comment with something like the above text

Just to check - the newly added tests fail without the code modifications being in place? And performing the QA steps produces the bad behaviour without the code modifications? And all other aspects of action execution work as expected with the code mods?

worker/uniter/actions/resolver.go
@@ -42,26 +42,47 @@ func (r *actionsResolver) NextOp(
opFactory operation.Factory,
) (operation.Operation, error) {
nextAction, err := nextAction(remoteState.Actions, localState.CompletedActions)
- if err != nil {
+ if err != nil && err != resolver.ErrNoOperation {
@ExternalReality

ExternalReality Jun 26, 2017

Member

If there are no operation left to be run, then we cannot return the error signaling such here, we must first check to see if an action is already running (that has been interrupted) before we declare that there is nothing to do.

@wallyworld

wallyworld Jun 26, 2017

Owner

I think we need to add a code comment with something like the above text

worker/uniter/actions/resolver.go
return nil, err
}
switch localState.Kind {
case operation.RunHook:
// We can still run actions if the unit is in a hook error state.
- if localState.Step == operation.Pending {
+ if localState.Step == operation.Pending && err != resolver.ErrNoOperation {
@wallyworld

wallyworld Jun 26, 2017

Owner

If we are here, then either err == nil or err == ErrNoOperation
I think it would read much nicer to say
if localState.Step == operation.Pending && err == nil

Member

ExternalReality commented Jun 26, 2017

!!build!!

Member

ExternalReality commented Jun 26, 2017

$$MERGE$$

Contributor

jujubot commented Jun 26, 2017

Status: merge request accepted. Url: http://juju-ci.vapour.ws:8080/job/github-merge-juju

@jujubot jujubot merged commit dec2bde into juju:develop Jun 26, 2017

1 check passed

github-check-merge-juju Ran tests against PR. Use !!.*!! to request another build. IE, !!build!!, !!retry!!
Details

@ExternalReality ExternalReality deleted the ExternalReality:bug_1694734 branch Jun 26, 2017

ExternalReality pushed a commit to ExternalReality/juju that referenced this pull request Jul 11, 2017

Merge pull request #7552 from ExternalReality/bug_1694734
Fix bug lp#1694734

## Description of change

**Why is this change needed?**

Fixes the referenced bug.

In a nutshell, if the `uniter` was to crash while running an action, upon restart it would attempt to fail that action, regardless of what the `controller` thought. This was bad since the `controller` does not allow the failure of arbitrary actions (only pending ones) and thus the `controller` would deny the `uniter's` failure request for actions that it considered to be finished. This would cause the uniter to spin in a never ending loop of failure request retries. A `juju run "reboot"` was a good way to potentially jam up the `uniter` in this fashion. 

**How do we verify that the change works?**

It is difficult to verify this bug from the CLI since invoking its cause does not always lead to the error. It is quite non-deterministic. Concretely, running `juju run "reboot"` may cause the issue to arise or may not. However, changing the code in such a way that the `uniter` "crashes" at specific points in its operation is a good way to flex the error and thus test the solution. Read on...

Add the following code immediately before the `uniter` writes to its local state after executing an operation (action). At this PR's proposed commit point the line should be https://github.com/juju/juju/blob/develop/worker/uniter/operation/executor.go#L103

```
  if step.verb == "executing" && x.state.Kind == RunAction {
     panic("stopping uniter before writing")
  }
```

This will crash the `uniter` when running an action. The `uniter` should restart, realize that it shutdown while running an action, and move back into a normal state of operation, no longer wanting to run the action that it was running when it crashed.

You may want to crash the `uniter` after it begins running an action but before it updates the `controller's` state. To do this you can put a panic in the appropriate spot in the code. Upon restart, the `uniter` should recover from this too without attempting to again run the action in question and move into a normal waiting state.

In summary, deploy a simple service:

```
juju deploy ubuntu
```

make one of the code changes above (which help to reproduce the bug's error condition) and then run an action (try the other code change afterwards):

``` 
juju run "ls -la" --unit=ubuntu/0
```

This will simulate a hard stop of the uniter in a places that will cause the referenced bug's error. It will show that the `uniter` is now able to recover from such conditions.

**Does it affect current user workflow? CLI? API?**
No

## Bug reference
[lp#1694734](https://bugs.launchpad.net/juju/+bug/1694734)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment