Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug lp#1694734 #7552

Merged
merged 1 commit into from Jun 26, 2017
Merged

Fix bug lp#1694734 #7552

merged 1 commit into from Jun 26, 2017

Conversation

ExternalReality
Copy link
Contributor

@ExternalReality ExternalReality commented Jun 25, 2017

Description of change

Why is this change needed?

Fixes the referenced bug.

In a nutshell, if the uniter was to crash while running an action, upon restart it would attempt to fail that action, regardless of what the controller thought. This was bad since the controller does not allow the failure of arbitrary actions (only pending ones) and thus the controller would deny the uniter's failure request for actions that it considered to be finished. This would cause the uniter to spin in a never ending loop of failure request retries. A juju run "reboot" was a good way to potentially jam up the uniter in this fashion.

How do we verify that the change works?

It is difficult to verify this bug from the CLI since invoking its cause does not always lead to the error. It is quite non-deterministic. Concretely, running juju run "reboot" may cause the issue to arise or may not. However, changing the code in such a way that the uniter "crashes" at specific points in its operation is a good way to flex the error and thus test the solution. Read on...

Add the following code immediately before the uniter writes to its local state after executing an operation (action). At this PR's proposed commit point the line should be https://github.com/juju/juju/blob/develop/worker/uniter/operation/executor.go#L103

  if step.verb == "executing" && x.state.Kind == RunAction {
     panic("stopping uniter before writing")
  }

This will crash the uniter when running an action. The uniter should restart, realize that it shutdown while running an action, and move back into a normal state of operation, no longer wanting to run the action that it was running when it crashed.

You may want to crash the uniter after it begins running an action but before it updates the controller's state. To do this you can put a panic in the appropriate spot in the code. Upon restart, the uniter should recover from this too without attempting to again run the action in question and move into a normal waiting state.

In summary, deploy a simple service:

juju deploy ubuntu

make one of the code changes above (which help to reproduce the bug's error condition) and then run an action (try the other code change afterwards):

juju run "ls -la" --unit=ubuntu/0

This will simulate a hard stop of the uniter in a places that will cause the referenced bug's error. It will show that the uniter is now able to recover from such conditions.

Does it affect current user workflow? CLI? API?
No

Bug reference

lp#1694734

@ExternalReality
Copy link
Contributor Author

!!build!!

@ExternalReality ExternalReality force-pushed the bug_1694734 branch 2 times, most recently from 436647b to 7c1294a Compare June 26, 2017 02:33
@@ -42,26 +42,47 @@ func (r *actionsResolver) NextOp(
opFactory operation.Factory,
) (operation.Operation, error) {
nextAction, err := nextAction(remoteState.Actions, localState.CompletedActions)
if err != nil {
if err != nil && err != resolver.ErrNoOperation {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are no operation left to be run, then we cannot return the error signaling such here, we must first check to see if an action is already running (that has been interrupted) before we declare that there is nothing to do.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to add a code comment with something like the above text

Copy link
Member

@wallyworld wallyworld left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to check - the newly added tests fail without the code modifications being in place? And performing the QA steps produces the bad behaviour without the code modifications? And all other aspects of action execution work as expected with the code mods?

return nil, err
}
switch localState.Kind {
case operation.RunHook:
// We can still run actions if the unit is in a hook error state.
if localState.Step == operation.Pending {
if localState.Step == operation.Pending && err != resolver.ErrNoOperation {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are here, then either err == nil or err == ErrNoOperation
I think it would read much nicer to say
if localState.Step == operation.Pending && err == nil

@@ -42,26 +42,47 @@ func (r *actionsResolver) NextOp(
opFactory operation.Factory,
) (operation.Operation, error) {
nextAction, err := nextAction(remoteState.Actions, localState.CompletedActions)
if err != nil {
if err != nil && err != resolver.ErrNoOperation {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to add a code comment with something like the above text

@ExternalReality
Copy link
Contributor Author

!!build!!

@ExternalReality
Copy link
Contributor Author

$$MERGE$$

@jujubot
Copy link
Collaborator

jujubot commented Jun 26, 2017

Status: merge request accepted. Url: http://juju-ci.vapour.ws:8080/job/github-merge-juju

@jujubot jujubot merged commit dec2bde into juju:develop Jun 26, 2017
@ExternalReality ExternalReality deleted the bug_1694734 branch June 26, 2017 20:44
ExternalReality pushed a commit to ExternalReality/juju that referenced this pull request Jul 11, 2017
Fix bug lp#1694734

## Description of change

**Why is this change needed?**

Fixes the referenced bug.

In a nutshell, if the `uniter` was to crash while running an action, upon restart it would attempt to fail that action, regardless of what the `controller` thought. This was bad since the `controller` does not allow the failure of arbitrary actions (only pending ones) and thus the `controller` would deny the `uniter's` failure request for actions that it considered to be finished. This would cause the uniter to spin in a never ending loop of failure request retries. A `juju run "reboot"` was a good way to potentially jam up the `uniter` in this fashion. 

**How do we verify that the change works?**

It is difficult to verify this bug from the CLI since invoking its cause does not always lead to the error. It is quite non-deterministic. Concretely, running `juju run "reboot"` may cause the issue to arise or may not. However, changing the code in such a way that the `uniter` "crashes" at specific points in its operation is a good way to flex the error and thus test the solution. Read on...

Add the following code immediately before the `uniter` writes to its local state after executing an operation (action). At this PR's proposed commit point the line should be https://github.com/juju/juju/blob/develop/worker/uniter/operation/executor.go#L103

```
  if step.verb == "executing" && x.state.Kind == RunAction {
     panic("stopping uniter before writing")
  }
```

This will crash the `uniter` when running an action. The `uniter` should restart, realize that it shutdown while running an action, and move back into a normal state of operation, no longer wanting to run the action that it was running when it crashed.

You may want to crash the `uniter` after it begins running an action but before it updates the `controller's` state. To do this you can put a panic in the appropriate spot in the code. Upon restart, the `uniter` should recover from this too without attempting to again run the action in question and move into a normal waiting state.

In summary, deploy a simple service:

```
juju deploy ubuntu
```

make one of the code changes above (which help to reproduce the bug's error condition) and then run an action (try the other code change afterwards):

``` 
juju run "ls -la" --unit=ubuntu/0
```

This will simulate a hard stop of the uniter in a places that will cause the referenced bug's error. It will show that the `uniter` is now able to recover from such conditions.

**Does it affect current user workflow? CLI? API?**
No

## Bug reference
[lp#1694734](https://bugs.launchpad.net/juju/+bug/1694734)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants