New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix bug lp#1694734 #7552
Fix bug lp#1694734 #7552
Conversation
!!build!! |
436647b
to
7c1294a
Compare
worker/uniter/actions/resolver.go
Outdated
@@ -42,26 +42,47 @@ func (r *actionsResolver) NextOp( | |||
opFactory operation.Factory, | |||
) (operation.Operation, error) { | |||
nextAction, err := nextAction(remoteState.Actions, localState.CompletedActions) | |||
if err != nil { | |||
if err != nil && err != resolver.ErrNoOperation { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there are no operation left to be run, then we cannot return the error signaling such here, we must first check to see if an action is already running (that has been interrupted) before we declare that there is nothing to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to add a code comment with something like the above text
7c1294a
to
b6da140
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to check - the newly added tests fail without the code modifications being in place? And performing the QA steps produces the bad behaviour without the code modifications? And all other aspects of action execution work as expected with the code mods?
worker/uniter/actions/resolver.go
Outdated
return nil, err | ||
} | ||
switch localState.Kind { | ||
case operation.RunHook: | ||
// We can still run actions if the unit is in a hook error state. | ||
if localState.Step == operation.Pending { | ||
if localState.Step == operation.Pending && err != resolver.ErrNoOperation { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are here, then either err == nil or err == ErrNoOperation
I think it would read much nicer to say
if localState.Step == operation.Pending && err == nil
worker/uniter/actions/resolver.go
Outdated
@@ -42,26 +42,47 @@ func (r *actionsResolver) NextOp( | |||
opFactory operation.Factory, | |||
) (operation.Operation, error) { | |||
nextAction, err := nextAction(remoteState.Actions, localState.CompletedActions) | |||
if err != nil { | |||
if err != nil && err != resolver.ErrNoOperation { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to add a code comment with something like the above text
b6da140
to
3f7b14e
Compare
!!build!! |
|
Status: merge request accepted. Url: http://juju-ci.vapour.ws:8080/job/github-merge-juju |
Fix bug lp#1694734 ## Description of change **Why is this change needed?** Fixes the referenced bug. In a nutshell, if the `uniter` was to crash while running an action, upon restart it would attempt to fail that action, regardless of what the `controller` thought. This was bad since the `controller` does not allow the failure of arbitrary actions (only pending ones) and thus the `controller` would deny the `uniter's` failure request for actions that it considered to be finished. This would cause the uniter to spin in a never ending loop of failure request retries. A `juju run "reboot"` was a good way to potentially jam up the `uniter` in this fashion. **How do we verify that the change works?** It is difficult to verify this bug from the CLI since invoking its cause does not always lead to the error. It is quite non-deterministic. Concretely, running `juju run "reboot"` may cause the issue to arise or may not. However, changing the code in such a way that the `uniter` "crashes" at specific points in its operation is a good way to flex the error and thus test the solution. Read on... Add the following code immediately before the `uniter` writes to its local state after executing an operation (action). At this PR's proposed commit point the line should be https://github.com/juju/juju/blob/develop/worker/uniter/operation/executor.go#L103 ``` if step.verb == "executing" && x.state.Kind == RunAction { panic("stopping uniter before writing") } ``` This will crash the `uniter` when running an action. The `uniter` should restart, realize that it shutdown while running an action, and move back into a normal state of operation, no longer wanting to run the action that it was running when it crashed. You may want to crash the `uniter` after it begins running an action but before it updates the `controller's` state. To do this you can put a panic in the appropriate spot in the code. Upon restart, the `uniter` should recover from this too without attempting to again run the action in question and move into a normal waiting state. In summary, deploy a simple service: ``` juju deploy ubuntu ``` make one of the code changes above (which help to reproduce the bug's error condition) and then run an action (try the other code change afterwards): ``` juju run "ls -la" --unit=ubuntu/0 ``` This will simulate a hard stop of the uniter in a places that will cause the referenced bug's error. It will show that the `uniter` is now able to recover from such conditions. **Does it affect current user workflow? CLI? API?** No ## Bug reference [lp#1694734](https://bugs.launchpad.net/juju/+bug/1694734)
Description of change
Why is this change needed?
Fixes the referenced bug.
In a nutshell, if the
uniter
was to crash while running an action, upon restart it would attempt to fail that action, regardless of what thecontroller
thought. This was bad since thecontroller
does not allow the failure of arbitrary actions (only pending ones) and thus thecontroller
would deny theuniter's
failure request for actions that it considered to be finished. This would cause the uniter to spin in a never ending loop of failure request retries. Ajuju run "reboot"
was a good way to potentially jam up theuniter
in this fashion.How do we verify that the change works?
It is difficult to verify this bug from the CLI since invoking its cause does not always lead to the error. It is quite non-deterministic. Concretely, running
juju run "reboot"
may cause the issue to arise or may not. However, changing the code in such a way that theuniter
"crashes" at specific points in its operation is a good way to flex the error and thus test the solution. Read on...Add the following code immediately before the
uniter
writes to its local state after executing an operation (action). At this PR's proposed commit point the line should be https://github.com/juju/juju/blob/develop/worker/uniter/operation/executor.go#L103This will crash the
uniter
when running an action. Theuniter
should restart, realize that it shutdown while running an action, and move back into a normal state of operation, no longer wanting to run the action that it was running when it crashed.You may want to crash the
uniter
after it begins running an action but before it updates thecontroller's
state. To do this you can put a panic in the appropriate spot in the code. Upon restart, theuniter
should recover from this too without attempting to again run the action in question and move into a normal waiting state.In summary, deploy a simple service:
make one of the code changes above (which help to reproduce the bug's error condition) and then run an action (try the other code change afterwards):
This will simulate a hard stop of the uniter in a places that will cause the referenced bug's error. It will show that the
uniter
is now able to recover from such conditions.Does it affect current user workflow? CLI? API?
No
Bug reference
lp#1694734