New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
config: set recover_stopped to default to false #260
Conversation
The use of recover_stopped may cause the Nomad agent to hang on startup, as the plugin tries to start an exited podman task. Podman itself will hang forever in this state, and the http client on the Nomad side is also unable to timeout in this case. The result is a permenantly hung Nomad agent, until someone force kills either Nomad or Podman. Also emit a log warning that recover_stopped should not be used. We leave it in place for compatability. Fixes #229
Spot check, 2 nodes
One simple redis job running
reboot the node with the redis alloc
the other node now has a new redis alloc
the other node comes back up, nomad service succesfully starts, and no lingering dead podman container exists
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah nice find! Any clues why Podman would hang? And in your test, would recover_stopped
cause the rebooted client to restart leaving two running (one on each client)?
The intention behind this feature seems to be very close to the max_client_disconnect
, so we could log that as an alternative.
This was implemented very early in project (d66b48c). @towe75 have you been using this configuration without issues?
My guess is something to do with the Go http client ignoring timeouts when talking to UDS, but I don't know.
No, the podman task leftover on the rebooted client would remain in the |
This PR just updates the README documentation for the recover_stopped setting changes in hashicorp#260
The use of
recover_stopped
may cause the Nomad agent to hang on startupafter node reboot, as the plugin tries to start an exited podman task.
Podman itself will hang forever in this state, and the http client on the
Nomad side is also unable to timeout in this case. The result is a permanently hung Nomad
agent, until someone force kills either Nomad or Podman.
Also emit a log warning that
recover_stopped
should not be used. We leaveit in place for compatibility.
Fixes #229