This PR adds a graceful shutdown to mcollectived.
When receiving the TERM signal the server stops processing new messages and waits a configurable amount of time for any running agent threads to finish processing.
My use case for a graceful shutdown for updating the mcollective installation from within an agent action.
@ripienaar The code can definitely be improved. I didn't really understood how you manage the log_code PLMC symbols. But what do you think of this feature in general? Any chance considering it?
Exit gracefully on SIGTERM. Wait for any active agent threads to finish.
add shutdown_timout config option
yeah this is interesting we'd need something similar but we need to be careful with this kind of thing now due to Ruby 2, signal handlers cannot block in any way - so we cant log or a number of other things. I'll need to figure out exactly what the resitrctions are and see how we do this.
We need something like this for windows too so its something we'd do, see http://projects.puppetlabs.com/issues/20467 - I commented on that ticket but we'd need to do some work before we can consider merging this
The PL messages are managed via localeapp.com/projects/3197 - still need to properly figure out the process for contributors its something we're working on
I didn't know about the ruby 2.0 changes regarding what you can do inside a trap context. I couldn't find a definite documentation for this, only this blog post and this bug report.
At least for handling the TERM signal on UNIX raising an Exception without any logging and handling the Exception in the loop of MCollective::Runner#run and logging a line there also works on Ruby 2.0.
Is this something worth pursuing or do you prefer rewriting the run method to not block until a message is received?
Form looking at the Stomp::Connection there is a poll method which could be used instead of receive in the loop.
Any plans to work on this issue in the near future or is this on the back burner for now?
Without large scale reworking I dont think we can make the main loop be anything but blocking so that'll be last resort.
It's on my horizon cos we need it for other things but right now we have a fair bit of higher priority work, so I wont have time to look at this PR and the related changes it brings in for a while - but I added a link to it on http://projects.puppetlabs.com/issues/20467 and will come back to this soonish
Sorry I don't have much better to offer, bit pressed for man power who can handle this kind of change
Waiting for CLA signature by @databus23
@databus23 - We require a Contributor License Agreement (CLA) for people who contribute to Puppet, but we have an easy click-through license with instructions, which is available at https://cla.puppetlabs.com/
Note: if your contribution is trivial and you think it may be exempt from the CLA, please post a short reply to this comment with details. http://docs.puppetlabs.com/community/trivial_patch_exemption.html
CLA signed by all contributors.
Sorry we haven't got back to you about this. I'm going to close this pr in the mean time but we are working on a solution internally.
np. can you give a rough eta (for master) maybe?
Sadly no eta at the moment. :(
Reopening since the windows fix was trivial and I'd like to get this into master.
This has been resolved in MCO-221 and will ship with the next MCollective release.
Very cool!. One question: I can't directly see why this shouldn't work on windows as well. Why was it made a unix only feature?
In the case of an agent that takes long to complete or timeout the service can go into a broken state on Windows during shut down. In the long term I'm not sure if the correct action is to allow it on Windows and let users deal with it going into a broken state, or to just disallow it on Windows. For now I'm going to be overly defensive and make it Unix only, but we can re-evaluate in the near future.
Ok, thats why I hat a timeout for the graceful shutdown to complete in this initial PR. I believe it is a good idea in general to have the shutdown complete in a timely fashion. Otherwise a hanging agent could block the shutdown on any platform.
Would you maybe considering this as an (optional) setting.
I would really like to have the graceful shutdown capability on windows available as well.
The hanging agent action should be killed by its timeout, but I hear what you're saying. I'm completely open to it being an optional config option. I've opened https://tickets.puppetlabs.com/browse/MCO-243 where we can discuss it further and track the work.