Graceful shutdown on TERM signal #84

Closed
wants to merge 2 commits into
from

Projects

None yet

4 participants

Contributor

This PR adds a graceful shutdown to mcollectived.

When receiving the TERM signal the server stops processing new messages and waits a configurable amount of time for any running agent threads to finish processing.

My use case for a graceful shutdown for updating the mcollective installation from within an agent action.

@ripienaar The code can definitely be improved. I didn't really understood how you manage the log_code PLMC symbols. But what do you think of this feature in general? Any chance considering it?

Contributor

yeah this is interesting we'd need something similar but we need to be careful with this kind of thing now due to Ruby 2, signal handlers cannot block in any way - so we cant log or a number of other things. I'll need to figure out exactly what the resitrctions are and see how we do this.

We need something like this for windows too so its something we'd do, see http://projects.puppetlabs.com/issues/20467 - I commented on that ticket but we'd need to do some work before we can consider merging this

The PL messages are managed via localeapp.com/projects/3197 - still need to properly figure out the process for contributors its something we're working on

Contributor

I didn't know about the ruby 2.0 changes regarding what you can do inside a trap context. I couldn't find a definite documentation for this, only this blog post and this bug report.

At least for handling the TERM signal on UNIX raising an Exception without any logging and handling the Exception in the loop of MCollective::Runner#run and logging a line there also works on Ruby 2.0.

Is this something worth pursuing or do you prefer rewriting the run method to not block until a message is received?
Form looking at the Stomp::Connection there is a poll method which could be used instead of receive in the loop.

Any plans to work on this issue in the near future or is this on the back burner for now?

Contributor

Without large scale reworking I dont think we can make the main loop be anything but blocking so that'll be last resort.

It's on my horizon cos we need it for other things but right now we have a fair bit of higher priority work, so I wont have time to look at this PR and the related changes it brings in for a while - but I added a link to it on http://projects.puppetlabs.com/issues/20467 and will come back to this soonish

Sorry I don't have much better to offer, bit pressed for man power who can handle this kind of change

Waiting for CLA signature by @databus23

@databus23 - We require a Contributor License Agreement (CLA) for people who contribute to Puppet, but we have an easy click-through license with instructions, which is available at https://cla.puppetlabs.com/

Note: if your contribution is trivial and you think it may be exempt from the CLA, please post a short reply to this comment with details. http://docs.puppetlabs.com/community/trivial_patch_exemption.html

CLA signed by all contributors.

Contributor

Sorry we haven't got back to you about this. I'm going to close this pr in the mean time but we are working on a solution internally.

@ploubser ploubser closed this Sep 25, 2013
Contributor

np. can you give a rough eta (for master) maybe?

Contributor

Sadly no eta at the moment. :(

@ploubser ploubser reopened this Oct 29, 2013
Contributor

Reopening since the windows fix was trivial and I'd like to get this into master.

Contributor

This has been resolved in MCO-221 and will ship with the next MCollective release.

@ploubser ploubser closed this Apr 10, 2014
Contributor

Very cool!. One question: I can't directly see why this shouldn't work on windows as well. Why was it made a unix only feature?

Contributor

In the case of an agent that takes long to complete or timeout the service can go into a broken state on Windows during shut down. In the long term I'm not sure if the correct action is to allow it on Windows and let users deal with it going into a broken state, or to just disallow it on Windows. For now I'm going to be overly defensive and make it Unix only, but we can re-evaluate in the near future.

Contributor

Ok, thats why I hat a timeout for the graceful shutdown to complete in this initial PR. I believe it is a good idea in general to have the shutdown complete in a timely fashion. Otherwise a hanging agent could block the shutdown on any platform.
Would you maybe considering this as an (optional) setting.
I would really like to have the graceful shutdown capability on windows available as well.

Contributor

The hanging agent action should be killed by its timeout, but I hear what you're saying. I'm completely open to it being an optional config option. I've opened https://tickets.puppetlabs.com/browse/MCO-243 where we can discuss it further and track the work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment