Skip to content


Subversion checkout URL

You can clone with
Download ZIP


Graceful shutdown on TERM signal #84

wants to merge 2 commits into from

4 participants


This PR adds a graceful shutdown to mcollectived.

When receiving the TERM signal the server stops processing new messages and waits a configurable amount of time for any running agent threads to finish processing.

My use case for a graceful shutdown for updating the mcollective installation from within an agent action.

@ripienaar The code can definitely be improved. I didn't really understood how you manage the log_code PLMC symbols. But what do you think of this feature in general? Any chance considering it?


yeah this is interesting we'd need something similar but we need to be careful with this kind of thing now due to Ruby 2, signal handlers cannot block in any way - so we cant log or a number of other things. I'll need to figure out exactly what the resitrctions are and see how we do this.

We need something like this for windows too so its something we'd do, see - I commented on that ticket but we'd need to do some work before we can consider merging this

The PL messages are managed via - still need to properly figure out the process for contributors its something we're working on


I didn't know about the ruby 2.0 changes regarding what you can do inside a trap context. I couldn't find a definite documentation for this, only this blog post and this bug report.

At least for handling the TERM signal on UNIX raising an Exception without any logging and handling the Exception in the loop of MCollective::Runner#run and logging a line there also works on Ruby 2.0.

Is this something worth pursuing or do you prefer rewriting the run method to not block until a message is received?
Form looking at the Stomp::Connection there is a poll method which could be used instead of receive in the loop.

Any plans to work on this issue in the near future or is this on the back burner for now?


Without large scale reworking I dont think we can make the main loop be anything but blocking so that'll be last resort.

It's on my horizon cos we need it for other things but right now we have a fair bit of higher priority work, so I wont have time to look at this PR and the related changes it brings in for a while - but I added a link to it on and will come back to this soonish

Sorry I don't have much better to offer, bit pressed for man power who can handle this kind of change


Waiting for CLA signature by @databus23

@databus23 - We require a Contributor License Agreement (CLA) for people who contribute to Puppet, but we have an easy click-through license with instructions, which is available at

Note: if your contribution is trivial and you think it may be exempt from the CLA, please post a short reply to this comment with details.


CLA signed by all contributors.


Sorry we haven't got back to you about this. I'm going to close this pr in the mean time but we are working on a solution internally.

@ploubser ploubser closed this

np. can you give a rough eta (for master) maybe?


Sadly no eta at the moment. :(

@ploubser ploubser reopened this

Reopening since the windows fix was trivial and I'd like to get this into master.


This has been resolved in MCO-221 and will ship with the next MCollective release.

@ploubser ploubser closed this

Very cool!. One question: I can't directly see why this shouldn't work on windows as well. Why was it made a unix only feature?


In the case of an agent that takes long to complete or timeout the service can go into a broken state on Windows during shut down. In the long term I'm not sure if the correct action is to allow it on Windows and let users deal with it going into a broken state, or to just disallow it on Windows. For now I'm going to be overly defensive and make it Unix only, but we can re-evaluate in the near future.


Ok, thats why I hat a timeout for the graceful shutdown to complete in this initial PR. I believe it is a good idea in general to have the shutdown complete in a timely fashion. Otherwise a hanging agent could block the shutdown on any platform.
Would you maybe considering this as an (optional) setting.
I would really like to have the graceful shutdown capability on windows available as well.


The hanging agent action should be killed by its timeout, but I hear what you're saying. I'm completely open to it being an optional config option. I've opened where we can discuss it further and track the work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on May 23, 2013
Commits on May 27, 2013
  1. add shutdown_timout config option

    Fabian Ruff authored
This page is out of date. Refresh to see the latest.
4 lib/mcollective/config.rb
@@ -15,6 +15,7 @@ class Config
attr_reader :main_collective, :ssl_cipher, :registration_collective
attr_reader :direct_addressing, :direct_addressing_threshold, :ttl, :helptemplatedir
attr_reader :queueprefix, :default_discovery_method, :default_discovery_options
+ attr_reader :shutdown_timeout
def initialize
@configured = false
@@ -117,6 +118,8 @@ def loadconfig(configfile)
@default_discovery_options << val
when "default_discovery_method"
@default_discovery_method = val
+ when "shutdown_timeout"
+ @shutdown_timeout = val.to_i
raise("Unknown config parameter #{key}")
@@ -191,6 +194,7 @@ def set_config_defaults(configfile)
@default_discovery_options = []
@ttl = 60
@mode = :client
+ @shutdown_timeout = 2
# look in the config dir for the template so users can provide their own and windows
# with odd paths will just work more often, but fall back to old behavior if it does
2  lib/mcollective/exception.rb
@@ -22,7 +22,7 @@ def log(level, log_backtrace=false)
Log.logexception(@code, level, self, log_backtrace)
+ class Shutdown<Exception;end
# Exceptions for the RPC system
class DDLValidationError<CodedError;end
class ValidatorError<RuntimeError; end
25 lib/mcollective/runner.rb
@@ -20,6 +20,7 @@ def initialize(configfile)
@agents =
+ @agent_threads =
Signal.trap("USR1") do
log_code(:PLMC2, "Reloading all agents after receiving USR1 signal", :info)
@@ -31,6 +32,10 @@ def initialize(configfile)
+ Signal.trap("TERM") do
+ log_code(:PLMC99, "Received TERM signal.", :info)
+ raise Shutdown
+ end
@@ -59,6 +64,19 @@ def run
log_code(:PLMC5, "Received a control message, possibly via 'mco controller' but this has been deprecated", :error)
+ purge_agent_threads
+ rescue Shutdown
+ purge_agent_threads
+"Shutting down gracefully. Waiting for #{@agent_threads.length} agent(s) to finish up.") unless @agent_threads.empty?
+ begin
+ Timeout.timeout(@config.shutdown_timeout) do
+ @agent_threads.each {|thread| thread.join}
+ end
+ rescue Timeout::Error
+ "Shutdown timeout (#{@config.shutdown_timeout}s) exceeded waiting for agent threads. Exiting."
+ end
+ @connection.disconnect
+ break
rescue SignalException => e
logexception(:PLMC7, "Exiting after signal: %{error}", :warn, e)
@@ -81,7 +99,7 @@ def run
def agentmsg(request)
log_code(:PLMC8, "Handling message for agent '%{agent}' on collective '%{collective} with requestid '%{requestid}'", :debug, :agent => request.agent, :collective => request.collective, :requestid => request.requestid)
- @agents.dispatch(request, @connection) do |reply_message|
+ @agent_threads << @agents.dispatch(request, @connection) do |reply_message|
reply(reply_message, request) if reply_message
@@ -132,6 +150,11 @@ def reply(msg, request)
+ #remove teminated threads from collection
+ def purge_agent_threads
+ @agent_threads.delete_if {|thread| !thread.alive? }
+ end
Something went wrong with that request. Please try again.