Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

Graceful shutdown on TERM signal #84

Closed
wants to merge 2 commits into from

4 participants

@databus23

This PR adds a graceful shutdown to mcollectived.

When receiving the TERM signal the server stops processing new messages and waits a configurable amount of time for any running agent threads to finish processing.

My use case for a graceful shutdown for updating the mcollective installation from within an agent action.

@ripienaar The code can definitely be improved. I didn't really understood how you manage the log_code PLMC symbols. But what do you think of this feature in general? Any chance considering it?

@ripienaar

yeah this is interesting we'd need something similar but we need to be careful with this kind of thing now due to Ruby 2, signal handlers cannot block in any way - so we cant log or a number of other things. I'll need to figure out exactly what the resitrctions are and see how we do this.

We need something like this for windows too so its something we'd do, see http://projects.puppetlabs.com/issues/20467 - I commented on that ticket but we'd need to do some work before we can consider merging this

The PL messages are managed via localeapp.com/projects/3197 - still need to properly figure out the process for contributors its something we're working on

@databus23

I didn't know about the ruby 2.0 changes regarding what you can do inside a trap context. I couldn't find a definite documentation for this, only this blog post and this bug report.

At least for handling the TERM signal on UNIX raising an Exception without any logging and handling the Exception in the loop of MCollective::Runner#run and logging a line there also works on Ruby 2.0.

Is this something worth pursuing or do you prefer rewriting the run method to not block until a message is received?
Form looking at the Stomp::Connection there is a poll method which could be used instead of receive in the loop.

Any plans to work on this issue in the near future or is this on the back burner for now?

@ripienaar

Without large scale reworking I dont think we can make the main loop be anything but blocking so that'll be last resort.

It's on my horizon cos we need it for other things but right now we have a fair bit of higher priority work, so I wont have time to look at this PR and the related changes it brings in for a while - but I added a link to it on http://projects.puppetlabs.com/issues/20467 and will come back to this soonish

Sorry I don't have much better to offer, bit pressed for man power who can handle this kind of change

@puppetcla

Waiting for CLA signature by @databus23

@databus23 - We require a Contributor License Agreement (CLA) for people who contribute to Puppet, but we have an easy click-through license with instructions, which is available at https://cla.puppetlabs.com/

Note: if your contribution is trivial and you think it may be exempt from the CLA, please post a short reply to this comment with details. http://docs.puppetlabs.com/community/trivial_patch_exemption.html

@puppetcla

CLA signed by all contributors.

@ploubser
Owner

Sorry we haven't got back to you about this. I'm going to close this pr in the mean time but we are working on a solution internally.

@ploubser ploubser closed this
@databus23

np. can you give a rough eta (for master) maybe?

@ploubser
Owner

Sadly no eta at the moment. :(

@ploubser ploubser reopened this
@ploubser
Owner

Reopening since the windows fix was trivial and I'd like to get this into master.

@ploubser
Owner

This has been resolved in MCO-221 and will ship with the next MCollective release.

@ploubser ploubser closed this
@databus23

Very cool!. One question: I can't directly see why this shouldn't work on windows as well. Why was it made a unix only feature?

@ploubser
Owner

In the case of an agent that takes long to complete or timeout the service can go into a broken state on Windows during shut down. In the long term I'm not sure if the correct action is to allow it on Windows and let users deal with it going into a broken state, or to just disallow it on Windows. For now I'm going to be overly defensive and make it Unix only, but we can re-evaluate in the near future.

@databus23

Ok, thats why I hat a timeout for the graceful shutdown to complete in this initial PR. I believe it is a good idea in general to have the shutdown complete in a timely fashion. Otherwise a hanging agent could block the shutdown on any platform.
Would you maybe considering this as an (optional) setting.
I would really like to have the graceful shutdown capability on windows available as well.

@ploubser
Owner

The hanging agent action should be killed by its timeout, but I hear what you're saying. I'm completely open to it being an optional config option. I've opened https://tickets.puppetlabs.com/browse/MCO-243 where we can discuss it further and track the work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on May 23, 2013
Commits on May 27, 2013
  1. add shutdown_timout config option

    Fabian Ruff authored
This page is out of date. Refresh to see the latest.
View
4 lib/mcollective/config.rb
@@ -15,6 +15,7 @@ class Config
attr_reader :main_collective, :ssl_cipher, :registration_collective
attr_reader :direct_addressing, :direct_addressing_threshold, :ttl, :helptemplatedir
attr_reader :queueprefix, :default_discovery_method, :default_discovery_options
+ attr_reader :shutdown_timeout
def initialize
@configured = false
@@ -117,6 +118,8 @@ def loadconfig(configfile)
@default_discovery_options << val
when "default_discovery_method"
@default_discovery_method = val
+ when "shutdown_timeout"
+ @shutdown_timeout = val.to_i
else
raise("Unknown config parameter #{key}")
end
@@ -191,6 +194,7 @@ def set_config_defaults(configfile)
@default_discovery_options = []
@ttl = 60
@mode = :client
+ @shutdown_timeout = 2
# look in the config dir for the template so users can provide their own and windows
# with odd paths will just work more often, but fall back to old behavior if it does
View
2  lib/mcollective/exception.rb
@@ -22,7 +22,7 @@ def log(level, log_backtrace=false)
Log.logexception(@code, level, self, log_backtrace)
end
end
-
+ class Shutdown<Exception;end
# Exceptions for the RPC system
class DDLValidationError<CodedError;end
class ValidatorError<RuntimeError; end
View
25 lib/mcollective/runner.rb
@@ -20,6 +20,7 @@ def initialize(configfile)
@agents = Agents.new
+ @agent_threads = Array.new
unless Util.windows?
Signal.trap("USR1") do
log_code(:PLMC2, "Reloading all agents after receiving USR1 signal", :info)
@@ -31,6 +32,10 @@ def initialize(configfile)
Log.cycle_level
end
+ Signal.trap("TERM") do
+ log_code(:PLMC99, "Received TERM signal.", :info)
+ raise Shutdown
+ end
else
Util.setup_windows_sleeper
end
@@ -59,6 +64,19 @@ def run
else
log_code(:PLMC5, "Received a control message, possibly via 'mco controller' but this has been deprecated", :error)
end
+ purge_agent_threads
+ rescue Shutdown
+ purge_agent_threads
+ Log.info("Shutting down gracefully. Waiting for #{@agent_threads.length} agent(s) to finish up.") unless @agent_threads.empty?
+ begin
+ Timeout.timeout(@config.shutdown_timeout) do
+ @agent_threads.each {|thread| thread.join}
+ end
+ rescue Timeout::Error
+ Log.info "Shutdown timeout (#{@config.shutdown_timeout}s) exceeded waiting for agent threads. Exiting."
+ end
+ @connection.disconnect
+ break
rescue SignalException => e
logexception(:PLMC7, "Exiting after signal: %{error}", :warn, e)
@connection.disconnect
@@ -81,7 +99,7 @@ def run
def agentmsg(request)
log_code(:PLMC8, "Handling message for agent '%{agent}' on collective '%{collective} with requestid '%{requestid}'", :debug, :agent => request.agent, :collective => request.collective, :requestid => request.requestid)
- @agents.dispatch(request, @connection) do |reply_message|
+ @agent_threads << @agents.dispatch(request, @connection) do |reply_message|
reply(reply_message, request) if reply_message
end
end
@@ -132,6 +150,11 @@ def reply(msg, request)
@stats.sent
end
+
+ #remove teminated threads from collection
+ def purge_agent_threads
+ @agent_threads.delete_if {|thread| !thread.alive? }
+ end
end
end
Something went wrong with that request. Please try again.