Permalink
Browse files

Add task jobs:daemon:start, which spawns a simple forking daemon that…

… actually works

The command

        $ WORKERS=n RAILS_ENV=production rake jobs:daemon:start

spawns a simple forking daemon, which spawns and restarts `n' instances of
Delayed::Worker. Worker processes are revived by the master process on receipt
of SIGCLD.

We can restart worker instances by sending SIGHUP to the master process or by
killing them directly. Sending SIGTERM, SIGINT, or SIGQUIT to the master
instructs it to kill its children and terminate.

Alternately, there are the tasks `jobs:daemon:restart' and `jobs:daemon:stop'

Two extra features:

* To avoid CPU thrashing, if a child worker dies 4 times in 60 seconds, a
  warning message is logged and the child sleeps for 300 seconds before
  booting up

* The master polls tmp/restart.txt and restarts children on timestamp update
  • Loading branch information...
1 parent b983b7e commit 031efe6b8783563958f70409018c00ca052fc7f4 guns committed Aug 17, 2010
Showing with 140 additions and 0 deletions.
  1. +138 −0 lib/delayed/daemon_tasks.rb
  2. +2 −0 lib/delayed/tasks.rb
View
138 lib/delayed/daemon_tasks.rb
@@ -0,0 +1,138 @@
+### Helpers
+
+def kill_master(signal)
+ pid_file = "#{Dir.pwd}/tmp/pids/delayed_worker.master.pid"
+ abort 'No pid file found!' unless File.exists? pid_file
+ pid = File.read(pid_file).to_i
+ puts "Sending #{signal} to #{pid}"
+ Process.kill signal, pid
+rescue Errno::ESRCH => e
+ abort e.to_s
+end
+
+### Tasks
+
+namespace :jobs do
+ namespace :daemon do
+ desc 'Spawn a daemon which forks WORKERS=n instances of Delayed::Worker'
+ task :start do
+ # we want master and children to share a logfile, so set these before fork
+ rails_env = ENV['RAILS_ENV'] || 'development'
+ rails_root = Dir.pwd
+ logfile = "#{rails_root}/log/delayed_worker.#{rails_env}.log"
+
+ # Loads the Rails environment and spawns a worker
+ worker = lambda do |id, delay|
+ fork do
+ $0 = "delayed_worker.#{id}"
+ # reset all inherited traps from main thread
+ [:CLD, :HUP, :TERM, :INT, :QUIT].each { |sig| trap sig, 'DEFAULT' }
+ sleep delay if delay
+ Rake::Task[:environment].invoke
+ Delayed::Worker.logger = Logger.new logfile
+ Delayed::Worker.new(:quiet => true).start
+ end
+ end
+
+ # fork a simple master process
+ master = fork do
+ $0 = 'delayed_worker.master'
+ rails_logger = lambda do |msg|
@bkeepers
bkeepers Sep 16, 2010

Can you explain what the headaches are with open files in the daemon?

@guns
guns Sep 16, 2010

Oh just the headache of passing around open file handles. It's not really a big problem, and you solve it by using a Logger class. I opted to just use a simple Proc to keep the memory usage of the master process low. It's not a super-scalable solution, but it was dead simple.

Also, I think you are looking at it, but I had pushed an update here: http://github.com/guns/delayed_job/blob/delayed_job_daemon/lib/delayed/daemon_tasks.rb

@bkeepers
bkeepers Sep 16, 2010

That was my guess. Thanks.

Yeah, I'm looking at the latest version, but github only lets you comment from the commit :/

+ File.open(logfile, 'a') { |f| f.puts "#{Time.now}: [#{$0}] #{msg}" }
+ end
+
+ # create pidfile or abort
+ pid_dir = "#{rails_root}/tmp/pids"
+ pid_file = "#{pid_dir}/#{$0}.pid"
+ if File.exists? pid_file
+ msg = "PID file #{pid_file} already exists!"
+ rails_logger.call msg
+ abort msg
+ else
+ # silence output like a proper daemon
+ [$stdin, $stdout, $stderr].each { |io| io.reopen '/dev/null' }
+ mkdir_p pid_dir, :verbose => false
+ File.open(pid_file, 'w') { |f| f.write $$ }
+ end
+
+ # spawn the first workers
+ children, times_dead = {}, {}
+ worker_count = (ENV['WORKERS'] || 1).to_i
+ rails_logger.call "Spawning #{worker_count} worker(s)"
+ worker_count.times { |id| children[worker.call id, nil] = id }
+
+ # and respawn the failures
+ trap :CLD do
+ id = children.delete Process.wait
+ # check to see if this worker is dying repeatedly
+ times_dead[id] ||= []
+ times_dead[id] << (now = Time.now)
+ times_dead[id].reject! { |time| now - time > 60 }
+ if times_dead[id].size > 4
+ delay = 60 * 5 # time to tell the children to sleep before loading
+ rails_logger.call %Q{
+ delayed_worker.#{id} has died four times in the past minute!
+ Something is seriously wrong!
+ Restarting worker in #{delay} seconds.
+ }.strip.gsub /\s+/, ' '
+ else
+ rails_logger.call "Restarting dead worker: delayed_worker.#{id}"
+ end
+ children[worker.call id, delay] = id
+ end
+
+ # restart children on SIGHUP
+ trap :HUP do
+ rails_logger.call 'SIGHUP received! Restarting workers.'
+ Process.kill :TERM, *children.keys
+ end
+
+ # terminate children on user termination
+ [:TERM, :INT, :QUIT].each do |sig|
+ trap sig do
+ rails_logger.call "SIG#{sig} received! Shutting down workers."
+ # reset trap handlers so we don't get caught in a trap loop
+ [:CLD, sig].each { |s| trap s, 'DEFAULT' }
+ # kill the children and reap them before terminating
+ Process.kill :TERM, *children.keys
+ Process.waitall
+ rm_f pid_file
+ # propagate the signal like a proper process should
+ Process.kill sig, $$
@giddie
giddie Sep 21, 2010

I assume this signal propagation is intended to cause the master to terminate? Unfortunately, this doesn't work for me. When the master receives SIGTERM, it closes the workers perfectly, removes its PID file, but fails to terminate. By adding a log line to the traps (line 95 in this commit), I can see that the master receives SIGCLD, but not the SIGTERM you'd expect from this line.

For me, the solution is to replace line 101 with "exit". The master terminates immediately after its children, and all is well. Is there any reason you know of why sending SIGTERM would be better than calling "exit" here?

@bkeepers
bkeepers Sep 21, 2010

I was seeing the same thing. If I called Process.wait after line 101, it would kill the process.

@giddie
giddie Sep 21, 2010

But doesn't Process.wait wait for child processes to exit? According to the Ruby API:

Calling this method raises a SystemError if there are no child processes.

Surely, having just called Process.waitall, the master will have no child processes. In fact thinking about it, raising any exception here would probably have the effect of terminating the master process anyway. It might be worth checking whether your added wait is actually throwing an exception or allowing the process to end gracefully.

@guns
guns Sep 21, 2010

giddle:

Is there any reason you know of why sending SIGTERM would be better than calling "exit" here?

In this case we could very well just call exit and be okay, but I like to reserve using exit with an explicit value for when the program exits by its own hand. When you trap a signal, you are supposed to resend it after you have done whatever it is that you wanted to do, so we don't disturb the normal rules of interprocess communication.

As to why the process doesn't properly exit for you, I'm not sure. What's your setup? Are you using the latest changes from this branch? It works on my machines, both with ruby 1.8.7 and 1.9.2. You may want to check out the other thread that brandon and I have here, but that doesn't really address this issue.

I'm also confused about the comment about adding logging to line 95 - that line resets the trap handler, so how do you inject logging into that?

Let's figure this out. I'm really curious about why Process.kill :TERM, $$ doesn't work properly...

@giddie
giddie Sep 22, 2010

My comment about logging was a bit obscure: I meant that I changed that line so that it would log the signal and do nothing, instead of resetting the handlers, just to see what signals were actually received. Obviously I then had to send SIGKILL to terminate the process. However, it did highlight for me the fact that SIGTERM was not propagated in the way that was intended: it was not caught.

I found this article that clearly agrees with what you say about needing to propagate the signal: http://www.cons.org/cracauer/sigint.html. It makes sense to me too. It's odd that it works for you and not me :s I'll try to look into this more later today.

So my setup is REE (1.8.7) in Archlinux. I've tried it on two systems, one x86 and one x86_64. I'm using your "guns" branch, so I'm pretty sure I've got the latest version. BTW, thank you so much for writing this daemon; it's a huge improvement :)

+ end
+ end
+
+ # NOTE: We want to block on something so that Process.waitall doesn't
+ # reap children before the SIGCLD handler does.
+ #
+ # poll passenger restart file and restart on update
+ years_ago = lambda { |n| Time.now - 60 * 60 * 24 * 365 * n }
+ mtime = lambda do |file|
+ File.exists?(file) ? File::Stat.new(file).mtime : years_ago.call(2)
+ end
+ restart_file = "#{rails_root}/tmp/restart.txt"
+ last_modified = mtime.call restart_file
+ loop do
+ if (check = mtime.call restart_file) > last_modified
+ last_modified = check
+ Process.kill :HUP, $$
+ end
+ sleep 5
+ end
+
+ # reap children and remove logfile if the blocking loop is broken
+ Process.waitall
+ rm_f pid_file
+ end
+
+ # detach the master process and exit
+ Process.detach master
+ end
+
+ desc 'Restart an existing delayed_worker daemon'
+ task(:restart) { kill_master :SIGHUP }
+
+ desc 'Stop and existing delayed_worker daemon'
+ task(:stop) { kill_master :SIGTERM }
+ end
+end
View
2 lib/delayed/tasks.rb
@@ -1,3 +1,5 @@
+require 'delayed/daemon_tasks'
+
# Re-definitions are appended to existing tasks
task :environment
task :merb_env

0 comments on commit 031efe6

Please sign in to comment.