Skip to content

Commit

Permalink
Add task jobs:daemon:start, which spawns a simple forking daemon that…
Browse files Browse the repository at this point in the history
… actually works

The command

        $ WORKERS=n RAILS_ENV=production rake jobs:daemon:start

spawns a simple forking daemon, which spawns and restarts `n' instances of
Delayed::Worker. Worker processes are revived by the master process on receipt
of SIGCLD.

We can restart worker instances by sending SIGHUP to the master process or by
killing them directly. Sending SIGTERM, SIGINT, or SIGQUIT to the master
instructs it to kill its children and terminate.

Alternately, there are the tasks `jobs:daemon:restart' and `jobs:daemon:stop'

Two extra features:

* To avoid CPU thrashing, if a child worker dies 4 times in 60 seconds, a
  warning message is logged and the child sleeps for 300 seconds before
  booting up

* The master polls tmp/restart.txt and restarts children on timestamp update
  • Loading branch information
guns committed Aug 17, 2010
1 parent b983b7e commit 031efe6
Show file tree
Hide file tree
Showing 2 changed files with 140 additions and 0 deletions.
138 changes: 138 additions & 0 deletions lib/delayed/daemon_tasks.rb
@@ -0,0 +1,138 @@
### Helpers

def kill_master(signal)
pid_file = "#{Dir.pwd}/tmp/pids/delayed_worker.master.pid"
abort 'No pid file found!' unless File.exists? pid_file
pid = File.read(pid_file).to_i
puts "Sending #{signal} to #{pid}"
Process.kill signal, pid
rescue Errno::ESRCH => e
abort e.to_s
end

### Tasks

namespace :jobs do
namespace :daemon do
desc 'Spawn a daemon which forks WORKERS=n instances of Delayed::Worker'
task :start do
# we want master and children to share a logfile, so set these before fork
rails_env = ENV['RAILS_ENV'] || 'development'
rails_root = Dir.pwd
logfile = "#{rails_root}/log/delayed_worker.#{rails_env}.log"

# Loads the Rails environment and spawns a worker
worker = lambda do |id, delay|
fork do
$0 = "delayed_worker.#{id}"
# reset all inherited traps from main thread
[:CLD, :HUP, :TERM, :INT, :QUIT].each { |sig| trap sig, 'DEFAULT' }
sleep delay if delay
Rake::Task[:environment].invoke
Delayed::Worker.logger = Logger.new logfile
Delayed::Worker.new(:quiet => true).start
end
end

# fork a simple master process
master = fork do
$0 = 'delayed_worker.master'
rails_logger = lambda do |msg|

This comment has been minimized.

Copy link
@bkeepers

bkeepers Sep 16, 2010

Can you explain what the headaches are with open files in the daemon?

This comment has been minimized.

Copy link
@guns

guns Sep 16, 2010

Owner

Oh just the headache of passing around open file handles. It's not really a big problem, and you solve it by using a Logger class. I opted to just use a simple Proc to keep the memory usage of the master process low. It's not a super-scalable solution, but it was dead simple.

Also, I think you are looking at it, but I had pushed an update here: http://github.com/guns/delayed_job/blob/delayed_job_daemon/lib/delayed/daemon_tasks.rb

This comment has been minimized.

Copy link
@bkeepers

bkeepers Sep 16, 2010

That was my guess. Thanks.

Yeah, I'm looking at the latest version, but github only lets you comment from the commit :/

File.open(logfile, 'a') { |f| f.puts "#{Time.now}: [#{$0}] #{msg}" }
end

# create pidfile or abort
pid_dir = "#{rails_root}/tmp/pids"
pid_file = "#{pid_dir}/#{$0}.pid"
if File.exists? pid_file
msg = "PID file #{pid_file} already exists!"
rails_logger.call msg
abort msg
else
# silence output like a proper daemon
[$stdin, $stdout, $stderr].each { |io| io.reopen '/dev/null' }
mkdir_p pid_dir, :verbose => false
File.open(pid_file, 'w') { |f| f.write $$ }
end

# spawn the first workers
children, times_dead = {}, {}
worker_count = (ENV['WORKERS'] || 1).to_i
rails_logger.call "Spawning #{worker_count} worker(s)"
worker_count.times { |id| children[worker.call id, nil] = id }

# and respawn the failures
trap :CLD do
id = children.delete Process.wait
# check to see if this worker is dying repeatedly
times_dead[id] ||= []
times_dead[id] << (now = Time.now)
times_dead[id].reject! { |time| now - time > 60 }
if times_dead[id].size > 4
delay = 60 * 5 # time to tell the children to sleep before loading
rails_logger.call %Q{
delayed_worker.#{id} has died four times in the past minute!
Something is seriously wrong!
Restarting worker in #{delay} seconds.
}.strip.gsub /\s+/, ' '
else
rails_logger.call "Restarting dead worker: delayed_worker.#{id}"
end
children[worker.call id, delay] = id
end

# restart children on SIGHUP
trap :HUP do
rails_logger.call 'SIGHUP received! Restarting workers.'
Process.kill :TERM, *children.keys
end

# terminate children on user termination
[:TERM, :INT, :QUIT].each do |sig|
trap sig do
rails_logger.call "SIG#{sig} received! Shutting down workers."
# reset trap handlers so we don't get caught in a trap loop
[:CLD, sig].each { |s| trap s, 'DEFAULT' }
# kill the children and reap them before terminating
Process.kill :TERM, *children.keys
Process.waitall
rm_f pid_file
# propagate the signal like a proper process should
Process.kill sig, $$

This comment has been minimized.

Copy link
@giddie

giddie Sep 21, 2010

I assume this signal propagation is intended to cause the master to terminate? Unfortunately, this doesn't work for me. When the master receives SIGTERM, it closes the workers perfectly, removes its PID file, but fails to terminate. By adding a log line to the traps (line 95 in this commit), I can see that the master receives SIGCLD, but not the SIGTERM you'd expect from this line.

For me, the solution is to replace line 101 with "exit". The master terminates immediately after its children, and all is well. Is there any reason you know of why sending SIGTERM would be better than calling "exit" here?

This comment has been minimized.

Copy link
@bkeepers

bkeepers Sep 21, 2010

I was seeing the same thing. If I called Process.wait after line 101, it would kill the process.

This comment has been minimized.

Copy link
@giddie

giddie Sep 21, 2010

But doesn't Process.wait wait for child processes to exit? According to the Ruby API:

Calling this method raises a SystemError if there are no child processes.

Surely, having just called Process.waitall, the master will have no child processes. In fact thinking about it, raising any exception here would probably have the effect of terminating the master process anyway. It might be worth checking whether your added wait is actually throwing an exception or allowing the process to end gracefully.

This comment has been minimized.

Copy link
@guns

guns Sep 21, 2010

Owner

giddle:

Is there any reason you know of why sending SIGTERM would be better than calling "exit" here?

In this case we could very well just call exit and be okay, but I like to reserve using exit with an explicit value for when the program exits by its own hand. When you trap a signal, you are supposed to resend it after you have done whatever it is that you wanted to do, so we don't disturb the normal rules of interprocess communication.

As to why the process doesn't properly exit for you, I'm not sure. What's your setup? Are you using the latest changes from this branch? It works on my machines, both with ruby 1.8.7 and 1.9.2. You may want to check out the other thread that brandon and I have here, but that doesn't really address this issue.

I'm also confused about the comment about adding logging to line 95 - that line resets the trap handler, so how do you inject logging into that?

Let's figure this out. I'm really curious about why Process.kill :TERM, $$ doesn't work properly...

This comment has been minimized.

Copy link
@giddie

giddie Sep 22, 2010

My comment about logging was a bit obscure: I meant that I changed that line so that it would log the signal and do nothing, instead of resetting the handlers, just to see what signals were actually received. Obviously I then had to send SIGKILL to terminate the process. However, it did highlight for me the fact that SIGTERM was not propagated in the way that was intended: it was not caught.

I found this article that clearly agrees with what you say about needing to propagate the signal: http://www.cons.org/cracauer/sigint.html. It makes sense to me too. It's odd that it works for you and not me :s I'll try to look into this more later today.

So my setup is REE (1.8.7) in Archlinux. I've tried it on two systems, one x86 and one x86_64. I'm using your "guns" branch, so I'm pretty sure I've got the latest version. BTW, thank you so much for writing this daemon; it's a huge improvement :)

end
end

# NOTE: We want to block on something so that Process.waitall doesn't
# reap children before the SIGCLD handler does.
#
# poll passenger restart file and restart on update
years_ago = lambda { |n| Time.now - 60 * 60 * 24 * 365 * n }
mtime = lambda do |file|
File.exists?(file) ? File::Stat.new(file).mtime : years_ago.call(2)
end
restart_file = "#{rails_root}/tmp/restart.txt"
last_modified = mtime.call restart_file
loop do
if (check = mtime.call restart_file) > last_modified
last_modified = check
Process.kill :HUP, $$
end
sleep 5
end

# reap children and remove logfile if the blocking loop is broken
Process.waitall
rm_f pid_file
end

# detach the master process and exit
Process.detach master
end

desc 'Restart an existing delayed_worker daemon'
task(:restart) { kill_master :SIGHUP }

desc 'Stop and existing delayed_worker daemon'
task(:stop) { kill_master :SIGTERM }
end
end
2 changes: 2 additions & 0 deletions lib/delayed/tasks.rb
@@ -1,3 +1,5 @@
require 'delayed/daemon_tasks'

# Re-definitions are appended to existing tasks # Re-definitions are appended to existing tasks
task :environment task :environment
task :merb_env task :merb_env
Expand Down

0 comments on commit 031efe6

Please sign in to comment.