God reports incorrect up status when it fails to fork #108

b4hand opened this Issue Sep 7, 2012 · 1 comment


None yet
2 participants

b4hand commented Sep 7, 2012

I've seen this now a couple of times on different machines when or after the OOM killer is invoked. This is probably do to some run away memory in either my service or memcached, but I would expect god to at least report the correct status. Instead it shows the corresponding process to be "up" even though ps shows it to not be running.

F [2012-09-07 02:30:07] FATAL: Unhandled exception in driver loop - (Errno::ENOMEM): Cannot allocate memory - fork(2)
/var/lib/gems/1.8/gems/god-0.12.1/bin/../lib/god/process.rb:238:in `fork'
/var/lib/gems/1.8/gems/god-0.12.1/bin/../lib/god/process.rb:238:in `call_action'
/var/lib/gems/1.8/gems/god-0.12.1/bin/../lib/god/watch.rb:305:in `call_action'
/var/lib/gems/1.8/gems/god-0.12.1/bin/../lib/god/watch.rb:261:in `action'
/var/lib/gems/1.8/gems/god-0.12.1/bin/../lib/god/task.rb:215:in `move'
/var/lib/gems/1.8/gems/god-0.12.1/bin/../lib/god/task.rb:444:in `handle_event'
/var/lib/gems/1.8/gems/god-0.12.1/bin/../lib/god/driver.rb:87:in `send'
/var/lib/gems/1.8/gems/god-0.12.1/bin/../lib/god/driver.rb:87:in `handle_event'
/var/lib/gems/1.8/gems/god-0.12.1/bin/../lib/god/driver.rb:181:in `initialize'
/var/lib/gems/1.8/gems/god-0.12.1/bin/../lib/god/driver.rb:179:in `loop'
/var/lib/gems/1.8/gems/god-0.12.1/bin/../lib/god/driver.rb:179:in `initialize'
/var/lib/gems/1.8/gems/god-0.12.1/bin/../lib/god/driver.rb:178:in `new'
/var/lib/gems/1.8/gems/god-0.12.1/bin/../lib/god/driver.rb:178:in `initialize'
/var/lib/gems/1.8/gems/god-0.12.1/bin/../lib/god/task.rb:51:in `new'
/var/lib/gems/1.8/gems/god-0.12.1/bin/../lib/god/task.rb:51:in `initialize'
/var/lib/gems/1.8/gems/god-0.12.1/bin/../lib/god/watch.rb:39:in `initialize'
/var/lib/gems/1.8/gems/god-0.12.1/bin/../lib/god.rb:283:in `new'
/var/lib/gems/1.8/gems/god-0.12.1/bin/../lib/god.rb:283:in `task'
/var/lib/gems/1.8/gems/god-0.12.1/bin/../lib/god.rb:271:in `watch'

scomma commented Apr 25, 2013

Getting this and it really burns because we trusted god to keep the production system alive. 😞

I'm not sure if anything can be done about it, since god is written in Ruby and the options to handle out of memory error can be limited. Attempting to do most anything else will involve allocating more memory (unlike in C). Hope somebody proves me wrong.

Adding to the injury, after the crisis is over and memory has been restored, god lies to me (in god status) that all workers are up when in fact many have failed to spawn, and a simple ps query with their PID will reveal this fact.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment