Skip to content


Subversion checkout URL

You can clone with
Download ZIP


Race condition makes "god terminate" intermittently fail #9

willbryant opened this Issue · 3 comments

3 participants


Almost every time I run terminate on one of my 8-CPU daemon servers with a few jobs running and one failing, it doesn't work as it doesn't actually run the stop action. It's pretty messy to debug, but I believe I have found at least a problem, if not the only problem; there's a race condition in the event processing.

The God stop_all method does this: do |name, w| do
    w.unmonitor if w.state != :unmonitored
    w.action(:stop) if w.alive?

The problem is that that unmonitor call gets added to the driver events queue to be run asynchronously. If the driver happens to get a turn before the next line, things work. But if not - if this stop_all method continues running before driver wakes up and grabs the move(:up, :unmonitored) event from the queue - then the :stop action will get queued immediately behind it in the driver event queue.

Unfortunately, when the driver runs the Task#move(:up, :unmonitored), it does this:

    # cleanup from current state

This results in the stop event being cleared from the events queue! Accordingly, the unmonitor happens but the stop doesn't, so the terminate method then rolls on, obliviously waiting for the watch to finish even though it's never been stopped, eventually giving up.

I can see a couple of ways to patch this. The most obvious is to move the unmonitor state transition and stop action into one driver event, but that seems like a bit of a hack.

Why does the code clear the events queue? Do we need to unmonitor before queueing the stop action?


Good find! I need to dig into the code again and see why it's written that way. I'll try to get this resolved for 0.10.0 or earlier.


Any progress on this one? Just had another customer project whose terminate is not working, and I think it's probably the same issue.


I'm also running into terminate problems. Would be great to know if this is what's causing them. Would sleeping in between the unmonitor and the stop force a context switch? If so, that could be a quick way to determine whether this is the problem I'm seeing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.