-
Notifications
You must be signed in to change notification settings - Fork 36
worker process occasionally stops working, task is lost #51
Comments
Just happened again in another 578 job run. I noted the PID of the idle worker and checked, it doesn't show up in the log. That implies |
I did some diving in the Python bugs database. I found some things that looked relevant but none exactly right. Worker processes terminating unexpectedly cause problems: see 9205 and 22393. However in most (all?) cases worker processes shouldn't terminate in our system, they just have an Exception pasted on their stack that causes an orderly exit. There's a general problem in Python with programs that use locks, threads, and fork (see 6721) and our combination of logging + multiprocessing does that. And someone noticed a problem where process IPC hangs the worker (see 10037): that sounds like our problem, but I have no idea if it really is. Short term I suggest we try living with this and seeing how bad it is and whether there's any pattern to it. Long term I'd love to get the task management out of a single Python program entirely, go to a more decoupled architecture that will be more robust in a lot of ways. |
Some ideas on how to deal with this bug.
|
This seems to fix an idle worker problem, see issue #51
Good news! My suggestion 2 above (not setting My best guess is the Python Update second test: 60 second timeouts. I sent two worker processes a SIGTERM, new workers spawned and kept working. Never saw an idle worker, but the job stalled at 576/578 complete. I'm not too worried about this failure; if someone's at the terminal doing SIGTERM they can SIGUSR1 when they want it to stop. OTOH if one of those worker processes segfaults or something it may show the same failure. |
Getting better. Although, I just had a whole run crash on this:
|
I think that's a different bug, I filed a new issue #52 for it with a suggested fix. |
I've done enough testing on the new code I'm pretty sure the important primary bug is closed. I may do some follow-up research on some related weirdnesses.
|
I managed to distill this down to a short test case without any timeouts or weird job exits or signals or anything. I think the problem comes down to I filed a Python bug: http://bugs.python.org/issue23278 |
Nice work, thanks. Hope it's a fixable thing upstream. |
I have a similar issue in my code. |
With the new multiprocessing
jobs.py
there's an occasional bug that we can't reproduce.The main visible effect is the run will idle at the end, reporting almost all the work is done. However no work is being processed and that whole run is stuck.
If that happens you can still salvage the work by doing a
kill -USR1
on the parent process. That will abort all work and start generating a result report and upload data to S3.I think there's a related thing which I've observed once or twice with
htop
, which is that there's a worker process sitting around that hasn't changed its name nor has it been assigned work. You can kill this idle worker just fine with a SIGTERM and the pool will start a new one which will do useful work again. But it may be that whatever task was supposed to be assigned to that worker has now gotten lost.I don't know how to reproduce this bug; it's occurred twice in about 20 runs I've done. It could be a bug in
mulitprocessing.Pool
or it could be that somehow we screwed things up by not executing a job cleanly. If we could reproduce it, the thing to do is attach a debugger and start looking at the internal state of the pool and see if there's something obviously wrong.The text was updated successfully, but these errors were encountered: