Skip to content
This repository has been archived by the owner on May 5, 2022. It is now read-only.

worker process occasionally stops working, task is lost #51

Closed
NelsonMinar opened this issue Jan 19, 2015 · 11 comments
Closed

worker process occasionally stops working, task is lost #51

NelsonMinar opened this issue Jan 19, 2015 · 11 comments

Comments

@NelsonMinar
Copy link
Contributor

With the new multiprocessing jobs.py there's an occasional bug that we can't reproduce.

The main visible effect is the run will idle at the end, reporting almost all the work is done. However no work is being processed and that whole run is stuck.

I 2015-01-18 23:59:39,533   MainThread     jobs:126  Job completion: 577/578 = 99%
I 2015-01-18 23:59:54,533   MainThread     jobs:126  Job completion: 577/578 = 99%

If that happens you can still salvage the work by doing a kill -USR1 on the parent process. That will abort all work and start generating a result report and upload data to S3.

I think there's a related thing which I've observed once or twice with htop, which is that there's a worker process sitting around that hasn't changed its name nor has it been assigned work. You can kill this idle worker just fine with a SIGTERM and the pool will start a new one which will do useful work again. But it may be that whatever task was supposed to be assigned to that worker has now gotten lost.

I don't know how to reproduce this bug; it's occurred twice in about 20 runs I've done. It could be a bug in mulitprocessing.Pool or it could be that somehow we screwed things up by not executing a job cleanly. If we could reproduce it, the thing to do is attach a debugger and start looking at the internal state of the pool and see if there's something obviously wrong.

@NelsonMinar
Copy link
Contributor Author

Just happened again in another 578 job run. I noted the PID of the idle worker and checked, it doesn't show up in the log. That implies Task.run() never got called in that process, it would show up in the log. (Update: happened 5 times total that run, ugh.)

@NelsonMinar
Copy link
Contributor Author

I did some diving in the Python bugs database. I found some things that looked relevant but none exactly right. Worker processes terminating unexpectedly cause problems: see 9205 and 22393. However in most (all?) cases worker processes shouldn't terminate in our system, they just have an Exception pasted on their stack that causes an orderly exit. There's a general problem in Python with programs that use locks, threads, and fork (see 6721) and our combination of logging + multiprocessing does that. And someone noticed a problem where process IPC hangs the worker (see 10037): that sounds like our problem, but I have no idea if it really is.

Short term I suggest we try living with this and seeing how bad it is and whether there's any pattern to it. Long term I'd love to get the task management out of a single Python program entirely, go to a more decoupled architecture that will be more robust in a lot of ways.

@NelsonMinar
Copy link
Contributor Author

This is a screenshot of htop when one of the workers is in a bad state. 11624, 3rd from bottom. It's the non-thread that didn't change its name. That process name shows up briefly when the pool first spawns the process, but should change it's name quite quickly once our Task code is executed.

screen shot 2015-01-18 at 7 05 28 pm

@NelsonMinar
Copy link
Contributor Author

Some ideas on how to deal with this bug.

  1. We could schedule abort_pool() to run, oh, 6 hours after the program starts. Either by sending a SIGUSR1 from the EC2 shell script that invokes the job or adding a master timeout right in the main job monitor loop.
  2. I'm doing a run again with maxtasksperchild=None, the default for multiprocessing. This means Python reuses the worker process instead of starting a new one. I prefer the idea of starting new processes but maybe that's exactly what is triggering a bug in the pool code.
  3. I'm going to read the multiprocessing.Pool() code closely and try to understand where the problem may come from. I've also set it up so I can get to pdb in the program to try to debug state. I don't have a lot of hope for figuring this out.
  4. We could replace multiprocessing.Pool with our own multiprocessing-based job manager. I think about 90% of the current code would stay the same, just need new code for forking processes. I hate to do this, but if the problem is a bug in Pool that'd be one way to avoid it.

NelsonMinar added a commit that referenced this issue Jan 19, 2015
This seems to fix an idle worker problem, see issue #51
@NelsonMinar
Copy link
Contributor Author

Good news! My suggestion 2 above (not setting maxtasksperchild) seems to have fixed the problem. At least, I finished a run of 578 sources cleanly (30 minute timeouts, 2 hours total to run). No hung workers, no SIGUSR1 required.

My best guess is the Python multiprocessing.Pool code has a bug around launching new worker processes. With 6370898 we no longer launch any new worker processes, we just keep reusing the first set that were created. Seems to work fine, memory is reclaimed, job timeouts work, etc. I'm going to keep testing, but I think this is resolved.

Update second test: 60 second timeouts. I sent two worker processes a SIGTERM, new workers spawned and kept working. Never saw an idle worker, but the job stalled at 576/578 complete. I'm not too worried about this failure; if someone's at the terminal doing SIGTERM they can SIGUSR1 when they want it to stop. OTOH if one of those worker processes segfaults or something it may show the same failure.

@migurski
Copy link
Member

Getting better. Although, I just had a whole run crash on this:

MainThread    589851.6  DEBUG: Requesting http://gisdata.cityofchesapeake.net/public/rest/services/A_POL_Basemap/MapServer/8/query
 MainThread    589898.8   INFO: Converted to out/process_one-jt9HBi/conform-hUZ9tU/converted/us-sc-lexington-59dc607d.csv
 MainThread    589908.5   INFO: Processed data in /home/ubuntu/out/process_one-jt9HBi/out.csv
 MainThread    589908.8   INFO: Starting task for /var/opt/openaddresses/sources/us-ms-hinds.json with PID 15414
 MainThread    589909.5  DEBUG: Downloading /arcgis/rest/services/Viewers/Hinds/MapServer/3 to us-ms-hinds-98c08b37.json
Traceback (most recent call last):
  File "/usr/local/bin/openaddr-process", line 9, in <module>
    load_entry_point('OpenAddresses-Machine==1.1.1', 'console_scripts', 'openaddr-process')()
  File "/usr/local/lib/python2.7/dist-packages/openaddr/process_all.py", line 47, in main
    process(s3, paths.sources, run_name)
  File "/usr/local/lib/python2.7/dist-packages/openaddr/process_all.py", line 113, in process
    results = jobs.run_all_process_ones(source_files, 'out', source_extras)
  File "/usr/local/lib/python2.7/dist-packages/openaddr/jobs.py", line 114, in run_all_process_ones
    completed_path, result = result_iter.next(timeout=report_interval)
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 659, in next
    raise value
IOError: [Errno 2] No such file or directory: '/20150119/us-sc-lexington.json'
 MainThread    589975.4  DEBUG: Requesting http://gis.mesaaz.gov/mesaaz/rest/services/Address/AddressIndexAcella/MapServer/0/query

out/us-sc-lexington is empty.

@NelsonMinar
Copy link
Contributor Author

I think that's a different bug, I filed a new issue #52 for it with a suggested fix.

@NelsonMinar
Copy link
Contributor Author

I've done enough testing on the new code I'm pretty sure the important primary bug is closed. I may do some follow-up research on some related weirdnesses.

  1. SIGTERM to a worker process causes new work to be scheduled, but the old job is never removed from the queue and the whole run never really completes without the SIGUSR1 hack. I recommend using SIGALRM to kill any errant workers, that's basically causing a job timeout early.
  2. maxtasksperchild=1 just seems to be broken. I may file a Python bug on this, although it could still be a problem in our own use of Pool.

@NelsonMinar
Copy link
Contributor Author

I managed to distill this down to a short test case without any timeouts or weird job exits or signals or anything. I think the problem comes down to logging vs. multiprocessing. There's some magic I don't understand where log messages get shared between processes and maybe some of that doesn't always work reliably when new worker processes are created? For our application, I'm satisfied that avoiding maxtasksperchild=1 has solved the problem.

I filed a Python bug: http://bugs.python.org/issue23278
Demo of problem: https://gist.github.com/NelsonMinar/022794b6a709ea5b7682

@migurski
Copy link
Member

Nice work, thanks. Hope it's a fixable thing upstream.

@hvasbath
Copy link

hvasbath commented Jun 1, 2017

I have a similar issue in my code.
https://github.com/hvasbath/beat/blob/seis_synth_restruct/src/paripool.py
However, I never used masktaskperchild so no new processes get started. But sometimes the process runs only partially and hangs somehow randomly in the middle. This happens around 1 time out of 1000 jobs. Which then causes the parent to hang and wait infinitely. I so far thought there might be a problem in the job I am running, I am also using logging almost in the exact same way as you do. Do you have any tips of how to investigate the problem? I so far couldnt even figure out where it hangs as there are no tracebacks ... only from the parent process. Your suggestions would be very much appreciated!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants