Corrupted Queue in Python's multiprocessing Pool implementation #13

andresriancho · 2017-06-30T13:32:06Z

Python's multiprocessing pool has various limitations, one I tried to solve with my wrapper code is the timeout of worker processes. I implemented that in a rather ugly way, when the timeout is reached I os.kill the process. The multiprocessing pool implementation does spawn a new worker process and 99.99% of the time everything works well.

0.01% of the time there are some strange issues and the whole pool stops working, which I believe is because of this issue which is documented in the python docs:

Warning If a process is killed using Process.terminate() or os.kill() while it is trying to use a Queue, then the data in the queue is likely to become corrupted. This may cause any other process to get an exception when it tries to use the queue later on.

While looking for different multiprocessing pool implementations I found yours, which does implement timeouts and as far as I could read from process.py#L360-L366 there is code to avoid killing a process when the queue is locked. Am I reading that part of the code correctly?

Which other issues from python's multiprocessing pool did you fix in your implementation?

Does your pool implementation have any known issues?

The text was updated successfully, but these errors were encountered:

noxdafox · 2017-06-30T14:02:48Z

Hello,

this Pool implementation was added because there was none which would support disaster recovery and hanging workers while still offering a clean interface.

Billiard was the closest one but its undocumented and confusing API was causing issues in our systems. Moreover, it was not properly handling timeouts: celery/billiard#104. It might have improved by now.

The Pebble implementation is quite stable and we use it in some high load systems.

Among the issues/features it handles/offers:

timing out operations
task cancellation (with worker termination)
crashing workers (python interpreter disasters or C libraries segfaults)
allows to transfer large data between server and workers
iteration (map) over faulty results

The only known issue I'm investigating (apart from issue #10) is the pool hanging in rare occasions where very large data is being transferred from the workers back to the server at once (in a single result). I'm not 100% sure is due to Pebble though.

The lines you linked deal with hard-killing an unresponsive worker (hanging in a C loop for example) without corrupting the Queue. But there are plenty of other small corner cases to deal with.

andresriancho · 2017-06-30T14:15:19Z

Matteo thanks for the detailed and quick response!

Seems that the code is pretty stable. When you say very large data is being transferred from the workers back to the server that happens when the workers are remote (over the network)? How large is very large?

I'll start writing some unittests for my current implementation and then migrate to Pebble. Hopefully it will all work well 👍

noxdafox · 2017-06-30T16:57:34Z

By server I mean the Pool process. Pebble is not capable to deal with remote processes. For that, I would recommend to take a look at Celery or Luigi.

As I said, I'm not sure about such issue as I've been dragged over other things a while ago. I'll try to resume the investigation in that regards and, if there's seems to be a real issue, I'll open a report myself.

Let me know if you encounter any problem integrating Pebble.

I will close this issue for now.

andresriancho · 2017-06-30T18:22:37Z

Integration with my code went really well andresriancho/w3af@27c6e25

I now have less code to maintain, and the whole thing seems to be working as expected.

Thanks for making pebble open source!

noxdafox · 2017-06-30T20:06:05Z

Np, glad it helps somebody.

I took a look at my notes regarding the "large data issue". It was a test which was hanging due to a mistake of mine in the test itself. I will fix the test in the following days (not really urgent).

On Windows, it might be problematic when transferring large amount of data through the Pool. I need to research a bit on how to improve that.

Nevertheless it's not a good idea to transfer large chunks of data via IPC. Better to rely on the filesystem for such use cases.

Signed-off-by: Matteo Cafasso <noxdafox@gmail.com>

andresriancho closed this as completed Jun 30, 2017

andresriancho reopened this Jun 30, 2017

noxdafox closed this as completed Jun 30, 2017

noxdafox added a commit that referenced this issue Jul 4, 2017

issue #13: add large data tests for ProcessPool

96338d4

Signed-off-by: Matteo Cafasso <noxdafox@gmail.com>

noxdafox added a commit that referenced this issue Jul 4, 2017

issue #13: add large data tests for ProcessPool

34da371

Signed-off-by: Matteo Cafasso <noxdafox@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Corrupted Queue in Python's multiprocessing Pool implementation #13

Corrupted Queue in Python's multiprocessing Pool implementation #13

andresriancho commented Jun 30, 2017 •

edited

noxdafox commented Jun 30, 2017 •

edited

andresriancho commented Jun 30, 2017

noxdafox commented Jun 30, 2017

andresriancho commented Jun 30, 2017

noxdafox commented Jun 30, 2017 •

edited

Corrupted Queue in Python's multiprocessing Pool implementation #13

Corrupted Queue in Python's multiprocessing Pool implementation #13

Comments

andresriancho commented Jun 30, 2017 • edited

noxdafox commented Jun 30, 2017 • edited

andresriancho commented Jun 30, 2017

noxdafox commented Jun 30, 2017

andresriancho commented Jun 30, 2017

noxdafox commented Jun 30, 2017 • edited

andresriancho commented Jun 30, 2017 •

edited

noxdafox commented Jun 30, 2017 •

edited

noxdafox commented Jun 30, 2017 •

edited