New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parallel tasks never finish under heavy load #3303
Comments
Can you try to reproduce with master as well (I assume mitigating #3106 means you are using 0.13, since it doesn't affect master). Also try the pure zmq scheduler, and see if it makes any difference. |
Does the number of engines matter (can you reproduce with a smaller number of engines?)? |
I can reproduce the bug on 4 engines, however the number of successful jobs goes to 7000 — significantly more successful jobs per engine. I cannot reproduce it on a cluster running locally on my laptop with 0.13.2. It'll take me some more time to try to reproduce it on master or 0.13.2 on our cluster, where the issue was observed. |
I encounter the following when trying to use master (most likely an unrelated bug).
|
v0.13.2, pure ZMQ scheduler:
The controller log has a bunch of
The engines logs have
Queue status seems to be the thing that goes wrong: the engines report very different queue lengths each time the stop occurs. |
Reproduced on master with 4 engines run locally on a laptop with the same symptoms. By the way, the fix to #1306 does not seem to work anymore: submission time of 40000 jobs is huge again. |
A completely different problem appears with hwm set to 1.
|
What is the libzmq version you are running |
I only saw the bug in 3.2.2 (and 3 different ipython versions). Could well be ZMQ, however note that the error appeared also when tasks lasted 50 seconds. |
does it appear when tasks last 50 seconds and you have
Excellent, then I think I am on the right track. zmq 3 changed the default HWM from -1 (infinite) to 1000, so if you try to assign more than 1000 tasks as fast as you can to engines, they will be dropped. |
I didn't test it, and not sure if it would be possible with 0.13 due to #3106.
Seems very reasonable, I wasn't able to reproduce it with 2.x.x. Does this also explain the issue with |
Try #3304, and see how your tests fare. |
#3304 seems to fix both the I still see #2215 on master, and submission of 40000 jobs takes around a minute now, so something has reinstated #3106. Unlike before, Is there a quick and dirty way to apply #3304 to 0.13? I don't find pure ZMQ scheduler a viable option right now because the engines do unregister every now and then. |
#3106 has been fixed - submission of jobs is now linear in the number of jobs, and I see that it can submit about 1600 tasks/sec pretty consistently on my laptop, so there isn't a regression, it's just not as fast as it should be. I should note that it is probably a good idea to set
You are welcome to try manual patches via |
set unlimited HWM for all relay devices libzmq-3 changed the default HWM for sockets to 1000 from infinite, which can result in dropped messages on ROUTER or PUB sockets under high load. This PR explicitly sets these back to infinite to avoid dropped messages. closes ipython#3303
I run an ipython cluster, and observe a very weird behavior. When there are too many tasks, at some point the cluster stops completely. I was able to narrow the issue down to this:
This was done with python scheduler, and hwm set to 0, in order to partially mitigate the effect of #3106. I see no engines unregistering, I see nothing out of the ordinary in the logs. The controller, kernel, and engine cpu loads stay minimal when this happens, so it cannot be just a slowdown. The issue occurs for the short tasks, as in the example above, but also when the time of a single task is ~50 sec.
I can investigate the issue more, but I don't quite know where to start.
The text was updated successfully, but these errors were encountered: