New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ipcontroller process goes to 100% CPU, ignores connection requests #3795
Comments
Try staggering engine connections, so they aren't all trying to connect at once. |
In data sabato 27 luglio 2013 17:57:10, Min RK ha scritto:
I've been trying this (even long stagger times), but in the end, after about When I say "250 connections", BTW, I mean when the engine acknowledges the This does not occur if I'm requesting less than about 300 processes. |
Might be running into FD limits, then. You can adjust this with |
@minrk -- great idea, I totally missed thinking of that one. What is the expected behavior of pyzmq/Ipython parallel when you hit file handle limits? Will it error out? Any good place to look for logs on this? |
It depends on which call failed to allocate an FD. If it is one of the calls inside libzmq (most likely), one or more Scheduler processes may actually SIGABRT and die. |
In data domenica 28 luglio 2013 10:14:24, Min RK ha scritto:
Initially it was like that (it used to throw exceptions), but now I have set |
After a long debugging session, I realized that SGE does not honor ulimits set by the system, and instead requires a separate configuration: fixing that made the controller work properly. Therefore this bug report is invalid. Closing. |
Thanks for the report, and sorry for the trouble. Hopefully your findings will help someone else. |
If there is a way to detect such errors (no idea ...) it would be nice if ipython (ZMQ) could warn the user of such problems. |
Background: I'm using @chapmanb's ipython-cluster-helper to distribute approximately 660 jobs over a 800 core cluster equipped with SGE.
I noticed that if I request more than 250 jobs from a pool of 600, the ipcontroller process overloads, goes to 100%CPU and does not answer requests from the engines. It does not matter if all jobs are sent at once or delayed in batches: ultimately I still observe this issue.
Setting different values of
TaskScheduler.hwm
has no effect.Attaching gdb to the runaway process shows
Called in
/usr/local/lib/python2.7/dist-packages/zmq/sugar/poll.py
.Example log: https://gist.github.com/lbeltrame/6087409
This happens with:
ipython master (from yesterday)
pyzmq latest stable or latest git (happens with both)
Python 2.7.3
Some details on the SW/HW setup as well:
32 diskless machines + 2 diskful machines (A and B). A offers OS images for the 32, and B for the storage area (NFS for both cases). All run Debian 7. ipython etc installed through git / pip (no virtualenv).
All the processing is happening on the shared NFS storage. When this issue occurs, network traffic isn't that high (it's 1Gbit network) and I/O is negligible as well.
Previous discussion (more related to the software where this issue occurred originally): bcbio/bcbio-nextgen#60
The text was updated successfully, but these errors were encountered: