Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ipcontroller process goes to 100% CPU, ignores connection requests #3795

Closed
lbeltrame opened this issue Jul 26, 2013 · 9 comments
Closed

ipcontroller process goes to 100% CPU, ignores connection requests #3795

lbeltrame opened this issue Jul 26, 2013 · 9 comments
Milestone

Comments

@lbeltrame
Copy link

Background: I'm using @chapmanb's ipython-cluster-helper to distribute approximately 660 jobs over a 800 core cluster equipped with SGE.

I noticed that if I request more than 250 jobs from a pool of 600, the ipcontroller process overloads, goes to 100%CPU and does not answer requests from the engines. It does not matter if all jobs are sent at once or delayed in batches: ultimately I still observe this issue.

Setting different values of TaskScheduler.hwm has no effect.

Attaching gdb to the runaway process shows

(gdb) py-list
  92                timeout = -1
  93            
  94            timeout = int(timeout)
  95            if timeout < 0:
  96                timeout = -1
 >97            return zmq_poll(list(self.sockets.items()), timeout=timeout)
  98    

Called in /usr/local/lib/python2.7/dist-packages/zmq/sugar/poll.py.

Example log: https://gist.github.com/lbeltrame/6087409

This happens with:

ipython master (from yesterday)
pyzmq latest stable or latest git (happens with both)
Python 2.7.3

Some details on the SW/HW setup as well:

32 diskless machines + 2 diskful machines (A and B). A offers OS images for the 32, and B for the storage area (NFS for both cases). All run Debian 7. ipython etc installed through git / pip (no virtualenv).

All the processing is happening on the shared NFS storage. When this issue occurs, network traffic isn't that high (it's 1Gbit network) and I/O is negligible as well.

Previous discussion (more related to the software where this issue occurred originally): bcbio/bcbio-nextgen#60

@minrk
Copy link
Member

minrk commented Jul 28, 2013

Try staggering engine connections, so they aren't all trying to connect at once.

@lbeltrame
Copy link
Author

In data sabato 27 luglio 2013 17:57:10, Min RK ha scritto:

Try staggering engine connections, so they aren't all trying to connect at
once.

I've been trying this (even long stagger times), but in the end, after about
250ish connections, the process goes haywire.

When I say "250 connections", BTW, I mean when the engine acknowledges the
connection: the processes have all (or almost all) got already registered with
their ID.

This does not occur if I'm requesting less than about 300 processes.

@minrk
Copy link
Member

minrk commented Jul 28, 2013

Might be running into FD limits, then. You can adjust this with ulimit, if you have permission.

@chapmanb
Copy link
Contributor

@minrk -- great idea, I totally missed thinking of that one. What is the expected behavior of pyzmq/Ipython parallel when you hit file handle limits? Will it error out? Any good place to look for logs on this?

@minrk
Copy link
Member

minrk commented Jul 28, 2013

It depends on which call failed to allocate an FD. If it is one of the calls inside libzmq (most likely), one or more Scheduler processes may actually SIGABRT and die.

@lbeltrame
Copy link
Author

In data domenica 28 luglio 2013 10:14:24, Min RK ha scritto:

Might be running into FD limits, then. You can adjust this with ulimit,
if you have permission.

Initially it was like that (it used to throw exceptions), but now I have set
it to approximately ~20K FD limits for the user that runs this. Neverthelss,
I'm going to experiment with this even more.

@lbeltrame
Copy link
Author

After a long debugging session, I realized that SGE does not honor ulimits set by the system, and instead requires a separate configuration: fixing that made the controller work properly. Therefore this bug report is invalid. Closing.

@minrk
Copy link
Member

minrk commented Aug 1, 2013

Thanks for the report, and sorry for the trouble. Hopefully your findings will help someone else.

@jankatins
Copy link
Contributor

If there is a way to detect such errors (no idea ...) it would be nice if ipython (ZMQ) could warn the user of such problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants