ipcontroller process goes to 100% CPU, ignores connection requests #3795

lbeltrame · 2013-07-26T09:08:09Z

Background: I'm using @chapmanb's ipython-cluster-helper to distribute approximately 660 jobs over a 800 core cluster equipped with SGE.

I noticed that if I request more than 250 jobs from a pool of 600, the ipcontroller process overloads, goes to 100%CPU and does not answer requests from the engines. It does not matter if all jobs are sent at once or delayed in batches: ultimately I still observe this issue.

Setting different values of TaskScheduler.hwm has no effect.

Attaching gdb to the runaway process shows

(gdb) py-list
  92                timeout = -1
  93            
  94            timeout = int(timeout)
  95            if timeout < 0:
  96                timeout = -1
 >97            return zmq_poll(list(self.sockets.items()), timeout=timeout)
  98

Called in /usr/local/lib/python2.7/dist-packages/zmq/sugar/poll.py.

Example log: https://gist.github.com/lbeltrame/6087409

This happens with:

ipython master (from yesterday)
pyzmq latest stable or latest git (happens with both)
Python 2.7.3

Some details on the SW/HW setup as well:

32 diskless machines + 2 diskful machines (A and B). A offers OS images for the 32, and B for the storage area (NFS for both cases). All run Debian 7. ipython etc installed through git / pip (no virtualenv).

All the processing is happening on the shared NFS storage. When this issue occurs, network traffic isn't that high (it's 1Gbit network) and I/O is negligible as well.

Previous discussion (more related to the software where this issue occurred originally): bcbio/bcbio-nextgen#60

The text was updated successfully, but these errors were encountered:

minrk · 2013-07-28T00:57:01Z

Try staggering engine connections, so they aren't all trying to connect at once.

lbeltrame · 2013-07-28T06:16:13Z

In data sabato 27 luglio 2013 17:57:10, Min RK ha scritto:

Try staggering engine connections, so they aren't all trying to connect at
once.

I've been trying this (even long stagger times), but in the end, after about
250ish connections, the process goes haywire.

When I say "250 connections", BTW, I mean when the engine acknowledges the
connection: the processes have all (or almost all) got already registered with
their ID.

This does not occur if I'm requesting less than about 300 processes.

minrk · 2013-07-28T17:14:14Z

Might be running into FD limits, then. You can adjust this with ulimit, if you have permission.

chapmanb · 2013-07-28T19:34:34Z

@minrk -- great idea, I totally missed thinking of that one. What is the expected behavior of pyzmq/Ipython parallel when you hit file handle limits? Will it error out? Any good place to look for logs on this?

minrk · 2013-07-28T20:54:02Z

It depends on which call failed to allocate an FD. If it is one of the calls inside libzmq (most likely), one or more Scheduler processes may actually SIGABRT and die.

lbeltrame · 2013-07-29T05:36:59Z

In data domenica 28 luglio 2013 10:14:24, Min RK ha scritto:

Might be running into FD limits, then. You can adjust this with ulimit,
if you have permission.

Initially it was like that (it used to throw exceptions), but now I have set
it to approximately ~20K FD limits for the user that runs this. Neverthelss,
I'm going to experiment with this even more.

lbeltrame · 2013-08-01T09:29:34Z

After a long debugging session, I realized that SGE does not honor ulimits set by the system, and instead requires a separate configuration: fixing that made the controller work properly. Therefore this bug report is invalid. Closing.

minrk · 2013-08-01T16:48:01Z

Thanks for the report, and sorry for the trouble. Hopefully your findings will help someone else.

jankatins · 2013-08-02T12:33:08Z

If there is a way to detect such errors (no idea ...) it would be nice if ipython (ZMQ) could warn the user of such problems.

lbeltrame closed this as completed Aug 1, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ipcontroller process goes to 100% CPU, ignores connection requests #3795

ipcontroller process goes to 100% CPU, ignores connection requests #3795

lbeltrame commented Jul 26, 2013

minrk commented Jul 28, 2013

lbeltrame commented Jul 28, 2013

minrk commented Jul 28, 2013

chapmanb commented Jul 28, 2013

minrk commented Jul 28, 2013

lbeltrame commented Jul 29, 2013

lbeltrame commented Aug 1, 2013

minrk commented Aug 1, 2013

jankatins commented Aug 2, 2013

ipcontroller process goes to 100% CPU, ignores connection requests #3795

ipcontroller process goes to 100% CPU, ignores connection requests #3795

Comments

lbeltrame commented Jul 26, 2013

minrk commented Jul 28, 2013

lbeltrame commented Jul 28, 2013

minrk commented Jul 28, 2013

chapmanb commented Jul 28, 2013

minrk commented Jul 28, 2013

lbeltrame commented Jul 29, 2013

lbeltrame commented Aug 1, 2013

minrk commented Aug 1, 2013

jankatins commented Aug 2, 2013