Creation of ~300+ MPI-spawned engines causes instability in ipcluster #2589

Closed
jamesbarkerau opened this Issue Nov 16, 2012 · 2 comments

Projects

None yet

2 participants

@jamesbarkerau

Hi folks,

I am running IPython 13.1 on a relatively new and large-scale cluster.

I have been experimenting with the creation of large clusters (in the IPython sense) using MPIEngineSetLauncher through ipcluster. For smaller numbers of engines (< 200), everything works fine -- the engines are created appropriately on multiple nodes of the physical cluster using mpiexec, and are able to successfully connect to and register with the hub. However, when the number of engines approaches and passes approximately 300, the ipcluster application stops the cluster and terminates during the registration phase (before early_timeout). From inspection of the code, it seems that this is done by calling stop_launchers(), but I cannot provide any more information.

As this is not a crash per se, I cannot provide a crash report. Anecdotally, I have observed (by tailing the relevant log file for the IPControllerApp) that termination occurs after a long string of registration_requests (i.e. hundreds) being received by the IPControllerApp. To me, this looks like a bug in the request queue, but I cannot say for sure.

Please let me know if you need any more information. It isn't a showstopper, but it will impact on my ability to scale the use of IPython to substantial tasks.

Owner
minrk commented Nov 16, 2012

My guess is you are hitting FD limits (a downside of using zmq is a large number of FDs per engine). To get better output, try starting the controller separately:

Instead of ipcluster start -n N, do ipcontroller in one shell, then ipcluster engines -n N in another (nothing is different, it's just easier to see controller output this way, and controller crash won't trigger stop_launchers()).

Thanks for the advice. You're right, it was related to the system running out of FDs; a quick change of ulimit setting made the problem go away. There is still some instability creating large numbers of engines (I've just spooled up 600, of which only 598 managed to register, leaving two orphaned processes in the MPI universe and preventing automated shutdown using Client().shutdown(targets='all')), but that's definitely much better. :)

@minrk minrk closed this Jan 19, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment