IPython Cluster (SGE) Registration Timeouts #8569

jakirkham · 2015-06-24T13:22:45Z

I am trying to debug a situation where I am running into sporadic registration timeouts on the engine. In this Gist, I have included relevant config files and sample output ( https://gist.github.com/jakirkham/b0452178331db511dd0d ). All other config files were simply the result of running ipython profile create --parallel --profile=sge.
To provide more information, this is on a CentOS 6.6 VM on a single machine; as such, there is no need to worry about accessibility between the jobs. The queue has 7 jobs in it in this case and has been configured to limit the number of running jobs based on the number of accessible cores. However, I have run into the same problem with less than 7 running jobs, as well. All of the jobs are able to start and run successfully. I don't believe this to be a resource issue as I have ran heavy duty machine learning algorithms in the VM without error repeatably.

As it is sporadic, I am wondering if timing differences between engines communicating to the controller could be causing the problem. For example, all the engines slam the controller at the same time leaving the controller unable to respond and this happens up to the timeout limit. Unfortunately, I have had trouble finding more information about parameters that could institute delays between engine queries or similar to test this hypothesis.

Any pointers would be appreciated?

The text was updated successfully, but these errors were encountered:

jakirkham · 2015-06-25T14:41:45Z

At this point, I can start up the cluster reliably the first time. However, every time afterwards it doesn't work. So, I am revising my theory. Something gets generated and is not properly remove, which stop the cluster from registering the engines properly.

jakirkham · 2015-06-25T20:09:35Z

I think the solution was provided by @minrk in an old mailing list, but I need to test this further. ( http://mail.scipy.org/pipermail/ipython-user/2011-November/008741.html ). I think it is lingering files in the profile's security folder that are causing the engines not to properly register.

jakirkham · 2015-06-25T21:04:42Z

After some testing, I have determined that these lingering files in the security folder are, in fact, the cause of the problem.

jakirkham · 2015-06-26T12:51:37Z

In particular, I propose the following. Once the cluster is shutdown ipcontroller-engine.json and ipcontroller-client.json should be deleted. Starting up the cluster afterwards should be fine regardless what type of cluster it was.

jakirkham · 2015-07-24T20:28:49Z

Does this belong here ( https://github.com/ipython/ipyparallel )?

jakirkham · 2015-07-27T12:39:41Z

@takluyver, wanted to bring this other SGE issue to your attention. I already know the solution, but am not sure where it belongs. Any direction you could give would be appreciated.

minrk · 2015-07-27T16:46:52Z

Yes, it does make sense to do this on the new ipyparallel repo. On a clean shutdown, the connection files are already cleaned up. The controller doesn't have the opportunity to do this if it is brought down forcefully, though.

jakirkham · 2015-07-27T17:05:47Z

Ok, I will move this once I'm back at my laptop.

How do you mean forcefully? Currently, I am starting and stopping the iPython Cluster like this. I can provide the ipcluster config file if you wish.

ipcluster start --daemon --profile=sge
ipcluster stop --profile=sge

minrk · 2015-07-27T17:07:52Z

Is the controller started with SGE as well, or is it started as a normal local process? When the controller is started with SGE, I believe ipcluster stop uses qdel, which probably kills is ungracefully.

jakirkham · 2015-07-27T17:19:55Z

Yes, the controller and engines are submitted to SGE. I see. So, it doesn't send a message to the engines to terminate. Does the controller wait until all of the engines are terminated?

jakirkham · 2015-07-27T17:39:37Z

Moved to ipython/ipyparallel#21.

minrk changed the title ~~iPython Cluster (SGE) Registration Timeouts~~ IPython Cluster (SGE) Registration Timeouts Jul 27, 2015

jakirkham closed this as completed Jul 27, 2015

minrk added this to the not ipython milestone Aug 11, 2015

jakirkham mentioned this issue Feb 1, 2016

IPython Cluster (SGE) Registration Timeouts ipython/ipyparallel#21

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IPython Cluster (SGE) Registration Timeouts #8569

IPython Cluster (SGE) Registration Timeouts #8569

jakirkham commented Jun 24, 2015

jakirkham commented Jun 25, 2015

jakirkham commented Jun 25, 2015

jakirkham commented Jun 25, 2015

jakirkham commented Jun 26, 2015

jakirkham commented Jul 24, 2015

jakirkham commented Jul 27, 2015

minrk commented Jul 27, 2015

jakirkham commented Jul 27, 2015

minrk commented Jul 27, 2015

jakirkham commented Jul 27, 2015

jakirkham commented Jul 27, 2015

IPython Cluster (SGE) Registration Timeouts #8569

IPython Cluster (SGE) Registration Timeouts #8569

Comments

jakirkham commented Jun 24, 2015

jakirkham commented Jun 25, 2015

jakirkham commented Jun 25, 2015

jakirkham commented Jun 25, 2015

jakirkham commented Jun 26, 2015

jakirkham commented Jul 24, 2015

jakirkham commented Jul 27, 2015

minrk commented Jul 27, 2015

jakirkham commented Jul 27, 2015

minrk commented Jul 27, 2015

jakirkham commented Jul 27, 2015

jakirkham commented Jul 27, 2015