New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ipcluster does not start all the engines #24
Comments
[ LP comment 1 by: Vishal Vatsa, on 2010-01-18 09:36:37.944247+00:00 ] Hi Tony, Which version of ipython are you using? Thanks, |
[ LP comment 2 by: Vishal Vatsa, on 2010-01-18 15:30:05+00:00 ] Ipython version info from user. ---------- Forwarded message ---------- HI Vishal, Thanks for looking at it. I’m using 0.10 with python 2.5.4 Python 2.5.4 (r254:67916, Dec 23 2008, 15:10:54) [MSC v.1310 32 bit (Intel)] IPython 0.10 -- An enhanced Interactive Python. --Toby P.S. I tried to join the bug list, separate from user list I guess. No response. |
[ LP comment 3 by: Brian Granger, on 2010-01-18 16:41:12+00:00 ]
Great, glad it is useful to you.
I think I know what the issue is here. We have found that sometimes The good news is that all of this is fixed in trunk (ipcluster is much In the mean time, I would suggest looking through ipcluster.py - you Cheers, Brian
Brian E. Granger, Ph.D. |
[ LP comment 4 by: Vishal Vatsa, on 2010-01-18 21:21:06+00:00 ] 2010/1/18 Brian Granger ellisonbg@gmail.com:
Yep this sound about correct. Toby, which OS are you on? could the filesystem semantics be different? I have not been able to replicate this so far on NAS backed cluster. Regards, |
[ LP comment 5 by: Brian Granger, on 2010-01-18 22:38:16+00:00 ]
Yes, even though that method exists, we have sometimes observed it to fail.
The problems we had with this were on Windows.
Brian E. Granger, Ph.D. |
[ LP comment 6 by: Toby Burnett, on 2010-01-18 23:13:58.126879+00:00 ] This was on linux. Looking at the ipcluster code, (version 0.10) I see that the ssh mode is not supported for windows to linux, a case that we would like to use. Hopefully with the new version when you get it to work> |
[ LP comment 7 by: Brian Granger, on 2010-01-18 23:25:06+00:00 ] Is there a ssh daemon that is usable on Windows. I now about putty Brian On Mon, Jan 18, 2010 at 3:13 PM, Toby Burnett tburnett@uw.edu wrote:
Brian E. Granger, Ph.D. |
[ LP comment 8 by: Vishal Vatsa, on 2010-01-18 23:39:09+00:00 ] I have used cygwin and openssh to get a shell on winXP even X11 2010/1/18 Brian Granger ellisonbg@gmail.com:
|
[ LP comment 9 by: Brian Granger, on 2010-01-19 00:00:11+00:00 ] What about non-cygwin Windows? On Mon, Jan 18, 2010 at 3:39 PM, Vishal Vatsa vishal.vatsa@gmail.com wrote:
Brian E. Granger, Ph.D. |
[ LP comment 10 by: Toby Burnett, on 2010-01-19 15:56:55.304633+00:00 ] About ssh on windows, we use the cygwin version, but not in the cygwin bash shell. I tried a version of my script that starts an ipcontroller, then 16 ipengines on each of the 4 machines. It I delay after the ipcontroller, I have the same problem: 2-3 engines don't get started per machine. However, a 100 ms delay in that loop that creates the engines works just fine. This seems inconsistent with an assertion about how the engines connect to a controller. |
[ LP comment 11 by: Vishal Vatsa, on 2010-01-19 16:30:54+00:00 ] Would you mind sending me a copy of your script. -v 2010/1/19 Toby Burnett tburnett@uw.edu:
|
[ LP comment 12 by: Toby Burnett, on 2010-01-19 18:40:49.459167+00:00 ] Here is the script, with an attempt to use ipcluster ssh commented out. The 64-engine case is run on one of the same nodes.
|
Clusterfile no longer in use, but the newparallel code does support SSH launching of engines. See here for details: Unfortunately it's possible there was a problem in Twisted that was causing part of this. In newparallel/master we've moved away from twisted completely, so I'm closing this bug. I'm sorry that the transition for users of the twisted code is going to be somewhat painful for some users, but it's simply not practical to maintain the twisted codebase for the long haul. |
Original Launchpad bug 509015: https://bugs.launchpad.net/ipython/+bug/509015
Reported by: fdo.perez (Fernando Perez).
As reported on the mailing list...
---------- Forwarded message ----------
From: Toby Burnett tburnett@EMAIL-REMOVED
Date: Sat, Jan 16, 2010 at 8:58 AM
Subject: [IPython-user] ipcluster does not start all the engines
To: "ipython-user@scipy.org" ipython-user@scipy.org
Hi,
I did not find any previous notes on this.
I have a cluster of 4 machines, each with 8 hyperthreaded cores, so I can run 16 engines per machine, or 64 in all. It is amazingly easy and useful, thanks so much for providing this.
However, when using ipcluster on one of these machines in ssh mode, with this clusterfile,
send_furl = False
engines = { 'tev1' : 16,
'tev2' : 16,
'tev3' : 16,
'tev4' : 16
}
I typically get about 50 engines to actually start. Since there seems to be no log file for ipcluster (in spite of code that seems like it should record which engines it tried to start), I can't send that. The ipcontroller log file looks fine, except for recording fewer than the 64 engines that I expected.
I have an alternative, very klugy method that starts a controller, then executes 64 ssh commands to the respective machines to simply run ipengine. I found the same problem, which went away when I introduced a 1 second delay after each ssh call, which of course takes more than a minute to run, and leaves all those ssh processes running.
So I suspect that the same thing would work in the loop in this method of ipcluster.SSHEngineSet
def _ssh_engine(self, hostname, count):
exec_engine = "ssh %s sh %s/%s-sshx.sh %s" % (
hostname, self.temp_dir,
os.environ['USER'], self.engine_command
)
cmds = exec_engine.split()
dlist = []
log.msg("about to start engines...")
for i in range(count):
log.msg('Starting engines: %s' % exec_engine)
d = getProcessOutput(cmds[0], cmds[1:], env=os.environ)
dlist.append(d)
return gatherBoth(dlist, consumeErrors=True)
but that would be inelegant, given that the real problem is probably related to the controller not responding properly to multiple requests.
Thanks for looking at this.
--Toby Burnett
The text was updated successfully, but these errors were encountered: