ipcluster does not start all the engines #24

ipython · 2010-05-10T09:29:03Z

Original Launchpad bug 509015: https://bugs.launchpad.net/ipython/+bug/509015
Reported by: fdo.perez (Fernando Perez).

As reported on the mailing list...

---------- Forwarded message ----------
From: Toby Burnett tburnett@EMAIL-REMOVED
Date: Sat, Jan 16, 2010 at 8:58 AM
Subject: [IPython-user] ipcluster does not start all the engines
To: "ipython-user@scipy.org" ipython-user@scipy.org

Hi,
I did not find any previous notes on this.
I have a cluster of 4 machines, each with 8 hyperthreaded cores, so I can run 16 engines per machine, or 64 in all. It is amazingly easy and useful, thanks so much for providing this.

However, when using ipcluster on one of these machines in ssh mode, with this clusterfile,
send_furl = False
engines = { 'tev1' : 16,
'tev2' : 16,
'tev3' : 16,
'tev4' : 16
}

I typically get about 50 engines to actually start. Since there seems to be no log file for ipcluster (in spite of code that seems like it should record which engines it tried to start), I can't send that. The ipcontroller log file looks fine, except for recording fewer than the 64 engines that I expected.

I have an alternative, very klugy method that starts a controller, then executes 64 ssh commands to the respective machines to simply run ipengine. I found the same problem, which went away when I introduced a 1 second delay after each ssh call, which of course takes more than a minute to run, and leaves all those ssh processes running.

So I suspect that the same thing would work in the loop in this method of ipcluster.SSHEngineSet

def _ssh_engine(self, hostname, count):
exec_engine = "ssh %s sh %s/%s-sshx.sh %s" % (
hostname, self.temp_dir,
os.environ['USER'], self.engine_command
)
cmds = exec_engine.split()
dlist = []
log.msg("about to start engines...")
for i in range(count):
log.msg('Starting engines: %s' % exec_engine)
d = getProcessOutput(cmds[0], cmds[1:], env=os.environ)
dlist.append(d)
return gatherBoth(dlist, consumeErrors=True)

but that would be inelegant, given that the real problem is probably related to the controller not responding properly to multiple requests.

Thanks for looking at this.

--Toby Burnett

ipython · 2010-05-10T09:29:04Z

[ LP comment 1 by: Vishal Vatsa, on 2010-01-18 09:36:37.944247+00:00 ]

Hi Tony,

Which version of ipython are you using?

Thanks,
-vishal

ipython · 2010-05-10T09:29:05Z

[ LP comment 2 by: Vishal Vatsa, on 2010-01-18 15:30:05+00:00 ]

Ipython version info from user.

---------- Forwarded message ----------
From: Toby Burnett tburnett@uw.edu
Date: 2010/1/18
Subject: RE: [IPython-user] ipcluster does not start all the engines
To: "vishal.vatsa@gmail.com" vishal.vatsa@gmail.com

HI Vishal,

Thanks for looking at it. I’m using 0.10 with python 2.5.4

Python 2.5.4 (r254:67916, Dec 23 2008, 15:10:54) [MSC v.1310 32 bit (Intel)]

IPython 0.10 -- An enhanced Interactive Python.

--Toby

P.S. I tried to join the bug list, separate from user list I guess. No response.

ipython · 2010-05-10T09:29:06Z

[ LP comment 3 by: Brian Granger, on 2010-01-18 16:41:12+00:00 ]

Hi,
I did not find any previous notes on this.
I have a cluster of 4 machines, each with 8 hyperthreaded cores, so I can run 16 engines per machine, or 64 in all. It is amazingly easy and useful, thanks so much for providing this.

Great, glad it is useful to you.

However, when using ipcluster on one of these machines in ssh mode, with this clusterfile,
send_furl = False
engines = { 'tev1' : 16,
'tev2' : 16,
'tev3' : 16,
'tev4' : 16
}

I typically get about 50 engines to actually start. Since there seems to
be no log file for ipcluster (in spite of code that seems like it should
record which engines it tried to start), I can't send that. The
ipcontroller log file looks fine, except for recording fewer than the 64
engines that I expected.

I have an alternative, very klugy method that starts a controller, then
executes 64 ssh commands to the respective machines to simply run
ipengine. I found the same problem, which went away when I introduced a
1 second delay after each ssh call, which of course takes more than a
minute to run, and leaves all those ssh processes running.

I think I know what the issue is here. We have found that sometimes
the engines startup
so fast that the controller is not yet up and running. The engine
that try to connect
before the controller is running fail. Twisted is fully capable of
handling many simultaneous
connections, so I don't think it is that.

The good news is that all of this is fixed in trunk (ipcluster is much
improved). The bad news
is that I haven't yet gotten the ssh mode cluster working with the new
ipcluster in trunk.
It shouldn't be difficult, and Vishal knows this code as well.

In the mean time, I would suggest looking through ipcluster.py - you
should be able to put a delay
between when the controller is started and when the engines are started.

Cheers,

Brian

So I suspect that the same thing would work in the loop in this method
of ipcluster.SSHEngineSet

def _ssh_engine(self, hostname, count):
exec_engine = "ssh %s sh %s/%s-sshx.sh %s" % (
hostname, self.temp_dir,
os.environ['USER'], self.engine_command
)
cmds = exec_engine.split()
dlist = []
log.msg("about to start engines...")
for i in range(count):
log.msg('Starting engines: %s' % exec_engine)
d = getProcessOutput(cmds[0], cmds[1:], env=os.environ)
dlist.append(d)
return gatherBoth(dlist, consumeErrors=True)

but that would be inelegant, given that the real problem is probably
related to the controller not responding properly to multiple requests.

Thanks for looking at this.

--Toby Burnett

** Affects: ipython
Importance: Undecided
Status: New

ipcluster does not start all the engines
https://bugs.launchpad.net/bugs/509015
You received this bug notification because you are a member of IPython
Developers, which is subscribed to IPython.

Status in IPython - Enhanced Interactive Python: New

Bug description:
As reported on the mailing list...

---------- Forwarded message ----------
From: Toby Burnett tburnett@EMAIL-REMOVED
Date: Sat, Jan 16, 2010 at 8:58 AM
Subject: [IPython-user] ipcluster does not start all the engines
To: "ipython-user@scipy.org" ipython-user@scipy.org

Hi,
I did not find any previous notes on this.
I have a cluster of 4 machines, each with 8 hyperthreaded cores, so I can run 16 engines per machine, or 64 in all. It is amazingly easy and useful, thanks so much for providing this.

However, when using ipcluster on one of these machines in ssh mode, with this clusterfile,
send_furl = False
engines = { 'tev1' : 16,
'tev2' : 16,
'tev3' : 16,
'tev4' : 16
}

I typically get about 50 engines to actually start. Since there seems to be no log file for ipcluster (in spite of code that seems like it should record which engines it tried to start), I can't send that. The ipcontroller log file looks fine, except for recording fewer than the 64 engines that I expected.

I have an alternative, very klugy method that starts a controller, then executes 64 ssh commands to the respective machines to simply run ipengine. I found the same problem, which went away when I introduced a 1 second delay after each ssh call, which of course takes more than a minute to run, and leaves all those ssh processes running.

So I suspect that the same thing would work in the loop in this method of ipcluster.SSHEngineSet

def _ssh_engine(self, hostname, count):
exec_engine = "ssh %s sh %s/%s-sshx.sh %s" % (
hostname, self.temp_dir,
os.environ['USER'], self.engine_command
)
cmds = exec_engine.split()
dlist = []
log.msg("about to start engines...")
for i in range(count):
log.msg('Starting engines: %s' % exec_engine)
d = getProcessOutput(cmds[0], cmds[1:], env=os.environ)
dlist.append(d)
return gatherBoth(dlist, consumeErrors=True)

but that would be inelegant, given that the real problem is probably related to the controller not responding properly to multiple requests.

Thanks for looking at this.

--Toby Burnett

Brian E. Granger, Ph.D.
Assistant Professor of Physics
Cal Poly State University, San Luis Obispo
bgranger@calpoly.edu
ellisonbg@gmail.com

ipython · 2010-05-10T09:29:06Z

[ LP comment 4 by: Vishal Vatsa, on 2010-01-18 21:21:06+00:00 ]

2010/1/18 Brian Granger ellisonbg@gmail.com:

I think I know what the issue is here. We have found that sometimes
the engines startup
so fast that the controller is not yet up and running. The engine
that try to connect
before the controller is running fail. Twisted is fully capable of
handling many simultaneous
connections, so I don't think it is that.

Yep this sound about correct.
Though it should not happen as, there is a _delay_start method which
waits for the furl files to be created on disk, which should be the test
that ipcontroller is up and running.

Toby, which OS are you on? could the filesystem semantics be different?
(In theory, it should work on windows/cygwin but I have never tested it)

I have not been able to replicate this so far on NAS backed cluster.
If this continues to be an issue for you, I can try to give you a patch
to insert a delay in the ssh engine start.

Regards,
-vishal

ipython · 2010-05-10T09:29:07Z

[ LP comment 5 by: Brian Granger, on 2010-01-18 22:38:16+00:00 ]

Yep this sound about correct.
Though it should not happen as, there is a _delay_start method which
waits for the furl files to be created on disk, which should be the test
that ipcontroller is up and running.

Yes, even though that method exists, we have sometimes observed it to fail.

Toby, which OS are you on? could the filesystem semantics be different?
(In theory, it should work on windows/cygwin but I have never tested it)

The problems we had with this were on Windows.

I have not been able to replicate this so far on NAS backed cluster.
If this continues to be an issue for you, I can try to give you a patch
to insert a delay in the ssh engine start.

Regards,
-vishal

ipcluster does not start all the engines
https://bugs.launchpad.net/bugs/509015
You received this bug notification because you are a member of IPython
Developers, which is subscribed to IPython.

Status in IPython - Enhanced Interactive Python: New

Bug description:
As reported on the mailing list...

---------- Forwarded message ----------
From: Toby Burnett tburnett@EMAIL-REMOVED
Date: Sat, Jan 16, 2010 at 8:58 AM
Subject: [IPython-user] ipcluster does not start all the engines
To: "ipython-user@scipy.org" ipython-user@scipy.org

Hi,
I did not find any previous notes on this.
I have a cluster of 4 machines, each with 8 hyperthreaded cores, so I can run 16 engines per machine, or 64 in all. It is amazingly easy and useful, thanks so much for providing this.

However, when using ipcluster on one of these machines in ssh mode, with this clusterfile,
send_furl = False
engines = { 'tev1' : 16,
'tev2' : 16,
'tev3' : 16,
'tev4' : 16
}

I typically get about 50 engines to actually start. Since there seems to be no log file for ipcluster (in spite of code that seems like it should record which engines it tried to start), I can't send that. The ipcontroller log file looks fine, except for recording fewer than the 64 engines that I expected.

I have an alternative, very klugy method that starts a controller, then executes 64 ssh commands to the respective machines to simply run ipengine. I found the same problem, which went away when I introduced a 1 second delay after each ssh call, which of course takes more than a minute to run, and leaves all those ssh processes running.

So I suspect that the same thing would work in the loop in this method of ipcluster.SSHEngineSet

def _ssh_engine(self, hostname, count):
exec_engine = "ssh %s sh %s/%s-sshx.sh %s" % (
hostname, self.temp_dir,
os.environ['USER'], self.engine_command
)
cmds = exec_engine.split()
dlist = []
log.msg("about to start engines...")
for i in range(count):
log.msg('Starting engines: %s' % exec_engine)
d = getProcessOutput(cmds[0], cmds[1:], env=os.environ)
dlist.append(d)
return gatherBoth(dlist, consumeErrors=True)

but that would be inelegant, given that the real problem is probably related to the controller not responding properly to multiple requests.

Thanks for looking at this.

--Toby Burnett

Brian E. Granger, Ph.D.
Assistant Professor of Physics
Cal Poly State University, San Luis Obispo
bgranger@calpoly.edu
ellisonbg@gmail.com

ipython · 2010-05-10T09:29:08Z

[ LP comment 6 by: Toby Burnett, on 2010-01-18 23:13:58.126879+00:00 ]

This was on linux.
So I guess the only issue is a slight delay after starting the ipcontroller. I'll try that, thanks.

Looking at the ipcluster code, (version 0.10) I see that the ssh mode is not supported for windows to linux, a case that we would like to use. Hopefully with the new version when you get it to work>

ipython · 2010-05-10T09:29:09Z

[ LP comment 7 by: Brian Granger, on 2010-01-18 23:25:06+00:00 ]

Is there a ssh daemon that is usable on Windows. I now about putty
for client ssh and we could use
that. Not sure if users would have an ssh daemon on Windows though.
For Windows, the best
solution is to use the Windows HPC 2008 OS and its builtin job
scheduler (IPython has very good support for this).
Only downside is the cost of the OS ;(

Brian

On Mon, Jan 18, 2010 at 3:13 PM, Toby Burnett tburnett@uw.edu wrote:

This was on linux.
So I guess the only issue is a slight delay after starting the ipcontroller. I'll try that, thanks.

Looking at the ipcluster code, (version 0.10) I see that the ssh mode is
not supported for windows to linux, a case that we would like to use.
Hopefully with the new version when you get it to work>

ipcluster does not start all the engines
https://bugs.launchpad.net/bugs/509015
You received this bug notification because you are a member of IPython
Developers, which is subscribed to IPython.

Status in IPython - Enhanced Interactive Python: New

Bug description:
As reported on the mailing list...

---------- Forwarded message ----------
From: Toby Burnett tburnett@EMAIL-REMOVED
Date: Sat, Jan 16, 2010 at 8:58 AM
Subject: [IPython-user] ipcluster does not start all the engines
To: "ipython-user@scipy.org" ipython-user@scipy.org

Hi,
I did not find any previous notes on this.
I have a cluster of 4 machines, each with 8 hyperthreaded cores, so I can run 16 engines per machine, or 64 in all. It is amazingly easy and useful, thanks so much for providing this.

However, when using ipcluster on one of these machines in ssh mode, with this clusterfile,
send_furl = False
engines = { 'tev1' : 16,
'tev2' : 16,
'tev3' : 16,
'tev4' : 16
}

I typically get about 50 engines to actually start. Since there seems to be no log file for ipcluster (in spite of code that seems like it should record which engines it tried to start), I can't send that. The ipcontroller log file looks fine, except for recording fewer than the 64 engines that I expected.

I have an alternative, very klugy method that starts a controller, then executes 64 ssh commands to the respective machines to simply run ipengine. I found the same problem, which went away when I introduced a 1 second delay after each ssh call, which of course takes more than a minute to run, and leaves all those ssh processes running.

So I suspect that the same thing would work in the loop in this method of ipcluster.SSHEngineSet

def _ssh_engine(self, hostname, count):
exec_engine = "ssh %s sh %s/%s-sshx.sh %s" % (
hostname, self.temp_dir,
os.environ['USER'], self.engine_command
)
cmds = exec_engine.split()
dlist = []
log.msg("about to start engines...")
for i in range(count):
log.msg('Starting engines: %s' % exec_engine)
d = getProcessOutput(cmds[0], cmds[1:], env=os.environ)
dlist.append(d)
return gatherBoth(dlist, consumeErrors=True)

but that would be inelegant, given that the real problem is probably related to the controller not responding properly to multiple requests.

Thanks for looking at this.

--Toby Burnett

Brian E. Granger, Ph.D.
Assistant Professor of Physics
Cal Poly State University, San Luis Obispo
bgranger@calpoly.edu
ellisonbg@gmail.com

ipython · 2010-05-10T09:29:10Z

[ LP comment 8 by: Vishal Vatsa, on 2010-01-18 23:39:09+00:00 ]

I have used cygwin and openssh to get a shell on winXP even X11
portfowarding in cygwin work :)

2010/1/18 Brian Granger ellisonbg@gmail.com:

Is there a ssh daemon that is usable on Windows. I now about putty
for client ssh and we could use
that. Not sure if users would have an ssh daemon on Windows though.
For Windows, the best
solution is to use the Windows HPC 2008 OS and its builtin job
scheduler (IPython has very good support for this).
Only downside is the cost of the OS ;(

Brian

On Mon, Jan 18, 2010 at 3:13 PM, Toby Burnett tburnett@uw.edu wrote:

This was on linux.
So I guess the only issue is a slight delay after starting the ipcontroller. I'll try that, thanks.

Looking at the ipcluster code, (version 0.10) I see that the ssh mode is
not supported for windows to linux, a case that we would like to use.
Hopefully with the new version when you get it to work>

ipcluster does not start all the engines
https://bugs.launchpad.net/bugs/509015
You received this bug notification because you are a member of IPython
Developers, which is subscribed to IPython.

Status in IPython - Enhanced Interactive Python: New

Bug description:
As reported on the mailing list...

---------- Forwarded message ----------
From: Toby Burnett tburnett@EMAIL-REMOVED
Date: Sat, Jan 16, 2010 at 8:58 AM
Subject: [IPython-user] ipcluster does not start all the engines
To: "ipython-user@scipy.org" ipython-user@scipy.org

Hi,
I did not find any previous notes on this.
I have a cluster of 4 machines, each with 8 hyperthreaded cores, so I can run 16 engines per machine, or 64 in all. It is amazingly easy and useful, thanks so much for providing this.

However, when using ipcluster on one of these machines in ssh mode, with this clusterfile,
send_furl = False
engines = { 'tev1' : 16,
'tev2' : 16,
'tev3' : 16,
'tev4' : 16
}

I typically get about 50 engines to actually start. Since there seems to be no log file for ipcluster (in spite of code that seems like it should record which engines it tried to start), I can't send that. The ipcontroller log file looks fine, except for recording fewer than the 64 engines that I expected.

I have an alternative, very klugy method that starts a controller, then executes 64 ssh commands to the respective machines to simply run ipengine. I found the same problem, which went away when I introduced a 1 second delay after each ssh call, which of course takes more than a minute to run, and leaves all those ssh processes running.

So I suspect that the same thing would work in the loop in this method of ipcluster.SSHEngineSet

def _ssh_engine(self, hostname, count):
exec_engine = "ssh %s sh %s/%s-sshx.sh %s" % (
hostname, self.temp_dir,
os.environ['USER'], self.engine_command
)
cmds = exec_engine.split()
dlist = []
log.msg("about to start engines...")
for i in range(count):
log.msg('Starting engines: %s' % exec_engine)
d = getProcessOutput(cmds[0], cmds[1:], env=os.environ)
dlist.append(d)
return gatherBoth(dlist, consumeErrors=True)

but that would be inelegant, given that the real problem is probably related to the controller not responding properly to multiple requests.

Thanks for looking at this.

--Toby Burnett

Brian E. Granger, Ph.D.
Assistant Professor of Physics
Cal Poly State University, San Luis Obispo
bgranger@calpoly.edu
ellisonbg@gmail.com

ipcluster does not start all the engines
https://bugs.launchpad.net/bugs/509015
You received this bug notification because you are a bug assignee.

Status in IPython - Enhanced Interactive Python: New

Bug description:
As reported on the mailing list...

---------- Forwarded message ----------
From: Toby Burnett tburnett@EMAIL-REMOVED
Date: Sat, Jan 16, 2010 at 8:58 AM
Subject: [IPython-user] ipcluster does not start all the engines
To: "ipython-user@scipy.org" ipython-user@scipy.org

Hi,
I did not find any previous notes on this.
I have a cluster of 4 machines, each with 8 hyperthreaded cores, so I can run 16 engines per machine, or 64 in all. It is amazingly easy and useful, thanks so much for providing this.

However, when using ipcluster on one of these machines in ssh mode, with this clusterfile,
send_furl = False
engines = { 'tev1' : 16,
'tev2' : 16,
'tev3' : 16,
'tev4' : 16
}

I typically get about 50 engines to actually start. Since there seems to be no log file for ipcluster (in spite of code that seems like it should record which engines it tried to start), I can't send that. The ipcontroller log file looks fine, except for recording fewer than the 64 engines that I expected.

I have an alternative, very klugy method that starts a controller, then executes 64 ssh commands to the respective machines to simply run ipengine. I found the same problem, which went away when I introduced a 1 second delay after each ssh call, which of course takes more than a minute to run, and leaves all those ssh processes running.

So I suspect that the same thing would work in the loop in this method of ipcluster.SSHEngineSet

def _ssh_engine(self, hostname, count):
exec_engine = "ssh %s sh %s/%s-sshx.sh %s" % (
hostname, self.temp_dir,
os.environ['USER'], self.engine_command
)
cmds = exec_engine.split()
dlist = []
log.msg("about to start engines...")
for i in range(count):
log.msg('Starting engines: %s' % exec_engine)
d = getProcessOutput(cmds[0], cmds[1:], env=os.environ)
dlist.append(d)
return gatherBoth(dlist, consumeErrors=True)

but that would be inelegant, given that the real problem is probably related to the controller not responding properly to multiple requests.

Thanks for looking at this.

--Toby Burnett

ipython · 2010-05-10T09:29:11Z

[ LP comment 9 by: Brian Granger, on 2010-01-19 00:00:11+00:00 ]

What about non-cygwin Windows?

On Mon, Jan 18, 2010 at 3:39 PM, Vishal Vatsa vishal.vatsa@gmail.com wrote:

I have used cygwin and openssh to get a shell on winXP even X11
portfowarding in cygwin work :)

2010/1/18 Brian Granger ellisonbg@gmail.com:

Is there a ssh daemon that is usable on Windows. I now about putty
for client ssh and we could use
that. Not sure if users would have an ssh daemon on Windows though.
For Windows, the best
solution is to use the Windows HPC 2008 OS and its builtin job
scheduler (IPython has very good support for this).
Only downside is the cost of the OS ;(

Brian

On Mon, Jan 18, 2010 at 3:13 PM, Toby Burnett tburnett@uw.edu wrote:

This was on linux.
So I guess the only issue is a slight delay after starting the ipcontroller. I'll try that, thanks.

Looking at the ipcluster code, (version 0.10) I see that the ssh mode is
not supported for windows to linux, a case that we would like to use.
Hopefully with the new version when you get it to work>

ipcluster does not start all the engines
https://bugs.launchpad.net/bugs/509015
You received this bug notification because you are a member of IPython
Developers, which is subscribed to IPython.

Status in IPython - Enhanced Interactive Python: New

Bug description:
As reported on the mailing list...

---------- Forwarded message ----------
From: Toby Burnett tburnett@EMAIL-REMOVED
Date: Sat, Jan 16, 2010 at 8:58 AM
Subject: [IPython-user] ipcluster does not start all the engines
To: "ipython-user@scipy.org" ipython-user@scipy.org

Hi,
I did not find any previous notes on this.
I have a cluster of 4 machines, each with 8 hyperthreaded cores, so I can run 16 engines per machine, or 64 in all. It is amazingly easy and useful, thanks so much for providing this.

However, when using ipcluster on one of these machines in ssh mode, with this clusterfile,
send_furl = False
engines = { 'tev1' : 16,
'tev2' : 16,
'tev3' : 16,
'tev4' : 16
}

I typically get about 50 engines to actually start. Since there seems to be no log file for ipcluster (in spite of code that seems like it should record which engines it tried to start), I can't send that. The ipcontroller log file looks fine, except for recording fewer than the 64 engines that I expected.

I have an alternative, very klugy method that starts a controller, then executes 64 ssh commands to the respective machines to simply run ipengine. I found the same problem, which went away when I introduced a 1 second delay after each ssh call, which of course takes more than a minute to run, and leaves all those ssh processes running.

So I suspect that the same thing would work in the loop in this method of ipcluster.SSHEngineSet

def _ssh_engine(self, hostname, count):
exec_engine = "ssh %s sh %s/%s-sshx.sh %s" % (
hostname, self.temp_dir,
os.environ['USER'], self.engine_command
)
cmds = exec_engine.split()
dlist = []
log.msg("about to start engines...")
for i in range(count):
log.msg('Starting engines: %s' % exec_engine)
d = getProcessOutput(cmds[0], cmds[1:], env=os.environ)
dlist.append(d)
return gatherBoth(dlist, consumeErrors=True)

but that would be inelegant, given that the real problem is probably related to the controller not responding properly to multiple requests.

Thanks for looking at this.

--Toby Burnett

Brian E. Granger, Ph.D.
Assistant Professor of Physics
Cal Poly State University, San Luis Obispo
bgranger@calpoly.edu
ellisonbg@gmail.com

ipcluster does not start all the engines
https://bugs.launchpad.net/bugs/509015
You received this bug notification because you are a bug assignee.

Status in IPython - Enhanced Interactive Python: New

Bug description:
As reported on the mailing list...

---------- Forwarded message ----------
From: Toby Burnett tburnett@EMAIL-REMOVED
Date: Sat, Jan 16, 2010 at 8:58 AM
Subject: [IPython-user] ipcluster does not start all the engines
To: "ipython-user@scipy.org" ipython-user@scipy.org

Hi,
I did not find any previous notes on this.
I have a cluster of 4 machines, each with 8 hyperthreaded cores, so I can run 16 engines per machine, or 64 in all. It is amazingly easy and useful, thanks so much for providing this.

However, when using ipcluster on one of these machines in ssh mode, with this clusterfile,
send_furl = False
engines = { 'tev1' : 16,
'tev2' : 16,
'tev3' : 16,
'tev4' : 16
}

I typically get about 50 engines to actually start. Since there seems to be no log file for ipcluster (in spite of code that seems like it should record which engines it tried to start), I can't send that. The ipcontroller log file looks fine, except for recording fewer than the 64 engines that I expected.

I have an alternative, very klugy method that starts a controller, then executes 64 ssh commands to the respective machines to simply run ipengine. I found the same problem, which went away when I introduced a 1 second delay after each ssh call, which of course takes more than a minute to run, and leaves all those ssh processes running.

So I suspect that the same thing would work in the loop in this method of ipcluster.SSHEngineSet

def _ssh_engine(self, hostname, count):
exec_engine = "ssh %s sh %s/%s-sshx.sh %s" % (
hostname, self.temp_dir,
os.environ['USER'], self.engine_command
)
cmds = exec_engine.split()
dlist = []
log.msg("about to start engines...")
for i in range(count):
log.msg('Starting engines: %s' % exec_engine)
d = getProcessOutput(cmds[0], cmds[1:], env=os.environ)
dlist.append(d)
return gatherBoth(dlist, consumeErrors=True)

but that would be inelegant, given that the real problem is probably related to the controller not responding properly to multiple requests.

Thanks for looking at this.

--Toby Burnett

ipcluster does not start all the engines
https://bugs.launchpad.net/bugs/509015
You received this bug notification because you are a member of IPython
Developers, which is subscribed to IPython.

Status in IPython - Enhanced Interactive Python: New

Bug description:
As reported on the mailing list...

---------- Forwarded message ----------
From: Toby Burnett tburnett@EMAIL-REMOVED
Date: Sat, Jan 16, 2010 at 8:58 AM
Subject: [IPython-user] ipcluster does not start all the engines
To: "ipython-user@scipy.org" ipython-user@scipy.org

Hi,
I did not find any previous notes on this.
I have a cluster of 4 machines, each with 8 hyperthreaded cores, so I can run 16 engines per machine, or 64 in all. It is amazingly easy and useful, thanks so much for providing this.

However, when using ipcluster on one of these machines in ssh mode, with this clusterfile,
send_furl = False
engines = { 'tev1' : 16,
'tev2' : 16,
'tev3' : 16,
'tev4' : 16
}

I typically get about 50 engines to actually start. Since there seems to be no log file for ipcluster (in spite of code that seems like it should record which engines it tried to start), I can't send that. The ipcontroller log file looks fine, except for recording fewer than the 64 engines that I expected.

I have an alternative, very klugy method that starts a controller, then executes 64 ssh commands to the respective machines to simply run ipengine. I found the same problem, which went away when I introduced a 1 second delay after each ssh call, which of course takes more than a minute to run, and leaves all those ssh processes running.

So I suspect that the same thing would work in the loop in this method of ipcluster.SSHEngineSet

def _ssh_engine(self, hostname, count):
exec_engine = "ssh %s sh %s/%s-sshx.sh %s" % (
hostname, self.temp_dir,
os.environ['USER'], self.engine_command
)
cmds = exec_engine.split()
dlist = []
log.msg("about to start engines...")
for i in range(count):
log.msg('Starting engines: %s' % exec_engine)
d = getProcessOutput(cmds[0], cmds[1:], env=os.environ)
dlist.append(d)
return gatherBoth(dlist, consumeErrors=True)

but that would be inelegant, given that the real problem is probably related to the controller not responding properly to multiple requests.

Thanks for looking at this.

--Toby Burnett

Brian E. Granger, Ph.D.
Assistant Professor of Physics
Cal Poly State University, San Luis Obispo
bgranger@calpoly.edu
ellisonbg@gmail.com

ipython · 2010-05-10T09:29:12Z

[ LP comment 10 by: Toby Burnett, on 2010-01-19 15:56:55.304633+00:00 ]

About ssh on windows, we use the cygwin version, but not in the cygwin bash shell.
C:\Users\burnett>ssh -V
OpenSSH_3.8.1p1, OpenSSL 0.9.7d 17 Mar 2004

I tried a version of my script that starts an ipcontroller, then 16 ipengines on each of the 4 machines. It I delay after the ipcontroller, I have the same problem: 2-3 engines don't get started per machine. However, a 100 ms delay in that loop that creates the engines works just fine.

This seems inconsistent with an assertion about how the engines connect to a controller.

ipython · 2010-05-10T09:29:13Z

[ LP comment 11 by: Vishal Vatsa, on 2010-01-19 16:30:54+00:00 ]

Would you mind sending me a copy of your script.
Also, is the setup like:
ipcontroller is on windows and ipengines are on linux

-v

2010/1/19 Toby Burnett tburnett@uw.edu:

About ssh on windows, we use the cygwin version, but not in the cygwin bash shell.
C:\Users\burnett>ssh -V
OpenSSH_3.8.1p1, OpenSSL 0.9.7d 17 Mar 2004

I tried a version of my script that starts an ipcontroller, then 16
ipengines on each of the 4 machines. It I delay after the ipcontroller,
I have the same problem: 2-3 engines don't get started per machine.
However, a 100 ms delay in that loop that creates the engines works just
fine.

This seems inconsistent with an assertion about how the engines connect
to a controller.

ipcluster does not start all the engines
https://bugs.launchpad.net/bugs/509015
You received this bug notification because you are a bug assignee.

Status in IPython - Enhanced Interactive Python: New

Bug description:
As reported on the mailing list...

---------- Forwarded message ----------
From: Toby Burnett tburnett@EMAIL-REMOVED
Date: Sat, Jan 16, 2010 at 8:58 AM
Subject: [IPython-user] ipcluster does not start all the engines
To: "ipython-user@scipy.org" ipython-user@scipy.org

Hi,
I did not find any previous notes on this.
I have a cluster of 4 machines, each with 8 hyperthreaded cores, so I can run 16 engines per machine, or 64 in all. It is amazingly easy and useful, thanks so much for providing this.

However, when using ipcluster on one of these machines in ssh mode, with this clusterfile,
send_furl = False
engines = { 'tev1' : 16,
'tev2' : 16,
'tev3' : 16,
'tev4' : 16
}

I typically get about 50 engines to actually start. Since there seems to be no log file for ipcluster (in spite of code that seems like it should record which engines it tried to start), I can't send that. The ipcontroller log file looks fine, except for recording fewer than the 64 engines that I expected.

I have an alternative, very klugy method that starts a controller, then executes 64 ssh commands to the respective machines to simply run ipengine. I found the same problem, which went away when I introduced a 1 second delay after each ssh call, which of course takes more than a minute to run, and leaves all those ssh processes running.

So I suspect that the same thing would work in the loop in this method of ipcluster.SSHEngineSet

def _ssh_engine(self, hostname, count):
exec_engine = "ssh %s sh %s/%s-sshx.sh %s" % (
hostname, self.temp_dir,
os.environ['USER'], self.engine_command
)
cmds = exec_engine.split()
dlist = []
log.msg("about to start engines...")
for i in range(count):
log.msg('Starting engines: %s' % exec_engine)
d = getProcessOutput(cmds[0], cmds[1:], env=os.environ)
dlist.append(d)
return gatherBoth(dlist, consumeErrors=True)

but that would be inelegant, given that the real problem is probably related to the controller not responding properly to multiple requests.

Thanks for looking at this.

--Toby Burnett

ipython · 2010-05-10T09:29:14Z

[ LP comment 12 by: Toby Burnett, on 2010-01-19 18:40:49.459167+00:00 ]

Here is the script, with an attempt to use ipcluster ssh commented out. The 64-engine case is run on one of the same nodes.
def setup_mec(engines=None, machines='tev1 tev2 tev3 tev4'.split()):
""" On windows:start cluster and 4 engines on the local machine, in the current directory
On linux: (our tev cluster) start controller on local machine, 16 engines/machine in all
"""
if os.name=='nt':
engines = engines or 4
os.system(r'start /MIN /D %s cmd /K python C:\python25\scripts\ipcluster local -xy -n %d'% (os.getcwd(),engines))
else:
# on a tev machine
engines = engines or 16 #default on a tev machine!

    #clusterfile_data='send_furl = False'\
    #    + '\nengines={'\
    #    + '\n'.join(['\t"%s" : %d,'%(m,engines) for m in machines])\
    #    + '\n}'
    #print 'cluster info:\n%s' % clusterfile_data
    #ofile=open('clusterfile', 'w')
    #ofile.writelines(clusterfile_data)
    #ofile.close()
    #os.system('ipcluster ssh -xy --clusterfile clusterfile &')

    # old, klugy way
    os.system('ipcontroller local -xy&')  
    for m in machines:
        for i in range(engines):
            time.sleep(0.1) # make sure the controller is started ?
            os.system('ssh %s ipengine&'% m) # this assumes that the environment is setup with non-interactive login

fperez · 2011-03-23T19:59:49Z

Clusterfile no longer in use, but the newparallel code does support SSH launching of engines. See here for details:

http://minrk.github.com/ipython-doc/newparallel/parallelz/parallel_process.html#using-ipclusterz-in-ssh-mode

Unfortunately it's possible there was a problem in Twisted that was causing part of this. In newparallel/master we've moved away from twisted completely, so I'm closing this bug.

I'm sorry that the transition for users of the twisted code is going to be somewhat painful for some users, but it's simply not practical to maintain the twisted codebase for the long haul.

fperez closed this as completed Mar 23, 2011

jdfreder mentioned this issue Apr 14, 2014

Migrate from Bootstrap 2 to Bootstrap 3 #5617

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ipcluster does not start all the engines #24

ipcluster does not start all the engines #24

ipython commented May 10, 2010

ipython commented May 10, 2010

ipython commented May 10, 2010

ipython commented May 10, 2010

ipython commented May 10, 2010

ipython commented May 10, 2010

ipython commented May 10, 2010

ipython commented May 10, 2010

ipython commented May 10, 2010

ipython commented May 10, 2010

ipython commented May 10, 2010

ipython commented May 10, 2010

ipython commented May 10, 2010

fperez commented Mar 23, 2011

ipcluster does not start all the engines #24

ipcluster does not start all the engines #24

Comments

ipython commented May 10, 2010

ipython commented May 10, 2010

ipython commented May 10, 2010

ipython commented May 10, 2010

ipython commented May 10, 2010

ipython commented May 10, 2010

ipython commented May 10, 2010

ipython commented May 10, 2010

ipython commented May 10, 2010

ipython commented May 10, 2010

ipython commented May 10, 2010

ipython commented May 10, 2010

ipython commented May 10, 2010

fperez commented Mar 23, 2011