[MRG]: new version of the ipcluster plugin to drop the SGE requirement #204

Closed
wants to merge 14 commits into
from

Projects

None yet

4 participants

@ogrisel
Contributor
ogrisel commented Jan 19, 2013

The NFS partition is used instead to be able to run:

ipcluster engines -n numproc

on each non-master node of the cluster.

TODO:

  • update the documentation
  • find out why running the notebook locally using the fetched json definition does not work (security group configuration issue?) configured the SG to open the controller ports
  • review the configuration: would it be better to leave the default configuration instead?
  • change the packer config to selection between json (default), pickle or msgpack + documentation.

This is an early pull request to collect feedback from others, especially @minrk before moving further with the documentation update.

Also some questions for Min:

  • would it be possible to restart all the engines of the cluster from the controller? Would this require to configure the ssh mode [1] instead? If so, where is the API to shutdown and restart all the active engines? Basic restart plugin implemented, better version will require some refactoring upstream in IPython
  • is there a cleaner way to shutdown some engines (with a timeout before calling the pkill command) in the on_remove_node method? Good enough for now.

[1] http://ipython.org/ipython-doc/dev/parallel/parallel_process.html#using-ipcluster-in-ssh-mode

@ogrisel ogrisel Draft ipcluster plugin refactoring to drop SGE requirement.
The NFS partition is used instead to be able to run:

    ipcluster engines -n numproc

on each non-master node of the cluster.
f179387
@ogrisel
Contributor
ogrisel commented Jan 19, 2013

Some additional questions for @jtriley this time:

  • I used a dummy threaded_call function to launch stuff in parallel on each non master node. Is there a better API already available in StarCluster? fixed by using the cluster thread pool.
  • when I launch starcluster an -n 3 mycluster, the new nodes initializations seem to be sequential: is there a way to make them run in parallel and then wait at the end? (probably to be addressed in another pull request).
@minrk minrk and 1 other commented on an outdated diff Jan 19, 2013
starcluster/plugins/ipcluster.py
@@ -181,7 +115,7 @@ def _start_cluster(self, master, n, profile_dir):
s.start()
try:
while not master.ssh.isfile(json):
@minrk
minrk Jan 19, 2013 Contributor

I think this while True has caused some folks to observe a hang if controller startup didn't succeed for some reason.
Maybe give it a for i in range(30), so it has a 30 second timeout, rather than waiting forever?

@ogrisel
ogrisel Jan 19, 2013 Contributor

Good idea, I'll do that.

@ogrisel
Contributor
ogrisel commented Jan 21, 2013

@minrk could you please answer the two questions I put in the description of the PR?

In particular a full IP cluster restart would be nice (aborting all pending tasks, killing all the engines and relaunching them), esp. when you are using starcluster to debug an algorithm buggy C bindings that leaks memory for instance. Right now it would be possible to do this using the "IPClusterStop" plugin but I find this a bit hackish and I am wondering what would be better, more integrated way.

Also another question: the starcluster configured IPython cluster does not show up in the "clusters" tab of the notebook. Is it expected? Maybe this is related to the previous questions?

@minrk
Contributor
minrk commented Jan 22, 2013

Weird, I thought I already did. Thanks for the ping.

would it be possible to restart all the engines of the cluster from the controller? Would this require to configure the ssh mode [1] instead? If so, where is the API to shutdown and restart all the active engines?

No, there is no API for restart, only shutdown (client.shutdown()). This will require the addition of KernelNanny-type process that actually does the kernel start / restart process, like a KernelManager does now. The only way to do this now would be to shutdown kernels via the Client, then start them up again with ssh, assuming you know how that should have been done. Control messages like shutdown jump to the head of the line, so they will pre-empt any queued tasks (but not the currently running one).

The easiest 'ipcluster restart' script would be:

client.shutdown(hub=True)

followed by running the ipcluster start process that you already have.

is there a cleaner way to shutdown some engines (with a timeout before calling the pkill command) in the on_remove_node method?

Yes, you can use client.shutdown(targets=[1,3,5]). The only tricky bit is knowing which engines those are.
Unfortunately, there are really only two ways to get this information:

  1. when starting the node, check the current engine list, start the node, then check the engine list after they have connected. This doesn't work if multiple nodes are coming up and starting engines concurrently.
  2. while the cluster is idle, use client[:].apply_async(socket.gethostname).get_dict() (a rather useful trick in general) to get a mapping of engine ID to hostname. Obviously, this only works when the engines are all idle.

Also another question: the starcluster configured IPython cluster does not show up in the "clusters" tab of the notebook. Is it expected? Maybe this is related to the previous questions?

Yes. The clusters tab is only aware of clusters that were started via that interface (it is a manager of ipcluster subprocesses). We intend to rewrite ipcluster as an RPC service that you can query, which would allow the clusters tab to know about clusters that it did not start.

@ogrisel
Contributor
ogrisel commented Jan 22, 2013

Ok great, thanks. I will try and investigate a bit further. Maybe I will write a IPClusterRestart plugin for starcluster. Or do you think it should be better contributed directly to IPython?

@minrk
Contributor
minrk commented Jan 22, 2013

straight to ipcluster. The way restart should get into IPython is via the KernelNanny idea above, which is a pretty big refactor.

@ogrisel
Contributor
ogrisel commented Jan 25, 2013

@minrk I am trying to solve the connection from client issue: in bd8e54 I explicitly authorize all the port numbers found in the controller / client json file. However I still cannot connect my client from my laptop:

>>> from IPython.parallel import Client
>>> Client('/Users/ogrisel/.starcluster/ipcluster/SecurityGroup:@sc-iptest-us-east-1.json', sshkey='/Users/ogrisel/.ssh/mykey.rsa')
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-2-f3cf3c18793f> in <module>()
----> 1 Client('/Users/ogrisel/.starcluster/ipcluster/SecurityGroup:@sc-iptest-us-east-1.json', sshkey='/Users/ogrisel/.ssh/mykey.rsa')

/Users/ogrisel/coding/ipython/IPython/parallel/client/client.pyc in __init__(self, url_file, profile, profile_dir, ipython_dir, context, debug, sshserver, sshkey, password, paramiko, timeout, cluster_id, **extra_args)
    454             sshserver = addr
    455         if self._ssh and password is None:
--> 456             if tunnel.try_passwordless_ssh(sshserver, sshkey, paramiko):
    457                 password=False
    458             else:

/Users/ogrisel/coding/ipython/IPython/external/ssh/tunnel.pyc in try_passwordless_ssh(server, keyfile, paramiko)
     86     else:
     87         f = _try_passwordless_paramiko
---> 88     return f(server, keyfile)
     89 
     90 def _try_passwordless_openssh(server, keyfile):

/Users/ogrisel/coding/ipython/IPython/external/ssh/tunnel.pyc in _try_passwordless_openssh(server, keyfile)
     99     while True:
    100         try:
--> 101             p.expect('[Pp]assword:', timeout=.1)
    102         except pexpect.TIMEOUT:
    103             continue

/Library/Python/2.7/site-packages/pexpect.pyc in expect(self, pattern, timeout, searchwindowsize)
   1314 
   1315         compiled_pattern_list = self.compile_pattern_list(pattern)
-> 1316         return self.expect_list(compiled_pattern_list, timeout, searchwindowsize)
   1317 
   1318     def expect_list(self, pattern_list, timeout = -1, searchwindowsize = -1):

/Library/Python/2.7/site-packages/pexpect.pyc in expect_list(self, pattern_list, timeout, searchwindowsize)
   1328         self.searchwindowsize value is used. """
   1329 
-> 1330         return self.expect_loop(searcher_re(pattern_list), timeout, searchwindowsize)
   1331 
   1332     def expect_exact(self, pattern_list, timeout = -1, searchwindowsize = -1):

/Library/Python/2.7/site-packages/pexpect.pyc in expect_loop(self, searcher, timeout, searchwindowsize)
   1381                     raise TIMEOUT ('Timeout exceeded in expect_any().')
   1382                 # Still have time left, so read more data
-> 1383                 c = self.read_nonblocking (self.maxread, timeout)
   1384                 freshlen = len(c)
   1385                 time.sleep (0.0001)

/Library/Python/2.7/site-packages/pexpect.pyc in read_nonblocking(self, size, timeout)
    818                 raise EOF ('End Of File (EOF) in read_nonblocking(). Pokey platform.')
    819 
--> 820         r,w,e = self.__select([self.child_fd], [], [], timeout)
    821 
    822         if not r:

/Library/Python/2.7/site-packages/pexpect.pyc in __select(self, iwtd, owtd, ewtd, timeout)
   1552         while True:
   1553             try:
-> 1554                 return select.select (iwtd, owtd, ewtd, timeout)
   1555             except select.error, e:
   1556                 if e[0] == errno.EINTR:

KeyboardInterrupt: 

I manually interrupted the connection attempt since it does not seem to ever timeout.

@ogrisel
Contributor
ogrisel commented Jan 25, 2013

Actually this is a false alert: it turns out the ssh tunnels setup was just taking a long time (e.g. ~15s or so). It seems to work!

@jtriley
Owner
jtriley commented Jan 25, 2013

@ogrisel @minrk This is looking great! Will test and merge soon...

@ogrisel
Contributor
ogrisel commented Jan 25, 2013

@jtriley I still have to update the documentation and I would like to add an option setting a custom notebook directory.

I would also like to add a new (sub)plugin to restart the ipython engines using brutal kill -f ipengine for now (the goal being to support IPython 0.13 while waiting for a finer support in 0.14 or later as @minrk said in a previous comment).

@ogrisel
Contributor
ogrisel commented Jan 25, 2013

@jtriley @minrk: I just granted you push rights to my repo shall you want to fix stuff directly in my branch so as to keep a single pull request to for this feature.

@ogrisel
Contributor
ogrisel commented Jan 28, 2013

@jtriley I made sure that this can work both with 0.13.1 and the master branch (0.14.dev) of IPython. I think this PR is ready for testing. I also updated the documentation but I am too lazy to redo all the screenshots with IPython 0.13...

@minrk I dropped the packer='pickle' as the default packer configuration seems to work both under 0.13.1 and master. Is this a bad idea? Why was this specific packer configuration was required in the first place? Isn't it better to use the default configuration? It makes it easier for the user not to have to pass a packer= keyword argument each time.

@minrk
Contributor
minrk commented Jan 28, 2013

it was a minor performance optimization. The default packer (stdlib JSON) is crazy slow.

@ogrisel
Contributor
ogrisel commented Jan 28, 2013

it was a minor performance optimization. The default packer (stdlib JSON) is crazy slow.

Hum, is there a way to tell the client which packer is configured on the cluster?

@minrk
Contributor
minrk commented Jan 28, 2013

Hum, is there a way to tell the client which packer is configured on the cluster?

It's in the connection file in 0.14, no need to specify it anymore.

@ogrisel
Contributor
ogrisel commented Jan 28, 2013

It's in the connection file in 0.14, no need to specify it anymore.

Alright. Thanks that explains the behavior changes I observed while testing. I will probably switch back to pickle in the default config and update the doc to make it explicit that passing the packer argument in Client is only required when using 0.13.

@jtriley do you think that StarCluster will be released before or after the next release of IPython?

@ogrisel
Contributor
ogrisel commented Jan 28, 2013

@minrk also for 0.13 to work from the local client, I had to open the port range 1000-65535 on EC2 SecurityGroup configuration as the port number of the controller channel are not given in the connector file. Is there a smarter way to do that?

@minrk
Contributor
minrk commented Jan 28, 2013

0.14 has a long todo list, and probably won't be before May or June.

I think it's probably right to use the default serialization by default - the current IPCluster was actually taken from a plugin I wrote for my personal use, hence some of the probably inappropriate default config. It should be easy to use custom serialization, though (I actually always use msgpack on EC2, for instance).

@ogrisel
Contributor
ogrisel commented Jan 28, 2013

Alright, I will stick to the default packer config by default but will also make it possible to configure the pickle and msgpack alternative packers.

Right now I have only added the option for msgpack.

@minrk
Contributor
minrk commented Jan 28, 2013

There are two ways to know the ports of the Controller in 0.13:

  1. specify them ahead of time via config
  2. connect an actual client, and read it from there after it completes registration (client._config['registration'])
@ogrisel
Contributor
ogrisel commented Jan 28, 2013

Ok I'll fix them in the config file, that sounds more appropriate.

@ogrisel
Contributor
ogrisel commented Jan 28, 2013

Ok I have updated the packer config and updated the documentation. I did not find an example configuration that demonstrates how to configure all the controller ports needed by the client for IPython 0.13 so I decided to level the SecurityGroup configuration as it is.

I think this PR is ready for final review and merge.

@ogrisel
Contributor
ogrisel commented Feb 3, 2013

Any comment / review?

@minrk
Contributor
minrk commented Feb 4, 2013

I've read through it, and it looks fine to me. I don't see any issues.

@ogrisel
Contributor
ogrisel commented Feb 6, 2013

@jtriley do you want me to check this PR against the new AMIs with IPython 0.13 installed by default? If so what are their ids / regions?

@ogrisel
Contributor
ogrisel commented Feb 6, 2013

BTW thanks @minrk for the review :)

@twiecki
twiecki commented Feb 12, 2013

@ogrisel In case this is still relevant (found this out unrelatedley):

>>starcluster listpublic                                                                                                                                                                                                                     
StarCluster - (http://star.mit.edu/cluster) (v. 0.9999)                                                                                                                                                                                      
Software Tools for Academics and Researchers (STAR)                                                                                                                                                                                          
Please submit bug reports to starcluster@mit.edu                                                                                                                                                                                             

>>> Listing all public StarCluster images...                                                                                                                                                                                                 

32bit Images:                                                                                                                                                                                                                                
-------------                                                                                                                                                                                                                                
[0] ami-adc149c4 us-east-1 starcluster-base-ubuntu-12.04-x86 (EBS)                                                                                                                                                                           
[1] ami-899d49e0 us-east-1 starcluster-base-ubuntu-11.10-x86 (EBS)                                                                                                                                                                           
[2] ami-8cf913e5 us-east-1 starcluster-base-ubuntu-10.04-x86-rc3                                                                                                                                                                             
[3] ami-d1c42db8 us-east-1 starcluster-base-ubuntu-9.10-x86-rc8                                                                                                                                                                              
[4] ami-8f9e71e6 us-east-1 starcluster-base-ubuntu-9.04-x86                                                                                                                                                                                  

64bit Images:                                                                                                                                                                                                                                
--------------
[0] ami-5b3fb632 us-east-1 starcluster-base-ubuntu-12.04-x86_64 (EBS)
[1] ami-4583572c us-east-1 starcluster-base-ubuntu-11.10-x86_64-hvm (HVM-EBS)
[2] ami-999d49f0 us-east-1 starcluster-base-ubuntu-11.10-x86_64 (EBS)
[3] ami-0af31963 us-east-1 starcluster-base-ubuntu-10.04-x86_64-rc1
[4] ami-2faa7346 us-east-1 starcluster-base-ubuntu-10.04-x86_64-qiime-1.4.0 (EBS)
[5] ami-8852a0e1 us-east-1 starcluster-base-ubuntu-10.04-x86_64-hadoop
[6] ami-a5c42dcc us-east-1 starcluster-base-ubuntu-9.10-x86_64-rc4
[7] ami-a19e71c8 us-east-1 starcluster-base-ubuntu-9.04-x86_64
[8] ami-06a75a6f us-east-1 starcluster-base-centos-5.4-x86_64-ebs-hvm-gpu-hadoop-rc2 (HVM-EBS)
[9] ami-12b6477b us-east-1 starcluster-base-centos-5.4-x86_64-ebs-hvm-gpu-rc2 (HVM-EBS)
@ogrisel
Contributor
ogrisel commented Feb 12, 2013

@twiecki thanks, i did not know about this.

@ogrisel
Contributor
ogrisel commented Feb 17, 2013

@twiecki alright I just tried with ami-5b3fb632 (ubuntu 12.04 64bit EBS with IPython 0.13.1 installed by default) and both the IPython notebook running on the server and a local IPython 0.13.1 client can instanciate a running IPython.parallel Client.

@twiecki
twiecki commented Feb 17, 2013

@ogrisel I also gave it a test spin and haven't had any issues.

Looking at the diff this seems like a great PR: no SGE, better parallel support, simpler code. What's not to like.

@jtriley jtriley closed this in f06b9c6 Feb 26, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment