Skip to content
This repository

ipcontroller crash with MPI #728

Closed
jdavid1385 opened this Issue August 24, 2011 · 2 comments

2 participants

Jesus David Min RK
Jesus David

Hi,

I am new to the use of ipython for handling a cluster with MPI. I just
followed the instructions 4 creating the profile and started the cluster
by ipcluster start -n 4 --profile=mpi instruction succesfully. Of course
it was only to check how it creates the processes and the kind of
logging it handles. Since the console was hanged (I suppose is a
blocking execution) I stop the process by Ctrl+C and then was when the
issue occurs. I hope my description of the escenario will be helpful.

I have already sent an e-mail with the same content as this issue,
Anyway the log ouput is below:

***************************************************************************

IPython post-mortem report

{'commit_hash': '464280a',
 'commit_source': 'installation',
 'ipython_path': '/usr/local/lib/python2.6/dist-packages/ipython-0.11-py2.6.egg/IPython',
 'ipython_version': '0.11',
 'os_name': 'posix',
 'platform': 'Linux-2.6.35-30-generic-i686-with-Ubuntu-10.10-maverick',
 'sys_executable': '/usr/bin/python2.6',
 'sys_platform': 'linux2',
 'sys_version': '2.6.6 (r266:84292, Sep 15 2010, 15:52:39) \n[GCC 4.4.5]'}

***************************************************************************



***************************************************************************

Crash traceback:

---------------------------------------------------------------------------
OSError                                    Python 2.6.6: /usr/bin/python2.6
                                                   Thu Aug 25 00:37:16 2011
A problem occured executing Python code.  Here is the sequence of function
calls leading up to the error, with the most recent (innermost) call last.
/usr/lib/python2.6/atexit.pyc in _run_exitfuncs()
      9 
     10 import sys
     11 
     12 _exithandlers = []
     13 def _run_exitfuncs():
     14     """run any registered exit functions
     15 
     16     _exithandlers is traversed in reverse order so functions are executed
     17     last in, first out.
     18     """
     19 
     20     exc_info = None
     21     while _exithandlers:
     22         func, targs, kargs = _exithandlers.pop()
     23         try:
---> 24             func(*targs, **kargs)
        func = <function _exit_function at 0xa2305dc>
        global function = undefined
        global to = undefined
        global be = undefined
        global called = undefined
        global at = undefined
        global exit = undefined
     25         except SystemExit:
     26             exc_info = sys.exc_info()
     27         except:
     28             import traceback
     29             print >> sys.stderr, "Error in atexit._run_exitfuncs:"
     30             traceback.print_exc()
     31             exc_info = sys.exc_info()
     32 
     33     if exc_info is not None:
     34         raise exc_info[0], exc_info[1], exc_info[2]
     35 
     36 
     37 def register(func, *targs, **kargs):
     38     """register a function to be executed upon normal program termination
     39 

/usr/lib/python2.6/multiprocessing/util.pyc in _exit_function()
    254 
    255 def _exit_function():
    256     global _exiting
    257 
    258     info('process shutting down')
    259     debug('running all "atexit" finalizers with priority >= 0')
    260     _run_finalizers(0)
    261 
    262     for p in active_children():
    263         if p._daemonic:
    264             info('calling terminate() for daemon %s', p.name)
    265             p._popen.terminate()
    266 
    267     for p in active_children():
    268         info('calling join() for process %s', p.name)
--> 269         p.join()
    270 
    271     debug('running the remaining "atexit" finalizers')
    272     _run_finalizers()
    273 
    274 atexit.register(_exit_function)
    275 
    276 #
    277 # Some fork aware types
    278 #
    279 
    280 class ForkAwareThreadLock(object):
    281     def __init__(self):
    282         self._lock = threading.Lock()
    283         self.acquire = self._lock.acquire
    284         self.release = self._lock.release

/usr/lib/python2.6/multiprocessing/process.pyc in join(self=<class 'multiprocessing.process.Process'> instance, timeout=None)
    104         self._popen = Popen(self)
    105         _current_process._children.add(self)
    106 
    107     def terminate(self):
    108         '''
    109         Terminate process; sends SIGTERM signal or uses TerminateProcess()
    110         '''
    111         self._popen.terminate()
    112 
    113     def join(self, timeout=None):
    114         '''
    115         Wait until child process terminates
    116         '''
    117         assert self._parent_pid == os.getpid(), 'can only join a child process'
    118         assert self._popen is not None, 'can only join a started process'
--> 119         res = self._popen.wait(timeout)
    120         if res is not None:
    121             _current_process._children.discard(self)
    122 
    123     def is_alive(self):
    124         '''
    125         Return whether process is alive
    126         '''
    127         if self is _current_process:
    128             return True
    129         assert self._parent_pid == os.getpid(), 'can only test a child process'
    130         if self._popen is None:
    131             return False
    132         self._popen.poll()
    133         return self._popen.returncode is None
    134 

/usr/lib/python2.6/multiprocessing/forking.pyc in wait(self=<multiprocessing.forking.Popen object>, timeout=None)
    102                 os._exit(code)
    103 
    104         def poll(self, flag=os.WNOHANG):
    105             if self.returncode is None:
    106                 pid, sts = os.waitpid(self.pid, flag)
    107                 if pid == self.pid:
    108                     if os.WIFSIGNALED(sts):
    109                         self.returncode = -os.WTERMSIG(sts)
    110                     else:
    111                         assert os.WIFEXITED(sts)
    112                         self.returncode = os.WEXITSTATUS(sts)
    113             return self.returncode
    114 
    115         def wait(self, timeout=None):
    116             if timeout is None:
--> 117                 return self.poll(0)
        global get_loggert = undefined
        global setLevelR = undefined
        global R = undefined
        global R1 = undefined
        global t = undefined
        global chdirR = undefined
        global splitextt = undefined
        global basenamet = undefined
        global dirnamet = undefined
        global impR = undefined
    118             deadline = time.time() + timeout
    119             delay = 0.0005
    120             while 1:
    121                 res = self.poll()
    122                 if res is not None:
    123                     break
    124                 remaining = deadline - time.time()
    125                 if remaining <= 0:
    126                     break
    127                 delay = min(delay * 2, remaining, 0.05)
    128                 time.sleep(delay)
    129             return res
    130 
    131         def terminate(self):
    132             if self.returncode is None:

/usr/lib/python2.6/multiprocessing/forking.pyc in poll(self=<multiprocessing.forking.Popen object>, flag=0)
     91             sys.stderr.flush()
     92             self.returncode = None
     93 
     94             self.pid = os.fork()
     95             if self.pid == 0:
     96                 if 'random' in sys.modules:
     97                     import random
     98                     random.seed()
     99                 code = process_obj._bootstrap()
    100                 sys.stdout.flush()
    101                 sys.stderr.flush()
    102                 os._exit(code)
    103 
    104         def poll(self, flag=os.WNOHANG):
    105             if self.returncode is None:
--> 106                 pid, sts = os.waitpid(self.pid, flag)
        global i = undefined
        global d = undefined
        global n = undefined
        global j = undefined
        global o = undefined
        global t = undefined
        global _ = undefined
    107                 if pid == self.pid:
    108                     if os.WIFSIGNALED(sts):
    109                         self.returncode = -os.WTERMSIG(sts)
    110                     else:
    111                         assert os.WIFEXITED(sts)
    112                         self.returncode = os.WEXITSTATUS(sts)
    113             return self.returncode
    114 
    115         def wait(self, timeout=None):
    116             if timeout is None:
    117                 return self.poll(0)
    118             deadline = time.time() + timeout
    119             delay = 0.0005
    120             while 1:
    121                 res = self.poll()

/usr/local/lib/python2.6/dist-packages/ipython-0.11-py2.6.egg/IPython/parallel/util.pyc in terminate_children(sig=2, frame=<frame object>)
    381             sock = socket.socket()
    382             sock.bind(('', 0))
    383         ports.append(sock)
    384     for i, sock in enumerate(ports):
    385         port = sock.getsockname()[1]
    386         sock.close()
    387         ports[i] = port
    388         _random_ports.add(port)
    389     return ports
    390 
    391 def signal_children(children):
    392     """Relay interupt/term signals to children, for more solid process cleanup."""
    393     def terminate_children(sig, frame):
    394         logging.critical("Got signal %i, terminating children..."%sig)
    395         for child in children:
--> 396             child.terminate()
    397         
    398         sys.exit(sig != SIGINT)
    399         # sys.exit(sig)
    400     for sig in (SIGINT, SIGABRT, SIGTERM):
    401         signal(sig, terminate_children)
    402 
    403 def generate_exec_key(keyfile):
    404     import uuid
    405     newkey = str(uuid.uuid4())
    406     with open(keyfile, 'w') as f:
    407         # f.write('ipython-key ')
    408         f.write(newkey+'\n')
    409     # set user-only RW permissions (0600)
    410     # this will have no effect on Windows
    411     os.chmod(keyfile, stat.S_IRUSR|stat.S_IWUSR)

/usr/lib/python2.6/multiprocessing/process.pyc in terminate(self=<class 'multiprocessing.process.Process'> instance)
     96                'can only start a process object created by current process'
     97         assert not _current_process._daemonic, \
     98                'daemonic processes are not allowed to have children'
     99         _cleanup()
    100         if self._Popen is not None:
    101             Popen = self._Popen
    102         else:
    103             from .forking import Popen
    104         self._popen = Popen(self)
    105         _current_process._children.add(self)
    106 
    107     def terminate(self):
    108         '''
    109         Terminate process; sends SIGTERM signal or uses TerminateProcess()
    110         '''
--> 111         self._popen.terminate()
    112 
    113     def join(self, timeout=None):
    114         '''
    115         Wait until child process terminates
    116         '''
    117         assert self._parent_pid == os.getpid(), 'can only join a child process'
    118         assert self._popen is not None, 'can only join a started process'
    119         res = self._popen.wait(timeout)
    120         if res is not None:
    121             _current_process._children.discard(self)
    122 
    123     def is_alive(self):
    124         '''
    125         Return whether process is alive
    126         '''

/usr/lib/python2.6/multiprocessing/forking.pyc in terminate(self=<multiprocessing.forking.Popen object>)
    121                 res = self.poll()
    122                 if res is not None:
    123                     break
    124                 remaining = deadline - time.time()
    125                 if remaining <= 0:
    126                     break
    127                 delay = min(delay * 2, remaining, 0.05)
    128                 time.sleep(delay)
    129             return res
    130 
    131         def terminate(self):
    132             if self.returncode is None:
    133                 try:
    134                     os.kill(self.pid, signal.SIGTERM)
    135                 except OSError, e:
--> 136                     if self.wait(timeout=0.1) is None:
    137                         raise
    138 
    139         @staticmethod
    140         def thread_is_spawning():
    141             return False
    142 
    143 #
    144 # Windows
    145 #
    146 
    147 else:
    148     import thread
    149     import msvcrt
    150     import _subprocess
    151     import time

/usr/lib/python2.6/multiprocessing/forking.pyc in wait(self=<multiprocessing.forking.Popen object>, timeout=0.10000000000000001)
    106                 pid, sts = os.waitpid(self.pid, flag)
    107                 if pid == self.pid:
    108                     if os.WIFSIGNALED(sts):
    109                         self.returncode = -os.WTERMSIG(sts)
    110                     else:
    111                         assert os.WIFEXITED(sts)
    112                         self.returncode = os.WEXITSTATUS(sts)
    113             return self.returncode
    114 
    115         def wait(self, timeout=None):
    116             if timeout is None:
    117                 return self.poll(0)
    118             deadline = time.time() + timeout
    119             delay = 0.0005
    120             while 1:
--> 121                 res = self.poll()
    122                 if res is not None:
    123                     break
    124                 remaining = deadline - time.time()
    125                 if remaining <= 0:
    126                     break
    127                 delay = min(delay * 2, remaining, 0.05)
    128                 time.sleep(delay)
    129             return res
    130 
    131         def terminate(self):
    132             if self.returncode is None:
    133                 try:
    134                     os.kill(self.pid, signal.SIGTERM)
    135                 except OSError, e:
    136                     if self.wait(timeout=0.1) is None:

/usr/lib/python2.6/multiprocessing/forking.pyc in poll(self=<multiprocessing.forking.Popen object>, flag=1)
     91             sys.stderr.flush()
     92             self.returncode = None
     93 
     94             self.pid = os.fork()
     95             if self.pid == 0:
     96                 if 'random' in sys.modules:
     97                     import random
     98                     random.seed()
     99                 code = process_obj._bootstrap()
    100                 sys.stdout.flush()
    101                 sys.stderr.flush()
    102                 os._exit(code)
    103 
    104         def poll(self, flag=os.WNOHANG):
    105             if self.returncode is None:
--> 106                 pid, sts = os.waitpid(self.pid, flag)
        global i = undefined
        global d = undefined
        global n = undefined
        global j = undefined
        global o = undefined
        global t = undefined
        global _ = undefined
    107                 if pid == self.pid:
    108                     if os.WIFSIGNALED(sts):
    109                         self.returncode = -os.WTERMSIG(sts)
    110                     else:
    111                         assert os.WIFEXITED(sts)
    112                         self.returncode = os.WEXITSTATUS(sts)
    113             return self.returncode
    114 
    115         def wait(self, timeout=None):
    116             if timeout is None:
    117                 return self.poll(0)
    118             deadline = time.time() + timeout
    119             delay = 0.0005
    120             while 1:
    121                 res = self.poll()

OSError: [Errno 10] No child processes

Again (as in the e-mail I sent previously) thanks for so gourgeous work you are doing..

Cheers

Min RK
Owner

In what environment are you running? It sounds like MPI is not properly allowing the controller to start its subprocesses, or they are failing to start for some other reason. Does it work if you don't use the MPIControllerLauncher, and instead only start the engines with MPI?

Jesus David

Sorry I had forgotten I wrote you. The problem was that I had enabled some features on the ipcluster_config.py on my mpi profile, file and I forgot to clean it up. Especifically I enabled the lines on "MPIExecControllerLauncher configuration" section, as you deduced from the log. Finally I fixed it by disabling those lines and adding the correct ones, namely:

c.IPClusterEngines.engine_launcher = 'IPython.parallel.apps.launcher.MPIExecEngineSetLauncher'
c.IPClusterEngines.engine_launcher_class = 'MPIExecEngineSetLauncher'

It was just a mistake of naives. Anyway thanks for your response; is good to know that there is support for such an important project.

Cheers

Jesus David jdavid1385 closed this September 12, 2011
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.