Fixed socket binding race conditions #412

MSeal · 2019-06-13T05:57:09Z

Following fixes in jupyter_client around zeroMQ thread safety issues I found and closed a couple races in the ipykernel wrapper as well that I found during testing of concurrent patterns with nbconvert.

The first is zeroMQ contexts using .instance() which is causes a shared object that's not thread / multiprocessing safe to use. Forcing new context objects to be allocated rather than using a global instance costs a small memory footprint to avoid race condition failures around those context objects.

The second race I found was rare but enough travis builds gave the stack trace below. This points to port binding where a port is picked first, then used in a later call. But that port which was picked could be picked by a concurrent execution at the same time, resulting in a port binding error for one half of the race. To fix this I added some logic to wrap port binding with retry on address taken exceptions. I've tested an early version of this fix extensively in tracking down concurrency issues in notebook executions in nbconvert and upstream libraries, so I'm quite confident this port binding issue is what caused the error seen in travis builds. 100 retries, rather than infinite retries, seemed like a safe compromise as port bind conflicts are difficult to trigger and if you get more than a couple dozen something more serious is wrong.

___________________________ test_parallel_notebooks ____________________________
capfd = <_pytest.capture.CaptureFixture object at 0x7fbbe7dc2790>
tmpdir = local('/tmp/pytest-of-travis/pytest-0/test_parallel_notebooks0')
    def test_parallel_notebooks(capfd, tmpdir):
        """Two notebooks should be able to be run simultaneously without problems.
    
        The two notebooks spawned here use the filesystem to check that the other notebook
        wrote to the filesystem."""
    
        opts = dict(kernel_name="python")
        input_name = "Parallel Execute.ipynb"
        input_file = os.path.join(current_dir, "files", input_name)
        res = notebook_resources()
    
        with modified_env({"NBEXECUTE_TEST_PARALLEL_TMPDIR": str(tmpdir)}):
            threads = [
                threading.Thread(
                    target=run_notebook,
                    args=(
                        input_file,
                        opts,
                        res,
                        functools.partial(label_parallel_notebook, label=label),
                    ),
                )
                for label in ("A", "B")
            ]
            [t.start() for t in threads]
            [t.join(timeout=2) for t in threads]
    
        captured = capfd.readouterr()
>       assert captured.err == ""
E       assert 'Traceback (m...kernel_info\n\n' == ''
E         - Traceback (most recent call last):
E         -   File "/opt/python/2.7.15/lib/python2.7/runpy.py", line 174, in _run_module_as_main
E         -     "__main__", fname, loader, pkg_name)
E         -   File "/opt/python/2.7.15/lib/python2.7/runpy.py", line 72, in _run_code
E         -     exec code in run_globals
E         -   File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/ipykernel_launcher.py", line 16, in <module>
E         -     app.launch_new_instance()
E         -   File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/traitlets/config/application.py", line 657, in launch_instance
E         -     app.initialize(argv)
E         -   File "</home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/decorator.pyc:decorator-gen-121>", line 2, in initialize
E         -   File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/traitlets/config/application.py", line 87, in catch_config_error
E         -     return method(app, *args, **kwargs)
E         -   File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/ipykernel/kernelapp.py", line 469, in initialize
E         -     self.init_sockets()
E         -   File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/ipykernel/kernelapp.py", line 239, in init_sockets
E         -     self.shell_port = self._bind_socket(self.shell_socket, self.shell_port)
E         -   File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/ipykernel/kernelapp.py", line 181, in _bind_socket
E         -     s.bind("tcp://%s:%i" % (self.ip, port))
E         -   File "zmq/backend/cython/socket.pyx", line 547, in zmq.backend.cython.socket.Socket.bind
E         -   File "zmq/backend/cython/checkrc.pxd", line 25, in zmq.backend.cython.checkrc._check_rc
E         -     raise ZMQError(errno)
E         - ZMQError: Address already in use
E         - Exception in thread Thread-16:
E         - Traceback (most recent call last):
E         -   File "/opt/python/2.7.15/lib/python2.7/threading.py", line 801, in __bootstrap_inner
E         -     self.run()
E         -   File "/opt/python/2.7.15/lib/python2.7/threading.py", line 754, in run
E         -     self.__target(*self.__args, **self.__kwargs)
E         -   File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/nbconvert/preprocessors/tests/test_execute.py", line 97, in run_notebook
E         -     output_nb, _ = preprocessor(cleaned_input_nb, resources)
E         -   File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/nbconvert/preprocessors/base.py", line 47, in __call__
E         -     return self.preprocess(nb, resources)
E         -   File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/nbconvert/preprocessors/execute.py", line 400, in preprocess
E         -     with self.setup_preprocessor(nb, resources, km=km):
E         -   File "/opt/python/2.7.15/lib/python2.7/contextlib.py", line 17, in __enter__
E         -     return self.gen.next()
E         -   File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/nbconvert/preprocessors/execute.py", line 345, in setup_preprocessor
E         -     self.km, self.kc = self.start_new_kernel(**kwargs)
E         -   File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/nbconvert/preprocessors/execute.py", line 296, in start_new_kernel
E         -     kc.wait_for_ready(timeout=self.startup_timeout)
E         -   File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/jupyter_client/blocking/client.py", line 120, in wait_for_ready
E         -     raise RuntimeError('Kernel died before replying to kernel_info')
E         - RuntimeError: Kernel died before replying to kernel_info
E         -
/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/nbconvert/preprocessors/tests/test_execute.py:305: AssertionError

MSeal · 2019-06-13T17:15:49Z

Not sure what broke the asyncio operations. I'll debug this over the weekend to figure out what's wrong.

MSeal · 2019-07-11T18:14:54Z

@Carreau I'm going to look into the async test failure tonight, but I was hoping maybe we could get this prioritized during the sprints and possibly get a release? The race condition here gets hit frequently with the new parallel execution tests in nbconvert.

MSeal · 2019-07-12T07:53:37Z

@minrk @Carreau If one of you has inisght, I think there's some low level issues the tests are hitting that I'm struggling to put together. The change above fixes the race I was seeing in nbconvert as far as I can tell, but the test suite has test_shutdown failing to teardown. I also see the test suite hang after completing, waiting for a thread to terminate. A cnt-C brings this stack trace out:

Traceback (most recent call last):
  File "zmq/backend/cython/context.pyx", line 136, in zmq.backend.cython.context.Context.term
  File "zmq/backend/cython/checkrc.pxd", line 12, in zmq.backend.cython.checkrc._check_rc

If the atexit change isn't make I see similar errors in other tests around _check_rc calls hanging, usually in the form of "Kernel Died before responding" when the kernel launch times out.

Traceback (most recent call last):
  File "/usr/lib/python3.6/asyncio/events.py", line 145, in _run
    self._callback(*self._args)
  File "/home/mseal/.py3local/lib/python3.6/site-packages/tornado/platform/asyncio.py", line 138, in _handle_events
    handler_func(fileobj, events)
  File "/home/mseal/.py3local/lib/python3.6/site-packages/zmq/eventloop/zmqstream.py", line 452, in _handle_events
    zmq_events = self.socket.EVENTS
  File "/home/mseal/.py3local/lib/python3.6/site-packages/zmq/sugar/attrsettr.py", line 48, in __getattr__
    return self._get_attr_opt(upper_key, opt)
  File "/home/mseal/.py3local/lib/python3.6/site-packages/zmq/sugar/attrsettr.py", line 52, in _get_attr_opt
    return self.get(opt)
  File "zmq/backend/cython/socket.pyx", line 501, in zmq.backend.cython.socket.Socket.get
  File "zmq/backend/cython/socket.pyx", line 262, in zmq.backend.cython.socket._getsockopt
  File "zmq/backend/cython/checkrc.pxd", line 22, in zmq.backend.cython.checkrc._check_rc

Considering I had to add the commented atexit cleanup back in to get the async tests to pass it looks suspiciously like zmq cleanup is failing in the tests due to a process or thread hanging indefinitely.

Changing the zmq.Context().instance() pattern seems to have no impact on the tests issues btw.

Thoughts or places to investigate more with the zmq stack traces?

minrk · 2019-07-16T11:24:49Z

ipykernel/kernelapp.py

        # Uncomment this to try closing the context.
-        # atexit.register(context.term)
+        atexit.register(context.term)


This is the source of the hang. Calling context.term in atexit is terminating the context prior to closing sockets, which causes a hang. This isn't safe to do without closing all sockets first. While this is a good thing to do, I don't think you need to do it in this PR. We would have to safely notify all sockets in all threads to be closed and then finally terminate the context.

Suggested change

atexit.register(context.term)

# atexit.register(context.term)

@minrk Makes sense. If I turn that back off I get Kernel Stopped Responding in the async tests. Though it seems more just tests running later in the pool of tests start failing with timeouts. It seemed like the tests weren't cleaning up sockets and then later tests are trying to find a free socket and timing out. Running any test individually passes but the full pytest pass starts failing (see build for first commit of PR). Ideas on what needs to be done there?

avoids hangs during context termination

because travis can be real slow

MSeal · 2019-07-18T21:37:19Z

Those changes do look good. Any inclination for why the shutdown failed on just python 3.5 suite? I'll have a little time this weekend to look at this again.

minrk · 2019-07-19T07:46:24Z

Sorry, I meant to make a comment about the changes, not silently update your branch! I went through and added some explicit cleanup of zmq sockets and contexts. The hang was being caused by some garbage collection of a context not explicitly cleaned up by a prior test, leaving its sockets hanging around, which hangs. I don't really understand why this would happen, since sockets have a reference to the context but not vice versa, so gc order shouldn't allow contexts to delete before sockets 🤷‍♂, but it's always a good idea to clean up explicitly anyway.

I'm trying to think about shutdown. My hunch is that the eventloop in the kernel is not stopping, or possibly something in zmq is not cleaning up in the right order. First thing I'd look at is if any package versions are different on 3.5. If it's random and different versions of py3 >= 3.5 hang on re-run, then I'd be almost certain it's garbage collection of zmq objects that's the cause, because Python 3.5 introduced forced random gc order at process shutdown. In that case, making sure we cleanup sockets at process shutdown should help.

might fix garbage collection-related hangs on kernel exit

minrk · 2019-07-19T07:56:38Z

I added one more commit that performs explicit cleanup at exit. If gc issues were the problem, I'd expect this to fix it (unless there are more sockets somewhere we aren't tracking)

edit: this is essentially restoring the cleanup-at-exit change you made earlier, but with the extra bookkeeping of ensuring everything is cleaned up in the right order.

MSeal · 2019-07-22T02:31:36Z

Awesome! Thanks for the help on the PR. Anything else needed for this to merge?

Carreau · 2019-07-26T00:16:13Z

Awesome! Thanks for the help on the PR. Anything else needed for this to merge?

I'd say let's go !

Carreau · 2019-07-26T00:16:27Z

And we shoudl try to do a release soon.

MSeal · 2019-07-26T16:23:23Z

Thanks! And yes a release would be helpful so the nbconvert release doesn't have to note a caveat about ipykernels occasionally crashing when parallelizing conversions.

Carreau · 2019-07-27T18:28:00Z

I have no objections; though I haven't been involved in ipykernel developement in a while so would prefer someone more involve with it to do one.

Not even sure if that woudl be a 5.2 release, a 5.1.2 or a 6.0

MSeal · 2019-07-28T17:55:29Z

@Carreau @minrk Given these commits: v5.1.1...master
I see 4 PRs:
#408
#405
#394
and this one #412

The changes are all forward compatible and fixing minor to niche problems, so I think a patch release is reasonable with a 5.1.2 release if this is roughly following semver. A 5.2.0 would probably be fine as well if we wanted to denote the parallel execution support better.

Azrealy · 2019-07-29T02:28:01Z

@MSeal Sorry for the break-in. I have an issue when I use ipykernel master branch which already merged with this branch, and then replace my 5.1.1 version ipykernel package. When I try to shut down my notebook, the following error message shows up.

Traceback (most recent call last):
  File "/usr/lib64/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/home/user/.virtualenvs/test-ipykernel/lib/python3.6/site-packages/ipykernel/heartbeat.py", line 98, in run
    zmq.device(zmq.QUEUE, self.socket, self.socket)
  File "zmq/backend/cython/_device.pyx", line 54, in zmq.backend.cython._device.device
  File "zmq/backend/cython/_device.pyx", line 95, in zmq.backend.cython._device.proxy
  File "zmq/backend/cython/checkrc.pxd", line 25, in zmq.backend.cython.checkrc._check_rc
zmq.error.ZMQError: Socket operation on non-socket

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib64/python3.6/threading.py", line 884, in _bootstrap
    self._bootstrap_inner()
  File "/usr/lib64/python3.6/threading.py", line 926, in _bootstrap_inner
    (self.name, _format_exc()), file=_sys.stderr)
  File "/home/user/.virtualenvs/test-ipykernel/lib/python3.6/site-packages/ipykernel/iostream.py", line 410, in write
    self._schedule_flush()
  File "/home/user/.virtualenvs/test-ipykernel/lib/python3.6/site-packages/ipykernel/iostream.py", line 334, in _schedule_flush
    self.pub_thread.schedule(_schedule_in_thread)
  File "/home/user/.virtualenvs/test-ipykernel/lib/python3.6/site-packages/ipykernel/iostream.py", line 207, in schedule
    f()
  File "/home/user/.virtualenvs/test-ipykernel/lib/python3.6/site-packages/ipykernel/iostream.py", line 333, in _schedule_in_thread
    self._io_loop.call_later(self.flush_interval, self._flush)
  File "/home/user/.virtualenvs/test-ipykernel/lib/python3.6/site-packages/tornado/ioloop.py", line 602, in call_later
    return self.call_at(self.time() + delay, callback, *args, **kwargs)
  File "/home/user/.virtualenvs/test-ipykernel/lib/python3.6/site-packages/tornado/platform/asyncio.py", line 164, in call_at
    functools.partial(callback, *args, **kwargs),
  File "/usr/lib64/python3.6/asyncio/base_events.py", line 544, in call_later
    timer = self.call_at(self.time() + delay, callback, *args)
  File "/usr/lib64/python3.6/asyncio/base_events.py", line 554, in call_at
    self._check_closed()
  File "/usr/lib64/python3.6/asyncio/base_events.py", line 358, in _check_closed
    raise RuntimeError('Event loop is closed')
Error in atexit._run_exitfuncs:
RuntimeError: Event loop is closed

I have set up a new virtualenv and pip installed the notebook 6.0.0 version before I upgrade the ipykernel base on this branch.

MSeal · 2019-07-29T16:23:54Z

Hmm... Something added in the changes to get the tests to pass broke the kernel when parallelized. I can reproduce by running the nbconvert test suite with this change installed. @minrk I won't be able to dig into fixing this for a bit fyi.

minrk · 2019-08-07T11:17:37Z

#419 should fix the shutdown cleanup issue

Fixed socket binding race conditions

970ba1a

MSeal mentioned this pull request Jun 13, 2019

Tests which fail in multiprocessing contexts jupyter/nbconvert#1018

Merged

MSeal mentioned this pull request Jun 17, 2019

NBConvert 5.6 Release jupyter/nbconvert#1052

Closed

MSeal added 2 commits July 12, 2019 00:35

Added atexit and better port checks to avoid async failures

f918e83

Applied fixes for everything but test_shutdown

f1ef18d

hmaarrfk mentioned this pull request Jul 14, 2019

Rebuild for aarch64 and ppc64le conda-forge/ipykernel-feedstock#47

Merged

5 tasks

minrk reviewed Jul 16, 2019

View reviewed changes

add explicit cleanup of zmq sockets

3a63c51

avoids hangs during context termination

minrk force-pushed the concurrencyRaceFixes branch from 055179e to 3a63c51 Compare July 18, 2019 12:13

increase shutdown test timeout to 30 seconds

ae563d6

because travis can be real slow

add explicit socket cleanup

d1b2d01

might fix garbage collection-related hangs on kernel exit

MSeal mentioned this pull request Jul 22, 2019

zmq.error.ZMQError: Address already in use jupyter/nbconvert#1066

Open

Carreau merged commit 4e3769d into ipython:master Jul 26, 2019

minrk mentioned this pull request Aug 7, 2019

Further cleanup on shutdown #419

Merged

MSeal mentioned this pull request Aug 7, 2019

5.1.2 Release #420

Closed

asmeurer mentioned this pull request Sep 23, 2019

Tracebacks are not shown right away when connecting to 'ipython kernel' from 'jupyter console' ipython/ipython#11894

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed socket binding race conditions #412

Fixed socket binding race conditions #412

MSeal commented Jun 13, 2019

MSeal commented Jun 13, 2019

MSeal commented Jul 11, 2019

MSeal commented Jul 12, 2019

minrk Jul 16, 2019

MSeal Jul 16, 2019 •

edited

Loading

MSeal commented Jul 18, 2019

minrk commented Jul 19, 2019

minrk commented Jul 19, 2019 •

edited

Loading

MSeal commented Jul 22, 2019

Carreau commented Jul 26, 2019

Carreau commented Jul 26, 2019

MSeal commented Jul 26, 2019

Carreau commented Jul 27, 2019

MSeal commented Jul 28, 2019

Azrealy commented Jul 29, 2019 •

edited

Loading

MSeal commented Jul 29, 2019

minrk commented Aug 7, 2019

	atexit.register(context.term)
	# atexit.register(context.term)

Fixed socket binding race conditions #412

Fixed socket binding race conditions #412

Conversation

MSeal commented Jun 13, 2019

MSeal commented Jun 13, 2019

MSeal commented Jul 11, 2019

MSeal commented Jul 12, 2019

minrk Jul 16, 2019

Choose a reason for hiding this comment

MSeal Jul 16, 2019 • edited Loading

Choose a reason for hiding this comment

MSeal commented Jul 18, 2019

minrk commented Jul 19, 2019

minrk commented Jul 19, 2019 • edited Loading

MSeal commented Jul 22, 2019

Carreau commented Jul 26, 2019

Carreau commented Jul 26, 2019

MSeal commented Jul 26, 2019

Carreau commented Jul 27, 2019

MSeal commented Jul 28, 2019

Azrealy commented Jul 29, 2019 • edited Loading

MSeal commented Jul 29, 2019

minrk commented Aug 7, 2019

MSeal Jul 16, 2019 •

edited

Loading

minrk commented Jul 19, 2019 •

edited

Loading

Azrealy commented Jul 29, 2019 •

edited

Loading