Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed socket binding race conditions #412

Merged
merged 6 commits into from
Jul 26, 2019

Conversation

MSeal
Copy link
Contributor

@MSeal MSeal commented Jun 13, 2019

Following fixes in jupyter_client around zeroMQ thread safety issues I found and closed a couple races in the ipykernel wrapper as well that I found during testing of concurrent patterns with nbconvert.

The first is zeroMQ contexts using .instance() which is causes a shared object that's not thread / multiprocessing safe to use. Forcing new context objects to be allocated rather than using a global instance costs a small memory footprint to avoid race condition failures around those context objects.

The second race I found was rare but enough travis builds gave the stack trace below. This points to port binding where a port is picked first, then used in a later call. But that port which was picked could be picked by a concurrent execution at the same time, resulting in a port binding error for one half of the race. To fix this I added some logic to wrap port binding with retry on address taken exceptions. I've tested an early version of this fix extensively in tracking down concurrency issues in notebook executions in nbconvert and upstream libraries, so I'm quite confident this port binding issue is what caused the error seen in travis builds. 100 retries, rather than infinite retries, seemed like a safe compromise as port bind conflicts are difficult to trigger and if you get more than a couple dozen something more serious is wrong.

___________________________ test_parallel_notebooks ____________________________
capfd = <_pytest.capture.CaptureFixture object at 0x7fbbe7dc2790>
tmpdir = local('/tmp/pytest-of-travis/pytest-0/test_parallel_notebooks0')
    def test_parallel_notebooks(capfd, tmpdir):
        """Two notebooks should be able to be run simultaneously without problems.
    
        The two notebooks spawned here use the filesystem to check that the other notebook
        wrote to the filesystem."""
    
        opts = dict(kernel_name="python")
        input_name = "Parallel Execute.ipynb"
        input_file = os.path.join(current_dir, "files", input_name)
        res = notebook_resources()
    
        with modified_env({"NBEXECUTE_TEST_PARALLEL_TMPDIR": str(tmpdir)}):
            threads = [
                threading.Thread(
                    target=run_notebook,
                    args=(
                        input_file,
                        opts,
                        res,
                        functools.partial(label_parallel_notebook, label=label),
                    ),
                )
                for label in ("A", "B")
            ]
            [t.start() for t in threads]
            [t.join(timeout=2) for t in threads]
    
        captured = capfd.readouterr()
>       assert captured.err == ""
E       assert 'Traceback (m...kernel_info\n\n' == ''
E         - Traceback (most recent call last):
E         -   File "/opt/python/2.7.15/lib/python2.7/runpy.py", line 174, in _run_module_as_main
E         -     "__main__", fname, loader, pkg_name)
E         -   File "/opt/python/2.7.15/lib/python2.7/runpy.py", line 72, in _run_code
E         -     exec code in run_globals
E         -   File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/ipykernel_launcher.py", line 16, in <module>
E         -     app.launch_new_instance()
E         -   File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/traitlets/config/application.py", line 657, in launch_instance
E         -     app.initialize(argv)
E         -   File "</home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/decorator.pyc:decorator-gen-121>", line 2, in initialize
E         -   File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/traitlets/config/application.py", line 87, in catch_config_error
E         -     return method(app, *args, **kwargs)
E         -   File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/ipykernel/kernelapp.py", line 469, in initialize
E         -     self.init_sockets()
E         -   File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/ipykernel/kernelapp.py", line 239, in init_sockets
E         -     self.shell_port = self._bind_socket(self.shell_socket, self.shell_port)
E         -   File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/ipykernel/kernelapp.py", line 181, in _bind_socket
E         -     s.bind("tcp://%s:%i" % (self.ip, port))
E         -   File "zmq/backend/cython/socket.pyx", line 547, in zmq.backend.cython.socket.Socket.bind
E         -   File "zmq/backend/cython/checkrc.pxd", line 25, in zmq.backend.cython.checkrc._check_rc
E         -     raise ZMQError(errno)
E         - ZMQError: Address already in use
E         - Exception in thread Thread-16:
E         - Traceback (most recent call last):
E         -   File "/opt/python/2.7.15/lib/python2.7/threading.py", line 801, in __bootstrap_inner
E         -     self.run()
E         -   File "/opt/python/2.7.15/lib/python2.7/threading.py", line 754, in run
E         -     self.__target(*self.__args, **self.__kwargs)
E         -   File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/nbconvert/preprocessors/tests/test_execute.py", line 97, in run_notebook
E         -     output_nb, _ = preprocessor(cleaned_input_nb, resources)
E         -   File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/nbconvert/preprocessors/base.py", line 47, in __call__
E         -     return self.preprocess(nb, resources)
E         -   File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/nbconvert/preprocessors/execute.py", line 400, in preprocess
E         -     with self.setup_preprocessor(nb, resources, km=km):
E         -   File "/opt/python/2.7.15/lib/python2.7/contextlib.py", line 17, in __enter__
E         -     return self.gen.next()
E         -   File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/nbconvert/preprocessors/execute.py", line 345, in setup_preprocessor
E         -     self.km, self.kc = self.start_new_kernel(**kwargs)
E         -   File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/nbconvert/preprocessors/execute.py", line 296, in start_new_kernel
E         -     kc.wait_for_ready(timeout=self.startup_timeout)
E         -   File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/jupyter_client/blocking/client.py", line 120, in wait_for_ready
E         -     raise RuntimeError('Kernel died before replying to kernel_info')
E         - RuntimeError: Kernel died before replying to kernel_info
E         -
/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/nbconvert/preprocessors/tests/test_execute.py:305: AssertionError

@MSeal
Copy link
Contributor Author

MSeal commented Jun 13, 2019

Not sure what broke the asyncio operations. I'll debug this over the weekend to figure out what's wrong.

@MSeal
Copy link
Contributor Author

MSeal commented Jul 11, 2019

@Carreau I'm going to look into the async test failure tonight, but I was hoping maybe we could get this prioritized during the sprints and possibly get a release? The race condition here gets hit frequently with the new parallel execution tests in nbconvert.

@MSeal
Copy link
Contributor Author

MSeal commented Jul 12, 2019

@minrk @Carreau If one of you has inisght, I think there's some low level issues the tests are hitting that I'm struggling to put together. The change above fixes the race I was seeing in nbconvert as far as I can tell, but the test suite has test_shutdown failing to teardown. I also see the test suite hang after completing, waiting for a thread to terminate. A cnt-C brings this stack trace out:

Traceback (most recent call last):
  File "zmq/backend/cython/context.pyx", line 136, in zmq.backend.cython.context.Context.term
  File "zmq/backend/cython/checkrc.pxd", line 12, in zmq.backend.cython.checkrc._check_rc

If the atexit change isn't make I see similar errors in other tests around _check_rc calls hanging, usually in the form of "Kernel Died before responding" when the kernel launch times out.

Traceback (most recent call last):
  File "/usr/lib/python3.6/asyncio/events.py", line 145, in _run
    self._callback(*self._args)
  File "/home/mseal/.py3local/lib/python3.6/site-packages/tornado/platform/asyncio.py", line 138, in _handle_events
    handler_func(fileobj, events)
  File "/home/mseal/.py3local/lib/python3.6/site-packages/zmq/eventloop/zmqstream.py", line 452, in _handle_events
    zmq_events = self.socket.EVENTS
  File "/home/mseal/.py3local/lib/python3.6/site-packages/zmq/sugar/attrsettr.py", line 48, in __getattr__
    return self._get_attr_opt(upper_key, opt)
  File "/home/mseal/.py3local/lib/python3.6/site-packages/zmq/sugar/attrsettr.py", line 52, in _get_attr_opt
    return self.get(opt)
  File "zmq/backend/cython/socket.pyx", line 501, in zmq.backend.cython.socket.Socket.get
  File "zmq/backend/cython/socket.pyx", line 262, in zmq.backend.cython.socket._getsockopt
  File "zmq/backend/cython/checkrc.pxd", line 22, in zmq.backend.cython.checkrc._check_rc

Considering I had to add the commented atexit cleanup back in to get the async tests to pass it looks suspiciously like zmq cleanup is failing in the tests due to a process or thread hanging indefinitely.

Changing the zmq.Context().instance() pattern seems to have no impact on the tests issues btw.

Thoughts or places to investigate more with the zmq stack traces?

# Uncomment this to try closing the context.
# atexit.register(context.term)
atexit.register(context.term)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the source of the hang. Calling context.term in atexit is terminating the context prior to closing sockets, which causes a hang. This isn't safe to do without closing all sockets first. While this is a good thing to do, I don't think you need to do it in this PR. We would have to safely notify all sockets in all threads to be closed and then finally terminate the context.

Suggested change
atexit.register(context.term)
# atexit.register(context.term)

Copy link
Contributor Author

@MSeal MSeal Jul 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@minrk Makes sense. If I turn that back off I get Kernel Stopped Responding in the async tests. Though it seems more just tests running later in the pool of tests start failing with timeouts. It seemed like the tests weren't cleaning up sockets and then later tests are trying to find a free socket and timing out. Running any test individually passes but the full pytest pass starts failing (see build for first commit of PR). Ideas on what needs to be done there?

avoids hangs during context termination
because travis can be real slow
@MSeal
Copy link
Contributor Author

MSeal commented Jul 18, 2019

Those changes do look good. Any inclination for why the shutdown failed on just python 3.5 suite? I'll have a little time this weekend to look at this again.

@minrk
Copy link
Member

minrk commented Jul 19, 2019

Sorry, I meant to make a comment about the changes, not silently update your branch! I went through and added some explicit cleanup of zmq sockets and contexts. The hang was being caused by some garbage collection of a context not explicitly cleaned up by a prior test, leaving its sockets hanging around, which hangs. I don't really understand why this would happen, since sockets have a reference to the context but not vice versa, so gc order shouldn't allow contexts to delete before sockets 🤷‍♂, but it's always a good idea to clean up explicitly anyway.

I'm trying to think about shutdown. My hunch is that the eventloop in the kernel is not stopping, or possibly something in zmq is not cleaning up in the right order. First thing I'd look at is if any package versions are different on 3.5. If it's random and different versions of py3 >= 3.5 hang on re-run, then I'd be almost certain it's garbage collection of zmq objects that's the cause, because Python 3.5 introduced forced random gc order at process shutdown. In that case, making sure we cleanup sockets at process shutdown should help.

might fix garbage collection-related hangs on kernel exit
@minrk
Copy link
Member

minrk commented Jul 19, 2019

I added one more commit that performs explicit cleanup at exit. If gc issues were the problem, I'd expect this to fix it (unless there are more sockets somewhere we aren't tracking)

edit: this is essentially restoring the cleanup-at-exit change you made earlier, but with the extra bookkeeping of ensuring everything is cleaned up in the right order.

@MSeal
Copy link
Contributor Author

MSeal commented Jul 22, 2019

Awesome! Thanks for the help on the PR. Anything else needed for this to merge?

@Carreau
Copy link
Member

Carreau commented Jul 26, 2019

Awesome! Thanks for the help on the PR. Anything else needed for this to merge?

I'd say let's go !

@Carreau Carreau merged commit 4e3769d into ipython:master Jul 26, 2019
@Carreau
Copy link
Member

Carreau commented Jul 26, 2019

And we shoudl try to do a release soon.

@MSeal
Copy link
Contributor Author

MSeal commented Jul 26, 2019

Thanks! And yes a release would be helpful so the nbconvert release doesn't have to note a caveat about ipykernels occasionally crashing when parallelizing conversions.

@Carreau
Copy link
Member

Carreau commented Jul 27, 2019

I have no objections; though I haven't been involved in ipykernel developement in a while so would prefer someone more involve with it to do one.

Not even sure if that woudl be a 5.2 release, a 5.1.2 or a 6.0

@MSeal
Copy link
Contributor Author

MSeal commented Jul 28, 2019

@Carreau @minrk Given these commits: v5.1.1...master
I see 4 PRs:
#408
#405
#394
and this one #412

The changes are all forward compatible and fixing minor to niche problems, so I think a patch release is reasonable with a 5.1.2 release if this is roughly following semver. A 5.2.0 would probably be fine as well if we wanted to denote the parallel execution support better.

@Azrealy
Copy link

Azrealy commented Jul 29, 2019

@MSeal Sorry for the break-in. I have an issue when I use ipykernel master branch which already merged with this branch, and then replace my 5.1.1 version ipykernel package. When I try to shut down my notebook, the following error message shows up.

Traceback (most recent call last):
  File "/usr/lib64/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/home/user/.virtualenvs/test-ipykernel/lib/python3.6/site-packages/ipykernel/heartbeat.py", line 98, in run
    zmq.device(zmq.QUEUE, self.socket, self.socket)
  File "zmq/backend/cython/_device.pyx", line 54, in zmq.backend.cython._device.device
  File "zmq/backend/cython/_device.pyx", line 95, in zmq.backend.cython._device.proxy
  File "zmq/backend/cython/checkrc.pxd", line 25, in zmq.backend.cython.checkrc._check_rc
zmq.error.ZMQError: Socket operation on non-socket

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib64/python3.6/threading.py", line 884, in _bootstrap
    self._bootstrap_inner()
  File "/usr/lib64/python3.6/threading.py", line 926, in _bootstrap_inner
    (self.name, _format_exc()), file=_sys.stderr)
  File "/home/user/.virtualenvs/test-ipykernel/lib/python3.6/site-packages/ipykernel/iostream.py", line 410, in write
    self._schedule_flush()
  File "/home/user/.virtualenvs/test-ipykernel/lib/python3.6/site-packages/ipykernel/iostream.py", line 334, in _schedule_flush
    self.pub_thread.schedule(_schedule_in_thread)
  File "/home/user/.virtualenvs/test-ipykernel/lib/python3.6/site-packages/ipykernel/iostream.py", line 207, in schedule
    f()
  File "/home/user/.virtualenvs/test-ipykernel/lib/python3.6/site-packages/ipykernel/iostream.py", line 333, in _schedule_in_thread
    self._io_loop.call_later(self.flush_interval, self._flush)
  File "/home/user/.virtualenvs/test-ipykernel/lib/python3.6/site-packages/tornado/ioloop.py", line 602, in call_later
    return self.call_at(self.time() + delay, callback, *args, **kwargs)
  File "/home/user/.virtualenvs/test-ipykernel/lib/python3.6/site-packages/tornado/platform/asyncio.py", line 164, in call_at
    functools.partial(callback, *args, **kwargs),
  File "/usr/lib64/python3.6/asyncio/base_events.py", line 544, in call_later
    timer = self.call_at(self.time() + delay, callback, *args)
  File "/usr/lib64/python3.6/asyncio/base_events.py", line 554, in call_at
    self._check_closed()
  File "/usr/lib64/python3.6/asyncio/base_events.py", line 358, in _check_closed
    raise RuntimeError('Event loop is closed')
Error in atexit._run_exitfuncs:
RuntimeError: Event loop is closed

I have set up a new virtualenv and pip installed the notebook 6.0.0 version before I upgrade the ipykernel base on this branch.

@MSeal
Copy link
Contributor Author

MSeal commented Jul 29, 2019

Hmm... Something added in the changes to get the tests to pass broke the kernel when parallelized. I can reproduce by running the nbconvert test suite with this change installed. @minrk I won't be able to dig into fixing this for a bit fyi.

@minrk minrk mentioned this pull request Aug 7, 2019
@minrk
Copy link
Member

minrk commented Aug 7, 2019

#419 should fix the shutdown cleanup issue

@MSeal MSeal mentioned this pull request Aug 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants