Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tests which fail in multiprocessing contexts #1018

Merged
merged 25 commits into from Jul 8, 2019

Conversation

@alexrudy
Copy link
Contributor

alexrudy commented May 8, 2019

@MSeal and I got a ways toward supporting multiprocessing at PyCon sprints. The two tests added here, one of which still fails, demonstrate the problem from nbconvert.

I am working on a minimum reproduction for jupyter_client as well, but I wanted to put this branch up somewhere so that we can discuss & share the reproducing examples. I've marked both tests as XFail, but since they are both slow, I don't even know that this should get merged.

@MSeal

This comment has been minimized.

Copy link
Collaborator

MSeal commented May 12, 2019

Thanks for posting. Let's leave this up while the jupyter_client change gets reviewed and then re-evaluate if the tests need to be made quicker to merge or not.

Copy link
Collaborator

MSeal left a comment

Apologies for being disconnected for a while. This was the first weekend I was around to go through the past week or two's open source PRs (been a busy month).

Do you have any suggestions for debugging test_parallel_fork_notebooks further? I could probably take a deeper dive on Monday into what's happening here.

assert captured.err == ""

@pytest.mark.xfail
def test_parallel_fork_notebooks(capfd):

This comment has been minimized.

Copy link
@MSeal

MSeal May 25, 2019

Collaborator

I can't get this test to pass on my ubuntu 18.04 machine. It will always hang if the thread is running when multi-processing launches.

This comment has been minimized.

Copy link
@MSeal

MSeal May 25, 2019

Collaborator

I tried with and without the zeromq cleanup, and with / without the jupyter_client pending release changes. Haven't dug in deeper as to why this still hangs.

input_file,
opts,
res,
functools.partial(label_parallel_notebook, label=label),

This comment has been minimized.

Copy link
@MSeal

MSeal May 25, 2019

Collaborator

@mpacer Wanted to keep run_notebook helper from getting more complex. Could we check in Parallel Execute A.ipynb and Parallel Execute B.ipynb into the PR to avoid adding this?

This comment has been minimized.

Copy link
@mpacer

mpacer May 30, 2019

Member

Sorry for not reaching out about this directly. What I would recommend doing would be the following
Instead of

threads = [
            threading.Thread(
                target=run_notebook,
                args=(
                    input_file,
                    opts,
                    res,
                    functools.partial(label_parallel_notebook, label=label),
                 ),
            )
            for label in ("A", "B")
        ]

go with

input_files = [label_parallel_notebook(input_file, label=label) for label in ("A", "B")]
threads = [
            threading.Thread(
                target=run_notebook,
                args=(
                    this_input_file,
                    opts,
                    res
                 ),
            )
            for this_input_file in input_files
        ]

That way you don't need to change the signature of run_notebook. This should work because preprocessing the notebook to add a cell doesn't need to occur concurrently, but the notebooks themselves will still be running concurrently.

# Destroy the context - if you don't do this, the context
# will survive across the fork, and then fail to start properly.
import zmq
zmq.Context.instance().destroy()

This comment has been minimized.

Copy link
@MSeal

MSeal May 25, 2019

Collaborator

FYI I tested this with the new jupyter_client. Could we add a TODO: delete when jupyter_client>=5.2.4 releases?

@MSeal

This comment has been minimized.

Copy link
Collaborator

MSeal commented May 30, 2019

I haven't gotten an actual stack trace here, but in tests I have seen

___________________________ test_parallel_notebooks ____________________________
capfd = <_pytest.capture.CaptureFixture object at 0x7fbbe7dc2790>
tmpdir = local('/tmp/pytest-of-travis/pytest-0/test_parallel_notebooks0')
    def test_parallel_notebooks(capfd, tmpdir):
        """Two notebooks should be able to be run simultaneously without problems.
    
        The two notebooks spawned here use the filesystem to check that the other notebook
        wrote to the filesystem."""
    
        opts = dict(kernel_name="python")
        input_name = "Parallel Execute.ipynb"
        input_file = os.path.join(current_dir, "files", input_name)
        res = notebook_resources()
    
        with modified_env({"NBEXECUTE_TEST_PARALLEL_TMPDIR": str(tmpdir)}):
            threads = [
                threading.Thread(
                    target=run_notebook,
                    args=(
                        input_file,
                        opts,
                        res,
                        functools.partial(label_parallel_notebook, label=label),
                    ),
                )
                for label in ("A", "B")
            ]
            [t.start() for t in threads]
            [t.join(timeout=2) for t in threads]
    
        captured = capfd.readouterr()
>       assert captured.err == ""
E       assert 'Traceback (m...kernel_info\n\n' == ''
E         - Traceback (most recent call last):
E         -   File "/opt/python/2.7.15/lib/python2.7/runpy.py", line 174, in _run_module_as_main
E         -     "__main__", fname, loader, pkg_name)
E         -   File "/opt/python/2.7.15/lib/python2.7/runpy.py", line 72, in _run_code
E         -     exec code in run_globals
E         -   File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/ipykernel_launcher.py", line 16, in <module>
E         -     app.launch_new_instance()
E         -   File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/traitlets/config/application.py", line 657, in launch_instance
E         -     app.initialize(argv)
E         -   File "</home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/decorator.pyc:decorator-gen-121>", line 2, in initialize
E         -   File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/traitlets/config/application.py", line 87, in catch_config_error
E         -     return method(app, *args, **kwargs)
E         -   File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/ipykernel/kernelapp.py", line 469, in initialize
E         -     self.init_sockets()
E         -   File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/ipykernel/kernelapp.py", line 239, in init_sockets
E         -     self.shell_port = self._bind_socket(self.shell_socket, self.shell_port)
E         -   File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/ipykernel/kernelapp.py", line 181, in _bind_socket
E         -     s.bind("tcp://%s:%i" % (self.ip, port))
E         -   File "zmq/backend/cython/socket.pyx", line 547, in zmq.backend.cython.socket.Socket.bind
E         -   File "zmq/backend/cython/checkrc.pxd", line 25, in zmq.backend.cython.checkrc._check_rc
E         -     raise ZMQError(errno)
E         - ZMQError: Address already in use
E         - Exception in thread Thread-16:
E         - Traceback (most recent call last):
E         -   File "/opt/python/2.7.15/lib/python2.7/threading.py", line 801, in __bootstrap_inner
E         -     self.run()
E         -   File "/opt/python/2.7.15/lib/python2.7/threading.py", line 754, in run
E         -     self.__target(*self.__args, **self.__kwargs)
E         -   File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/nbconvert/preprocessors/tests/test_execute.py", line 97, in run_notebook
E         -     output_nb, _ = preprocessor(cleaned_input_nb, resources)
E         -   File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/nbconvert/preprocessors/base.py", line 47, in __call__
E         -     return self.preprocess(nb, resources)
E         -   File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/nbconvert/preprocessors/execute.py", line 400, in preprocess
E         -     with self.setup_preprocessor(nb, resources, km=km):
E         -   File "/opt/python/2.7.15/lib/python2.7/contextlib.py", line 17, in __enter__
E         -     return self.gen.next()
E         -   File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/nbconvert/preprocessors/execute.py", line 345, in setup_preprocessor
E         -     self.km, self.kc = self.start_new_kernel(**kwargs)
E         -   File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/nbconvert/preprocessors/execute.py", line 296, in start_new_kernel
E         -     kc.wait_for_ready(timeout=self.startup_timeout)
E         -   File "/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/jupyter_client/blocking/client.py", line 120, in wait_for_ready
E         -     raise RuntimeError('Kernel died before replying to kernel_info')
E         - RuntimeError: Kernel died before replying to kernel_info
E         -
/home/travis/virtualenv/python2.7.15/lib/python2.7/site-packages/nbconvert/preprocessors/tests/test_execute.py:305: AssertionError

now which indicates there might be another issue in on the kernel side with port management?

@@ -69,14 +72,17 @@ def build_preprocessor(opts):
return preprocessor


def run_notebook(filename, opts, resources):
def run_notebook(filename, opts, resources, preprocess_notebook=None):

This comment has been minimized.

Copy link
@mpacer

mpacer May 30, 2019

Member

As described below, you can avoid adding this complication that also introduces an additional meaning to what it means to preprocess a notebook.

}
)

nb.cells.insert(1, label_cell)

This comment has been minimized.

Copy link
@mpacer

mpacer May 30, 2019

Member

I would make a deepcopy rather than mutating the copy you sent in.

Suggested change
nb.cells.insert(1, label_cell)
nb = deepcopy(nb)
nb.cells.insert(1, label_cell)
Used for parallel testing to label two notebooks which are run simultaneously.
"""
label_cell = nbformat.NotebookNode(

This comment has been minimized.

Copy link
@mpacer

mpacer May 30, 2019

Member

Rather than creating a NotebookNode directly I would use the new_code_cell function from nbformat.v4.
https://github.com/jupyter/nbformat/blob/11903688167d21af96c92f7f5bf0634ab51819f1/nbformat/v4/nbbase.py#L98-L110

This comment has been minimized.

Copy link
@mpacer

mpacer May 30, 2019

Member

The reason to do that is it ensures that the code cell itself will be valid according to the v4 spec.

return nb


def test_parallel_notebooks(capfd, tmpdir):

This comment has been minimized.

Copy link
@mpacer

mpacer May 30, 2019

Member

Are capfd and tmpdir pytest fixtures? Otherwise, where are these being populated when it comes time to test them?

@@ -220,6 +220,20 @@ class ExecutePreprocessor(Preprocessor):
)
).tag(config=True)

ipython_hist_file = Unicode(

This comment has been minimized.

Copy link
@mpacer

mpacer May 30, 2019

Member

I think you might want to remove this change and put it in its own PR as it stands alone as a huge benefit.

@mpacer

This comment has been minimized.

Copy link
Member

mpacer commented May 31, 2019

E - raise ZMQError(errno)
E - ZMQError: Address already in use

That definitely suggests that whatever is assigning ports is unaware of the other instances that are assigning ports. This might have been introduced by getting rid of the singleton ZmqContext or it might have been always present but less likely to occur because there were fewer ports being assigned.

I wonder if there are lessons to be learned from how the MultiKernelManager handles this case. The notebook server builds on that in its MappingKernelManager.

But the fact that we're running into this issue might mean that we might need to provide a primitive on the nbconvert side to enable this rather than relying on library users to solve it in their own parallel code.

@MSeal

This comment has been minimized.

Copy link
Collaborator

MSeal commented May 31, 2019

The subprocess wouldn't have the zeromq change or parent process manipulation, as that's unreleased isolated by process. So I believe it's unrelated to the singleton change. I think I found the bug: https://github.com/ipython/ipykernel/blob/master/ipykernel/kernelapp.py#L185-L187 is a race condition where between when it finds a free port (it's written to handle port conflicts) and when it actually uses it the port could have been taken by another process. It's likely not safe regardless of nbconvert parallelism if you have a notebook server up and are running nbconvert or another notebook server at the same time.

I'll open an issue / fix for ipykernel.

On the later comment you had I don't know if an nbconvert primitive would help, because I might spawn multiple nbconvert processes at the same time outside of a parent python process (e.g. spawning nbconvert processes in a bash loop). This would probably defeat the approach MultiKernelManager takes. Maybe we could use a file lock to enable cross process nbconvert instances to wait for each kernel to launch before starting a new one. It would only make nbconvert parallelism safe (and slow) but not safe across multiple types of clients connecting to zeromq. Correct me if your thinking of something else here.

Basically I think the kernels wrappers should solve it and we should help improve the tools for making kernels to not have to think about this if we can.

@MSeal

This comment has been minimized.

Copy link
Collaborator

MSeal commented Jun 2, 2019

I now have a local fix to the the ipykernel socket issue. But that failure mode is pretty rare and wasn't what we were hitting here.

With some modifications to tests and printing a few lines wrapping execute starts and stops I was able to narrow down that any processes that start while the threaded execution is running fail. Looking deeper by adding a log message whenever a zmq context instance is generated revealed that the forked processes before that the parent process finishes executing would not generate new context instances. I believe the traitlets usage or some other wrapper is leaving the context object on the class and reusing it in the forked processes instead of recomputing.

/usr/lib/python3.6/concurrent/futures/_base.py:384: TimeoutError
----------------------------------------------------------------------------------------------- Captured stdout call ------------------------------------------------------------------------------------------------
ABOUT TO START 0! /home/mseal/Workspace/nbconvert/nbconvert/preprocessors/tests/files/Sleep One.ipynb
ABOUT TO START 1! /home/mseal/Workspace/nbconvert/nbconvert/preprocessors/tests/files/HelloWorld.ipynb
WAITING ON 8
ABOUT TO START 2! /home/mseal/Workspace/nbconvert/nbconvert/preprocessors/tests/files/HelloWorld.ipynb
ABOUT TO START 4! /home/mseal/Workspace/nbconvert/nbconvert/preprocessors/tests/files/HelloWorld.ipynb
ABOUT TO START 3! /home/mseal/Workspace/nbconvert/nbconvert/preprocessors/tests/files/HelloWorld.ipynb
CREATING CONTEXT
DONE 0! /home/mseal/Workspace/nbconvert/nbconvert/preprocessors/tests/files/Sleep One.ipynb
ABOUT TO START 5! /home/mseal/Workspace/nbconvert/nbconvert/preprocessors/tests/files/HelloWorld.ipynb
ABOUT TO START 6! /home/mseal/Workspace/nbconvert/nbconvert/preprocessors/tests/files/HelloWorld.ipynb
ABOUT TO START 7! /home/mseal/Workspace/nbconvert/nbconvert/preprocessors/tests/files/HelloWorld.ipynb
ABOUT TO START 8! /home/mseal/Workspace/nbconvert/nbconvert/preprocessors/tests/files/HelloWorld.ipynb
CREATING CONTEXT
CREATING CONTEXT
CREATING CONTEXT
CREATING CONTEXT
DONE 8! /home/mseal/Workspace/nbconvert/nbconvert/preprocessors/tests/files/HelloWorld.ipynb
WAITING ON 7
DONE 7! /home/mseal/Workspace/nbconvert/nbconvert/preprocessors/tests/files/HelloWorld.ipynb
WAITING ON 6
DONE 6! /home/mseal/Workspace/nbconvert/nbconvert/preprocessors/tests/files/HelloWorld.ipynb
WAITING ON 5
DONE 5! /home/mseal/Workspace/nbconvert/nbconvert/preprocessors/tests/files/HelloWorld.ipynb
WAITING ON 4
# Times out on 4 waiting for a failed processes
@MSeal

This comment has been minimized.

Copy link
Collaborator

MSeal commented Jun 2, 2019

Seems like the processes 1-4 aren't even making it to the kernel client initialization. I'll do some more debugging tonight and find where they fail to get past more definitively.

@MSeal

This comment has been minimized.

Copy link
Collaborator

MSeal commented Jun 12, 2019

Found the parallel processing issue after I tracked it down to the GIL freezing up before entering the __new__ class method of the KernelManager.
https://bugs.python.org/issue27422
... forgot the C rule of never thread then fork, always fork and then thread

Basically the processes hangs if the GIL is still active when the multiprocessing forks try to acquire it and deadlock.
If the multiprocessing forks happen to get started before the threads do (which we saw sometimes) then we had a fork first situation and it would pass.

Going to modify the PR to remove the thread then fork test and get the rest merged so we can move forward. There is a race condition in ipykernel that's difficult to trigger that I have PR partially written to solve the only remaining race condition I've seen crop up.

@MSeal
MSeal approved these changes Jun 12, 2019
@MSeal

This comment has been minimized.

Copy link
Collaborator

MSeal commented Jun 13, 2019

Known race condition in ipykernel now has a PR to address: ipython/ipykernel#412

@MSeal MSeal merged commit 76061f8 into jupyter:master Jul 8, 2019
1 check passed
1 check passed
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@MSeal MSeal added this to the 5.6 milestone Jul 30, 2019
@meeseeksmachine

This comment has been minimized.

Copy link

meeseeksmachine commented Aug 9, 2019

This pull request has been mentioned on Jupyter Community Forum. There might be relevant details there:

https://discourse.jupyter.org/t/nbconvert-5-6-0-release/1867/1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.