Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port already in use #279

Merged
merged 7 commits into from Aug 9, 2017

Conversation

@rolweber
Copy link
Contributor

rolweber commented Jul 6, 2017

See issue #276.

@rolweber

This comment has been minimized.

Copy link
Contributor Author

rolweber commented Jul 18, 2017

This PR is ready for a first round of reviews and feedback. Summary:

  • connect.py: The ConnectionFileMixin tracks which ports are randomly assigned, and knows how to unassign them.
  • manager.py: The KernelManager.restart_kernel function has a new, optional parameter that tells whether to choose new random ports or keep the old ones.
  • restarter.py: The KernelRestarter will keep the random ports as soon as the kernel has been detected alive. If the kernel must be restarted without ever being alive, new ports will be chosen.

It depends on the kernel implementation whether that addresses the port-in-use problem. If a kernel first gets all ports and does not respond to heartbeats unless that succeeds, it will be restarted with new ports. But if there are kernel implementations out there which respond to a heartbeat and therefore appear alive, although they couldn't get one of the other ports, they will be restarted with the same ports. But there's only so much that can be detected on the Jupyter client side.

@rolweber rolweber changed the title [WIP] Port already in use Port already in use Jul 19, 2017
@minrk

This comment has been minimized.

Copy link
Member

minrk commented Jul 19, 2017

Nice! I think this is a good idea. We will have to think a little bit carefully about when newports is invoked automatically to avoid breaking things in surprising ways.

In particular, we should check what's going to happen in the notebook when this event occurs. What I would like to avoid is this sequence of events:

  1. new kernel is requested
  2. websocket channel connects to assigned port
  3. kernel fails and is restarted with a new port
  4. kernel is running and reports okay, but notebook is connected to the wrong port and wasn't notified that the port changed

This would be frustrating and confusing to have everything claim to be working, but have execution silently fail.

It's okay if this is the starting point because it's replacing an already broken case of a dead kernel, but I want to make sure that the right notifications/events are in place to handle it better in the notebook (i.e. ensure that connected channels reconnect to the new ports).

@rolweber

This comment has been minimized.

Copy link
Contributor Author

rolweber commented Jul 19, 2017

@minrk: Thanks for having a look. I agree that we need to think through some error scenarios.

The websocket channel connects to the notebook server, not to the kernel, right?
All kernel sockets are for ZeroMQ communication, or am I missing something?

@minrk

This comment has been minimized.

Copy link
Member

minrk commented Jul 19, 2017

@rolweber when a websocket connects to the server, a corresponding zmq socket is connected from the server to the kernel. I want to know what will happen if the connection happens before the restart (i.e. is it possible that the zmq socket is connected to the first port and not the new port).

@rolweber

This comment has been minimized.

Copy link
Contributor Author

rolweber commented Jul 24, 2017

I studied the code of class ZMQChannelsHandler in notebook/services/kernels/handlers.py, along with some base classes and mixins. Afaict, ZMQ connections to the kernel are created in create_stream, which calls one of the connect_* methods in the ConnectionFileMixin. These methods use the current port number as stored in the mixin attributes.

create_stream registers a handler for kernel restart events, on_kernel_restarted, which sends a status update about the restart. I assume that the status message is going to the browser. From there on, it's the JavaScript code in the browser that has to handle the event.

Without the change I'm proposing here, the ZMQ connections between the notebook server and the kernel already have to be re-established, because they were dropped when the old kernel got killed. With my change, re-establishing connections will pick up the new port numbers. Disclaimer: AFAICT.

@rolweber

This comment has been minimized.

Copy link
Contributor Author

rolweber commented Jul 24, 2017

The KernelRestarter already has some config parameters. I could add another one that enables the new behavior, while keeping the old behavior as the default.

@rolweber

This comment has been minimized.

Copy link
Contributor Author

rolweber commented Jul 24, 2017

If we feel it is safe, we could also set the default to the new behavior and keep the config parameter only as a fallback in case of unexpected problems.

@rolweber

This comment has been minimized.

Copy link
Contributor Author

rolweber commented Aug 9, 2017

I'm back from vacation. Any suggestions on how I can move this PR further towards merging?

@minrk

This comment has been minimized.

Copy link
Member

minrk commented Aug 9, 2017

Thanks for your detailed look into the notebook behavior! I think we can make this True by default based on your findings. Then I think it's good to go.

@rolweber

This comment has been minimized.

Copy link
Contributor Author

rolweber commented Aug 9, 2017

Thanks for reviewing! I switched the default.

@minrk minrk merged commit d6e6bc9 into jupyter:master Aug 9, 2017
1 check passed
1 check passed
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@minrk minrk added this to the 5.2 milestone Aug 9, 2017
@rolweber rolweber deleted the rolweber:port-already-in-use branch Aug 9, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.