Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SUB sockets take time to subscribe to the IOPub channel and miss important messages #593

Closed
SylvainCorlay opened this issue Dec 8, 2020 · 13 comments

Comments

@SylvainCorlay
Copy link
Member

SylvainCorlay commented Dec 8, 2020

This occurs randomly in e.g.

  • JupyterLab when hitting the button to "restart the kernel, and re-run the whole notebook"

    iopub

    (On my setup roughly one time out of three). In the screenshot above, we see that only the IOPub stream messages from Cell 4 are received. Most typically, either all or non will be shown, but an intermediate result like that one also happens.

    Inspecting the messages on the websocket shows that the idle status message with the parent header corresponding to the shutdown_request is the first IOPub message, but it is sent after the first three stream messages corresponding to the first cells. (This has been reported here: Jupyter Lab stalls when choosing "Restart Kernel and Run All Cells..." jupyterlab/jupyterlab#9008)

  • Voilà, when running a dashboard

    In that case, it is a more rare occurrence. Basically, the idle status message for the kernel info request is never received because it is sent before jupyter_client has subscribed to the IOPub channel! Since the connect method in JupyterLab waits for that idle message to resolve the promise, it hangs indefinitely...

I am surprised that we did not hit this before and I wonder if this is a regression in jupyter_client or pyzmq, or if some changes in the relative timing exposed this race condition only recently.

cc @minrk @jtpio @JohanMabille @afshin

EDIT:

Independent fixes wherever we subscribe to channels. The approach is to nudge the kernel with info request until we get something on IOPub.

@JohanMabille
Copy link
Member

JohanMabille commented Dec 8, 2020

A reliable solution to this problem is to replace the PUB socket of the kernel with a XPUB socket, which is able to detect new subscribers. The kernel then sends a "welcome message" upon new subscription. When connecting to a kernel, the frontend waits ofr this message before sending any request.

The frontend could support both behaviors and select the right one based on a a new optional field in the kernelspec (welcome_message for instance). This would avoid backward incompatible changes in the protocol.

@minrk
Copy link
Member

minrk commented Dec 9, 2020

The kernel should not care if there are subscribers, so I don't think the XPUB change is appropriate (talking to a kernel with only a shell socket is totally valid, and having 0-many subscribers is also valid).

I think we addressed this in nb classic (and nbconvert, I believe) by probing with a kernel_info_request and waiting for status messages. Only after IOPub connection is verified are requests actually sent. I'll try to find where it happens, and maybe some change resulted in losing that logic.

If there are changes to make, they are likely in notebook/Jupyter_server/JupyterLab. It could be a performance regression in libzmq causing subscriptions to take a bit longer to register, if this is happening more in recent weeks than earlier. jupyter_client could also provide a wait_for_IOPub utility function.

@JohanMabille
Copy link
Member

JohanMabille commented Dec 9, 2020

The kernel should not care if there are subscribers, so I don't think the XPUB change is appropriate (talking to a kernel with only a shell socket is totally valid, and having 0-many subscribers is also valid).

I don't think the XPUB change invalidates that behavior, it just helps the client knowing the IOPub connection is established, in case the client would be waiting for that before sending requests to the kernel.

I think we addressed this in nb classic (and nbconvert, I believe) by probing with a kernel_info_request and waiting for status messages.

The problem is that the idle status message (after the kernel_info_request has been responded) can be sent before the IOPub connection is established, and in this case the message is lost. The client will wait forever the status message (which is exactly the problem we have here).

Only after IOPub connection is verified are requests actually sent.

Maybe there's a mechanism that I am nor aware of, but for me there is no reliable way to ensure the IOPub connection has been established other than receiving a message. In that case, the kernel should regularly send idle messages for clients connecting late.

@JohanMabille
Copy link
Member

JohanMabille commented Dec 9, 2020

After an in-person meeting, the solution suggested by @minrk is to ensure the IOPub connection is established with the following procedure:

  • the client sends a kernel_info_request to the kernel and waits for both the kernel_info_reply and the idle status message.
  • if the client receives the kernel_info_reply but not the idle message before a timeout, repeat the previous operation until both messages are received.

The advantage of this over the XPUB (and welcome message) is that it does not require changing all the kernels. Besides, implementing it in jupyter_server and making it not accepting websocket connections until IOPub connection is established will guarantee the fix for most of javascript clients.

@SylvainCorlay
Copy link
Member Author

Btw, the classic notebook has exactly the same issue with "restart and run all".

@jasongrout
Copy link
Member

Besides, implementing it in jupyter_server and making it not accepting websocket connections until IOPub connection is established will guarantee the fix for most of javascript clients.

After a quick glance at the code, I think these are the two relevant places in jupyter_server:

(and similar places in the current notebook repo as well)

@kevin-bates
Copy link
Member

I'm not sure if this will be found helpful, but I wanted to share my findings/observations (and I apologize for the length).

  • Using notebook (6.1.5/master), jupyter_client (6.1.7), and lab (2.x) or nbclassic UI, I see the expected ~30% failure rate.
  • Using jupyter_server (1.0.8/master), jupyter_client (6.1.7), and lab 3.0rc13, I see a 100% failure rate - I cannot NOT reproduce the issue - so I question that configuration - unless others can confirm.

With --debug enabled, I went ahead and extended the activity logging to include the [parent_header][msg_type] field of the logged activity status in order to associate the idle status to the kernel_info_request message, and found that even in failure cases, a kernel_info_request idle status is (nearly always) returned.

Since this is not what the web client will receive anyway, it may not reflect the actual results - so I added the same logging to ZMQChannelsHandler._on_zmq_reply() and found similar behavior.

Here's the snippet I added to the channels handler in notebook and jupyter_server just prior to the call to the superclass.

        if msg_type == 'status':
            execution_state = msg['content']['execution_state']
            if 'msg_type' in msg['parent_header']:
                request = msg['parent_header']['msg_type']
                self.log.debug("handler - activity on %s: from %s - %s (%s)", self.kernel_id, request, msg_type,
                               execution_state)
            else:
                self.log.debug("handler - activity on %s: %s (%s)", self.kernel_id, msg_type, execution_state)
        else:
            self.log.debug("handler - activity on %s: %s", self.kernel_id, msg_type)

FWIW - here's the applicable snippet I added to the activity tracking - in case you wanted to display the parent_header.msg_type field (notebook and jupyter_server).

            if msg_type == 'status':
                kernel.execution_state = msg['content']['execution_state']
                if 'msg_type' in msg['parent_header']:
                    request = msg['parent_header']['msg_type']
                    self.log.debug("activity on %s: from %s - %s (%s)", kernel_id, request, msg_type, kernel.execution_state)
                else:
                    self.log.debug("activity on %s: %s (%s)", kernel_id, msg_type, kernel.execution_state)
            else:
                self.log.debug("activity on %s: %s", kernel_id, msg_type)

If I'm understanding the issue, the issue is that the IOPub channel is not started/initialized and that we should ensure an idle status has been received to confirm its initialization prior to proceeding. However, I believe I'm seeing the issue occur after an idle status has been received. And, amongst the failing cases, it seems I do not see the status logged from the channels handler, while it does appear to be logged by the activity-monitor - which makes sense since lab won't send the execution requests unless it receives the status (essentially from the channels handler).

At any rate, I'm hoping you might find the additional logging easier to troubleshoot with.

@SylvainCorlay
Copy link
Member Author

SylvainCorlay commented Dec 10, 2020

Thanks for sharing your findings @kevin-bates. I think we understand the issue well now. We will be implementing a workaround in jupyter-client and jupyter-server shortly, and proposing a change in the protocol to address the issue in a more robust way.

@SylvainCorlay
Copy link
Member Author

I have opened a couple of PRs to notebook and jupyter_server to nudge the kernel with kernel info requests until it starts receiving IOPub notifications.

minrk added a commit to minrk/ipykernel that referenced this issue Dec 18, 2020
@minrk minrk changed the title Jupyter client takes time to subscribe to the IOPub channel and misses important messages. SUB sockets take time to subscribe to the IOPub channel and miss important messages Dec 18, 2020
@minrk
Copy link
Member

minrk commented Dec 18, 2020

I was able to provoke this behavior more severely and reliably by artificially dropping the first N iopub messages from the kernel: minrk/ipykernel@3fdb063

then it's very easy to see what happens if it takes a ridiculous amount of time for subscriptions to propagate, and jupyter/notebook#5908 handles it gracefully!

@jasongrout-db
Copy link

The kernel should not care if there are subscribers, so I don't think the XPUB change is appropriate (talking to a kernel with only a shell socket is totally valid, and having 0-many subscribers is also valid).

Just to follow up on this for people reading this discussion for the first time: please read the continued discussion on jupyter/enhancement-proposals#65, in which Min and others approve using XPUB in a JEP.

@JohanMabille
Copy link
Member

Should we close this issue? The JEP has been approved, and the nudge mechanism has been implemented in jupyter_client.

@jasongrout
Copy link
Member

Fixed by #592

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants