-
-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Out-of-order execution #9566
Comments
Does it happen if using "run all" instead? I think the execution queue is populated differently, but still worth checking... |
BTW - for me it's been impossible to reproduce it even as reported above, but that can be variations in the binder node I happened to get. My kernel restarts were super fast, and execution (whether manual or "run all") came very quickly and always in-order. So it may be a bug that's a little tricky to trigger/replicate. |
No, it happens only when executing the cells manually. |
The browser output shows a bunch of |
Sounds possible, though I don't know that code at all, unfortunately... Worth perhaps instrumenting that particular path to add more info about what kind of message it was trying to send. This is definitely a pretty serious bug, as it will lead to potentially very hard to understand behavior if a user doesn't realize it happened (and depending on their code, it could still run, just produce odd results). |
Looks like the error might come from this?
|
If it's only possible to reproduce depending on location (I opened the binder from France), I'd be happy to help with debugging. |
So it looks like it may be tied to comm messages? So perhaps widgets are a good way to reproduce it? |
I am not sure if this is related but I have opened another issue in which I could see |
I think the root problem here isn't with comm messages. I can occasionally reproduce the issue on a local install of all current packages in a fresh conda env with a notebook with simple inputs. Here is one result of restarting the kernel and immediately evaluating the cells in this notebook from top to bottom: Conda packages:
|
Update: I think I found a place in the logic where some messages get queued, then others are sent (jumping the queue) before the queued messages are sent. I'll investigate more later today. Relevant logic seems to be at jupyterlab/packages/services/src/kernel/default.ts Lines 385 to 395 in d5a6043
|
Maybe that explains the example above? With:
|
Yes, that's my hypothesis. And probably the queued messages are scheduled for sending, but async, so the direct sending cuts in line? We have to be careful, because we do want kernel info requests to jump the queue. |
Relevant issues:
|
Fixes jupyterlab#9566 Followup on jupyterlab#8562 Changes solution in jupyterlab#9484 If we restarted a kernel, then quickly evaluated a lot of cells, we were often seeing the cells evaluated out of order. This came because the initial evaluations would be queued (because we had the kernel restarting sentinel in place), but later evaluations would happen synchronously, even if there were still messages queued. The logic is now changed to (almost) always queue a message if there are already queued messages waiting to be sent to preserve the message order. One exception to this is the kernel info request when we are restarting. We redo the logic in jupyterlab#9484 to encode the exception in the _sendMessage function (rather than hacking around the conditions for queueing a message). This brings the exception closer to the logic it is working around, so it is a bit cleaner. Also, we realize that the sendMessage `queue` parameter is really signifying when we are sending pending messages. As such, we always try to send those messages if we can. Finally, we saw that there was a race condition between sending messages after a restart and when the websocket was reconnected, leading to some stalled initial message replies. We delete the logic that sends pending messages on shutdown_reply, since those pending messages will be more correctly sent when the websocket reconnects anyway. We also don’t worry about setting the kernel session there since the calling function does that logic.
@davidbrochart - can you check the binder link generated at #9571 (or test that PR directly) to see if it solves the problem you saw here? Testing locally, that fix seemed to solve the issues I was seeing. |
I tried the generated binder, and I could see once that all the cells were marked with |
Let's take the review conversation over to that PR. I'll copy your comment over there. |
Hi, I have just read the various issues around this and wonder if the following can help
|
Thanks for looking at this @echarles. Can we distinguish between what might be a good idea in the future, and what is needed now to fix this rather serious bug without making things worse? To me, it sounds like your (1) item may be something to address in the future, and (2) may be a suggestion for a change on this PR? Am I reading things how you intended? That does seem elegant to always have a queue, and always work through that queue. Except, of course, for our queue-jumping kernel_info messages that we need right now to clear the restarting status sentinel. But I think the logic would look very similar to how it does now? Right now, if we send a message, basically the _sendMessage function deals with whether to put it on a queue, or to send immediately - that is transparent to the user. Even if we enqueued all messages (except those special kernel_info_request ones), somewhere we would need logic to flush the queue. I'd hate to put that on to the user to do as an extra step, so we'd have something somewhere keeping track of the queue state, and when a new message is added, determining whether we should flush the queue or not - which is essentially what _sendMessage is doing right now anyway. In other words, we'll need the same logic as we have now, and the api would also essentially look the same, especially to the user. Am I missing something? |
Yes, (1) is for longer term, (2) can be discussed in the scope of the needed hot fix.
I don' know but I think I do miss something as I have not spent enough time to look at all the conditions and the conversation history. I just had a feeling that a managed queue could help solve the issues. When I look at jupyterlab/packages/services/src/terminal/default.ts Lines 102 to 114 in 9ddc785
_sendPending before sending the message.
|
Well, you have to be a bit more careful - _sendPending calls this function as the lowest-level sending function IIRC, so you have to guard against the queue flush sends getting cycled back onto the queue itself. The other thing is that it wasn't clear this was a problem in kernels before the restart status logic was introduced. The connection status change triggered a queue flush, which IIRC was synchronous, which wouldn't let other messages jump the queue. If that is the case, it may not be a problem here either. |
(Let's please take discussion of the specific PR over to the PR, and leave more general conversation here, so conversation about that specific fix isn't scattered on other issues) |
Description
Manually executing cells while the kernel is restarting leads to out-of-order cell execution:
Reproduce
*
, and/or some cells executed out of order, potentially leading to execution errors.Expected behavior
Cells should be executed in order.
Context
Troubleshoot Output
Browser Output
The text was updated successfully, but these errors were encountered: