Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow Terminating Kernels #8562

Merged

Conversation

echarles
Copy link
Contributor

@echarles echarles commented Jun 13, 2020

References

This fixes #8477 (Race Condition with Slow Restarting Kernels)

Code changes

A new field isTerminating and new method shudownKernel are added on the session context, and the needed signals are emitted based on the kernel status so the Kernel Indicator is updated.

User-facing changes

The Kernel Indicator shows now terminating and restarting status. In those status, a code cell can not be executed.

kernel-terminating-restarting

Backwards-incompatible changes

None

@jupyterlab-dev-mode
Copy link

jupyterlab-dev-mode bot commented Jun 13, 2020

Thanks for making a pull request to JupyterLab!

To try out this branch on binder, follow this link: Binder

@echarles echarles changed the title Terminating Kernel Status Slow Terminating Kernel Status Jun 13, 2020
@echarles echarles changed the title Slow Terminating Kernel Status Slow Terminating Kernels Jun 13, 2020
@blink1073 blink1073 added this to the 3.0 milestone Jun 13, 2020
@blink1073
Copy link
Member

blink1073 commented Jun 13, 2020

cf #8563 (comment) for an alternate approach to showing a dialog.

Do we strictly need the new API (restartKernel, isRestarting, isTerminating), or can we get there with existing APIs (and the new terminating status)?

@echarles
Copy link
Contributor Author

echarles commented Jun 13, 2020

To simulate a slow terminating kernel, add time.sleep(10) in notebook.services.kernels.kernelmanager:shutdown_kernel().

For the slow restaring kernel, add time.sleep(10) in notebook.services.kernels.kernelmanager: restart_kernel().

@echarles
Copy link
Contributor Author

echarles commented Jun 13, 2020

This PR does not cover a shutdown via the running panel, see also #8564

@echarles
Copy link
Contributor Author

echarles commented Jun 13, 2020

cf #8563 (comment) for an alternate approach to showing a dialog.

Thx @blink1073 for your inputs and help. We could indeed show a text below the cell instead of the blocking modal. No strong opinion on that here. @tgeorgeux WDYT?

Do we strictly need the new API (restartKernel, isRestarting, isTerminating), or can we get there with existing APIs (and the new terminating status)?

Great question! I had the same while implementing this.

The cell action need to know if the kernel is restarting or terminating. So in way that information need to be exposed this is why I have added those 2 new getters.

The restartKernel needs to be invoked by the modal window which asks confirmation to the user as he will loose his variables. We could invoke restart only without new method via the session?.kernel?.restart() but some state need to be updated before and after doing that... Maybe you have a better approach?

  async restartKernel(): Promise<void> {
    const kernel = this.session?.kernel || null;
    this._isRestarting = true;
    this._isReady = false;
    this._statusChanged.emit('restarting');
    await this.session?.kernel?.restart();
    this._isRestarting = false;
    this._isReady = true;
    this._statusChanged.emit(this.session?.kernel?.status || 'unknown');
    this._kernelChanged.emit({
      name: 'kernel',
      oldValue: kernel,
      newValue: this.session?.kernel || null
    });
  }

@jasongrout
Copy link
Contributor

jasongrout commented Jun 13, 2020

There is something about this idea that is troubling me. Any client could be signaling a restart, for example - how do we handle the case where someone else is restarting the kernel? (this will probably be even more of an issue when we have real-time collaboration)

Also, for a kernel restart, why shouldn't we be able to run cells, and just have them queued for when the kernel comes back? I could imagine hitting restart, then immediately starting to run the new cells I want to run, and feeling very stilted about having to wait until the kernel is up again before I can execute cells.

When a kernel is initializing, or when the websocket is down, we don't wait until it comes up before someone can run cells. We just queue them if the websocket is not available and send them when the websocket comes up.

@echarles
Copy link
Contributor Author

echarles commented Jun 13, 2020

There is something about this idea that is troubling me. Any client could be signaling a restart, for example - how do we handle the case where someone else is restarting the kernel? (this will probably be even more of an issue when we have real-time collaboration)

I have started from the principle that jlab is single user for now. This implies only the single user of the server can restart a kernel.

I will check how this PR behaves in case off kernel being killed by a third party.

Also, for a kernel restart, why shouldn't we be able to run cells, and just have them queued for when the kernel comes back? I could imagine hitting restart, then immediately starting to run the new cells I want to run, and feeling very stilted about having to wait until the kernel is up again before I can execute cells.

Yep, we have discussed about that in #8477 (comment)

This blocking approach is a first step which can be followed by a second step with the queuing approach. I just feel that the queuing might produce unexpected side effect for the user: what if the kernel never terminated and never comes up again (I have seen that multiple time with Spark kernel handling Big Data. sets), what if the user does not understand that the execution is queued and ask multiple time the same cell to execute...? I am sure a lot more edge case will come along the way, this is why this first approach sounds safer and stronger for now.

When a kernel is initializing, or when the websocket is down, we don't wait until it comes up before someone can run cells. We just queue them if the websocket is not available and send them when the websocket comes up.

Yep that would be the next step.

@echarles
Copy link
Contributor Author

echarles commented Jun 19, 2020

This blocking approach is a first step which can be followed by a second step with the queuing approach.

I have now implemented the queuing of cell execution.

restartqueue

@echarles echarles modified the milestones: 3.0, 2.2 Jun 19, 2020
@echarles
Copy link
Contributor Author

echarles commented Jun 19, 2020

I have set the 2.2 milestone on this one assuming we have enough reviews.

@echarles echarles force-pushed the slow-terminating-kernel branch from 8de1dad to da84a7c Compare Jun 20, 2020
@echarles
Copy link
Contributor Author

echarles commented Jun 20, 2020

I had to make a tradeoff to trigger the sending of the pendingMessages.

I wanted to use a status message generated by the server by the on_kernel_restarted callback but this callback is not registered as the implementation of the multikernelmanager

https://github.com/jupyter/jupyter_client/blob/daa79f917031eb7e5bc58df9fdc8929775eb8171/jupyter_client/multikernelmanager.py#L289-L291

should be overridden somewhere by the notebook server MappingKernelManager(MultiKernelManager):

https://github.com/jupyter/notebook/blob/9fa644bbe46da9aa907f876dc027a7347e270e8b/notebook/services/kernels/kernelmanager.py#L42-L43

... but it is not. As a consequence, the restarting message is not created on restart, and jlab can not use it to submit the pending messages.

I felt back with this implementation that uses the shutdown_reply message generated by the ipykernel implementation, but there is no guarantee imho that that message would be generated by other kernel implementations.

https://github.com/ipython/ipykernel/blob/8ec2c4689614342483b7b3d85fe2a8536eef1527/ipykernel/kernelbase.py#L682-L688

To summary, I think :

  1. This PR fixes the issue in case of ipykernel based kernels.
  2. We could bring a better fix with a notebook release that would create that restarting message but I would prefer migration jlab to jserver and fixing that in the https://github.com/jupyter/jupyter_server repository, especially when we will move to the kernel providers.

Any second eyes and though are very welcome to confirm the above @jasongrout @blink1073 @Zsailer @kevin-bates

@blink1073
Copy link
Member

blink1073 commented Jun 20, 2020

One clarification: any kernel that uses the base Kernel class would work with this PR. This includes all of the metakernel based kernels (there are at least a dozen).

I agree we should make the fix in jupyter_server

@@ -31,6 +31,10 @@ import * as restapi from './restapi';
// Stub for requirejs.
declare let requirejs: any;

const RESTARTING_KERNEL_SESSION = '_RESTARTING_';

let _pendingMessages: KernelMessage.IMessage[] = [];
Copy link
Member

@blink1073 blink1073 Jun 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I read it, any queued messages would go to the first kernel that starts. So, if you tried to run cells from two different notebooks while their kernels were starting then you'd end up executing both sets of code on one kernel.

Copy link
Contributor Author

@echarles echarles Jun 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing escapes your eagle eyes @blink1073 :) I have created that static field (while debugging...) but reverted it back to an instance field in this f0a1a1a .... but forgot to push it. Now the _pendingMessages is again an instance of the KernelConnection.

Anyway, I have just tested so far with one restarting kernel, so yeah, I will test with multiple restarting kernels to ensure we meet the expected behavior.

Copy link
Contributor Author

@echarles echarles Jun 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have done a few tests and the pendingMessages do not mangle each other in case of multiple slow kernels restarting at the same time. I have also opened jupyter-server/jupyter_server#247 where we can implement a better solution for this feature. For now, I believe the proposed solution is "good enough".

@echarles
Copy link
Contributor Author

echarles commented Jun 22, 2020

Hi @blink1073, do you have any other comments on my last push? Would be great if we could merge a fix before 2.2 deadline this Wednesday.

Copy link
Member

@blink1073 blink1073 left a comment

Looks good, thanks! Let's leave it open for a day in case someone else has comments.

@blink1073
Copy link
Member

blink1073 commented Jun 25, 2020

Cheers!

@blink1073 blink1073 merged commit 5555c77 into jupyterlab:master Jun 25, 2020
39 checks passed
@echarles
Copy link
Contributor Author

echarles commented Jun 25, 2020

Thx Steve and Jason for your reviews and also for the merge.

saulshanabrook pushed a commit that referenced this issue Jun 25, 2020
jasongrout added a commit to jasongrout/jupyterlab that referenced this issue Jan 7, 2021
Fixes jupyterlab#9566
Followup on jupyterlab#8562
Changes solution in jupyterlab#9484

If we restarted a kernel, then quickly evaluated a lot of cells, we were often seeing the cells evaluated out of order. This came because the initial evaluations would be queued (because we had the kernel restarting sentinel in place), but later evaluations would happen synchronously, even if there were still messages queued. The logic is now changed to (almost) always queue a message if there are already queued messages waiting to be sent to preserve the message order.

One exception to this is the kernel info request when we are restarting. We redo the logic in jupyterlab#9484 to encode the exception in the _sendMessage function (rather than hacking around the conditions for queueing a message). This brings the exception closer to the logic it is working around, so it is a bit cleaner.

Also, we realize that the sendMessage `queue` parameter is really signifying when we are sending pending messages. As such, we always try to send those messages if we can.

Finally, we saw that there was a race condition between sending messages after a restart and when the websocket was reconnected, leading to some stalled initial message replies. We delete the logic that sends pending messages on shutdown_reply, since those pending messages will be more correctly sent when the websocket reconnects anyway. We also don’t worry about setting the kernel session there since the calling function does that logic.
jasongrout added a commit to jasongrout/jupyterlab that referenced this issue Jan 26, 2021
Fixes jupyterlab#9200

jupyterlab#8562 changed from listening for the ‘restarting’ message to a non-existent ‘autorestarting’ message from the server. This changes back to listening for a ‘restarting’ message, which indicates that the server has autorestarted the kernel.

This also refactors and simplifies some of the code around restarting and autorestarting kernels to be more consistent.
@github-actions github-actions bot added the status:resolved-locked Closed issues are locked after 30 days inactivity. Please open a new issue for related discussion. label Feb 8, 2021
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 8, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
pkg:apputils pkg:notebook pkg:services status:resolved-locked Closed issues are locked after 30 days inactivity. Please open a new issue for related discussion.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants