Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unhandled Promise Rejections - Kernel Autorestarting #12200

Open
matthart-sisense opened this issue Mar 11, 2022 · 6 comments
Open

Unhandled Promise Rejections - Kernel Autorestarting #12200

matthart-sisense opened this issue Mar 11, 2022 · 6 comments

Comments

@matthart-sisense
Copy link

Hi,

I am using @jupyterlab/services version 6.0.9. I am experiencing an issue with the Kernel autorestarting when you try and execute code that causes and OOM exception. The Kernel status is getting in a state where the following unhandled promise rejections are causing our application to crash because we can only catch the rejections at the NodeJS process level.

Canceled future for kernel_info_request message before replies were done
    at KernelShellFutureHandler.dispose (/usr/src/app/node_modules/@jupyterlab/services/lib/kernel/future.js:179:31)
    at /usr/src/app/node_modules/@jupyterlab/services/lib/kernel/default.js:982:20
    at Map.forEach (<anonymous>)
    at KernelConnection._clearKernelState (/usr/src/app/node_modules/@jupyterlab/services/lib/kernel/default.js:981:23)
    at /usr/src/app/node_modules/@jupyterlab/services/lib/kernel/default.js:1177:34
    at runMicrotasks (<anonymous>)
    at processTicksAndRejections (internal/process/task_queues.js:95:5)
Kernel connection disconnected
    at KernelConnection.fulfill (/usr/src/app/node_modules/@jupyterlab/services/lib/kernel/default.js:507:31)
    at invokeSlot (/usr/src/app/node_modules/@lumino/signaling/dist/index.js:463:22)
    at Object.emit (/usr/src/app/node_modules/@lumino/signaling/dist/index.js:420:21)
    at Signal.exports.Signal.Signal.emit (/usr/src/app/node_modules/@lumino/signaling/dist/index.js:103:21)
    at KernelConnection._updateConnectionStatus (/usr/src/app/node_modules/@jupyterlab/services/lib/kernel/default.js:1127:39)
    at KernelConnection.dispose (/usr/src/app/node_modules/@jupyterlab/services/lib/kernel/default.js:311:14)
    at KernelConnection._updateStatus (/usr/src/app/node_modules/@jupyterlab/services/lib/kernel/default.js:956:18)
    at KernelConnection.handleShutdown (/usr/src/app/node_modules/@jupyterlab/services/lib/kernel/default.js:549:14)
    at KernelConnection.shutdown (/usr/src/app/node_modules/@jupyterlab/services/lib/kernel/default.js:538:14)
    at runMicrotasks (<anonymous>)
    at processTicksAndRejections (internal/process/task_queues.js:95:5)

Things I have tried:

  • Wait for the status of the kernelConnection to be 'connected' and 'idle' before we try sending in further code for execution
  • Call interrupt() on the Kernel connection and then shutdown() on the connection to then. spin up a new connection. This results in the "Kernel connection disconnected", which I believe is due to the interrupt() call not properly terminating all async tasks, which was reported here. This means our call to shutdown() is going to set the status to disconnected and the the async call to reconnect when the Kernel is autorestarting is probably not terminated immediately by the interrupt() call.
  • Upgraded to version @jupyterlab/services to 6.3.1

Curious if anyone is aware of the best way to handle Kernel crashes due to OOM exceptions? Thanks.

@welcome
Copy link

welcome bot commented Mar 11, 2022

Thank you for opening your first issue in this project! Engagement like this is essential for open source projects! 🤗

If you haven't done so already, check out Jupyter's Code of Conduct. Also, please try to follow the issue template as it helps other other community members to contribute more effectively.
welcome
You can meet the other Jovyans by joining our Discourse forum. There is also an intro thread there where you can stop by and say Hi! 👋

Welcome to the Jupyter community! 🎉

@jupyterlab-probot jupyterlab-probot bot added the status:Needs Triage Applied to new issues that need triage label Mar 11, 2022
@JasonWeill JasonWeill added pkg:services and removed status:Needs Triage Applied to new issues that need triage labels Mar 17, 2022
@JasonWeill
Copy link
Contributor

Triage notes: Add catch that can log something to the console? This won't fix the problem if we have no unhandled inaccessible promise.

Can you share an example of code that reliably causes an OOM error so that we can test any fix against it?

@matthart-sisense
Copy link
Author

We have a catch for all promises that are returned from calls to this third party library. The KernelConnection.reconnect function is returning a promise that is being rejected when the connection status is changed to 'disconnected'. This promise is not being handled, from what I see, in the call to reconnect from the Promise that is handling the 'autorestarting'.

As for example code, here is something that will cause an OOM exception:

import os, sys
import math
import pandas as pd

file_name = 'XXL_dataset.csv'
body = '0,1,2,3,4,5,6,7,8,9\n'
with open(file_name, 'w') as f:
    f.write('a,b,c,d,e,f,g,h,i,j\n')
    f.write(body * math.floor((1000000000 / len(body)) - 1))
print(f'{file_name} was successfully created')
df = pd.read_csv(file_name)
print(df.head(1))

@krassowski
Copy link
Member

To be frank the fact that there is no way to handle this rejection is frustrating. It causes issues in our tests with random tests marked as failing or random/cryptic error logs (due to issue in jest jestjs/jest#9210) and slows down development.

this._done.reject(
new Error(
`Canceled future for ${this.msg.header.msg_type} message before replies were done`
)
);

get done(): Promise<REPLY> {
return this._done.promise;
}

So far I found that this can be suppressed in tests with:

// @ts-ignore
future['_done']['reject'] = () => {}

@krassowski
Copy link
Member

Of note, this caused a lot of confusion and annoyance downstream:

I think that we should make kernel restart NOT throw this error (unhandled rejection). I think it is fair enough to special-case kernel shutdown/restart so that this nicer error message (warning) is logged in the console instead.

@krassowski
Copy link
Member

Well, this clearly does not work like this:

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants