Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zombie Process Fix #34

Merged
merged 9 commits into from
Mar 31, 2020
Merged

Zombie Process Fix #34

merged 9 commits into from
Mar 31, 2020

Conversation

MSeal
Copy link
Contributor

@MSeal MSeal commented Mar 9, 2020

This PR (probably) fixes both the dead / zombie process issues that come up when calling nbclient in nested subprocesses as well as the memory leak when running nbclient repeatidely in the same process (might fix nteract/papermill#478). These were both being caused by zmq processes not being cleaned up, which the change in moving to nbclient left around by not deleting the kernel objects after execution. The cleanup process has been made much more robust to a number of failure modes and will attempt a direct kill call on the child process if graceful shutdown is not achieved.

Also fixed a number of async issues / warnings that were cropping up. Downstream libraries were relying on some functions being synchronous, so I renamed a number of functions to async_ and used loop calls to avoid compatibility issues for the next release.

Most of these were really hard to write a meaningful unittest. I made some notebooks that recreated complex setups that caused process issues (or emitted a warning that would do so in some circumstances) and cleaned things up until warnings and hanging processes were no long around. I don't think there's an easy way to encode that into unittests unfortunately.

Needed for moving forward with #31
Fixes runtime issues, but not all test warnings for #33

Comment on lines 328 to 332
try:
# For AsyncKernelManager
loop.run_until_complete(self.km.shutdown_kernel(now=True))
except TypeError:
self.km.shutdown_kernel(now=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check if self.km.shutdown_kernel is a coroutine, instead of using try/except?
Also, we might want to have an async version of _cleanup_kernel, which is use in blocking and async functions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can change to that pattern if you prefer.

@@ -330,22 +380,47 @@ def start_kernel_manager(self):
if self.km.ipykernel and self.ipython_hist_file:
self.extra_arguments += ['--HistoryManager.hist_file={}'.format(self.ipython_hist_file)]

self.km.start_kernel(extra_arguments=self.extra_arguments, **kwargs)
# Support AsyncKernelManager
if inspect.iscoroutinefunction(self.km.start_kernel):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also use asyncio.iscoroutinefunction, but I guess this is fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on https://stackoverflow.com/questions/36076619/test-if-function-or-method-is-normal-or-asynchronous the asyncio is probably a better choice to cover more cases. I'll change em over.


self.kc = self.km.client()
self.kc.start_channels()
try:
await self.kc.wait_for_ready(timeout=self.startup_timeout)
except RuntimeError:
self.kc.stop_channels()
self.km.shutdown_kernel()
self._cleanup_kernel()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we might use an async version of _cleanup_kernel, since we are in an async function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 will do

try:
await yield_(None) # would just yield in python >3.5
finally:
self.kc.stop_channels()
self.kc = None
self._cleanup_kernel()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we might use an async version of _cleanup_kernel, since we are in an async function.

"""
A custom exception used to indicate that the exception is used for cell
control actions (not the best model, but it's needed to cover existing
behvior without major refactors).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo behvior -> behavior.

self.async_wait_for_reply(msg_id, cell=cell)
)

async def async_wait_for_reply(self, msg_id, cell=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be _async_wait_for_reply for name consistency with _wait_for_reply?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of exposing async_wait_for_reply as not private so it could be used in contracts for extensibility since papermill ended up needing the blocking verison to extend behavior. If I'm doing that I should probably commit and make wait_for_reply, or revert the decision overall and keep with _async_wait_for_reply.

@davidbrochart
Copy link
Member

Thanks for this PR @MSeal.
I am wondering if we should have a blocking and an asynchronous directory, as it is for jupyter_client, to clearly separate the two implementations?

@MSeal
Copy link
Contributor Author

MSeal commented Mar 10, 2020

Thanks for this PR @MSeal.
I am wondering if we should have a blocking and an asynchronous directory, as it is for jupyter_client, to clearly separate the two implementations?

We could do that (or rather have two files, asyncclient.py and client.py would be simple enough given the lack of directory structure here). Only downside is that we may end up repeating some logic that we're currently wrapping in run_until_complete. It would be a be cleaner and we could have two distinct classes that clearly separate async from blocking concerns... Might need a base class to hold common synchronous code patterns. What's you preference @davidbrochart ?

Also might be a couple days before responding on this PR fyi

@davidbrochart
Copy link
Member

I've made some refactoring along this line in #37.

@MSeal
Copy link
Contributor Author

MSeal commented Mar 19, 2020

Assuming #37 merges, do you want me to address separating the class into blocking / non-blocking versions or finish providing lower level interfaces in each to blocking and non-blocking functions to be used by downstream extensions still?

@davidbrochart
Copy link
Member

I think what we now have in #37 with blocking methods using run_sync on async methods is fine.

@davidbrochart
Copy link
Member

I believe this is blocking the release. Do you need help @MSeal ?

@MSeal
Copy link
Contributor Author

MSeal commented Mar 26, 2020

I've got time to take a pass tonight on updating. Should be able to more quickly respond / finish up the release now that some other threads are resolved.

@MSeal
Copy link
Contributor Author

MSeal commented Mar 27, 2020

See how those changes look. Had to do more than I expected unfortunately so the diff is longer

nbclient/util.py Outdated
Comment on lines 52 to 63
async def await_or_block(func, *args, **kwargs):
"""Awaits the function if it's an asynchronous function. Otherwise block
on execution.
"""
if asyncio.iscoroutinefunction(func):
return await func(*args, **kwargs)
else:
result = func(*args, **kwargs)
# Mocks mask that the function is a coroutine :/
if isinstance(result, Coroutine):
return await result
return result
Copy link
Member

@davidbrochart davidbrochart Mar 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In jupyter_server, we will have something similar that we call ensure_async. We use it as a wrapper on a function call when we don't know if this function is async or not. I prefer that over passing the function object and its parameters to await_or_block and let it make the call. I will try it and maybe make a PR on your zombieProcFix branch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with that change, thanks for helping

@MSeal
Copy link
Contributor Author

MSeal commented Mar 30, 2020

I haven't gotten a chance to dig into the failures -- Looks like something is trying to await a non-async call for at least one of them.

@davidbrochart
Copy link
Member

I sent a PR on your branch: MSeal#2

@MSeal
Copy link
Contributor Author

MSeal commented Mar 30, 2020

:doh: -- I added that fork as a watched repo so I see those changes in a timely manner going forward.

@MSeal
Copy link
Contributor Author

MSeal commented Mar 30, 2020

I think we may need to drop 3.5 support if async keeps hanging with it :/. Same failures are occurring in the nbconvert port to use nbclient with python 3.5. End of life is around the corner 2020-09-13, maybe we should just pull the trigger now and drop it.

@captainsafia @willingc I checked some known images / VMs that were pinned to 3.5 and those I know about have moved the defaults forward to 3.6 or 3.7. Is there internals you know of that depend heavily on 3.5? If we drop support here / in jupyter_client (where I think the issue also manifesting) it would mean papermill, nbconvert, and scrapbook would have to follow to 3.6+.

@captainsafia
Copy link
Member

Is there internals you know of that depend heavily on 3.5? If we drop support here / in jupyter_client (where I think the issue also manifesting) it would mean papermill, nbconvert, and scrapbook would have to follow to 3.6+.

Based on my inquiries, it seems like most people are on 3.6 already.

@MSeal
Copy link
Contributor Author

MSeal commented Mar 31, 2020

I asked around a bit more to confirm the change would generally be accepted. I'm going to go forward with dropping 3.5 support so we don't mislead people into failing processes.

@MSeal
Copy link
Contributor Author

MSeal commented Mar 31, 2020

@davidbrochart Good with a merge / release?

Copy link
Member

@davidbrochart davidbrochart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, let's release!

@MSeal MSeal merged commit 2d05335 into jupyter:master Mar 31, 2020
@davidbrochart
Copy link
Member

BTW, do you need help in the future for the releases?

@MSeal
Copy link
Contributor Author

MSeal commented Mar 31, 2020

@davidbrochart That would be great. I'll get this one as PyPI permissions are more restrictive than github for jupyter project.

I would really like to have more python developers doing release management across the projects so I'm not a bottleneck. I'll get that conversation going about promoting PyPI permissions to more people with the core team.

@MSeal
Copy link
Contributor Author

MSeal commented Mar 31, 2020

@davidbrochart Actually if you wanted to help with this one, we could use an announcement to discourse and the jupyter mailing list. You did the lion share of the async work and it'd be great to publicize the changes in the release there. Up to you, I can also write it up

@davidbrochart
Copy link
Member

@MSeal sure, let me try.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Kernel dies while executing papermill
3 participants