Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KernelProvisionerBase (and all classes derived from it) runs its async functions in different eventloops #959

Closed
dchirikov opened this issue Jul 19, 2023 · 9 comments · Fixed by jupyter/jupyter_core#362
Labels

Comments

@dchirikov
Copy link

Hi, I am trying to build customer Kernel Provisioner and and build my own class to handle kernel's lifecycle. As almost all the function in that class are marked as asynchronous I assumed there would no issues to have a class with asynchronous functions and tasks to handle communications with the kernel. Turned out it was not that straightforward.

And the reason is that different functions in KernelProvisionerBase are being run different event loops. Which undermines the whole idea. In my case my tasks are being blocked forever.

I noticed that pre_launch, launch_kernel and post_launch are run in one event loop and all others poll, wait, kill, terminate in other one.

To illustrate that here is a small patch for jupyter_client/provisioning/local_provisioner.py which display's the IDs of event loops calls are running in:

--- .tox/py311/lib/python3.11/site-packages/jupyter_client/provisioning/local_provisioner.py.orig	2023-07-19 15:01:14.634920411 +0200
+++ .tox/py311/lib/python3.11/site-packages/jupyter_client/provisioning/local_provisioner.py	2023-07-19 15:05:59.944652561 +0200
@@ -45,6 +45,7 @@
 
     async def wait(self) -> Optional[int]:
         """Wait for the provisioner process."""
+        self.log.debug(f"wait() IOLoop ID: {id(asyncio.get_running_loop())}")
         ret = 0
         if self.process:
             # Use busy loop at 100ms intervals, polling until the process is
@@ -73,6 +74,7 @@
         check if the desired signal is for interrupt and apply the
         applicable code on Windows in that case.
         """
+        self.log.debug(f"send_signal() IOLoop ID: {id(asyncio.get_running_loop())}")
         if self.process:
             if signum == signal.SIGINT and sys.platform == 'win32':
                 from ..win_interrupt import send_interrupt
@@ -94,6 +96,7 @@
 
     async def kill(self, restart: bool = False) -> None:
         """Kill the provisioner and optionally restart."""
+        self.log.debug(f"kill() IOLoop ID: {id(asyncio.get_running_loop())}")
         if self.process:
             if hasattr(signal, "SIGKILL"):
                 # If available, give preference to signalling the process-group over `kill()`.
@@ -109,6 +112,7 @@
 
     async def terminate(self, restart: bool = False) -> None:
         """Terminate the provisioner and optionally restart."""
+        self.log.debug(f"terminate() IOLoop ID: {id(asyncio.get_running_loop())}")
         if self.process:
             if hasattr(signal, "SIGTERM"):
                 # If available, give preference to signalling the process group over `terminate()`.
@@ -139,6 +143,7 @@
 
     async def cleanup(self, restart: bool = False) -> None:
         """Clean up the resources used by the provisioner and optionally restart."""
+        self.log.debug(f"cleanup() IOLoop ID: {id(asyncio.get_running_loop())}")
         if self.ports_cached and not restart:
             # provisioner is about to be destroyed, return cached ports
             lpc = LocalPortCache.instance()
@@ -160,6 +165,7 @@
 
         Returns the updated kwargs.
         """
+        self.log.debug(f"pre_launch() IOLoop ID: {id(asyncio.get_running_loop())}")
 
         # This should be considered temporary until a better division of labor can be defined.
         km = self.parent
@@ -204,6 +210,7 @@
 
     async def launch_kernel(self, cmd: List[str], **kwargs: Any) -> KernelConnectionInfo:
         """Launch a kernel with a command."""
+        self.log.debug(f"launch() IOLoop ID: {id(asyncio.get_running_loop())}")
         scrubbed_kwargs = LocalProvisioner._scrub_kwargs(kwargs)
         self.process = launch_kernel(cmd, **scrubbed_kwargs)
         pgid = None
@@ -228,12 +235,14 @@
 
     async def get_provisioner_info(self) -> Dict:
         """Captures the base information necessary for persistence relative to this instance."""
+        self.log.debug(f"get_provisioner_info() IOLoop ID: {id(asyncio.get_running_loop())}")
         provisioner_info = await super().get_provisioner_info()
         provisioner_info.update({'pid': self.pid, 'pgid': self.pgid, 'ip': self.ip})
         return provisioner_info
 
     async def load_provisioner_info(self, provisioner_info: Dict) -> None:
         """Loads the base information necessary for persistence relative to this instance."""
+        self.log.debug(f"load_provisioner_info() IOLoop ID: {id(asyncio.get_running_loop())}")
         await super().load_provisioner_info(provisioner_info)
         self.pid = provisioner_info['pid']
         self.pgid = provisioner_info['pgid']

The example of jupyter-kernel --debug --kernel python command output:

[KernelApp] pre_launch() IOLoop ID: 140306180444624
[KernelApp] launch() IOLoop ID: 140306180444624
[KernelApp] send_signal() IOLoop ID: 140306156098192
[KernelApp] wait() IOLoop ID: 140306156098192
[KernelApp] cleanup() IOLoop ID: 140306156098192
[KernelApp] cleanup() IOLoop ID: 140306180444624

How to tackle that? Can I do anything to stick my provisioner to single event loop?

@kevin-bates
Copy link
Member

Thanks for opening this issue @dchirikov - this does seem like a problem. Are you trying to extend LocalProvisioner or implement one from scratch?

I have not experienced this kind of issue with any of the gateway provisioners (which build directly on KernelProvisionerBase), but perhaps I haven't looked close enough.

Unfortunately, I am not aware of what implicitly (?) triggers the use of a different event loop. I'm hoping others like @blink1073 or @davidbrochart might have ideas on where/how a different loop may be getting introduced in the kernel's lifecycle.

@dchirikov
Copy link
Author

Hi @kevin-bates ! Nice to see you here! You helped me with Enterprise Gateway issue while ago.

Yes, I am trying to implement a custom provisioner based on KernelProvisionerBase, but took LocalProvisioner class as an illustration for the sake of simplicity because it also derives from KernelProvisionerBase.

In a meanwhile I was trying to pinpoint where issue arises and noticed another symptom. If I run export PYTHONASYNCIODEBUG=1 && jupyter-kernel --debug --kernel python and then try to stop kernel by pressing Ctrl+C I get a stacktrace with RuntimeError: Non-thread-safe operation invoked on an event loop other than the current one.

So it it looks like treads are messing up with ioloops not in a good way.

@kevin-bates
Copy link
Member

Yes, I am trying to implement a custom provisioner based on KernelProvisionerBase, but took LocalProvisioner class as an illustration for the sake of simplicity because it also derives from KernelProvisionerBase.

Looking at the trace output from above, it seems like the difference between the contexts (i.e., event loop values) of when those particular methods are called relate to the introduction of the kernel messaging (as well as the Popen instance from within the provisioner). That is, the ioloop value 140306156098192 is associated with methods that get called after a kernel process (and messaging) has taken place, while the other value is post process/messaging. Might messaging introduce a different event loop? Or, if you're running from Jupyter Server, the websocket usage, etc.? In case it matters, what version of Python are you using?

In a meanwhile I was trying to pinpoint where issue arises and noticed another symptom. If I run export PYTHONASYNCIODEBUG=1 && jupyter-kernel --debug --kernel python and then try to stop kernel by pressing Ctrl+C I get a stacktrace with RuntimeError: Non-thread-safe operation invoked on an event loop other than the current one.

Again, I'll have to defer to folks more expert than myself.

@kevin-bates
Copy link
Member

kevin-bates commented Jul 20, 2023

I brought this topic up in today's Server/Kernel's meeting and @vidartf proposed using a custom event-loop policy to gain more detailed troubleshooting analysis. It was also suggested that the version of jupyter_client be checked since there has been some churn in the ioloop area.

@dchirikov
Copy link
Author

Brilliant idea about custom policy. Here is what I did. I patched the bin/jupyter-kernel file the following way:

--- .tox/py311/bin/jupyter-kernel.orig	2023-07-20 20:54:40.860629917 +0200
+++ .tox/py311/bin/jupyter-kernel	2023-07-20 20:49:49.682613941 +0200
@@ -1,5 +1,30 @@
 #!/w/.tox/py311/bin/python
 # -*- coding: utf-8 -*-
+import asyncio
+import traceback
+
+from asyncio.unix_events import DefaultEventLoopPolicy
+
+class CustomAsyncioPolicy(DefaultEventLoopPolicy):
+    def get_event_loop(self):
+        loop = super().get_event_loop()
+        print(f"get_event_loop: IOLoopID: {id(loop)}")
+        traceback.print_stack()
+        return loop
+
+    def set_event_loop(self, loop):
+        print(f"set_event_loop: IOLoopID: {id(loop)}")
+        traceback.print_stack()
+        return super().set_event_loop(loop)
+
+    def new_event_loop(self):
+        loop = super().new_event_loop()
+        print(f"new_event_loop: IOLoopID: {id(loop)}")
+        traceback.print_stack()
+        return loop
+
+asyncio.set_event_loop_policy(CustomAsyncioPolicy())
+
 import re
 import sys
 from jupyter_client.kernelapp import main

Here is a full output of kernel run:
kernel-custom-policy-run.log

I am using jupyter_client==8.3.0 and you can find a full pip freeze output here:
pip.freeze.log

Hope it will help.

@dchirikov
Copy link
Author

dchirikov commented Jul 21, 2023

I guess, the root cause of the issue is somewhere here: https://github.com/jupyter/jupyter_client/blob/main/jupyter_client/kernelapp.py

So a KernelApp.loop (loop 1) being created on L43, then on L83 KernelApp.km.start_kernel() is invoked synchronously. But KernelApp.km.start_kernel() is actually a wrapper (jupyter_core.utils.run_sync) around asynchronous function jupyter_client.manager._async_start_kernel, and this wrapper (jupyter_core.utils.run_sync()) creates new event loop for km.start_kernel() (loop 2) which run_until_complete() (jupyter_core/utils/__init__.py L164 and L166).

And then KernelApp.loop (loop 1) is being finally started on L86 - it wasn't untill then. So loop 2 is considered to be stopped at this time, and I am not sure what happen with tasks, open sockets and other resources running in loop 2.

@blink1073
Copy link
Member

Hi @dchirikov, thanks for the thorough analysis! Yes, it seems as if it is unsafe to instantiate a KernelApp prior to there being a running event loop, and we need to find a way around this.

I think we need to provide an API in jupyter_core to get the event loop for the current thread, so that is gets added to the _loop_map if needed. Then, we call that new API in jupyter_client anywhere we would have called ioloop.IOLoop.current().

@blink1073
Copy link
Member

@dchirikov, actually, this might be enough of a fix, if you can verify:

diff --git a/jupyter_core/utils/__init__.py b/jupyter_core/utils/__init__.py
index 68973e6..c3cb38f 100644
--- a/jupyter_core/utils/__init__.py
+++ b/jupyter_core/utils/__init__.py
@@ -126,7 +126,6 @@ class _TaskRunner:


 _runner_map = {}
-_loop_map = {}


 def run_sync(coro: Callable[..., Awaitable[T]]) -> Callable[..., T]:
@@ -160,9 +159,11 @@ def run_sync(coro: Callable[..., Awaitable[T]]) -> Callable[..., T]:
             pass

         # Run the loop for this thread.
-        if name not in _loop_map:
-            _loop_map[name] = asyncio.new_event_loop()
-        loop = _loop_map[name]
+        try:
+            loop = asyncio.get_event_loop()
+        except RuntimeError:
+            loop = asyncio.new_event_loop()
+            asyncio.set_event_loop(loop)
         return loop.run_until_complete(inner)

     wrapped.__doc__ = coro.__doc__

@vindi97
Copy link

vindi97 commented Sep 21, 2023

@blink1073 Please tell me where to patch, jupyter_core or jupyter_client ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants