Persisted sessions are not restored at startup, only if requested by kernel ID, leaking pods? #1086

dnwe · 2022-05-10T18:28:12Z

Description

After changes were made under #737 (comment) persisted sessions were no longer loaded at startup, but only "on demand" if an /api/kernels/<kernel-id> came in that matched one of the persisted sessions.

In the scenario where a single enterprise-gateway replica is running in a Kube cluster and using KubernetesProcessProxy, this can lead to driver+executor Pods being leaked and remaining as idle processes that are never cleaned up.

Reproduce

Deploy enterprise-gateway into a kube cluster and spinup some kernels
kubectl delete --grace-period=0 the pod (or submit a kill -9 manually) so that it is restarted without stopping any kernels
Observe time passing without querying those kernels
Note that the idle driver+executor pods are still present and seemingly lost.

Expected behavior

--- a/enterprise_gateway/enterprisegatewayapp.py
+++ b/enterprise_gateway/enterprisegatewayapp.py
@@ -141,8 +141,7 @@ class EnterpriseGatewayApp(EnterpriseGat
         )
 
         # Attempt to start persisted sessions
-        # Commented as part of https://github.com/jupyter/enterprise_gateway/pull/737#issuecomment-567598751
-        # self.kernel_session_manager.start_sessions()
+        self.kernel_session_manager.start_sessions()
 
         self.contents_manager = None  # Gateways don't use contents manager

Context

Operating System and version: Kubernetes v1.21.12
Browser and version: N/A
Jupyter Server version: 2.6.0

The text was updated successfully, but these errors were encountered:

welcome · 2022-05-10T18:28:14Z

Thank you for opening your first issue in this project! Engagement like this is essential for open source projects! 🤗

If you haven't done so already, check out Jupyter's Code of Conduct. Also, please try to follow the issue template as it helps other other community members to contribute more effectively.

You can meet the other Jovyans by joining our Discourse forum. There is also an intro thread there where you can stop by and say Hi! 👋

Welcome to the Jupyter community! 🎉

kevin-bates · 2022-05-10T19:07:19Z

Hi @dnwe - thanks for opening this issue.

I suppose this makes sense because, without the query, there's nothing that is going to "hydrate" a KernelManager to "connect" to outstanding kernels. Of course, the same "leak" can happen if there are no further uses of orphaned kernels (i.e., the users have walked away), which, I would view as outside the scope here.

I'm curious if you see this issue when multiple replicas are configured since that is the design center here (Active/Active)? If nothing is requesting a kernel's output following the termination of one of the EG nodes, then I'm not sure what the expected behavior should be. How are your kernels being terminated if users just stop operating against them? Are you relying completely on culling, and that's the difference here? Are you using a different client than Jupyter Lab, since I'm fairly certain it will send period inquiries for the kernels it knows about?

I suppose we could introduce a "mode" here that denotes Active/Active or Active/Passive and bring back the call for Active/Passive modes, but if all EG nodes are terminated (ungracefully), kernel pods will be leaked irrespective of "mode" if they're never queried following the EG node's termination and subsequent restart.

What is the expected behavior here? There really shouldn't be a difference between running two replicas, or one EG instance, followed by another - assuming a) they all use the same persisted session store and b) users/applications using those kernels issue a request.

I think we need more information as to what the desired behavior is before proceeding. Thanks again for spending time on this as this is something we definitely want to support.

dnwe · 2022-05-10T19:35:30Z

The scenario here is that a single EG is responsible for managing the kernels within the kube namespace its running in. It can theoretically be restarted at any time if (e.g.) a node is rebooted underneath it or a new image/version is rolled. In this scenario it’s preferable to SIGKILL EG and for it to re-discover all the persistent kernels at startup when it rescheduled on another node such that they’re publicised on /api/kernels and their running state and any future idleness can be discovered externally

kevin-bates · 2022-05-10T20:27:25Z

Understood. We'll need to introduce the notion of "mode" for this (it's not as simple as activating the old call). In the meantime, can you use multiple replicas where each is configured with client affinity? (Yes, that still "leaks" kernel pods if there are no additional uses of an orphaned kernel.)

Another approach would be to introduce a TTL to each kernel. Since the server pings the kernel every few seconds, we could have the kernel shut down if it hasn't been pinged in X amount of time.

dnwe · 2022-05-10T21:59:04Z

Are you talking about multiple replicas and using cookie-based session affinity via nginx/traefik ingress, or NodePort and clientIP session affinity? I had seen some discussion around both before, but I'm not sure how it is intended to work if a given kernel is inadvertently queried on two running instances (e.g., remote IP is a gateway and cookie is lost by the client so affinity hashing isn't consistent).

I guess my goal is resiliency rather than availability. I'd like EG to be capable of being restarted at any time, rescheduled and then use the session persistence as the source of truth as to what should be found (via the kube-api) running in its owned KERNEL_NAMESPACE as those will still have been making good progress whilst it was briefly offline. Any driver/executor pods that are lying around in the namespace, aren't known to EG (/api/kernels) can theoretically be culled by my own garbage collection service (as we must assume they were once created, but never persisted). Similarly I'm fine for kernels to be reported "execution_state":"idle" as I can also perform my own tidyup of those.

kevin-bates · 2022-05-10T22:38:17Z

Are you talking about multiple replicas and using cookie-based session affinity via nginx/traefik ingress, or NodePort and clientIP session affinity?

EG doesn't care or know how the affinity is configured (much like myself 😄) , just so long as when the request reaches an EG instance, that instance will be the "host" for that client for the duration of the kernel or until EG terminates.

Is this something you'd like to contribute? Are you okay with introducing a configurable "AvailabilityMode" to distinguish between single and multi-node operations?

I'm also curious about what you're using for your persistent storage.

kevin-bates · 2022-05-25T18:49:53Z

@dnwe - are you able to take PR #1095 for a spin in your environment? You should find that setting the mode to 'active-passive' gives you the behavior you need.

dnwe added the bug label May 10, 2022

kevin-bates mentioned this issue May 25, 2022

Availability modes #1095

Merged

kevin-bates closed this as completed in #1095 Jun 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Persisted sessions are not restored at startup, only if requested by kernel ID, leaking pods? #1086

Persisted sessions are not restored at startup, only if requested by kernel ID, leaking pods? #1086

dnwe commented May 10, 2022

welcome bot commented May 10, 2022

kevin-bates commented May 10, 2022

dnwe commented May 10, 2022

kevin-bates commented May 10, 2022

dnwe commented May 10, 2022

kevin-bates commented May 10, 2022

kevin-bates commented May 25, 2022

Persisted sessions are not restored at startup, only if requested by kernel ID, leaking pods? #1086

Persisted sessions are not restored at startup, only if requested by kernel ID, leaking pods? #1086

Comments

dnwe commented May 10, 2022

Description

Reproduce

Expected behavior

Context

welcome bot commented May 10, 2022

kevin-bates commented May 10, 2022

dnwe commented May 10, 2022

kevin-bates commented May 10, 2022

dnwe commented May 10, 2022

kevin-bates commented May 10, 2022

kevin-bates commented May 25, 2022