Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pod containing spawned server dies regularly #1430

Closed
tarekmehrez opened this issue Sep 30, 2019 · 8 comments
Closed

Pod containing spawned server dies regularly #1430

tarekmehrez opened this issue Sep 30, 2019 · 8 comments

Comments

@tarekmehrez
Copy link

tarekmehrez commented Sep 30, 2019

The spawned server for the authenticated user dies constantly whether it's idle or not (running a notebook). Is that behaviour expected? Is there a configurable parameter that affects the lifespan of the spawned server?

All other pods (hub, proxy, puller, .. etc) have been running for more than 30 days so it's hard to figure out why are all spawned servers being constantly killed.

Helm chart version: jupyterhub-0.9-445a953

@consideRatio
Copy link
Member

There is a mechanism, see the documentation about culling of user pods and note the default values:

cull:
enabled: true
users: false
timeout: 3600
every: 600
concurrency: 10
maxAge: 0

But, is this why they are dying?

If you doubt that, inspect what goes on by looking in the logs of the hub or the user pod if you can manage to capture these logs on error before the pod terminated and no longer available. I think you will find culling activity within the hub log.

kubectl get -n <your namespace> deploy/hub
kubectl get-n <your namespace> pod/jupyter-someuser
kubectl logs -n <your namespace> deploy/hub
kubectl logs -n <your namespace> pod/jupyter-someuser
kubectl describe -n <your namespace> deploy/hub
kubectl describe -n <your namespace> pod/jupyter-someuser

@tarekmehrez
Copy link
Author

So even when a kernel is still running in the background, if the browser is closed and culling is enabled, these pods will be killed?
That might explain the behaviour I am seeing.

@aguinaldoabbj
Copy link

There is a mechanism, see the documentation about culling of user pods and note the default values:

cull:
enabled: true
users: false
timeout: 3600
every: 600
concurrency: 10
maxAge: 0

But, is this why they are dying?

If you doubt that, inspect what goes on by looking in the logs of the hub or the user pod if you can manage to capture these logs on error before the pod terminated and no longer available. I think you will find culling activity within the hub log.

kubectl get -n <your namespace> deploy/hub
kubectl get-n <your namespace> pod/jupyter-someuser
kubectl logs -n <your namespace> deploy/hub
kubectl logs -n <your namespace> pod/jupyter-someuser
kubectl describe -n <your namespace> deploy/hub
kubectl describe -n <your namespace> pod/jupyter-someuser

It's a reasonable mechanism, though. However, there's a sync problem between jupyterhub and kubernetes. Pod is culled but jupyterhub seems to be unaware of that. When the user logs in again, an error 503 appears and user can't spawn a new pod.

@consideRatio
Copy link
Member

I'm never getting enough clarity about this, but there are two cullers in play. One by jupyterhub, and perhaps also one by the jupyter server created by jupyterhub.

The culling by jupyterhub is not communicating with KubeSpawner which is the logic within JupyterHub on how being aware of what servers are running etc though, and this may be the crux.

I don't think this is an issue in the latest version of the helm chart though, or even 0.8.2, as I think they have a check every 30 seconds to verify if everything as as it should regarding users and their routes.

What version of the jupyterhub helm chart are you using where you experience this @aguinaldoabbj?

@aguinaldoabbj
Copy link

Jupyterhub helm chart version 0.8.2 and app version 0.9.6 on Kubernetes 1.16.

@consideRatio
Copy link
Member

:O Hmmm can you run 0.8.2 on k8s 1.16? Is that your kubectl client rather than the cluster itself?

@aguinaldoabbj
Copy link

aguinaldoabbj commented Nov 27, 2019

No, I have upgraded the whole cluster to 1.16. The only big problem I had with that was hotfixed with the workaround suggested in jupyterhub/kubespawner#354

@consideRatio
Copy link
Member

I'll go ahead and close this issue, as I don't have sufficient information to conclude a concrete action. It is an issue that I've not seen reported before though, so I suspect this is unrelated to the Helm chart.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants