New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot reliably spawn singleuser notebooks on Azure Kuberenetes Service/KubeSpawner #2480
Comments
The main thing to debug here is to watch the failing pods to see their events. The "XXX's server failed to start in 300 seconds, giving up" message means that JupyterHub wasn't informed that the given pod entered the Running state within the time limit. This can have a few causes:
In all of these cases, the state of the pod when it is considered to have failed is the key to debugging. |
pretty sure it's in kubespawner or its an issue with AKS (azure kubernetes). i am familiar with 2 and 3 and none of those signs are there. i do use a prepuller and it works quite well. i'm well below resource limits and there are no signals or indications that kubernetes thinks otherwise. i'm currently following up with microsoft tech folks to try to confirm or eliminate an AKS issue. |
I am also encountering the exact same issue. I am pre-pulling the images, so the 2nd point is not an issue for me. I am also well below the resource limit, which rules out the 3rd point. I am using AKS as well. |
I'm also pursuing the idea that somehow it is related to ephemeral storage. So somehow the cluster lacks it and cannot schedule a pod. Problems with this idea: 1 - no messages about lacking that anywhere. 2 - when some of my other pods run out of ephemeral storage they get evicted. 3 - I've never seen any disk pressure messages. My main idea is that some pod/container is logging a lot, and causing an issue where we run out of inodes similar to what is seen here: kubernetes/kubernetes#32542 . Questions that I'm working on answering: Would running out of inodes trigger a disk pressure message? Or is that only triggered on space? Is there any logging that happens by kubernetes to ephemeral storage areas? How can such ephemeral storage areas be cleaned up? |
@laserninja could you give us your cluster specs/config/k8s version? |
Yes, sure. I also had a chat with the AKS support guys and they mentioned that I am using VMSS for autoscaling and it is under preview. I don't think this would be an issue. Currently, I am running a single node cluster with 6 total cores, 56 GB of total memory and a GPU. K8s version: 1.12.5 Let me know if you need any other details. PS: I also have Airflow with kubernetes executor running on this cluster. The scheduler in it also requires to restart quite often. |
Ok I was kind of wondering if it would be possible for example to share your helm configs, XXXing out whatever info you dont want shared. Would be helpful to compare. For example the singleuser spec i shared above. Or any logs from your hub. Have you tried putting your hub on DEBUG mode? Like so: https://github.com/jupyterhub/jupyterhub/wiki/Debug-Jupyterhub ? |
Yes, sure.
There are no errors in the hub except what you posted. I'll try putting my hub on DEBUG more. |
I tried the DEBUG mode and there are no error messages. It just says the hub is trying to spawn the user pod. |
good to know. was this just the for the hub or did your plug it in for the spawner as well? |
Just the hub. I am not aware of the spawner pod? Where can I see the spawner?
|
if you set |
If you have auto-scaler disabled you can check |
not sure it will matter. kubelets handle running pods. my pod never even starts. are you suggesting to check the kubelet log for the hub? if so what would you search for there? |
You can check |
Did you try this: jupyter/jupyter#382 (comment) |
no. i didn't see what this "azure load balancer drop connection after ~5min inactivity" has to do with anything. my thing doesn't drop after 5 minutes. it drops right away and THEN times out after 5 minutes. i.e. the 5 minutes is not the problem, the dropping and inability to spawn a pod seems to be the problem. |
It worked for me. It's been 3 days and I am able to spawn user pods consistently. |
i am running a test. will try to confirm. |
jupyter/jupyter#382 works. to explain what is happening more clearly: when the hub pod is created it is connected to a load balancer that is attached to the cluster and handles networking which passes along requests to the kubernetes api. if the pod is 'inactive' for 4 (or 5?) minutes the connection to the load balancer is severed. this connection is how your hub pod connects to the kube-api to send a 'signal' to spawn a new user pod. if the connection becomes severed its signals now get dropped and your k8s scheduler never receives a signal to spawn a user pod via the kube-api, load balancer. the solution basically sends a TCP signal from our hub pod every 60 seconds to the load balancer to make sure it doesn't drop our hub pod. This maintains connectivity between the hub pod and place where it sends signals to spawn new pods. Not sure what the load balancer policies are like on EKS or google cloud, network policy may be AKS (azure) specific. it is worth exploring to increase the keepalive singal from 60 second to maybe like 120 or 180 seconds (2 or 3 minutes). probably makes no difference.
->
another small change, make sure to add it to the hub extra config if using helm:
please see: https://stackoverflow.com/questions/22472844/how-to-set-socket-option-tcp-keepcnt-tcp-keepidle-tcp-keepintvl-in-java-or-n for explanation of some of the socket options. see also: http://www.unixguide.net/network/socketfaq/4.7.shtml |
@yvan thanks for writing the more detailed explanation, cause I failed to implement jupyter/jupyter#382 since I didn't trust it would work. Much appreciated ! |
Hey everyone
I recently deployed a jupyterhub setup on Azure Kubernetes Service using z2jh. I'm pretty happy with the setup. That said I'm having one major issue where my hubs don't spawn some of the time. Here is a screen shot of what it looks like from the hub view:
During this process no pod is created, this happens pre- node assignment.
There is no resource pressure on any of our nodes, all kubelets are listed as ready:
There are no events in a describe of the hub pod or nodes that I noticed; all events empty:
Here are the Hub pod logs:
Every single user pod tries to attach a PVC for the user's home directory (normally goes off without a hitch). And a shared AzueFile resource, that can be shared between users. The performance of the AzureFile is bad (it's basically an FTP server) but there aren't really any issues attaching it usually. Both of these things happen after node assignment and I usually get feedback on them as the pod is spawning.
Proxy logs don't really show any unusual behavior, response code is 200 all around (and why would they show an issue, they should have nothing to do with the spawning of single user pods).
Checking the kubelet logs on the nodes themselves has proven difficult. I'm not sure how to do this on azure. There aren't really any clear instructions on how to log into one of these nodes. If anyone knows that please let me know, would be helpful. I've tried accessing kubelet logs from inside a singleuser pod/container but I don't see them where I expect to see them (files/logs don't exist).
Our node OS:
Linux XXXXX 4.15.0-1037-azure #39~16.04.1-Ubuntu SMP Tue Jan 15 17:20:47 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Here is our
singleuser
spec with some things X-ed:Things I know/notice:
1- It is either at the jupyterhub (kubespawner issue) or AKS (engine/resource management/signaling issue) level. I think this is obvious but I'll state it anyways just in case I'm completely off.
2- The issue can happen to one user while another user can successfully login.
3- I have another pod/app of my own design that I try to run that just runs some housekeeping, that keeps getting evicted because it lacks ephemeral storage, may be a separate issue, may be somehow related, no such warning or eviction has ever affected our singleuser pods.
4- After a couple timeouts and failures to spawn the single user pod usually spins up, no problem.
5- It could be an azure networking issue (we actually have our cluster deployed in a private virtual network) as we seem to have the issue come in 'waves.' To test this we are going to deploy our cluster with another provider and uh, wait for the problem to happen (that said we are trying to make this to work in AKS asap).
6- I do not remember having this issue when I deployed test clusters with jupyterhub chart outside a private virtual net. That said, the configuration was simpler then.
7- I recently upgraded to the 0.8.0 chart for jupyterhub.
Ideas:
AKS thinks it cannot get enough resources for the user pod (not sure how to check this other than checking actual resource usage).
A networking issue causes the hub pod to be unable to reach the AKS api to tell it to spawn a new single user pod.
I would not be surprised to learn this is some interaction between azure engine+jhub. They have quite a few issues they are working through with AKS.
Try to run jupyterhub with increased logging so I can see what is going wrong inside the spawner/handler? Not sure how to do this.
Conclusion:
This issue could be related to:
jupyter/jupyter#382
I know the text above is long but frankly part of this whole thing is that I'm not really sure where to look for a problem that isn't providing much feedback. I've had a look at kuberentes troubleshooting docs (and frankly countless issues and posts) here (https://kubernetes.io/docs/tasks/debug-application-cluster/debug-cluster/) and as general guideline it is good, but progress is slow which is why I'm posting this issue. Nothing seems to match quite what I'm seeing and I've been digging for ~ 2 weeks on/off now.
Source I'm digging through:
https://github.com/jupyterhub/kubespawner/blob/master/kubespawner/spawner.py
https://github.com/jupyterhub/jupyterhub/blob/master/jupyterhub/spawner.py
https://github.com/jupyterhub/jupyterhub/blob/af0082a16ba365a4b1bfb79f9d01a7c1a6cee5be/jupyterhub/handlers/base.py
Main ask:
Where should I look to understand what is going wrong here? What log can i open? What cmd can I run? What setting can i put on jhub to get more feedback on this issue?
Thanks a lot
The text was updated successfully, but these errors were encountered: