Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot reliably spawn singleuser notebooks on Azure Kuberenetes Service/KubeSpawner #2480

Closed
yvan opened this issue Mar 20, 2019 · 21 comments
Closed

Comments

@yvan
Copy link

yvan commented Mar 20, 2019

Hey everyone

I recently deployed a jupyterhub setup on Azure Kubernetes Service using z2jh. I'm pretty happy with the setup. That said I'm having one major issue where my hubs don't spawn some of the time. Here is a screen shot of what it looks like from the hub view:

Screen Shot 2019-03-20 at 8 42 11 AM

During this process no pod is created, this happens pre- node assignment.

There is no resource pressure on any of our nodes, all kubelets are listed as ready:

  MemoryPressure   False   Wed, 20 Mar 2019 09:07:37 +0000   Thu, 21 Feb 2019 22:20:25 +0000   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Wed, 20 Mar 2019 09:07:37 +0000   Fri, 15 Mar 2019 15:57:47 +0000   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Wed, 20 Mar 2019 09:07:37 +0000   Thu, 21 Feb 2019 22:20:25 +0000   KubeletHasSufficientPID      kubelet has sufficient PID available

  MemoryPressure   False   Wed, 20 Mar 2019 09:07:35 +0000   Tue, 12 Mar 2019 01:02:44 +0000   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Wed, 20 Mar 2019 09:07:35 +0000   Tue, 12 Mar 2019 17:54:20 +0000   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Wed, 20 Mar 2019 09:07:35 +0000   Thu, 21 Feb 2019 22:20:19 +0000   KubeletHasSufficientPID      kubelet has sufficient PID available

  MemoryPressure   False   Wed, 20 Mar 2019 09:07:32 +0000   Tue, 12 Mar 2019 17:51:56 +0000   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Wed, 20 Mar 2019 09:07:32 +0000   Mon, 18 Mar 2019 13:48:19 +0000   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Wed, 20 Mar 2019 09:07:32 +0000   Thu, 21 Feb 2019 22:20:56 +0000   KubeletHasSufficientPID      kubelet has sufficient PID available

  MemoryPressure   False   Wed, 20 Mar 2019 09:07:34 +0000   Mon, 11 Mar 2019 22:48:25 +0000   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Wed, 20 Mar 2019 09:07:34 +0000   Fri, 15 Mar 2019 08:51:59 +0000   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Wed, 20 Mar 2019 09:07:34 +0000   Wed, 06 Mar 2019 16:29:07 +0000   KubeletHasSufficientPID      kubelet has sufficient PID available

There are no events in a describe of the hub pod or nodes that I noticed; all events empty:

Events:          <none>

Here are the Hub pod logs:

0 JupyterHub log:158] 200 GET /hub/spawn/XXX (XXX@XXX) 30.46ms
[W 2019-03-20 07:33:16.520 JupyterHub base:714] User XXX is slow to start (timeout=0)
[I 2019-03-20 07:33:16.547 JupyterHub log:158] 302 POST /hub/spawn/XXX -> /user/XXX/ (XXX@XXX) 205.60ms
[I 2019-03-20 07:33:16.717 JupyterHub log:158] 302 GET /user/XXX/ -> /hub/user/XXX/ (XXX@XXX) 0.68ms
[I 2019-03-20 07:33:16.812 JupyterHub base:1012] Pending spawn for XXX didn't finish in 0.0 seconds
[I 2019-03-20 07:33:16.812 JupyterHub base:1018] XXX is pending spawn
[I 2019-03-20 07:33:16.813 JupyterHub log:158] 200 GET /hub/user/XXX/ (XXX@XXX) 32.24ms
[I 2019-03-20 07:33:44.372 JupyterHub proxy:301] Checking routes
[I 2019-03-20 07:34:44.372 JupyterHub proxy:301] Checking routes
[I 2019-03-20 07:35:44.372 JupyterHub proxy:301] Checking routes
[I 2019-03-20 07:36:44.372 JupyterHub proxy:301] Checking routes
[I 2019-03-20 07:37:44.372 JupyterHub proxy:301] Checking routes
[W 2019-03-20 07:38:16.520 JupyterHub user:471] XXX's server failed to start in 300 seconds, giving up
[W 2019-03-20 07:38:16.591 JupyterHub base:689] 2 consecutive spawns failed.  Hub will exit if failure count reaches 5 before succeeding
[E 2019-03-20 07:38:16.591 JupyterHub gen:974] Exception in Future <Task finished coro=<BaseHandler.spawn_single_user.<locals>.finish_user_spawn() done, defined at /usr/local/lib/python3.6/dist-packages/jupyterhub/handlers/base.py:619> exception=TimeoutError('Timeout',)> after timeout
    Traceback (most recent call last):
      File "/usr/local/lib/python3.6/dist-packages/tornado/gen.py", line 970, in error_callback
        future.result()
      File "/usr/local/lib/python3.6/dist-packages/jupyterhub/handlers/base.py", line 626, in finish_user_spawn
        await spawn_future
      File "/usr/local/lib/python3.6/dist-packages/jupyterhub/user.py", line 489, in spawn
        raise e
      File "/usr/local/lib/python3.6/dist-packages/jupyterhub/user.py", line 409, in spawn
        url = await gen.with_timeout(timedelta(seconds=spawner.start_timeout), f)
    tornado.util.TimeoutError: Timeout

[E 2019-03-20 07:38:16.592 JupyterHub gen:974] Exception in Future <Task finished coro=<BaseHandler.spawn_single_user.<locals>.finish_user_spawn() done, defined at /usr/local/lib/python3.6/dist-packages/jupyterhub/handlers/base.py:619> exception=TimeoutError('Timeout',)> after timeout
    Traceback (most recent call last):
      File "/usr/local/lib/python3.6/dist-packages/tornado/gen.py", line 970, in error_callback
        future.result()
      File "/usr/local/lib/python3.6/dist-packages/tornado/gen.py", line 970, in error_callback
        future.result()
      File "/usr/local/lib/python3.6/dist-packages/jupyterhub/handlers/base.py", line 626, in finish_user_spawn
        await spawn_future
      File "/usr/local/lib/python3.6/dist-packages/jupyterhub/user.py", line 489, in spawn
        raise e
      File "/usr/local/lib/python3.6/dist-packages/jupyterhub/user.py", line 409, in spawn
        url = await gen.with_timeout(timedelta(seconds=spawner.start_timeout), f)
    tornado.util.TimeoutError: Timeout

Every single user pod tries to attach a PVC for the user's home directory (normally goes off without a hitch). And a shared AzueFile resource, that can be shared between users. The performance of the AzureFile is bad (it's basically an FTP server) but there aren't really any issues attaching it usually. Both of these things happen after node assignment and I usually get feedback on them as the pod is spawning.

Proxy logs don't really show any unusual behavior, response code is 200 all around (and why would they show an issue, they should have nothing to do with the spawning of single user pods).

Checking the kubelet logs on the nodes themselves has proven difficult. I'm not sure how to do this on azure. There aren't really any clear instructions on how to log into one of these nodes. If anyone knows that please let me know, would be helpful. I've tried accessing kubelet logs from inside a singleuser pod/container but I don't see them where I expect to see them (files/logs don't exist).

Our node OS:

Linux XXXXX 4.15.0-1037-azure #39~16.04.1-Ubuntu SMP Tue Jan 15 17:20:47 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

Here is our singleuser spec with some things X-ed:

singleuser:
  prePuller:
    hook:
      enabled: true
    continuous:
      enabled: true

  image:
    name: XXX
    tag: XXX
    pullPolicy: IfNotPresent

  profileList:
    - display_name: "Small"
      description: "small with 1 cpu and 4gb ram"
      kubespawner_override:
        cpu_limit: 1
        mem_limit: "4G"
    - display_name: "Medium"
      description: "medium with 2 cpus and 8gb ram"
      kubespawner_override:
        cpu_limit: 2
        mem_limit: "8G"
    - display_name: "Large"
      description: "large with 4 cpus and 16gb ram"
      kubespawner_override:
        cpu_limit: 4
        mem_limit: "16G"

  storage:
    capacity: 25Gi
    dynamic:
      storageClass: default
    homeMountPath: "XXX"
    extraVolumes:
      - name: XXX
        persistentVolumeClaim:
          claimName: XXX
    extraVolumeMounts:
      - name: XXX
        mountPath: XXX

spec:
    - ip: "XXX"
      hostnames:
      - "XXX"

  lifecycleHooks:
    postStart:
      exec:
        command:
          - "sh"
          - "-c"
          - >
            cp XXX XXX;
            mkdir -p XXX;

Things I know/notice:

1- It is either at the jupyterhub (kubespawner issue) or AKS (engine/resource management/signaling issue) level. I think this is obvious but I'll state it anyways just in case I'm completely off.

2- The issue can happen to one user while another user can successfully login.

3- I have another pod/app of my own design that I try to run that just runs some housekeeping, that keeps getting evicted because it lacks ephemeral storage, may be a separate issue, may be somehow related, no such warning or eviction has ever affected our singleuser pods.

4- After a couple timeouts and failures to spawn the single user pod usually spins up, no problem.

5- It could be an azure networking issue (we actually have our cluster deployed in a private virtual network) as we seem to have the issue come in 'waves.' To test this we are going to deploy our cluster with another provider and uh, wait for the problem to happen (that said we are trying to make this to work in AKS asap).

6- I do not remember having this issue when I deployed test clusters with jupyterhub chart outside a private virtual net. That said, the configuration was simpler then.

7- I recently upgraded to the 0.8.0 chart for jupyterhub.

Ideas:

AKS thinks it cannot get enough resources for the user pod (not sure how to check this other than checking actual resource usage).

A networking issue causes the hub pod to be unable to reach the AKS api to tell it to spawn a new single user pod.

I would not be surprised to learn this is some interaction between azure engine+jhub. They have quite a few issues they are working through with AKS.

Try to run jupyterhub with increased logging so I can see what is going wrong inside the spawner/handler? Not sure how to do this.

Conclusion:

This issue could be related to:

jupyter/jupyter#382

I know the text above is long but frankly part of this whole thing is that I'm not really sure where to look for a problem that isn't providing much feedback. I've had a look at kuberentes troubleshooting docs (and frankly countless issues and posts) here (https://kubernetes.io/docs/tasks/debug-application-cluster/debug-cluster/) and as general guideline it is good, but progress is slow which is why I'm posting this issue. Nothing seems to match quite what I'm seeing and I've been digging for ~ 2 weeks on/off now.

Source I'm digging through:

https://github.com/jupyterhub/kubespawner/blob/master/kubespawner/spawner.py

https://github.com/jupyterhub/jupyterhub/blob/master/jupyterhub/spawner.py

https://github.com/jupyterhub/jupyterhub/blob/af0082a16ba365a4b1bfb79f9d01a7c1a6cee5be/jupyterhub/handlers/base.py

Main ask:

Where should I look to understand what is going wrong here? What log can i open? What cmd can I run? What setting can i put on jhub to get more feedback on this issue?

Thanks a lot

@minrk
Copy link
Member

minrk commented Mar 25, 2019

The main thing to debug here is to watch the failing pods to see their events. kubectl get events may be what you want, or kubectl describe pod on the pod that's not starting.

The "XXX's server failed to start in 300 seconds, giving up" message means that JupyterHub wasn't informed that the given pod entered the Running state within the time limit. This can have a few causes:

  1. events failed to be delivered, due to a Kubernetes issue. This should be rare and is quite hard to debug. It used to be a real issue, but I believe is fixed in recent KubeSpawner (0.10). (In this case, the pod will be running, but JupyterHub incorrectly thinks it is not)
  2. Spawning is just slow, due to e.g. pulling the image taking a long time. If this is the case, using the pre-puller to pre-pull images to nodes, or increasing the spawn timeout if spawns really just take a long time (pod will be in a state such as "pulling image..."
  3. Spawning fails, e.g. due to exhausted resource limits or failed image pulls, etc. A change in the relevant configuration should address this one. In this case, the pod will be in an error state, e.g. NotSchedulable, ImagePullBackoff, etc.

In all of these cases, the state of the pod when it is considered to have failed is the key to debugging.

@yvan
Copy link
Author

yvan commented Mar 25, 2019

pretty sure it's in kubespawner or its an issue with AKS (azure kubernetes). i am familiar with 2 and 3 and none of those signs are there. i do use a prepuller and it works quite well. i'm well below resource limits and there are no signals or indications that kubernetes thinks otherwise. i'm currently following up with microsoft tech folks to try to confirm or eliminate an AKS issue.

@laserninja
Copy link

I am also encountering the exact same issue. I am pre-pulling the images, so the 2nd point is not an issue for me. I am also well below the resource limit, which rules out the 3rd point. I am using AKS as well.

@yvan
Copy link
Author

yvan commented Mar 26, 2019

I'm also pursuing the idea that somehow it is related to ephemeral storage. So somehow the cluster lacks it and cannot schedule a pod. Problems with this idea: 1 - no messages about lacking that anywhere. 2 - when some of my other pods run out of ephemeral storage they get evicted. 3 - I've never seen any disk pressure messages. My main idea is that some pod/container is logging a lot, and causing an issue where we run out of inodes similar to what is seen here: kubernetes/kubernetes#32542 .

Questions that I'm working on answering:

Would running out of inodes trigger a disk pressure message? Or is that only triggered on space? Is there any logging that happens by kubernetes to ephemeral storage areas? How can such ephemeral storage areas be cleaned up?

@yvan
Copy link
Author

yvan commented Apr 8, 2019

@laserninja could you give us your cluster specs/config/k8s version?

@laserninja
Copy link

Yes, sure. I also had a chat with the AKS support guys and they mentioned that I am using VMSS for autoscaling and it is under preview. I don't think this would be an issue.

Currently, I am running a single node cluster with 6 total cores, 56 GB of total memory and a GPU. K8s version: 1.12.5 Let me know if you need any other details.

PS: I also have Airflow with kubernetes executor running on this cluster. The scheduler in it also requires to restart quite often.

@yvan
Copy link
Author

yvan commented Apr 9, 2019

Ok I was kind of wondering if it would be possible for example to share your helm configs, XXXing out whatever info you dont want shared. Would be helpful to compare. For example the singleuser spec i shared above. Or any logs from your hub. Have you tried putting your hub on DEBUG mode? Like so: https://github.com/jupyterhub/jupyterhub/wiki/Debug-Jupyterhub ?

@laserninja
Copy link

laserninja commented Apr 9, 2019

Yes, sure.

  secretToken: "xxxxxxxxxxxxxxxxx"
singleuser:
  image:
    name: jupyter/minimal-notebook
    tag: 2343e33dec46
  extraEnv:
    AAD_TENANT_ID: 'xxxxxxxxxxxxxxxxx'
    KUBERNETES_SERVICE_HOST: 'kubernetes.default.svc.cluster.local'
    #This does not work. the env var is not injected properly from here.
  storage:
    dynamic:
      storageClass: azurefile

  profileList:
    - display_name: "Minimal environment"
      description: "To avoid too much bells and whistles: Python."
      default: true
    - display_name: "Datascience environment"
      description: "If you want the additional bells and whistles: Python, R, and Julia."
      kubespawner_override:
        image: jupyter/datascience-notebook:2343e33dec46
    - display_name: "Spark environment"
      description: "The Jupyter Stacks spark image!"
      kubespawner_override:
        image: jupyter/all-spark-notebook:2343e33dec46
    - display_name: "Tensorflow GPU"
      description: "Tensorflow on steroids!"
      kubespawner_override:
        image: gcr.io/kubeflow-images-public/tensorflow-1.8.0-notebook-gpu:v0.2.1
        extra_resource_limits:
          nvidia.com/gpu: "1"
    - display_name: "Custom Minimal"
      description: "The Jupyter thats customized for you!"
      kubespawner_override:
        image: thoracr.azurecr.io/thor-jupyterhub/xxxxxxx

scheduling:
  podPriority:
    enabled: true
  userPlaceholder:
    # Specify three dummy user pods will be used as placeholders
    replicas: 3
auth:
  type: custom
  custom:
    className: 'oauthenticator.azuread.AzureAdOAuthenticator'
    config:
      client_id: "xxxxxxxxxxxxxxxxx"
      client_secret: "xxxxxxxxxxxxxxxxx"
      oauth_callback_url: "xxxxxxxxxxxxxxxxx"

There are no errors in the hub except what you posted. I'll try putting my hub on DEBUG more.

@laserninja
Copy link

I tried the DEBUG mode and there are no error messages. It just says the hub is trying to spawn the user pod.

@yvan
Copy link
Author

yvan commented Apr 11, 2019

good to know. was this just the for the hub or did your plug it in for the spawner as well?

@laserninja
Copy link

Just the hub. I am not aware of the spawner pod? Where can I see the spawner?
I just set this up:

c.JupyterHub.log_level = 'DEBUG'

# Enable debug-logging of the single-user server
c.Spawner.debug = True

# Enable debug-logging of the single-user server
c.LocalProcessSpawner.debug = True```
and checked the logs of "hub" pod.

@yvan
Copy link
Author

yvan commented Apr 11, 2019

if you set c.Spawner then that should do it. pretty sure these should show up in the hub. so if you saw nothing then i assume there's not much to do. my guess is the spawner never even gets engaged.

@laserninja
Copy link

If you have auto-scaler disabled you can check kubelet logs.

@yvan
Copy link
Author

yvan commented Apr 12, 2019

not sure it will matter. kubelets handle running pods. my pod never even starts. are you suggesting to check the kubelet log for the hub? if so what would you search for there?

@laserninja
Copy link

You can check kubelet logs in the VM. Just want to check if it is at least receiving some request or throwing any errors. You can find your VM in the MC_ resource group.
To get the logs: sudo journalctl -u kubelet -o cat

@laserninja
Copy link

Did you try this: jupyter/jupyter#382 (comment)
Did it work for you?

@yvan
Copy link
Author

yvan commented Apr 16, 2019

no. i didn't see what this "azure load balancer drop connection after ~5min inactivity" has to do with anything. my thing doesn't drop after 5 minutes. it drops right away and THEN times out after 5 minutes. i.e. the 5 minutes is not the problem, the dropping and inability to spawn a pod seems to be the problem.

@laserninja
Copy link

It worked for me. It's been 3 days and I am able to spawn user pods consistently.

@yvan
Copy link
Author

yvan commented Apr 19, 2019

i am running a test. will try to confirm.

@yvan
Copy link
Author

yvan commented Apr 24, 2019

jupyter/jupyter#382 works. to explain what is happening more clearly: when the hub pod is created it is connected to a load balancer that is attached to the cluster and handles networking which passes along requests to the kubernetes api. if the pod is 'inactive' for 4 (or 5?) minutes the connection to the load balancer is severed. this connection is how your hub pod connects to the kube-api to send a 'signal' to spawn a new user pod. if the connection becomes severed its signals now get dropped and your k8s scheduler never receives a signal to spawn a user pod via the kube-api, load balancer. the solution basically sends a TCP signal from our hub pod every 60 seconds to the load balancer to make sure it doesn't drop our hub pod. This maintains connectivity between the hub pod and place where it sends signals to spawn new pods. Not sure what the load balancer policies are like on EKS or google cloud, network policy may be AKS (azure) specific.

it is worth exploring to increase the keepalive singal from 60 second to maybe like 120 or 180 seconds (2 or 3 minutes). probably makes no difference.

(socket.IPPROTO_TCP, socket.TCP_KEEPINTVL, 60)

->

(socket.SOL_SOCKET, socket.TCP_KEEPINTVL, 120)

another small change, make sure to add it to the hub extra config if using helm:

hub:
    extraConfig:
        import socket
        from urllib3 import connection
        # workaround for azure load balancer issue 
        connection.HTTPConnection.default_socket_options += [(socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1),
                                                    (socket.IPPROTO_TCP, socket.TCP_KEEPIDLE, 60),
                                                    (socket.IPPROTO_TCP, socket.TCP_KEEPINTVL, 60),
                                                    (socket.IPPROTO_TCP, socket.TCP_KEEPCNT, 3)]

please see: https://stackoverflow.com/questions/22472844/how-to-set-socket-option-tcp-keepcnt-tcp-keepidle-tcp-keepintvl-in-java-or-n for explanation of some of the socket options. see also: http://www.unixguide.net/network/socketfaq/4.7.shtml

@yvan yvan closed this as completed Apr 24, 2019
@paularmand
Copy link

@yvan thanks for writing the more detailed explanation, cause I failed to implement jupyter/jupyter#382 since I didn't trust it would work. Much appreciated !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants