Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU support on GKE not available #1246

Closed
mmatiaschek opened this issue Jul 19, 2018 · 5 comments
Closed

GPU support on GKE not available #1246

mmatiaschek opened this issue Jul 19, 2018 · 5 comments
Labels

Comments

@mmatiaschek
Copy link

mmatiaschek commented Jul 19, 2018

KUBEFLOW_VERSION=0.2.2

After setting up a pretty default cluster on GKE with "getting-started-gke" deploy.sh with a gpu-pool-initialNodeCount: 1 i could not spawn gpu images on jupyterhub. Removing the taint on the gpu node with kubectl taint nodes gke-hub-gpu-pool-9d1db964-9gqn nvidia.com/gpu:NoSchedule- allows me to spawn the image gcr.io/kubeflow-images-public/tensorflow-1.8.0-notebook-gpu:v0.2.1.

Now i created a jupyter notebook and executed the following:

from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())

but i get this error:

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
/opt/conda/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow.py in <module>()
     57 
---> 58   from tensorflow.python.pywrap_tensorflow_internal import *
     59   from tensorflow.python.pywrap_tensorflow_internal import __version__

/opt/conda/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py in <module>()
     27             return _mod
---> 28     _pywrap_tensorflow_internal = swig_import_helper()
     29     del swig_import_helper

/opt/conda/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py in swig_import_helper()
     23             try:
---> 24                 _mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)
     25             finally:

/opt/conda/lib/python3.6/imp.py in load_module(name, file, filename, details)
    242         else:
--> 243             return load_dynamic(name, filename, file)
    244     elif type_ == PKG_DIRECTORY:

/opt/conda/lib/python3.6/imp.py in load_dynamic(name, path, file)
    342             name=name, loader=loader, origin=path)
--> 343         return _load(spec)
    344 

ImportError: libcuda.so.1: cannot open shared object file: No such file or directory

During handling of the above exception, another exception occurred:

ImportError                               Traceback (most recent call last)
<ipython-input-1-0ca82b29604d> in <module>()
----> 1 from tensorflow.python.client import device_lib
      2 print(device_lib.list_local_devices())

/opt/conda/lib/python3.6/site-packages/tensorflow/__init__.py in <module>()
     22 
     23 # pylint: disable=g-bad-import-order
---> 24 from tensorflow.python import pywrap_tensorflow  # pylint: disable=unused-import
     25 # pylint: disable=wildcard-import
     26 from tensorflow.tools.api.generator.api import *  # pylint: disable=redefined-builtin

/opt/conda/lib/python3.6/site-packages/tensorflow/python/__init__.py in <module>()
     47 import numpy as np
     48 
---> 49 from tensorflow.python import pywrap_tensorflow
     50 
     51 # Protocol buffers

/opt/conda/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow.py in <module>()
     72 for some common reasons and solutions.  Include the entire stack trace
     73 above this error message when asking for help.""" % traceback.format_exc()
---> 74   raise ImportError(msg)
     75 
     76 # pylint: enable=wildcard-import,g-import-not-at-top,unused-import,line-too-long

ImportError: Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow.py", line 58, in <module>
    from tensorflow.python.pywrap_tensorflow_internal import *
  File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 28, in <module>
    _pywrap_tensorflow_internal = swig_import_helper()
  File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 24, in swig_import_helper
    _mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)
  File "/opt/conda/lib/python3.6/imp.py", line 243, in load_module
    return load_dynamic(name, filename, file)
  File "/opt/conda/lib/python3.6/imp.py", line 343, in load_dynamic
    return _load(spec)
ImportError: libcuda.so.1: cannot open shared object file: No such file or directory


Failed to load the native TensorFlow runtime.

See https://www.tensorflow.org/install/install_sources#common_installation_problems

for some common reasons and solutions.  Include the entire stack trace
above this error message when asking for help.

more information:

$ kubectl get nodes
NAME                                 STATUS     ROLES     AGE       VERSION
gke-hub-default-pool-ad675cd4-0lcq   Ready      <none>    5h        v1.9.6-gke.1
gke-hub-default-pool-ad675cd4-5sf3   Ready      <none>    1h        v1.9.6-gke.1
gke-hub-default-pool-ad675cd4-7kk2   Ready      <none>    1h        v1.9.6-gke.1
gke-hub-default-pool-ad675cd4-nccc   Ready      <none>    5h        v1.9.6-gke.1
gke-hub-gpu-pool-9d1db964-9gqn       NotReady   <none>    3s        v1.9.6-gke.1

$ kubectl get pods -n kubeflow -o wide
NAME                                          READY     STATUS    RESTARTS   AGE       IP           NODE
ambassador-9f658d5bc-bbhfw                    2/2       Running   0          1h        10.60.0.9    gke-hub-default-pool-ad675cd4-nccc
ambassador-9f658d5bc-rxlnl                    2/2       Running   0          1h        10.60.4.4    gke-hub-default-pool-ad675cd4-5sf3
ambassador-9f658d5bc-w7cmr                    2/2       Running   0          1h        10.60.1.9    gke-hub-default-pool-ad675cd4-0lcq
centraldashboard-6665fc46cb-tz92l             1/1       Running   0          1h        10.60.0.8    gke-hub-default-pool-ad675cd4-nccc
cert-manager-555c87df98-jtgsk                 2/2       Running   0          1h        10.60.0.12   gke-hub-default-pool-ad675cd4-nccc
cgm-pd-provisioner-5cd8b667b4-kwnzh           1/1       Running   0          4m        10.60.4.5    gke-hub-default-pool-ad675cd4-5sf3
cloud-endpoints-controller-6584cfdf54-fnfrm   1/1       Running   0          1h        10.60.1.12   gke-hub-default-pool-ad675cd4-0lcq
envoy-76774f8d5c-tcqlc                        2/2       Running   2          1h        10.60.3.4    gke-hub-default-pool-ad675cd4-7kk2
envoy-76774f8d5c-vllhw                        2/2       Running   2          1h        10.60.3.3    gke-hub-default-pool-ad675cd4-7kk2
envoy-76774f8d5c-xpgxj                        2/2       Running   1          1h        10.60.4.3    gke-hub-default-pool-ad675cd4-5sf3
iap-enabler-6586ccc64-r6b46                   1/1       Running   0          1h        10.60.1.13   gke-hub-default-pool-ad675cd4-0lcq
kube-metacontroller-69fcb8c5d4-k4kq2          1/1       Running   0          1h        10.60.0.11   gke-hub-default-pool-ad675cd4-nccc
spartakus-volunteer-7f5ccf89d7-bvjg4          1/1       Running   0          1h        10.60.1.8    gke-hub-default-pool-ad675cd4-0lcq
tf-hub-0                                      1/1       Running   0          1h        10.60.0.10   gke-hub-default-pool-ad675cd4-nccc
tf-job-dashboard-644865ddff-vbg57             1/1       Running   0          1h        10.60.1.10   gke-hub-default-pool-ad675cd4-0lcq
tf-job-operator-v1alpha2-75bcb7f5f7-7nvqh     1/1       Running   0          1h        10.60.1.11   gke-hub-default-pool-ad675cd4-0lcq
whoami-app-6d9d8dc867-9xz76                   1/1       Running   0          1h        10.60.0.13   gke-hub-default-pool-ad675cd4-nccc

$ kubectl describe node gke-hub-gpu-pool-9d1db964-9gqn
Name:               gke-hub-gpu-pool-9d1db964-9gqn
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/fluentd-ds-ready=true
                    beta.kubernetes.io/instance-type=n1-standard-4
                    beta.kubernetes.io/os=linux
                    cloud.google.com/gke-accelerator=nvidia-tesla-k80
                    cloud.google.com/gke-nodepool=gpu-pool
                    failure-domain.beta.kubernetes.io/region=europe-west1
                    failure-domain.beta.kubernetes.io/zone=europe-west1-b
                    kubernetes.io/hostname=gke-hub-gpu-pool-9d1db964-9gqn
Annotations:        node.alpha.kubernetes.io/ttl=0
                    volumes.kubernetes.io/controller-managed-attach-detach=true
CreationTimestamp:  Thu, 19 Jul 2018 22:55:04 +0200
Taints:             nvidia.com/gpu=present:NoSchedule
Unschedulable:      false
Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  KernelDeadlock       False   Thu, 19 Jul 2018 22:55:04 +0200   Thu, 19 Jul 2018 22:55:03 +0200   KernelHasNoDeadlock          kernel has no deadlock
  NetworkUnavailable   False   Thu, 19 Jul 2018 22:55:19 +0200   Thu, 19 Jul 2018 22:55:19 +0200   RouteCreated                 RouteController created a route
  OutOfDisk            False   Thu, 19 Jul 2018 22:55:54 +0200   Thu, 19 Jul 2018 22:55:04 +0200   KubeletHasSufficientDisk     kubelet has sufficient disk space available
  MemoryPressure       False   Thu, 19 Jul 2018 22:55:54 +0200   Thu, 19 Jul 2018 22:55:04 +0200   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Thu, 19 Jul 2018 22:55:54 +0200   Thu, 19 Jul 2018 22:55:04 +0200   KubeletHasNoDiskPressure     kubelet has no disk pressure
  Ready                True    Thu, 19 Jul 2018 22:55:54 +0200   Thu, 19 Jul 2018 22:55:24 +0200   KubeletReady                 kubelet is posting ready status. AppArmor enabled
Addresses:
  InternalIP:  10.132.0.6
  ExternalIP:  146.148.13.101
  Hostname:    gke-hub-gpu-pool-9d1db964-9gqn
Capacity:
 cpu:     4
 memory:  15405960Ki
 pods:    110
Allocatable:
 cpu:     3920m
 memory:  12706696Ki
 pods:    110
System Info:
 Machine ID:                 5b124e59e9078f45cc7709c4ed99fa94
 System UUID:                5B124E59-E907-8F45-CC77-09C4ED99FA94
 Boot ID:                    060ac762-0a32-490a-b50f-050894488776
 Kernel Version:             4.4.111+
 OS Image:                   Container-Optimized OS from Google
 Operating System:           linux
 Architecture:               amd64
 Container Runtime Version:  docker://17.3.2
 Kubelet Version:            v1.9.6-gke.1
 Kube-Proxy Version:         v1.9.6-gke.1
PodCIDR:                     10.60.5.0/24
ExternalID:                  5156590157901415579
ProviderID:                  gce://child-growth-monitor/europe-west1-b/gke-hub-gpu-pool-9d1db964-9gqn
Non-terminated Pods:         (4 in total)
  Namespace                  Name                                         CPU Requests  CPU Limits  Memory Requests  Memory Limits
  ---------                  ----                                         ------------  ----------  ---------------  -------------
  kube-system                fluentd-gcp-v2.0.10-94k5b                    100m (2%)     0 (0%)      200Mi (1%)       300Mi (2%)
  kube-system                kube-proxy-gke-hub-gpu-pool-9d1db964-9gqn    100m (2%)     0 (0%)      0 (0%)           0 (0%)
  kube-system                nvidia-driver-installer-vzsz4                150m (3%)     0 (0%)      0 (0%)           0 (0%)
  kube-system                nvidia-gpu-device-plugin-vv89m               50m (1%)      50m (1%)    10Mi (0%)        10Mi (0%)
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  CPU Requests  CPU Limits  Memory Requests  Memory Limits
  ------------  ----------  ---------------  -------------
  400m (10%)    50m (1%)    210Mi (1%)       310Mi (2%)
Events:
  Type    Reason                   Age                From                                        Message
  ----    ------                   ----               ----                                        -------
  Normal  Starting                 53s                kubelet, gke-hub-gpu-pool-9d1db964-9gqn     Starting kubelet.
  Normal  NodeHasSufficientDisk    53s (x2 over 53s)  kubelet, gke-hub-gpu-pool-9d1db964-9gqn     Node gke-hub-gpu-pool-9d1db964-9gqn status is now: NodeHasSufficientDisk
  Normal  NodeHasSufficientMemory  53s (x2 over 53s)  kubelet, gke-hub-gpu-pool-9d1db964-9gqn     Node gke-hub-gpu-pool-9d1db964-9gqn status is now: NodeHasSufficientMemory
  Normal  NodeHasNoDiskPressure    53s (x2 over 53s)  kubelet, gke-hub-gpu-pool-9d1db964-9gqn     Node gke-hub-gpu-pool-9d1db964-9gqn status is now: NodeHasNoDiskPressure
  Normal  NodeAllocatableEnforced  53s                kubelet, gke-hub-gpu-pool-9d1db964-9gqn     Updated Node Allocatable limit across pods
  Normal  Starting                 49s                kube-proxy, gke-hub-gpu-pool-9d1db964-9gqn  Starting kube-proxy.
  Normal  NodeReady                33s                kubelet, gke-hub-gpu-pool-9d1db964-9gqn     Node gke-hub-gpu-pool-9d1db964-9gqn status is now: NodeReady

Spawning image gcr.io/kubeflow-images-public/tensorflow-1.8.0-notebook-gpu:v0.2.1 via JupyterHub.

$ kubectl get pods -n kubeflow -o wide
NAME                                                          READY     STATUS    RESTARTS   AGE       IP           NODE
ambassador-9f658d5bc-bbhfw                                    2/2       Running   0          1h        10.60.0.9    gke-hub-default-pool-ad675cd4-nccc
ambassador-9f658d5bc-rxlnl                                    2/2       Running   0          1h        10.60.4.4    gke-hub-default-pool-ad675cd4-5sf3
ambassador-9f658d5bc-w7cmr                                    2/2       Running   0          1h        10.60.1.9    gke-hub-default-pool-ad675cd4-0lcq
centraldashboard-6665fc46cb-tz92l                             1/1       Running   0          1h        10.60.0.8    gke-hub-default-pool-ad675cd4-nccc
cert-manager-555c87df98-jtgsk                                 2/2       Running   0          1h        10.60.0.12   gke-hub-default-pool-ad675cd4-nccc
cgm-pd-provisioner-5cd8b667b4-kwnzh                           1/1       Running   0          6m        10.60.4.5    gke-hub-default-pool-ad675cd4-5sf3
cloud-endpoints-controller-6584cfdf54-fnfrm                   1/1       Running   0          1h        10.60.1.12   gke-hub-default-pool-ad675cd4-0lcq
envoy-76774f8d5c-tcqlc                                        2/2       Running   2          1h        10.60.3.4    gke-hub-default-pool-ad675cd4-7kk2
envoy-76774f8d5c-vllhw                                        2/2       Running   2          1h        10.60.3.3    gke-hub-default-pool-ad675cd4-7kk2
envoy-76774f8d5c-xpgxj                                        2/2       Running   1          1h        10.60.4.3    gke-hub-default-pool-ad675cd4-5sf3
iap-enabler-6586ccc64-r6b46                                   1/1       Running   0          1h        10.60.1.13   gke-hub-default-pool-ad675cd4-0lcq
jupyter-accounts-2egoogle-2ecom-3ammatiaschek-40gmail-2ecom   0/1       Pending   0          7s        <none>       <none>
kube-metacontroller-69fcb8c5d4-k4kq2                          1/1       Running   0          1h        10.60.0.11   gke-hub-default-pool-ad675cd4-nccc
spartakus-volunteer-7f5ccf89d7-bvjg4                          1/1       Running   0          1h        10.60.1.8    gke-hub-default-pool-ad675cd4-0lcq
tf-hub-0                                                      1/1       Running   0          1h        10.60.0.10   gke-hub-default-pool-ad675cd4-nccc
tf-job-dashboard-644865ddff-vbg57                             1/1       Running   0          1h        10.60.1.10   gke-hub-default-pool-ad675cd4-0lcq
tf-job-operator-v1alpha2-75bcb7f5f7-7nvqh                     1/1       Running   0          1h        10.60.1.11   gke-hub-default-pool-ad675cd4-0lcq
whoami-app-6d9d8dc867-9xz76                                   1/1       Running   0          1h        10.60.0.13   gke-hub-default-pool-ad675cd4-nccc

$ kubectl -n kubeflow describe po/jupyter-accounts-2egoogle-2ecom-3ammatiaschek-40gmail-2ecom
Name:         jupyter-accounts-2egoogle-2ecom-3ammatiaschek-40gmail-2ecom
Namespace:    kubeflow
Node:         <none>
Labels:       app=jupyterhub
              component=singleuser-server
              heritage=jupyterhub
Annotations:  hub.jupyter.org/username=accounts.google.com:mmatiaschek@gmail.com
Status:       Pending
IP:
Containers:
  notebook:
    Image:      gcr.io/kubeflow-images-public/tensorflow-1.8.0-notebook-gpu:v0.2.1
    Port:       8888/TCP
    Host Port:  0/TCP
    Args:
      start-singleuser.sh
      --ip="0.0.0.0"
      --port=8888
      --allow-root
    Requests:
      cpu:     500m
      memory:  1Gi
    Environment:
      JUPYTERHUB_API_TOKEN:           5f54e7941eb74132b1d58fba8db68031
      JPY_API_TOKEN:                  5f54e7941eb74132b1d58fba8db68031
      JUPYTERHUB_CLIENT_ID:           jupyterhub-user-accounts.google.com%3Ammatiaschek%40gmail.com
      JUPYTERHUB_HOST:
      JUPYTERHUB_OAUTH_CALLBACK_URL:  /user/accounts.google.com%3Ammatiaschek@gmail.com/oauth_callback
      JUPYTERHUB_USER:                accounts.google.com:mmatiaschek@gmail.com
      JUPYTERHUB_API_URL:             http://tf-hub-0:8081/hub/api
      JUPYTERHUB_BASE_URL:            /
      JUPYTERHUB_SERVICE_PREFIX:      /user/accounts.google.com%3Ammatiaschek@gmail.com/
      MEM_GUARANTEE:                  1Gi
      CPU_GUARANTEE:                  500m
    Mounts:
      /home/jovyan from volume-accounts-2egoogle-2ecom-3ammatiaschek-40gmail-2ecom (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from no-api-access-please (ro)
Conditions:
  Type           Status
  PodScheduled   False
Volumes:
  volume-accounts-2egoogle-2ecom-3ammatiaschek-40gmail-2ecom:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  claim-accounts-2egoogle-2ecom-3ammatiaschek-40gmail-2ecom
    ReadOnly:   false
  no-api-access-please:
    Type:        EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason            Age                From               Message
  ----     ------            ----               ----               -------
  Warning  FailedScheduling  12s (x6 over 27s)  default-scheduler  0/5 nodes are available: 1 PodToleratesNodeTaints, 4 Insufficient cpu, 4 Insufficient memory.

Removing the taint

$ kubectl taint nodes gke-hub-gpu-pool-9d1db964-9gqn nvidia.com/gpu:NoSchedule-
node "gke-hub-gpu-pool-9d1db964-9gqn" untainted

$ kubectl describe node gke-hub-gpu-pool-9d1db964-9gqn
Name:               gke-hub-gpu-pool-9d1db964-9gqn
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/fluentd-ds-ready=true
                    beta.kubernetes.io/instance-type=n1-standard-4
                    beta.kubernetes.io/os=linux
                    cloud.google.com/gke-accelerator=nvidia-tesla-k80
                    cloud.google.com/gke-nodepool=gpu-pool
                    failure-domain.beta.kubernetes.io/region=europe-west1
                    failure-domain.beta.kubernetes.io/zone=europe-west1-b
                    kubernetes.io/hostname=gke-hub-gpu-pool-9d1db964-9gqn
Annotations:        node.alpha.kubernetes.io/ttl=0
                    volumes.kubernetes.io/controller-managed-attach-detach=true
CreationTimestamp:  Thu, 19 Jul 2018 22:55:04 +0200
Taints:             <none>
Unschedulable:      false
Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  KernelDeadlock       False   Thu, 19 Jul 2018 23:04:11 +0200   Thu, 19 Jul 2018 22:55:03 +0200   KernelHasNoDeadlock          kernel has no deadlock
  NetworkUnavailable   False   Thu, 19 Jul 2018 22:55:19 +0200   Thu, 19 Jul 2018 22:55:19 +0200   RouteCreated                 RouteController created a route
  OutOfDisk            False   Thu, 19 Jul 2018 23:04:15 +0200   Thu, 19 Jul 2018 22:55:04 +0200   KubeletHasSufficientDisk     kubelet has sufficient disk space available
  MemoryPressure       False   Thu, 19 Jul 2018 23:04:15 +0200   Thu, 19 Jul 2018 22:55:04 +0200   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Thu, 19 Jul 2018 23:04:15 +0200   Thu, 19 Jul 2018 22:55:04 +0200   KubeletHasNoDiskPressure     kubelet has no disk pressure
  Ready                True    Thu, 19 Jul 2018 23:04:15 +0200   Thu, 19 Jul 2018 22:55:24 +0200   KubeletReady                 kubelet is posting ready status. AppArmor enabled
Addresses:
  InternalIP:  10.132.0.6
  ExternalIP:  146.148.13.101
  Hostname:    gke-hub-gpu-pool-9d1db964-9gqn
Capacity:
 cpu:             4
 memory:          15405960Ki
 nvidia.com/gpu:  1
 pods:            110
Allocatable:
 cpu:             3920m
 memory:          12706696Ki
 nvidia.com/gpu:  1
 pods:            110
System Info:
 Machine ID:                 5b124e59e9078f45cc7709c4ed99fa94
 System UUID:                5B124E59-E907-8F45-CC77-09C4ED99FA94
 Boot ID:                    060ac762-0a32-490a-b50f-050894488776
 Kernel Version:             4.4.111+
 OS Image:                   Container-Optimized OS from Google
 Operating System:           linux
 Architecture:               amd64
 Container Runtime Version:  docker://17.3.2
 Kubelet Version:            v1.9.6-gke.1
 Kube-Proxy Version:         v1.9.6-gke.1
PodCIDR:                     10.60.5.0/24
ExternalID:                  5156590157901415579
ProviderID:                  gce://child-growth-monitor/europe-west1-b/gke-hub-gpu-pool-9d1db964-9gqn
Non-terminated Pods:         (5 in total)
  Namespace                  Name                                                           CPU Requests  CPU Limits  Memory Requests  Memory Limits
  ---------                  ----                                                           ------------  ----------  ---------------  -------------
  kube-system                fluentd-gcp-v2.0.10-94k5b                                      100m (2%)     0 (0%)      200Mi (1%)       300Mi (2%)
  kube-system                kube-proxy-gke-hub-gpu-pool-9d1db964-9gqn                      100m (2%)     0 (0%)      0 (0%)           0 (0%)
  kube-system                nvidia-driver-installer-vzsz4                                  150m (3%)     0 (0%)      0 (0%)           0 (0%)
  kube-system                nvidia-gpu-device-plugin-vv89m                                 50m (1%)      50m (1%)    10Mi (0%)        10Mi (0%)
  kubeflow                   jupyter-accounts-2egoogle-2ecom-3ammatiaschek-40gmail-2ecom    500m (12%)    0 (0%)      1Gi (8%)         0 (0%)
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  CPU Requests  CPU Limits  Memory Requests  Memory Limits
  ------------  ----------  ---------------  -------------
  900m (22%)    50m (1%)    1234Mi (9%)      310Mi (2%)
Events:
  Type    Reason                   Age              From                                        Message
  ----    ------                   ----             ----                                        -------
  Normal  Starting                 9m               kubelet, gke-hub-gpu-pool-9d1db964-9gqn     Starting kubelet.
  Normal  NodeHasSufficientDisk    9m (x2 over 9m)  kubelet, gke-hub-gpu-pool-9d1db964-9gqn     Node gke-hub-gpu-pool-9d1db964-9gqn status is now: NodeHasSufficientDisk
  Normal  NodeHasSufficientMemory  9m (x2 over 9m)  kubelet, gke-hub-gpu-pool-9d1db964-9gqn     Node gke-hub-gpu-pool-9d1db964-9gqn status is now: NodeHasSufficientMemory
  Normal  NodeHasNoDiskPressure    9m (x2 over 9m)  kubelet, gke-hub-gpu-pool-9d1db964-9gqn     Node gke-hub-gpu-pool-9d1db964-9gqn status is now: NodeHasNoDiskPressure
  Normal  NodeAllocatableEnforced  9m               kubelet, gke-hub-gpu-pool-9d1db964-9gqn     Updated Node Allocatable limit across pods
  Normal  Starting                 9m               kube-proxy, gke-hub-gpu-pool-9d1db964-9gqn  Starting kube-proxy.
  Normal  NodeReady                8m               kubelet, gke-hub-gpu-pool-9d1db964-9gqn     Node gke-hub-gpu-pool-9d1db964-9gqn status is now: NodeReady

$ kubectl -n kubeflow describe pod jupyter-accounts-2egoogle-2ecom-3ammatiaschek-40gmail-2ecom
Name:         jupyter-accounts-2egoogle-2ecom-3ammatiaschek-40gmail-2ecom
Namespace:    kubeflow
Node:         gke-hub-gpu-pool-9d1db964-9gqn/10.132.0.6
Start Time:   Thu, 19 Jul 2018 23:03:55 +0200
Labels:       app=jupyterhub
              component=singleuser-server
              heritage=jupyterhub
Annotations:  hub.jupyter.org/username=accounts.google.com:mmatiaschek@gmail.com
Status:       Pending
IP:
Containers:
  notebook:
    Container ID:
    Image:         gcr.io/kubeflow-images-public/tensorflow-1.8.0-notebook-gpu:v0.2.1
    Image ID:
    Port:          8888/TCP
    Host Port:     0/TCP
    Args:
      start-singleuser.sh
      --ip="0.0.0.0"
      --port=8888
      --allow-root
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:     500m
      memory:  1Gi
    Environment:
      JUPYTERHUB_API_TOKEN:           5f54e7941eb74132b1d58fba8db68031
      JPY_API_TOKEN:                  5f54e7941eb74132b1d58fba8db68031
      JUPYTERHUB_CLIENT_ID:           jupyterhub-user-accounts.google.com%3Ammatiaschek%40gmail.com
      JUPYTERHUB_HOST:
      JUPYTERHUB_OAUTH_CALLBACK_URL:  /user/accounts.google.com%3Ammatiaschek@gmail.com/oauth_callback
      JUPYTERHUB_USER:                accounts.google.com:mmatiaschek@gmail.com
      JUPYTERHUB_API_URL:             http://tf-hub-0:8081/hub/api
      JUPYTERHUB_BASE_URL:            /
      JUPYTERHUB_SERVICE_PREFIX:      /user/accounts.google.com%3Ammatiaschek@gmail.com/
      MEM_GUARANTEE:                  1Gi
      CPU_GUARANTEE:                  500m
    Mounts:
      /home/jovyan from volume-accounts-2egoogle-2ecom-3ammatiaschek-40gmail-2ecom (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from no-api-access-please (ro)
Conditions:
  Type           Status
  Initialized    True
  Ready          False
  PodScheduled   True
Volumes:
  volume-accounts-2egoogle-2ecom-3ammatiaschek-40gmail-2ecom:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  claim-accounts-2egoogle-2ecom-3ammatiaschek-40gmail-2ecom
    ReadOnly:   false
  no-api-access-please:
    Type:        EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason                 Age               From                                     Message
  ----     ------                 ----              ----                                     -------
  Warning  FailedScheduling       2m (x22 over 7m)  default-scheduler                        0/5 nodes are available: 1 PodToleratesNodeTaints, 4 Insufficient cpu, 4 Insufficient memory.
  Normal   Scheduled              1m                default-scheduler                        Successfully assigned jupyter-accounts-2egoogle-2ecom-3ammatiaschek-40gmail-2ecom to gke-hub-gpu-pool-9d1db964-9gqn
  Normal   SuccessfulMountVolume  1m                kubelet, gke-hub-gpu-pool-9d1db964-9gqn  MountVolume.SetUp succeeded for volume "no-api-access-please"
  Normal   SuccessfulMountVolume  1m                kubelet, gke-hub-gpu-pool-9d1db964-9gqn  MountVolume.SetUp succeeded for volume "pvc-348452ed-8b8d-11e8-9c05-42010a8400f9"
  Normal   Pulling                1m                kubelet, gke-hub-gpu-pool-9d1db964-9gqn  pulling image "gcr.io/kubeflow-images-public/tensorflow-1.8.0-notebook-gpu:v0.2.1"

then of course i can still not use the GPU because the container image does not seem to have the right drivers for cloud.google.com/gke-accelerator=nvidia-tesla-k80

any help appreciated.

@mmatiaschek mmatiaschek changed the title taint prevents jupyterhub gpu container from spawning GPU support on GKE not available Jul 19, 2018
@swiftdiaries
Copy link
Member

Others can correct me if I'm wrong, I think you need to have the DaemonSet installed in your cluster for Nvidia drivers.

$ kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/stable/nvidia-driver-installer/cos/daemonset-preloaded.yaml

Source: https://cloud.google.com/kubernetes-engine/docs/concepts/gpus

@mmatiaschek
Copy link
Author

@swiftdiaries unfortunately not. It is in deploy.sh and to make sure i just tried it manually, no success.

@jlewi
Copy link
Contributor

jlewi commented Jul 20, 2018

It doesn't look like you requested GPUs for your notebook

    Requests:
      cpu:     500m
      memory:  1Gi

In the JupyterHub spawner did you supply extra resource limits

e.g

{"nvidia.com/gpu":1}

@jlewi jlewi added the area/jupyter Issues related to Jupyter label Jul 20, 2018
@jlewi
Copy link
Contributor

jlewi commented Jul 20, 2018

@mmatiaschek
Copy link
Author

yes, i have tried it with and without this spawner option. In the end the exact option that is documented worked for me: {"nvidia.com/gpu": "1"}

The spawner makes another suggestion and i think it didn't work:
image

Thank you so much for your help!!

yanniszark pushed a commit to arrikto/kubeflow that referenced this issue Feb 15, 2021
* Fix Trial parameter in darts example

* Fix description
surajkota pushed a commit to surajkota/kubeflow that referenced this issue Jun 13, 2022
* update kfctl_ibm kfdef to kustomize v3

* small update to README

* update to use katib, minio and mysql generic

* update after platform test

* fix test failure
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants