Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JupyterHub fails to load image properly, but starts a notebook anyway #226

Closed
bkungfoo opened this issue Feb 9, 2018 · 6 comments
Closed

Comments

@bkungfoo
Copy link

bkungfoo commented Feb 9, 2018

I've encountered this several times during deployment, both on minikube and gke. When starting jupyterhub, sometimes starting a server with a valid image (e.g. gcr.io/kubeflow/tensorflow-notebook-gpu:8fbc341245695e482848ac3c2034a99f7c1e5763) creates a container without any libraries installed.

kubectl logs tf-hub-0 -n $NAMESPACE shows the following error:

[W 2018-02-08 23:51:44.573 JupyterHub configurable:168] Config option singleuser_image_spec not recognized by KubeFormSpawner. Did you mean one of: singleuser_image_pull_policy, singleuser_image_pull_secrets, singleuser_node_selector?

@jlewi
Copy link
Contributor

jlewi commented Feb 10, 2018

When you say no libraries are installed you mean python libraries?
Does Jupyter start running?

@jlewi
Copy link
Contributor

jlewi commented Feb 12, 2018

@bkungfoo ping? Any more info?

@bkungfoo
Copy link
Author

Jupyter starts running, but tf-gpu is not properly installed. Here is what I get when I create a notebook and run "import tensorflow as tf":


ImportError Traceback (most recent call last)
/opt/conda/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow.py in ()
57
---> 58 from tensorflow.python.pywrap_tensorflow_internal import *
59 from tensorflow.python.pywrap_tensorflow_internal import version

/opt/conda/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py in ()
27 return _mod
---> 28 _pywrap_tensorflow_internal = swig_import_helper()
29 del swig_import_helper

/opt/conda/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py in swig_import_helper()
23 try:
---> 24 _mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)
25 finally:

/opt/conda/lib/python3.6/imp.py in load_module(name, file, filename, details)
242 else:
--> 243 return load_dynamic(name, filename, file)
244 elif type_ == PKG_DIRECTORY:

/opt/conda/lib/python3.6/imp.py in load_dynamic(name, path, file)
342 name=name, loader=loader, origin=path)
--> 343 return _load(spec)
344

ImportError: libcuda.so.1: cannot open shared object file: No such file or directory

During handling of the above exception, another exception occurred:

ImportError Traceback (most recent call last)
in ()
----> 1 import tensorflow as tf

/opt/conda/lib/python3.6/site-packages/tensorflow/init.py in ()
22
23 # pylint: disable=wildcard-import
---> 24 from tensorflow.python import *
25 # pylint: enable=wildcard-import
26

/opt/conda/lib/python3.6/site-packages/tensorflow/python/init.py in ()
47 import numpy as np
48
---> 49 from tensorflow.python import pywrap_tensorflow
50
51 # Protocol buffers

/opt/conda/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow.py in ()
71 for some common reasons and solutions. Include the entire stack trace
72 above this error message when asking for help.""" % traceback.format_exc()
---> 73 raise ImportError(msg)
74
75 # pylint: enable=wildcard-import,g-import-not-at-top,unused-import,line-too-long

ImportError: Traceback (most recent call last):
File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow.py", line 58, in
from tensorflow.python.pywrap_tensorflow_internal import *
File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 28, in
_pywrap_tensorflow_internal = swig_import_helper()
File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 24, in swig_import_helper
_mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)
File "/opt/conda/lib/python3.6/imp.py", line 243, in load_module
return load_dynamic(name, filename, file)
File "/opt/conda/lib/python3.6/imp.py", line 343, in load_dynamic
return _load(spec)
ImportError: libcuda.so.1: cannot open shared object file: No such file or directory

Failed to load the native TensorFlow runtime.

See https://www.tensorflow.org/install/install_sources#common_installation_problems

for some common reasons and solutions. Include the entire stack trace
above this error message when asking for help.

@jlewi
Copy link
Contributor

jlewi commented Feb 12, 2018

This usually means GPUs aren't properly configured.

I'm assuming you are running on GKE?

  1. Did you follow the GKE instructions to install the NVIDIA drivers via daemonset?
  2. When you spawned the Jupyter server via JupyterHub did you specify GPUs in the resource requirements?

@aronchick
Copy link
Contributor

aronchick commented Feb 13, 2018 via email

@bkungfoo
Copy link
Author

This problem is likely due to not following the instructions here to deploy an nvidia driver daemon on the GKE cluster. Closing the issue.
https://cloud.google.com/kubernetes-engine/docs/concepts/gpus

yanniszark pushed a commit to arrikto/kubeflow that referenced this issue Feb 15, 2021
elenzio9 pushed a commit to arrikto/kubeflow that referenced this issue Oct 31, 2022
/cc @jessiezcc

/assign @jlewi

remove extra new line

remove extra new line
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants