JupyterHub singleuser not able to use tensorflow gpu support using systemdspawner #2927

ccauet · 2020-02-12T11:18:33Z

(this is a crossposting to SO, the jupyterhub issue tracker and the jupyterhub/systemdspawner issue tracker)

I have a private JupyterHub Setup using a SystemdSpawner where I try to run tensorflow with gpu support.

I followed the tensorflow instructions and alternatively tried a already provisioned AWS AMI (Deep Learning Base AMI (Ubuntu 18.04) Version 21.0) with NDVIDIA, both on AWS EC2 g4 instances.

On both setups I'm able to use tensorflow with gpu support in an (i)python 3.6 shell

>>> import tensorflow as tf
>>> tf.config.list_physical_devices('GPU')
2020-02-12 10:57:13.670937: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-02-12 10:57:13.698230: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-02-12 10:57:13.699066: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:00:1e.0 name: Tesla T4 computeCapability: 7.5
coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.73GiB deviceMemoryBandwidth: 298.08GiB/s
2020-02-12 10:57:13.699286: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-02-12 10:57:13.700918: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-02-12 10:57:13.702512: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-02-12 10:57:13.702814: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-02-12 10:57:13.704561: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-02-12 10:57:13.705586: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-02-12 10:57:13.709171: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-02-12 10:57:13.709278: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-02-12 10:57:13.710120: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-02-12 10:57:13.710893: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

(some warnings about NUMA node, but the gpu is found)

Also using nvidia-smi and deviceQuery shows the gpu:

$ nvidia-smi
Wed Feb 12 10:39:44 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01    Driver Version: 418.87.01    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
| N/A   33C    P8     9W /  70W |      0MiB / 15079MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

$ /usr/local/cuda/extras/demo_suite/deviceQuery
/usr/local/cuda/extras/demo_suite/deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "Tesla T4"
  CUDA Driver Version / Runtime Version          10.1 / 10.0
  CUDA Capability Major/Minor version number:    7.5
  Total amount of global memory:                 15080 MBytes (15812263936 bytes)
  (40) Multiprocessors, ( 64) CUDA Cores/MP:     2560 CUDA Cores
  GPU Max Clock rate:                            1590 MHz (1.59 GHz)
  Memory Clock rate:                             5001 Mhz
  Memory Bus Width:                              256-bit
  L2 Cache Size:                                 4194304 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1024
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 3 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 0 / 30
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.1, CUDA Runtime Version = 10.0, NumDevs = 1, Device0 = Tesla T4
Result = PASS

Now I start the JupyterHub, login, and open a terminal, there I get:

$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

and

$ /usr/local/cuda/extras/demo_suite/deviceQuery
cuda/extras/demo_suite/deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 38
-> no CUDA-capable device is detected
Result = FAIL

and also

I suspect some kind of "sandbox", missing ENV vars, etc. due to that the gpu drivers are not found inside the singleuser environment and subsequently the tensorflow gpu support does not work.

Any ideas on this? Probably it is either a small configuration tweak or due to the architecture not solvable at all ;)

Thanks a lot!

EDIT: link to SO post

The text was updated successfully, but these errors were encountered:

ccauet · 2020-02-15T11:46:39Z

Problem solved. I was setting

c.SystemdSpawner.isolate_devices = True

this seemed clever to me in the past, but now when trying to use GPUs was getting in my way.

ccauet mentioned this issue Feb 13, 2020

JupyterHub singleuser not able to use tensorflow gpu support using systemdspawner jupyterhub/systemdspawner#67

Closed

ccauet closed this as completed Feb 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JupyterHub singleuser not able to use tensorflow gpu support using systemdspawner #2927

JupyterHub singleuser not able to use tensorflow gpu support using systemdspawner #2927

ccauet commented Feb 12, 2020 •

edited

ccauet commented Feb 15, 2020

JupyterHub singleuser not able to use tensorflow gpu support using systemdspawner #2927

JupyterHub singleuser not able to use tensorflow gpu support using systemdspawner #2927

Comments

ccauet commented Feb 12, 2020 • edited

ccauet commented Feb 15, 2020

ccauet commented Feb 12, 2020 •

edited