-
Notifications
You must be signed in to change notification settings - Fork 208
How does sysbox k8s in docker schedule tensorflow/tensorflow:2.9.1-gpu? #643
Copy link
Copy link
Closed
Labels
duplicateThis issue or pull request already existsThis issue or pull request already exists
Description
- nvidia driver verison : NVIDIA-Linux-x86_64-525.85.12.run, os: ubuntu 20.04
- docker run --detach --interactive --runtime=sysbox-runc --name k8s-worker01 --hostname=k8s-worker01
--mount type=tmpfs,destination=/proc/driver/nvidia
--mount type=bind,source=/usr/bin/nvidia-smi,target=/usr/bin/nvidia-smi
--mount type=bind,source=/usr/bin/nvidia-debugdump,target=/usr/bin/nvidia-debugdump
--mount type=bind,source=/usr/bin/nvidia-persistenced,target=/usr/bin/nvidia-persistenced
--mount type=bind,source=/usr/bin/nvidia-cuda-mps-control,target=/usr/bin/nvidia-cuda-mps-control
--mount type=bind,source=/usr/bin/nvidia-cuda-mps-server,target=/usr/bin/nvidia-cuda-mps-server
-v /usr/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu
--mount type=bind,source=/run/nvidia-persistenced/socket,target=/run/nvidia-persistenced/socket
--device /dev/nvidiactl:/dev/nvidiactl --device /dev/nvidia-uvm:/dev/nvidia-uvm
--device /dev/nvidia-uvm-tools:/dev/nvidia-uvm-tools
--device /dev/nvidia0:/dev/nvidia0
nestybox/k8s-node:v1.20.2 - sysbox run tensorflow/tensorflow:2.9.1-gpu as follows:
docker run --gpus all --mount type=bind,source=/usr/bin/nvidia-smi,target=/usr/bin/nvidia-smi -v /usr/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu --device=/dev/nvidiactl --device=/dev/nvidia-uvm --device=/dev/nvidia0 --name test10 tensorflow/tensorflow:2.9.1-gpu python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
error message:
E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: UNKNOWN ERROR (34)
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: aaf4ecde1157
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: aaf4ecde1157
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: NOT_FOUND: was unable to find libcuda.so DSO loaded into this program
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 525.85.12
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
duplicateThis issue or pull request already existsThis issue or pull request already exists