Skip to content

How does sysbox k8s in docker schedule tensorflow/tensorflow:2.9.1-gpu? #643

@zhongcloudtian

Description

@zhongcloudtian
  1. nvidia driver verison : NVIDIA-Linux-x86_64-525.85.12.run, os: ubuntu 20.04
  2. docker run --detach --interactive --runtime=sysbox-runc --name k8s-worker01 --hostname=k8s-worker01
    --mount type=tmpfs,destination=/proc/driver/nvidia
    --mount type=bind,source=/usr/bin/nvidia-smi,target=/usr/bin/nvidia-smi
    --mount type=bind,source=/usr/bin/nvidia-debugdump,target=/usr/bin/nvidia-debugdump
    --mount type=bind,source=/usr/bin/nvidia-persistenced,target=/usr/bin/nvidia-persistenced
    --mount type=bind,source=/usr/bin/nvidia-cuda-mps-control,target=/usr/bin/nvidia-cuda-mps-control
    --mount type=bind,source=/usr/bin/nvidia-cuda-mps-server,target=/usr/bin/nvidia-cuda-mps-server
    -v /usr/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu
    --mount type=bind,source=/run/nvidia-persistenced/socket,target=/run/nvidia-persistenced/socket
    --device /dev/nvidiactl:/dev/nvidiactl --device /dev/nvidia-uvm:/dev/nvidia-uvm
    --device /dev/nvidia-uvm-tools:/dev/nvidia-uvm-tools
    --device /dev/nvidia0:/dev/nvidia0
    nestybox/k8s-node:v1.20.2
  3. sysbox run tensorflow/tensorflow:2.9.1-gpu as follows:
    docker run --gpus all --mount type=bind,source=/usr/bin/nvidia-smi,target=/usr/bin/nvidia-smi -v /usr/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu --device=/dev/nvidiactl --device=/dev/nvidia-uvm --device=/dev/nvidia0 --name test10 tensorflow/tensorflow:2.9.1-gpu python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
    error message:
    E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: UNKNOWN ERROR (34)
    I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: aaf4ecde1157
    I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: aaf4ecde1157
    I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: NOT_FOUND: was unable to find libcuda.so DSO loaded into this program
    I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 525.85.12

Metadata

Metadata

Assignees

No one assigned

    Labels

    duplicateThis issue or pull request already exists

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions