docker run & docker-compose run fails sharing nvidia-gpu capabilities #47424

typoworx-de · 2024-02-21T15:52:41Z

Description

I've been trying to share nvidia-gpu (for cuda/compute) to docker-container as described in:

I'm using Ubuntu 22.04 and installed recent docker and nvidia-docker2 + nvidia-container-toolkit:

Reproduce

sudo apt-get install nvidia-docker2 nvidia-container-toolkit
I've added nvidia-runtime to host's docker daemon.json and restarter docker-service.

  {
  "runtimes": {
    "sysbox-runc": {
       "path": "/usr/bin/sysbox-runc"
    },
    "nvidia": {
       "path": "nvidia-container-runtime",
       "runtimeArgs": []
    }
  },
  "exec-opts": ["native.cgroupdriver=cgroupfs"]
  }

docker run --rm -ti --gpus all --entrypoint nvidia-smi nvidia/cuda:12.3.1-runtime-ubuntu22.04
Failed to initialize NVML: Unknown Error
inside container: ls -lah /dev/nvidia* shows up nvidia-devices
Trying with docker-compose.yml results in the same problem:

version: '3.8'

services:
  nvidia-smi:
    image: nvidia/cuda:12.3.1-runtime-ubuntu22.04
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
      - NVIDIA_DRIVER_CAPABILITIES=all
    command: nvidia-smi
    deploy:
      resources:
        reservations:
          devices:
            - capabilities:
                - gpu
                - compute
            - driver: nvidia
              #device_ids: ['0']
              capabilities: [gpu]

Expected behavior

NVIDIA/Compute sharing should work as documented in docker docs!

docker version

Client: Docker Engine - Community
 Version:           25.0.2
 API version:       1.44
 Go version:        go1.21.6
 Git commit:        29cf629
 Built:             Thu Feb  1 00:23:03 2024
 OS/Arch:           linux/amd64
 Context:           default

Server: Docker Engine - Community
 Engine:
  Version:          25.0.2
  API version:      1.44 (minimum version 1.24)
  Go version:       go1.21.6
  Git commit:       fce6e0c
  Built:            Thu Feb  1 00:23:03 2024
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.28
  GitCommit:        ae07eda36dd25f8a1b98dfbf587313b99c0190bb
 runc:
  Version:          1.1.12
  GitCommit:        v1.1.12-0-g51d5e94
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

docker info

16:39:33
Client: Docker Engine - Community
 Version:    25.0.2
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.11.2
    Path:     /usr/local/lib/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.20.3
    Path:     /usr/local/lib/docker/cli-plugins/docker-compose
  scan: Docker Scan (Docker Inc.)
    Version:  v0.23.0
    Path:     /usr/libexec/docker/cli-plugins/docker-scan

Server:
 Containers: 17
  Running: 14
  Paused: 0
  Stopped: 3
 Images: 106
 Server Version: 25.0.2
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: local
 Cgroup Driver: cgroupfs
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 nvidia runc sysbox-runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: ae07eda36dd25f8a1b98dfbf587313b99c0190bb
 runc version: v1.1.12-0-g51d5e94
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 6.5.0-10022-tuxedo
 Operating System: Ubuntu 23.04
 OSType: linux
 Architecture: x86_64
 CPUs: 32
 Total Memory: 62.53GiB
 Name: Gabriel-Tuxedo
 ID: 68d1049d-4416-42f0-a884-8fcbc24145ce
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Username: typoworx
 Experimental: false
 Insecure Registries:
  registry-api.php.docker
  registry.php.docker
  127.0.0.0/8
 Live Restore Enabled: false
 Default Address Pools:
   Base: 192.168.165.0/24, Size: 24
   Base: 172.30.0.0/16, Size: 24

Additional Info

nvidia-container-cli --load-kmods info 16:51:59
[sudo] password for gabriel:
NVRM version: 535.146.02
CUDA version: 12.2

Device Index: 0
Device Minor: 0
Model: NVIDIA GeForce RTX 4060 Laptop GPU
Brand: GeForce
GPU UUID: GPU-46eb0b05-a309-1169-2f8d-e076379b85a3
Bus Location: 00000000:01:00.0
Architecture: 8.9

The text was updated successfully, but these errors were encountered:

typoworx-de · 2024-02-21T16:12:08Z

Actually I tried this tutorial which shows a repository with updates I installed for nvidia-docker2
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#installing-with-apt

After upgrading the packages I ran:
sudo nvidia-ctk runtime configure --runtime=docker

But nvidia-smi still refuses to work:
Failed to initialize NVML: Unknown Error

typoworx-de · 2024-02-21T17:30:48Z

Obviously it worked after using privileged capabilitiies. This is nowhere documentated and I even think it should be avoided if possible only exposing nvidia-gpu not exposing full privileges.

typoworx-de added kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. status/0-triage labels Feb 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docker run & docker-compose run fails sharing nvidia-gpu capabilities #47424

docker run & docker-compose run fails sharing nvidia-gpu capabilities #47424

typoworx-de commented Feb 21, 2024 •

edited

typoworx-de commented Feb 21, 2024

typoworx-de commented Feb 21, 2024

docker run & docker-compose run fails sharing nvidia-gpu capabilities #47424

docker run & docker-compose run fails sharing nvidia-gpu capabilities #47424

Comments

typoworx-de commented Feb 21, 2024 • edited

Description

Reproduce

Expected behavior

docker version

docker info

Additional Info

typoworx-de commented Feb 21, 2024

typoworx-de commented Feb 21, 2024

typoworx-de commented Feb 21, 2024 •

edited