Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docker run & docker-compose run fails sharing nvidia-gpu capabilities #47424

Open
typoworx-de opened this issue Feb 21, 2024 · 2 comments
Open
Labels
kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. status/0-triage

Comments

@typoworx-de
Copy link

typoworx-de commented Feb 21, 2024

Description

I've been trying to share nvidia-gpu (for cuda/compute) to docker-container as described in:

I'm using Ubuntu 22.04 and installed recent docker and nvidia-docker2 + nvidia-container-toolkit:

Reproduce

  1. sudo apt-get install nvidia-docker2 nvidia-container-toolkit
  2. I've added nvidia-runtime to host's docker daemon.json and restarter docker-service.
  {
  "runtimes": {
    "sysbox-runc": {
       "path": "/usr/bin/sysbox-runc"
    },
    "nvidia": {
       "path": "nvidia-container-runtime",
       "runtimeArgs": []
    }
  },
  "exec-opts": ["native.cgroupdriver=cgroupfs"]
  }
  1. docker run --rm -ti --gpus all --entrypoint nvidia-smi nvidia/cuda:12.3.1-runtime-ubuntu22.04
    Failed to initialize NVML: Unknown Error

  2. inside container: ls -lah /dev/nvidia* shows up nvidia-devices

  3. Trying with docker-compose.yml results in the same problem:

version: '3.8'

services:
  nvidia-smi:
    image: nvidia/cuda:12.3.1-runtime-ubuntu22.04
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
      - NVIDIA_DRIVER_CAPABILITIES=all
    command: nvidia-smi
    deploy:
      resources:
        reservations:
          devices:
            - capabilities:
                - gpu
                - compute
            - driver: nvidia
              #device_ids: ['0']
              capabilities: [gpu]

Expected behavior

NVIDIA/Compute sharing should work as documented in docker docs!

docker version

Client: Docker Engine - Community
 Version:           25.0.2
 API version:       1.44
 Go version:        go1.21.6
 Git commit:        29cf629
 Built:             Thu Feb  1 00:23:03 2024
 OS/Arch:           linux/amd64
 Context:           default

Server: Docker Engine - Community
 Engine:
  Version:          25.0.2
  API version:      1.44 (minimum version 1.24)
  Go version:       go1.21.6
  Git commit:       fce6e0c
  Built:            Thu Feb  1 00:23:03 2024
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.28
  GitCommit:        ae07eda36dd25f8a1b98dfbf587313b99c0190bb
 runc:
  Version:          1.1.12
  GitCommit:        v1.1.12-0-g51d5e94
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

docker info

16:39:33
Client: Docker Engine - Community
 Version:    25.0.2
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.11.2
    Path:     /usr/local/lib/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.20.3
    Path:     /usr/local/lib/docker/cli-plugins/docker-compose
  scan: Docker Scan (Docker Inc.)
    Version:  v0.23.0
    Path:     /usr/libexec/docker/cli-plugins/docker-scan

Server:
 Containers: 17
  Running: 14
  Paused: 0
  Stopped: 3
 Images: 106
 Server Version: 25.0.2
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: local
 Cgroup Driver: cgroupfs
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 nvidia runc sysbox-runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: ae07eda36dd25f8a1b98dfbf587313b99c0190bb
 runc version: v1.1.12-0-g51d5e94
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 6.5.0-10022-tuxedo
 Operating System: Ubuntu 23.04
 OSType: linux
 Architecture: x86_64
 CPUs: 32
 Total Memory: 62.53GiB
 Name: Gabriel-Tuxedo
 ID: 68d1049d-4416-42f0-a884-8fcbc24145ce
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Username: typoworx
 Experimental: false
 Insecure Registries:
  registry-api.php.docker
  registry.php.docker
  127.0.0.0/8
 Live Restore Enabled: false
 Default Address Pools:
   Base: 192.168.165.0/24, Size: 24
   Base: 172.30.0.0/16, Size: 24

Additional Info

nvidia-container-cli --load-kmods info 16:51:59
[sudo] password for gabriel:
NVRM version: 535.146.02
CUDA version: 12.2

Device Index: 0
Device Minor: 0
Model: NVIDIA GeForce RTX 4060 Laptop GPU
Brand: GeForce
GPU UUID: GPU-46eb0b05-a309-1169-2f8d-e076379b85a3
Bus Location: 00000000:01:00.0
Architecture: 8.9

@typoworx-de typoworx-de added kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. status/0-triage labels Feb 21, 2024
@typoworx-de
Copy link
Author

Actually I tried this tutorial which shows a repository with updates I installed for nvidia-docker2
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#installing-with-apt

After upgrading the packages I ran:
sudo nvidia-ctk runtime configure --runtime=docker

But nvidia-smi still refuses to work:
Failed to initialize NVML: Unknown Error

@typoworx-de
Copy link
Author

Obviously it worked after using privileged capabilitiies. This is nowhere documentated and I even think it should be avoided if possible only exposing nvidia-gpu not exposing full privileges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. status/0-triage
Projects
None yet
Development

No branches or pull requests

1 participant