Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nvidia-container-toolkit broken and cgroups v2 issues #289

Open
RafalSkolasinski opened this issue Nov 11, 2021 · 2 comments
Open

nvidia-container-toolkit broken and cgroups v2 issues #289

RafalSkolasinski opened this issue Nov 11, 2021 · 2 comments

Comments

@RafalSkolasinski
Copy link

How did you upgrade to 21.10? (Fresh install / Upgrade)

Upgrade from 21.04 (actually it was quite accidental in sense I was not aware it was still beta :))

Related Application and/or Package Version (run apt policy $PACKAGE NAME):

nvidia-container-toolkit:
  Installed: 1.5.1-1
  Candidate: 1.5.1-1
  Version table:
 *** 1.5.1-1 500
        500 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64  Packages
        100 /var/lib/dpkg/status
     1.5.0-1 500
        500 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64  Packages
     1.4.2-1 500
        500 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64  Packages
     1.4.1-1 500
        500 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64  Packages
     1.4.0-1 500
        500 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64  Packages
     1.3.0-1 500
        500 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64  Packages
     1.2.1-1 500
        500 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64  Packages
     1.2.0-1 500
        500 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64  Packages
     1.1.2-1 500
        500 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64  Packages
     1.1.1-1 500
        500 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64  Packages
     1.1.0-1 500
        500 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64  Packages
     1.0.5-1 500
        500 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64  Packages
     1.0.4-1 500
        500 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64  Packages
     1.0.3-1 500
        500 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64  Packages
     1.0.2-1 500
        500 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64  Packages

Issue/Bug Description:

Package nividia-container-toolkit was missing. Previously it was provided from

nvidia-container-toolkit:
  Installed: 1.5.1-1pop1~1627998766~21.04~9847cf2
  Candidate: 1.5.1-1pop1~1627998766~21.04~9847cf2
  Version table:
 *** 1.5.1-1pop1~1627998766~21.04~9847cf2 1001
       1001 http://ppa.launchpad.net/system76/pop/ubuntu hirsute/main amd64 Packages
        100 /var/lib/dpkg/status

I did have to try to get it from older releases with

distribution=ubuntu20.04
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit

but then I was getting

$ docker run --gpus all nvidia/cuda:10.0-base nvidia-smi
docker: Error response from daemon: failed to create shim: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: container error: cgroup subsystem devices not found: unknown.
ERRO[0000] error waiting for container: context canceled 

Seems that this is issue with cgroups v2 (googling for error leads to quite a few issues out there already reported - I will try to compile list later) and the workaround (not a solution) seemed to be

sudo kernelstub -a "systemd.unified_cgroup_hierarchy=0"
sudo update-initramfs -c -k all
sudo reboot

Steps to reproduce (if you know):

  1. Get 21.10 PopOS
  2. Install nvidia-container-toolkit (and other nvidia stuff)
  3. Try to use docker run --gpus all ... command

Expected behavior:

it works fine with output along

docker run --gpus all nvidia/cuda:10.0-base nvidia-smi
Thu Nov 11 10:21:10 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.63.01    Driver Version: 470.63.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
|  0%   39C    P8     7W / 185W |   1486MiB /  7979MiB |     19%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Other Notes:

Happy to provide additional information. I planned to reinstall my machine back to 21.04 but decided to postpone by a day or two in case you'd like to get some more information about the problem or have some advice.

@RafalSkolasinski
Copy link
Author

May be related: NVIDIA/nvidia-docker#1447

@elezar
Copy link

elezar commented Oct 11, 2022

Note that only versions after v1.8.0 of the NVIDIA Container Toolkit (including libnvidia-container1) support cgroupv2. Please install a more recent version and see if this addresses your issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants