torch gpu problem #139

arnirs · 2024-02-23T10:04:58Z

Issue Description

Describe the issue
Hi and thanks for your hard works. Unfortunately, pytorch gpu is not working in v1.6_cuda-11.8_ubuntu-22.04 image. It says that no cuda 12.1 found. In fact torch now supports cuda 12 and the docker file used for building v1.6_cuda-11.8_ubuntu-22.04 image does not explicitly state the cuda version while installing pytorch and because of that the cuda 12 compatible pytorch will be installed. For installing pytorch with cuda 11.8, we should use following commands:
conda install pytorch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 pytorch-cuda=11.8 -c pytorch -c nvidia
or for the latest pytorch:
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia

To Reproduce
Steps to reproduce the behavior:
docker run --gpus all -d -it -p 8848:8888 -v $(pwd)/data:/home/jovyan/work -e GRANT_SUDO=yes -e JUPYTER_ENABLE_LAB=yes --user root cschranz/gpu-jupyter:v1.6_cuda-11.8_ubuntu-22.04

log into Jupyterlab

run following in a notebook:
print(torch.version.cuda)

Expected Behavior
Torch must use CUDA 11.8.

Environment

Operating System:
Ubuntu 22.04.3

NVIDIA GPU and CUDA version Details:
CUDA 11.8
NIVIDA driver 520.xxx

GPU-Jupyter Version:
v1.6_cuda-11.8_ubuntu-22.04

Thanks in advance.

Layoric · 2024-02-24T22:59:55Z

If you add the environment variable TORCH_CUDA_ARCH_LIST I have been able to get it to load correctly. Eg,

version: "3.8"
services:
  gpu-jupyter:
    container_name: gpu-jupyter
    build: .build
    deploy:
      resources:
        reservations:
          devices:
            - capabilities:
              - gpu
    # # Set hardware limits: one GPU, max. 48GB RAM, max. 31 GPUs
    # deploy:
    #   resources:
    #     reservations:
    #       devices:
    #         - driver: nvidia
    #           capabilities: [gpu]
    #           device_ids: ["0"]  # select one GPU
    #     limits:
    #       cpus: "31.0"
    #       memory: 48g
    ports:
      - 10000:8888
    volumes:
      - ./data:/home/jovyan/work
    environment:
      GRANT_SUDO: "yes"
      JUPYTER_ENABLE_LAB: "yes"
      NB_UID: ${JUPYTER_UID:-1000}
      NB_GID: ${JUPYTER_GID:-1000}
      JUPYTER_TOKEN: ${JUPYTER_TOKEN}
      TORCH_CUDA_ARCH_LIST: 8.6
    # enable sudo permissions
    user:
      "root"
    restart: always

Hope that helps.

ChristophSchranz · 2024-02-29T15:00:40Z

Hi,
The problems origins from the torch installation routine that was suggested at the time. It updates the CUDA version and corrupts the installation. Now, pytorch suggests an installation with fixed cuda version (again) and I changed it in this commit: 2ac3181

It wonders me that it occurs now for this image, as it worked in the tests. Is this error fixed if you build the image based on the repository? If yes, I'll update the image tag.

@Layoric Thanks for the quick fix! As the origin is a corrupted cuda installation, I'll fix the origin.

ChristophSchranz · 2024-03-21T10:06:43Z

The commit 9982802 should provide a clear solution for this problem in version v1.6_cuda-11.8_ubuntu-22.04. For the pip install of Pytorch, the index-url is pinned for CUDA 11.8.

Please re-open if the issue still occurs.

ChristophSchranz closed this as completed Mar 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

torch gpu problem #139

torch gpu problem #139

arnirs commented Feb 23, 2024

Layoric commented Feb 24, 2024

ChristophSchranz commented Feb 29, 2024

ChristophSchranz commented Mar 21, 2024

torch gpu problem #139

torch gpu problem #139

Comments

arnirs commented Feb 23, 2024

Issue Description

Environment

Layoric commented Feb 24, 2024

ChristophSchranz commented Feb 29, 2024

ChristophSchranz commented Mar 21, 2024