Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

torch gpu problem #139

Closed
arnirs opened this issue Feb 23, 2024 · 3 comments
Closed

torch gpu problem #139

arnirs opened this issue Feb 23, 2024 · 3 comments

Comments

@arnirs
Copy link

arnirs commented Feb 23, 2024

Issue Description

Describe the issue
Hi and thanks for your hard works. Unfortunately, pytorch gpu is not working in v1.6_cuda-11.8_ubuntu-22.04 image. It says that no cuda 12.1 found. In fact torch now supports cuda 12 and the docker file used for building v1.6_cuda-11.8_ubuntu-22.04 image does not explicitly state the cuda version while installing pytorch and because of that the cuda 12 compatible pytorch will be installed. For installing pytorch with cuda 11.8, we should use following commands:
conda install pytorch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 pytorch-cuda=11.8 -c pytorch -c nvidia
or for the latest pytorch:
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia

To Reproduce
Steps to reproduce the behavior:
docker run --gpus all -d -it -p 8848:8888 -v $(pwd)/data:/home/jovyan/work -e GRANT_SUDO=yes -e JUPYTER_ENABLE_LAB=yes --user root cschranz/gpu-jupyter:v1.6_cuda-11.8_ubuntu-22.04

log into Jupyterlab

run following in a notebook:
print(torch.version.cuda)

Expected Behavior
Torch must use CUDA 11.8.

Environment

Operating System:
Ubuntu 22.04.3

NVIDIA GPU and CUDA version Details:
CUDA 11.8
NIVIDA driver 520.xxx

GPU-Jupyter Version:
v1.6_cuda-11.8_ubuntu-22.04

Thanks in advance.

@Layoric
Copy link

Layoric commented Feb 24, 2024

If you add the environment variable TORCH_CUDA_ARCH_LIST I have been able to get it to load correctly. Eg,

version: "3.8"
services:
  gpu-jupyter:
    container_name: gpu-jupyter
    build: .build
    deploy:
      resources:
        reservations:
          devices:
            - capabilities:
              - gpu
    # # Set hardware limits: one GPU, max. 48GB RAM, max. 31 GPUs
    # deploy:
    #   resources:
    #     reservations:
    #       devices:
    #         - driver: nvidia
    #           capabilities: [gpu]
    #           device_ids: ["0"]  # select one GPU
    #     limits:
    #       cpus: "31.0"
    #       memory: 48g
    ports:
      - 10000:8888
    volumes:
      - ./data:/home/jovyan/work
    environment:
      GRANT_SUDO: "yes"
      JUPYTER_ENABLE_LAB: "yes"
      NB_UID: ${JUPYTER_UID:-1000}
      NB_GID: ${JUPYTER_GID:-1000}
      JUPYTER_TOKEN: ${JUPYTER_TOKEN}
      TORCH_CUDA_ARCH_LIST: 8.6
    # enable sudo permissions
    user:
      "root"
    restart: always

Hope that helps.

@ChristophSchranz
Copy link
Collaborator

Hi,
The problems origins from the torch installation routine that was suggested at the time. It updates the CUDA version and corrupts the installation. Now, pytorch suggests an installation with fixed cuda version (again) and I changed it in this commit: 2ac3181

It wonders me that it occurs now for this image, as it worked in the tests. Is this error fixed if you build the image based on the repository? If yes, I'll update the image tag.

@Layoric Thanks for the quick fix! As the origin is a corrupted cuda installation, I'll fix the origin.

@ChristophSchranz
Copy link
Collaborator

The commit 9982802 should provide a clear solution for this problem in version v1.6_cuda-11.8_ubuntu-22.04. For the pip install of Pytorch, the index-url is pinned for CUDA 11.8.

Please re-open if the issue still occurs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants