Cuda becomes unavailable and script is excuted by multiple times #2622

MagicianWu · 2024-04-04T12:17:39Z

System Info

- `Accelerate` version: 0.28.0
- Platform: Linux-5.4.0-173-generic-x86_64-with-glibc2.31
- Python version: 3.9.13
- Numpy version: 1.21.5
- PyTorch version (GPU?): 2.2.1+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- System RAM: 2015.53 GB
- GPU type: NVIDIA A800-SXM4-80GB
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: MULTI_GPU
        - mixed_precision: bf16
        - use_cpu: False
        - debug: True
        - num_processes: 8
        - machine_rank: 0
        - num_machines: 1
        - gpu_ids: [0,1,2,3,4,5,6,7]
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []

Information

The official example scripts
My own modified scripts

Tasks

One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)

Reproduction

import torch
from accelerate import Accelerator
import os

def main():
accelerator = Accelerator()
print(torch.cuda.is_available())

if name == "main":
main()

Executed with command:
accelerate launch accelerate_test.py

When executed with command:
python accelerate_test.py

Expected behavior

Cuda should be available while using accelerate.
And based on my understanding, print function should not be excuted by multiple times?

The text was updated successfully, but these errors were encountered:

muellerzr · 2024-04-04T12:19:42Z

Can you check python -c "import torch; print(torch.cuda.is_available()) from the CLI?

This means something is up with your torch build and/or cuda drivers.

And yes, print will be ran n times because you're not using accelerator.print() :) (N==num gpus)

muellerzr · 2024-04-04T12:20:25Z

You also have multiple envs active, which can lead to weird issues like this (been there/seen it before). Do a conda deactivate fully then conda activate meshgpt. Might also solve the issue (could be pointing to the wrong python or bash!)

MagicianWu · 2024-04-04T12:24:33Z

@muellerzr Thanks for your quick response!

muellerzr · 2024-04-04T12:31:00Z

My best guess is you installed accelerate in another env, and it's messed up your bash scripts, so accelerate launch is pointing to the wrong accelerate. I recommend a full uninstall, as your system is borked from the accelerate installs.

How to check:

which accelerate launch

It should point to something equivalent to:

/.../mycondalocation/envs/meshgpt/bin/accelerate

muellerzr · 2024-04-04T12:31:20Z

Let me know if it doesn't

muellerzr · 2024-04-04T12:32:15Z

Actually I can see right there it's calling it from your .local bash, so you installed it without conda possibly once, messing up the whole thing?

MagicianWu · 2024-04-04T12:32:42Z

MagicianWu · 2024-04-04T12:37:09Z

Should I reinstall accelerate or the whole environment?

muellerzr · 2024-04-04T12:43:21Z

I'd uninstall accelerate on your base environment first (without conda), which seems to stem the issue. Then reinstall it in the conda env using pip install accelerate --force-reinstall --no-deps. Hopefully afterwards which accelerate launch will point correctly!

MagicianWu · 2024-04-04T12:53:39Z

@muellerzr Thanks for your help! Problems in this issue and issue2621 are resolved!

muellerzr · 2024-04-04T12:56:34Z

Fantastic! Glad to hear it @MagicianWu :)

muellerzr mentioned this issue Apr 4, 2024

Deepspeed can not be detected #2621

Closed

4 tasks

MagicianWu closed this as completed Apr 4, 2024

muellerzr mentioned this issue Apr 4, 2024

Link to bash in env reporting #2623

Merged

5 tasks

SunMarc mentioned this issue Apr 15, 2024

Cuda is not available after using accelerate launch to run script #2626

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cuda becomes unavailable and script is excuted by multiple times #2622

Cuda becomes unavailable and script is excuted by multiple times #2622

MagicianWu commented Apr 4, 2024

muellerzr commented Apr 4, 2024

muellerzr commented Apr 4, 2024

MagicianWu commented Apr 4, 2024

muellerzr commented Apr 4, 2024

muellerzr commented Apr 4, 2024

muellerzr commented Apr 4, 2024

MagicianWu commented Apr 4, 2024

MagicianWu commented Apr 4, 2024

muellerzr commented Apr 4, 2024

MagicianWu commented Apr 4, 2024

muellerzr commented Apr 4, 2024

Cuda becomes unavailable and script is excuted by multiple times #2622

Cuda becomes unavailable and script is excuted by multiple times #2622

Comments

MagicianWu commented Apr 4, 2024

System Info

Information

Tasks

Reproduction

Expected behavior

muellerzr commented Apr 4, 2024

muellerzr commented Apr 4, 2024

MagicianWu commented Apr 4, 2024

muellerzr commented Apr 4, 2024

muellerzr commented Apr 4, 2024

muellerzr commented Apr 4, 2024

MagicianWu commented Apr 4, 2024

MagicianWu commented Apr 4, 2024

muellerzr commented Apr 4, 2024

MagicianWu commented Apr 4, 2024

muellerzr commented Apr 4, 2024