Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cuda becomes unavailable and script is excuted by multiple times #2622

Closed
2 of 4 tasks
MagicianWu opened this issue Apr 4, 2024 · 11 comments
Closed
2 of 4 tasks

Cuda becomes unavailable and script is excuted by multiple times #2622

MagicianWu opened this issue Apr 4, 2024 · 11 comments

Comments

@MagicianWu
Copy link

System Info

- `Accelerate` version: 0.28.0
- Platform: Linux-5.4.0-173-generic-x86_64-with-glibc2.31
- Python version: 3.9.13
- Numpy version: 1.21.5
- PyTorch version (GPU?): 2.2.1+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- System RAM: 2015.53 GB
- GPU type: NVIDIA A800-SXM4-80GB
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: MULTI_GPU
        - mixed_precision: bf16
        - use_cpu: False
        - debug: True
        - num_processes: 8
        - machine_rank: 0
        - num_machines: 1
        - gpu_ids: [0,1,2,3,4,5,6,7]
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • My own task or dataset (give details below)

Reproduction

import torch
from accelerate import Accelerator
import os

def main():
accelerator = Accelerator()
print(torch.cuda.is_available())

if name == "main":
main()

Executed with command:
accelerate launch accelerate_test.py
image

When executed with command:
python accelerate_test.py
image

Expected behavior

Cuda should be available while using accelerate.
And based on my understanding, print function should not be excuted by multiple times?

@muellerzr
Copy link
Collaborator

Can you check python -c "import torch; print(torch.cuda.is_available()) from the CLI?

This means something is up with your torch build and/or cuda drivers.

And yes, print will be ran n times because you're not using accelerator.print() :) (N==num gpus)

@muellerzr
Copy link
Collaborator

You also have multiple envs active, which can lead to weird issues like this (been there/seen it before). Do a conda deactivate fully then conda activate meshgpt. Might also solve the issue (could be pointing to the wrong python or bash!)

@MagicianWu
Copy link
Author

@muellerzr Thanks for your quick response!
image
image

@muellerzr
Copy link
Collaborator

My best guess is you installed accelerate in another env, and it's messed up your bash scripts, so accelerate launch is pointing to the wrong accelerate. I recommend a full uninstall, as your system is borked from the accelerate installs.

How to check:

which accelerate launch

It should point to something equivalent to:

/.../mycondalocation/envs/meshgpt/bin/accelerate

@muellerzr
Copy link
Collaborator

Let me know if it doesn't

@muellerzr
Copy link
Collaborator

Actually I can see right there it's calling it from your .local bash, so you installed it without conda possibly once, messing up the whole thing?
319584355-f7be7107-3a8b-43dc-9399-6c8ad215668a

@MagicianWu
Copy link
Author

image

@MagicianWu
Copy link
Author

Should I reinstall accelerate or the whole environment?

@muellerzr
Copy link
Collaborator

I'd uninstall accelerate on your base environment first (without conda), which seems to stem the issue. Then reinstall it in the conda env using pip install accelerate --force-reinstall --no-deps. Hopefully afterwards which accelerate launch will point correctly!

@MagicianWu
Copy link
Author

@muellerzr Thanks for your help! Problems in this issue and issue2621 are resolved!

@muellerzr
Copy link
Collaborator

Fantastic! Glad to hear it @MagicianWu :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants