For multiple GPUs: torch.cuda.empty_cache() stuck forever #30766

animeshkumarpaul · 2024-05-11T23:03:08Z

System Info

transformers version: 4.41.0.dev0
Platform: Linux-5.4.0-172-generic-x86_64-with-glibc2.31
Python version: 3.11.0
Huggingface_hub version: 0.23.0
Safetensors version: 0.4.3
Accelerate version: 0.30.0
Accelerate config: not found
PyTorch version (GPU?): 2.2.2 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: yes
Using distributed or parallel set-up in script?: yes

Who can help?

@muellerzr and @pacman10

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

transformers/src/transformers/trainer.py

Line 3219 in 835de4c

torch.cuda.empty_cache()

I am using the BartForConditionalGeneration model.

Expected behavior

For the multiple GPUs: during the training time, the process stucks at this line forever - at that time there is no GPU usage, but there is CPU usage.

The text was updated successfully, but these errors were encountered:

amyeroberts added the trainer label May 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

For multiple GPUs: torch.cuda.empty_cache() stuck forever #30766

For multiple GPUs: torch.cuda.empty_cache() stuck forever #30766

animeshkumarpaul commented May 11, 2024

For multiple GPUs: torch.cuda.empty_cache() stuck forever #30766

For multiple GPUs: torch.cuda.empty_cache() stuck forever #30766

Comments

animeshkumarpaul commented May 11, 2024

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior