listing train_dataloader sampler throws out of memory error #30500

gioaca00 · 2024-04-26T12:33:28Z

System Info

transformers version: 4.39.0
Platform: Linux-3.10.0-1160.76.1.el7.x86_64-x86_64-with-glibc2.17
Python version: 3.10.13
Huggingface_hub version: 0.21.4
Safetensors version: 0.4.2
Accelerate version: 0.29.3
Accelerate config: - compute_environment: LOCAL_MACHINE
- distributed_type: NO
- mixed_precision: fp16
- use_cpu: False
- debug: False
- num_processes: 1
- machine_rank: 0
- num_machines: 1
- gpu_ids: all
- rdzv_backend: static
- same_network: True
- main_training_function: main
- enable_cpu_affinity: False
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
PyTorch version (GPU?): 2.2.1+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: yes
Using distributed or parallel set-up in script?: yes

Who can help?

@muellerz @pacman100

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Problem arises when I run trainer.train(resume_from_checkpoint=True) from the last checkpoint of a run that ran for more than 24 hours. The bug is in this line: https://github.com/huggingface/transformers/blob/20081c743ee2ce31d178f2182c7466c3313adcd2/src/transformers/trainer.py#L2156. When listing the sampler, the gpu goes out of memory. Whatever that line is trying to accomplish, it shouldn't give an error. Perhaps it's an OOM error. If I comment out that line and try to load the same checkpoint as before, the training is restarted correctly.

Expected behavior

When I run trainer.train(resume_from_checkpoint=True) from whichever previous checkpoint, it should continue to train and not throwing an error. Line 2156 of trainer.py should be modified so that it doesn't list the sampler.

The text was updated successfully, but these errors were encountered:

muellerzr · 2024-04-26T13:31:25Z

Oh that's a big ol' bug we can fix now with accelerate I think. Thanks!

amyeroberts added the trainer label Apr 26, 2024

jubueche mentioned this issue Apr 26, 2024

accelerate launch increases allocated memory on all GPUs, not just GPU:0, by amount proportional to total number of available GPUs huggingface/accelerate#2709

Closed

4 tasks

muellerzr mentioned this issue Apr 26, 2024

Remove skipping logic now that set_epoch exists #30501

Merged

5 tasks

muellerzr closed this as completed in #30501 Apr 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

listing train_dataloader sampler throws out of memory error #30500

listing train_dataloader sampler throws out of memory error #30500

gioaca00 commented Apr 26, 2024

muellerzr commented Apr 26, 2024

listing train_dataloader sampler throws out of memory error #30500

listing train_dataloader sampler throws out of memory error #30500

Comments

gioaca00 commented Apr 26, 2024

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

muellerzr commented Apr 26, 2024