Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

listing train_dataloader sampler throws out of memory error #30500

Closed
2 of 4 tasks
gioaca00 opened this issue Apr 26, 2024 · 1 comment · Fixed by #30501
Closed
2 of 4 tasks

listing train_dataloader sampler throws out of memory error #30500

gioaca00 opened this issue Apr 26, 2024 · 1 comment · Fixed by #30501
Labels

Comments

@gioaca00
Copy link

System Info

  • transformers version: 4.39.0
  • Platform: Linux-3.10.0-1160.76.1.el7.x86_64-x86_64-with-glibc2.17
  • Python version: 3.10.13
  • Huggingface_hub version: 0.21.4
  • Safetensors version: 0.4.2
  • Accelerate version: 0.29.3
  • Accelerate config: - compute_environment: LOCAL_MACHINE
    - distributed_type: NO
    - mixed_precision: fp16
    - use_cpu: False
    - debug: False
    - num_processes: 1
    - machine_rank: 0
    - num_machines: 1
    - gpu_ids: all
    - rdzv_backend: static
    - same_network: True
    - main_training_function: main
    - enable_cpu_affinity: False
    - downcast_bf16: no
    - tpu_use_cluster: False
    - tpu_use_sudo: False
    - tpu_env: []
  • PyTorch version (GPU?): 2.2.1+cu121 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: yes
  • Using distributed or parallel set-up in script?: yes

Who can help?

@muellerz @pacman100

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Problem arises when I run trainer.train(resume_from_checkpoint=True) from the last checkpoint of a run that ran for more than 24 hours. The bug is in this line: https://github.com/huggingface/transformers/blob/20081c743ee2ce31d178f2182c7466c3313adcd2/src/transformers/trainer.py#L2156. When listing the sampler, the gpu goes out of memory. Whatever that line is trying to accomplish, it shouldn't give an error. Perhaps it's an OOM error. If I comment out that line and try to load the same checkpoint as before, the training is restarted correctly.

Expected behavior

When I run trainer.train(resume_from_checkpoint=True) from whichever previous checkpoint, it should continue to train and not throwing an error. Line 2156 of trainer.py should be modified so that it doesn't list the sampler.

@muellerzr
Copy link
Contributor

Oh that's a big ol' bug we can fix now with accelerate I think. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants