You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)
Reproduction
Problem arises when I run trainer.train(resume_from_checkpoint=True) from the last checkpoint of a run that ran for more than 24 hours. The bug is in this line: https://github.com/huggingface/transformers/blob/20081c743ee2ce31d178f2182c7466c3313adcd2/src/transformers/trainer.py#L2156. When listing the sampler, the gpu goes out of memory. Whatever that line is trying to accomplish, it shouldn't give an error. Perhaps it's an OOM error. If I comment out that line and try to load the same checkpoint as before, the training is restarted correctly.
Expected behavior
When I run trainer.train(resume_from_checkpoint=True) from whichever previous checkpoint, it should continue to train and not throwing an error. Line 2156 of trainer.py should be modified so that it doesn't list the sampler.
The text was updated successfully, but these errors were encountered:
System Info
transformers
version: 4.39.0- distributed_type: NO
- mixed_precision: fp16
- use_cpu: False
- debug: False
- num_processes: 1
- machine_rank: 0
- num_machines: 1
- gpu_ids: all
- rdzv_backend: static
- same_network: True
- main_training_function: main
- enable_cpu_affinity: False
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
Who can help?
@muellerz @pacman100
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Problem arises when I run
trainer.train(resume_from_checkpoint=True)
from the last checkpoint of a run that ran for more than 24 hours. The bug is in this line: https://github.com/huggingface/transformers/blob/20081c743ee2ce31d178f2182c7466c3313adcd2/src/transformers/trainer.py#L2156. When listing the sampler, the gpu goes out of memory. Whatever that line is trying to accomplish, it shouldn't give an error. Perhaps it's an OOM error. If I comment out that line and try to load the same checkpoint as before, the training is restarted correctly.Expected behavior
When I run
trainer.train(resume_from_checkpoint=True)
from whichever previous checkpoint, it should continue to train and not throwing an error. Line 2156 of trainer.py should be modified so that it doesn't list the sampler.The text was updated successfully, but these errors were encountered: