New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parameter at index 195 has been marked as ready twice. #23018
Comments
There is little we can do to help without seeing a full reproducer. |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Got exact same bug when gradient_checkpointing_enable() |
Are you using DDP? I am using DDP on two GPUs: python -m torch.distributed.run --nproc_per_node 2 run_audio_classification.py (run because launch fails) All the rest being equal facebook/wav2vec2-base works if gradient_checkpointing is set to True, however, the large model crashes unless the option it is either set to False or removed. gradient_checkpointing works for both models if using a single GPU, so the issue seems to be DDP-related. This seems to come from: https://github.com/pytorch/pytorch/blob/main/torch/csrc/distributed/c10d/reducer.cpp |
The problem may be that when the trainer is invoked from torchrun is setting find_unused_parameters to True for all devices, when, apparently, it should only do it for the first one: And the reason why the base model works is because that option can be set to False. However, for the large model it has to be True. The solution would be changing the way in which that argument is parsed. |
Thank you @mirix , Making |
if you use model.enable_gradient_checkpointing(gradient_checkpointing_kwargs={"use_reentrant": False}) |
System Info
transformers
version: 4.28.0Who can help?
@ArthurZucker @younesbelkada
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I retrained Roberta on my own corpus with the MLM task. I set
model.gradient_checkpointing_enable()
to save memory.My model:
There is an error:
If I get rid of this line of code:
model.gradient_checkpointing_enable()
, it is ok. Why?Expected behavior
I want to pre-train with
gradient_checkpointing
.The text was updated successfully, but these errors were encountered: