-
Notifications
You must be signed in to change notification settings - Fork 27k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.LongTensor [1, 128]] is at version 3; expected version 2 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck! #23087
Comments
Hey! |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Hi @jordane95 @ArthurZucker |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Yeah, this seems to be a problem involved with the siamese architecture? Althogh I can avoid this error by moving loss computation operations in |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
@jordane95 any idea of what happened in this error? Thanks |
@anaivebird If you are using BertModel class, try to put |
Thank you for your suggestion. I encountered the same issue as well. I’m also trying to use SimCLR or another contrastive learning framework based on the BERT class. However, I was able to resolve the problem using the method you provided. |
System Info
transformers
version: 4.25.1Who can help?
@sgugger @ArthurZucker
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I want to train a embedding-based retrieval qa system by minimizing the contrastive loss of correct (q,a) pairs against in-batch negatives. I also want it to be run on multiple gpus. But I run into the problem of backward propagation in position embedding layer of BERT (which I infer from the error log) when runing in distributed manner. I don't know where is broken (trainer? BertModel? pytorch?)
btw, the code works in single gpu setting
Command that I ran:
Error details:
Source code of
retrieval_qa.py
Expected behavior
Currently there is no problem on single gpu.
I want this code to run normally on multi-gpus. But it seems somewhere is broken...
It's hard to find where the problem is cause I'm not super familar with how pytorch/trainer/bertmodel works in distributed manner...
Could you help me? Thanks!
The text was updated successfully, but these errors were encountered: