RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.LongTensor [1, 128]] is at version 3; expected version 2 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck! #23087
Labels
System Info
transformers
version: 4.25.1Who can help?
@sgugger @ArthurZucker
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I want to train a embedding-based retrieval qa system by minimizing the contrastive loss of correct (q,a) pairs against in-batch negatives. I also want it to be run on multiple gpus. But I run into the problem of backward propagation in position embedding layer of BERT (which I infer from the error log) when runing in distributed manner. I don't know where is broken (trainer? BertModel? pytorch?)
btw, the code works in single gpu setting
Command that I ran:
Error details:
Source code of
retrieval_qa.py
Expected behavior
Currently there is no problem on single gpu.
I want this code to run normally on multi-gpus. But it seems somewhere is broken...
It's hard to find where the problem is cause I'm not super familar with how pytorch/trainer/bertmodel works in distributed manner...
Could you help me? Thanks!
The text was updated successfully, but these errors were encountered: