Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

training problem #19

Open
StephanPan opened this issue Mar 24, 2021 · 12 comments
Open

training problem #19

StephanPan opened this issue Mar 24, 2021 · 12 comments

Comments

@StephanPan
Copy link

when i trained the model on campus datasets and met such problem. and i use the torch1.7, cuda 11.1. And the training strategy in the code seems be different from the strategy given in the paper.
Traceback (most recent call last):
File "run/train_3d.py", line 163, in
main()
File "run/train_3d.py", line 136, in main
train_3d(config, model, optimizer, train_loader, epoch, final_output_dir, writer_dict)
File "/home/gw/Project/voxelpose/lib/core/function.py", line 68, in train_3d
accu_loss_3d.backward()
File "/home/gw/anaconda3/envs/VIBE/lib/python3.7/site-packages/torch/tensor.py", line 221, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/gw/anaconda3/envs/VIBE/lib/python3.7/site-packages/torch/autograd/init.py", line 132, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [1, 32, 1, 1, 1]] is at version 8; expected version 6 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

@axhiao
Copy link

axhiao commented Mar 25, 2021

Hi @StephanPan have you solved the issue? I think the problem is this line. But I don't know how to rewrite it.

@StephanPan
Copy link
Author

@axhiao i change the loss calculation in function.py as follows, and it worked, but i do not know whether it will influence the model performance.
optimizer.zero_grad()
if loss_cord > 0:
(loss_2d + loss_cord).backward()
if loss_3d > 0 and (i + 1) % accumulation_steps == 0:
loss_3d.backward()
optimizer.step()

@axhiao
Copy link

axhiao commented Mar 25, 2021

hi @StephanPan I think it's due to different pytorch version. I recommend you use the requirements.txt to create a fully new virtual python env to run this codes.

@StephanPan
Copy link
Author

@axhiao that's right, but my cuda version and gpu driver is not corresponding to the torch1.4

@tamasino52
Copy link

I'm in the same error too...

@wkom
Copy link

wkom commented May 18, 2021

@StephanPan, hi, you are right, the problem is in the backward step, you can change the code in function.py as follows

loss = loss_2d + loss_3d + loss_cord
loss.backward()
if (i + 1) % accumulation_steps == 0:
optimizer.step()
optimizer.zero_grad()

@sudo-vinnie
Copy link

@StephanPan Hi
Do you know what is loss cord?

@SauBuen
Copy link

SauBuen commented Jul 7, 2021

@StephanPan, @wkom hi, you are right, the problem is in the backward step, you can change the code in function.py as follows

loss = loss_2d + loss_3d + loss_cord
loss.backward()
if (i + 1) % accumulation_steps == 0:
optimizer.step()
optimizer.zero_grad()

How exactly do you change the code ???


        loss = loss_2d + loss_3d + loss_cord
        losses.update(loss.item())

        if loss_cord > 0:
            optimizer.zero_grad()
            (loss_2d + loss_cord).backward()
            optimizer.step()

        if accu_loss_3d > 0 and (i + 1) % accumulation_steps == 0:
            optimizer.zero_grad()
            accu_loss_3d.backward()
            optimizer.step()
            accu_loss_3d = 0.0
        else:
            accu_loss_3d += loss_3d / accumulation_steps

@salvador-blanco
Copy link

@StephanPan, @wkom hi, you are right, the problem is in the backward step, you can change the code in function.py as follows
loss = loss_2d + loss_3d + loss_cord
loss.backward()
if (i + 1) % accumulation_steps == 0:
optimizer.step()
optimizer.zero_grad()

How exactly do you change the code ???


        loss = loss_2d + loss_3d + loss_cord
        losses.update(loss.item())

        if loss_cord > 0:
            optimizer.zero_grad()
            (loss_2d + loss_cord).backward()
            optimizer.step()

        if accu_loss_3d > 0 and (i + 1) % accumulation_steps == 0:
            optimizer.zero_grad()
            accu_loss_3d.backward()
            optimizer.step()
            accu_loss_3d = 0.0
        else:
            accu_loss_3d += loss_3d / accumulation_steps

This is how I changed it, it works for me:

 loss_2d = loss_2d.mean()
        loss_3d = loss_3d.mean()
        loss_cord = loss_cord.mean()

        losses_2d.update(loss_2d.item())
        losses_3d.update(loss_3d.item())
        losses_cord.update(loss_cord.item())
        loss = loss_2d + loss_3d + loss_cord
        losses.update(loss.item())

        loss.backward()
        if (i + 1) % accumulation_steps == 0:
            optimizer.step()
            optimizer.zero_grad()

        # if loss_cord > 0:
        #     optimizer.zero_grad()
        #     (loss_2d + loss_cord).backward()
        #     optimizer.step()

        # if accu_loss_3d > 0 and (i + 1) % accumulation_steps == 0
        #     optimizer.step()
        #     optimizer.zero_grad()
        #     accu_loss_3d.backward()
        #     accu_loss_3d = 0.0
        # else:
        #     accu_loss_3d += loss_3d / accumulation_steps

        batch_time.update(time.time() - end)
        end = time.time()

@baojunshan
Copy link

baojunshan commented Oct 29, 2021

Try to change torch version to 1.4, it should be ok. :)

@Alex-JYJ
Copy link

@StephanPan, @wkom hi, you are right, the problem is in the backward step, you can change the code in function.py as follows
loss = loss_2d + loss_3d + loss_cord
loss.backward()
if (i + 1) % accumulation_steps == 0:
optimizer.step()
optimizer.zero_grad()

How exactly do you change the code ???


        loss = loss_2d + loss_3d + loss_cord
        losses.update(loss.item())

        if loss_cord > 0:
            optimizer.zero_grad()
            (loss_2d + loss_cord).backward()
            optimizer.step()

        if accu_loss_3d > 0 and (i + 1) % accumulation_steps == 0:
            optimizer.zero_grad()
            accu_loss_3d.backward()
            optimizer.step()
            accu_loss_3d = 0.0
        else:
            accu_loss_3d += loss_3d / accumulation_steps

This is how I changed it, it works for me:

 loss_2d = loss_2d.mean()
        loss_3d = loss_3d.mean()
        loss_cord = loss_cord.mean()

        losses_2d.update(loss_2d.item())
        losses_3d.update(loss_3d.item())
        losses_cord.update(loss_cord.item())
        loss = loss_2d + loss_3d + loss_cord
        losses.update(loss.item())

        loss.backward()
        if (i + 1) % accumulation_steps == 0:
            optimizer.step()
            optimizer.zero_grad()

        # if loss_cord > 0:
        #     optimizer.zero_grad()
        #     (loss_2d + loss_cord).backward()
        #     optimizer.step()

        # if accu_loss_3d > 0 and (i + 1) % accumulation_steps == 0
        #     optimizer.step()
        #     optimizer.zero_grad()
        #     accu_loss_3d.backward()
        #     accu_loss_3d = 0.0
        # else:
        #     accu_loss_3d += loss_3d / accumulation_steps

        batch_time.update(time.time() - end)
        end = time.time()

The change also works for me, but I don't know whether it will affect the precison of the result, can you give some explanation? Thanks!

@cucdengjunli
Copy link

same question

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants