training problem #19

StephanPan · 2021-03-24T02:32:14Z

when i trained the model on campus datasets and met such problem. and i use the torch1.7, cuda 11.1. And the training strategy in the code seems be different from the strategy given in the paper.
Traceback (most recent call last):
File "run/train_3d.py", line 163, in
main()
File "run/train_3d.py", line 136, in main
train_3d(config, model, optimizer, train_loader, epoch, final_output_dir, writer_dict)
File "/home/gw/Project/voxelpose/lib/core/function.py", line 68, in train_3d
accu_loss_3d.backward()
File "/home/gw/anaconda3/envs/VIBE/lib/python3.7/site-packages/torch/tensor.py", line 221, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/gw/anaconda3/envs/VIBE/lib/python3.7/site-packages/torch/autograd/init.py", line 132, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [1, 32, 1, 1, 1]] is at version 8; expected version 6 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

axhiao · 2021-03-25T00:13:15Z

Hi @StephanPan have you solved the issue? I think the problem is this line. But I don't know how to rewrite it.

StephanPan · 2021-03-25T13:41:39Z

@axhiao i change the loss calculation in function.py as follows, and it worked, but i do not know whether it will influence the model performance.
optimizer.zero_grad()
if loss_cord > 0:
(loss_2d + loss_cord).backward()
if loss_3d > 0 and (i + 1) % accumulation_steps == 0:
loss_3d.backward()
optimizer.step()

axhiao · 2021-03-25T16:32:54Z

hi @StephanPan I think it's due to different pytorch version. I recommend you use the requirements.txt to create a fully new virtual python env to run this codes.

StephanPan · 2021-03-26T01:29:32Z

@axhiao that's right, but my cuda version and gpu driver is not corresponding to the torch1.4

tamasino52 · 2021-04-07T10:34:29Z

I'm in the same error too...

wkom · 2021-05-18T03:33:25Z

@StephanPan, hi, you are right, the problem is in the backward step, you can change the code in function.py as follows

loss = loss_2d + loss_3d + loss_cord
loss.backward()
if (i + 1) % accumulation_steps == 0:
optimizer.step()
optimizer.zero_grad()

sudo-vinnie · 2021-05-26T02:23:12Z

@StephanPan Hi
Do you know what is loss cord?

SauBuen · 2021-07-07T13:04:52Z

@StephanPan, @wkom hi, you are right, the problem is in the backward step, you can change the code in function.py as follows

loss = loss_2d + loss_3d + loss_cord
loss.backward()
if (i + 1) % accumulation_steps == 0:
optimizer.step()
optimizer.zero_grad()

How exactly do you change the code ???


        loss = loss_2d + loss_3d + loss_cord
        losses.update(loss.item())

        if loss_cord > 0:
            optimizer.zero_grad()
            (loss_2d + loss_cord).backward()
            optimizer.step()

        if accu_loss_3d > 0 and (i + 1) % accumulation_steps == 0:
            optimizer.zero_grad()
            accu_loss_3d.backward()
            optimizer.step()
            accu_loss_3d = 0.0
        else:
            accu_loss_3d += loss_3d / accumulation_steps

salvador-blanco · 2021-08-04T09:53:37Z

@StephanPan, @wkom hi, you are right, the problem is in the backward step, you can change the code in function.py as follows
loss = loss_2d + loss_3d + loss_cord
loss.backward()
if (i + 1) % accumulation_steps == 0:
optimizer.step()
optimizer.zero_grad()

How exactly do you change the code ???
        loss = loss_2d + loss_3d + loss_cord
        losses.update(loss.item())

        if loss_cord > 0:
            optimizer.zero_grad()
            (loss_2d + loss_cord).backward()
            optimizer.step()

        if accu_loss_3d > 0 and (i + 1) % accumulation_steps == 0:
            optimizer.zero_grad()
            accu_loss_3d.backward()
            optimizer.step()
            accu_loss_3d = 0.0
        else:
            accu_loss_3d += loss_3d / accumulation_steps

This is how I changed it, it works for me:

 loss_2d = loss_2d.mean()
        loss_3d = loss_3d.mean()
        loss_cord = loss_cord.mean()

        losses_2d.update(loss_2d.item())
        losses_3d.update(loss_3d.item())
        losses_cord.update(loss_cord.item())
        loss = loss_2d + loss_3d + loss_cord
        losses.update(loss.item())

        loss.backward()
        if (i + 1) % accumulation_steps == 0:
            optimizer.step()
            optimizer.zero_grad()

        # if loss_cord > 0:
        #     optimizer.zero_grad()
        #     (loss_2d + loss_cord).backward()
        #     optimizer.step()

        # if accu_loss_3d > 0 and (i + 1) % accumulation_steps == 0
        #     optimizer.step()
        #     optimizer.zero_grad()
        #     accu_loss_3d.backward()
        #     accu_loss_3d = 0.0
        # else:
        #     accu_loss_3d += loss_3d / accumulation_steps

        batch_time.update(time.time() - end)
        end = time.time()

baojunshan · 2021-10-29T10:29:43Z

Try to change torch version to 1.4, it should be ok. :)

Alex-JYJ · 2022-03-25T08:10:32Z

@StephanPan, @wkom hi, you are right, the problem is in the backward step, you can change the code in function.py as follows
loss = loss_2d + loss_3d + loss_cord
loss.backward()
if (i + 1) % accumulation_steps == 0:
optimizer.step()
optimizer.zero_grad()

How exactly do you change the code ???
        loss = loss_2d + loss_3d + loss_cord
        losses.update(loss.item())

        if loss_cord > 0:
            optimizer.zero_grad()
            (loss_2d + loss_cord).backward()
            optimizer.step()

        if accu_loss_3d > 0 and (i + 1) % accumulation_steps == 0:
            optimizer.zero_grad()
            accu_loss_3d.backward()
            optimizer.step()
            accu_loss_3d = 0.0
        else:
            accu_loss_3d += loss_3d / accumulation_steps

This is how I changed it, it works for me:

 loss_2d = loss_2d.mean()
        loss_3d = loss_3d.mean()
        loss_cord = loss_cord.mean()

        losses_2d.update(loss_2d.item())
        losses_3d.update(loss_3d.item())
        losses_cord.update(loss_cord.item())
        loss = loss_2d + loss_3d + loss_cord
        losses.update(loss.item())

        loss.backward()
        if (i + 1) % accumulation_steps == 0:
            optimizer.step()
            optimizer.zero_grad()

        # if loss_cord > 0:
        #     optimizer.zero_grad()
        #     (loss_2d + loss_cord).backward()
        #     optimizer.step()

        # if accu_loss_3d > 0 and (i + 1) % accumulation_steps == 0
        #     optimizer.step()
        #     optimizer.zero_grad()
        #     accu_loss_3d.backward()
        #     accu_loss_3d = 0.0
        # else:
        #     accu_loss_3d += loss_3d / accumulation_steps

        batch_time.update(time.time() - end)
        end = time.time()

The change also works for me, but I don't know whether it will affect the precison of the result, can you give some explanation? Thanks!

cucdengjunli · 2022-12-18T08:09:36Z

same question

StephanPan closed this as completed Mar 25, 2021

StephanPan reopened this Mar 25, 2021

ooe1123 mentioned this issue Nov 15, 2021

ADD voxel pose axinc-ai/ailia-models#562

Open

cucdengjunli mentioned this issue Feb 7, 2023

A little performance drop when running this code, ask for HigherHRnet version AlvinYH/Faster-VoxelPose#17

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

training problem #19

training problem #19

StephanPan commented Mar 24, 2021

axhiao commented Mar 25, 2021

StephanPan commented Mar 25, 2021

axhiao commented Mar 25, 2021

StephanPan commented Mar 26, 2021

tamasino52 commented Apr 7, 2021

wkom commented May 18, 2021

sudo-vinnie commented May 26, 2021

SauBuen commented Jul 7, 2021 •

edited

Loading

salvador-blanco commented Aug 4, 2021

baojunshan commented Oct 29, 2021 •

edited

Loading

Alex-JYJ commented Mar 25, 2022

cucdengjunli commented Dec 18, 2022

training problem #19

training problem #19

Comments

StephanPan commented Mar 24, 2021

axhiao commented Mar 25, 2021

StephanPan commented Mar 25, 2021

axhiao commented Mar 25, 2021

StephanPan commented Mar 26, 2021

tamasino52 commented Apr 7, 2021

wkom commented May 18, 2021

sudo-vinnie commented May 26, 2021

SauBuen commented Jul 7, 2021 • edited Loading

salvador-blanco commented Aug 4, 2021

baojunshan commented Oct 29, 2021 • edited Loading

Alex-JYJ commented Mar 25, 2022

cucdengjunli commented Dec 18, 2022

SauBuen commented Jul 7, 2021 •

edited

Loading

baojunshan commented Oct 29, 2021 •

edited

Loading