About mghead loss compute question #19

muzi2045 · 2020-01-03T03:19:59Z

Beside CBGS, tring train original pointpillars in nuscenes with the repo.
find the loss compute problem leading to a gradient explosion
here is the first epoch Head1 box_conv weight:

 box conv weight: Parameter containing:
tensor([[[[-0.0235]],

         [[-0.0223]],

         [[ 0.0100]],

         ...,

         [[ 0.0126]],

         [[-0.0176]],

         [[ 0.0154]]],


        [[[-0.0487]],

         [[ 0.0367]],

         [[ 0.0096]],

         ...,

         [[ 0.0182]],

         [[ 0.0200]],

         [[-0.0325]]],


        [[[ 0.0089]],

         [[-0.0121]],

         [[-0.0017]],

         ...,

         [[-0.0492]],

         [[-0.0505]],

         [[-0.0137]]],


        ...,


        [[[-0.0302]],

         [[-0.0257]],

         [[-0.0246]],

         ...,

         [[ 0.0090]],

         [[-0.0497]],

         [[ 0.0128]]],


        [[[ 0.0449]],

         [[ 0.0291]],

         [[ 0.0460]],

         ...,

         [[ 0.0024]],

         [[-0.0081]],

         [[-0.0162]]],


        [[[ 0.0178]],

         [[-0.0133]],

         [[ 0.0189]],

         ...,

         [[ 0.0100]],

         [[-0.0445]],

         [[-0.0162]]]], device='cuda:0', requires_grad=True)

here is the loss output (only compute head1 loss):

OrderedDict([('loss', [203.5531005859375]), ('cls_pos_loss', [0.04986190423369408]), ('cls_neg_loss', [201.2117919921875]), ('dir_loss_reduced', [0.6615481376647949]), ('cls_loss_reduced', [201.26165771484375]), ('loc_loss_reduced', [2.1591315269470215]), ('loc_loss_elem', [[0.05492932349443436, 0.041640881448984146, 0.67469322681427, 0.035490743815898895, 0.05674883723258972, 0.05906621366739273, 0.0, 0.0, 0.15699654817581177]]), ('num_pos', [86]), ('num_neg', [126794])])

in the second epoch:
the head1 cpnv_box weight changed and contain some NaN value:

 box conv weight: Parameter containing:
tensor([[[[-0.0235]],

         [[-0.0223]],

         [[ 0.0100]],

         ...,

         [[ 0.0126]],

         [[-0.0176]],

         [[ 0.0154]]],


        [[[-0.0487]],

         [[ 0.0367]],

         [[ 0.0096]],

         ...,

         [[ 0.0182]],

         [[ 0.0200]],

         [[-0.0325]]],


        [[[ 0.0089]],

         [[-0.0121]],

         [[-0.0017]],

         ...,

         [[-0.0492]],

         [[-0.0505]],

         [[-0.0137]]],


        ...,


        [[[    nan]],

         [[    nan]],

         [[    nan]],

         ...,

         [[    nan]],

         [[    nan]],

         [[    nan]]],


        [[[    nan]],

         [[    nan]],

         [[    nan]],

         ...,

         [[    nan]],

         [[    nan]],

         [[    nan]]],


        [[[ 0.0178]],

         [[-0.0133]],

         [[ 0.0189]],

         ...,

         [[ 0.0100]],

         [[-0.0445]],

         [[-0.0162]]]], device='cuda:0', requires_grad=True)

that's the last layer weight contain nan value leading back propagation to other layer are all nan value, the grad clip are set to:

optimizer_config = dict(grad_clip=dict(max_norm=35, norm_type=2))

Another try is that I set the loss value in a fixed num(300), which leading no nan value in all layer weight, and the loss are normal value(which means the problem is the loss compute rather than the network layer compute problem).

@poodarchu

The text was updated successfully, but these errors were encountered:

poodarchu · 2020-01-03T03:32:13Z

does the result right?

muzi2045 · 2020-01-03T03:50:27Z

the test result is correct, and there are other people are get the same problem

poodarchu · 2020-01-03T03:53:27Z

I also encounter this problem occasionally, but it's hard to reproduce so I didn't pay much attention to it.

muzi2045 · 2020-01-03T04:12:41Z

I am checking the loss compute in your repo and the second.pytorch repo, in the original repo, I have never encounter this kind of problem though the loss compute are almost same when train pointpoillars.

muzi2045 · 2020-01-04T02:20:37Z

here is some problem in data generate, the invalid Nan Value in gt_boxes velocity leading this problem.
If this error occurs to anyone, please check the data generate output, this repo get gt_boxes will generate wrong velocity value.
check the part in nusc_common.py

if not test:
            annotations = [
                nusc.get("sample_annotation", token) for token in sample["anns"]
            ]

            locs = np.array([b.center for b in ref_boxes]).reshape(-1, 3)
            dims = np.array([b.wlh for b in ref_boxes]).reshape(-1, 3)
            # rots = np.array([b.orientation.yaw_pitch_roll[0] for b in ref_boxes]).reshape(-1, 1)
            # velocity = np.array([b.velocity for b in ref_boxes]).reshape(-1, 3)
            velocity = np.array(
                [nusc.box_velocity(token)[:2] for token in sample['anns']]
            )
            # convert velo from global to lidar
            for i in range(len(ref_boxes)):
                velo = np.array([*velocity[i], 0.0])
                velo = velo @ np.linalg.inv(e2g_r_mat).T @ np.linalg.inv(
                    l2e_r_mat).T
                velocity[i] = velo[:2]
            velocity = velocity.reshape(-1,2)

            rots = np.array([quaternion_yaw(b.orientation) for b in ref_boxes]).reshape(
                -1, 1
            )
            names = np.array([b.name for b in ref_boxes])
            tokens = np.array([b.token for b in ref_boxes])
            gt_boxes = np.concatenate(
                [locs, dims, velocity[:, :2], -rots - np.pi / 2], axis=1
            )

although you modify these part, the velo compute may be still be illegal output.

the most dirty way avoid this .
you can just add these code in mg_head.py

if kwargs.get("mode", False):
    reg_targets = example["reg_targets"][task_id][:, :, [0, 1, 3, 4, 6]]
    reg_targets_left = example["reg_targets"][task_id][:, :, [2, 5]]
else:
    reg_targets = example["reg_targets"][task_id]

## Add part
for i in range(6):
    example["reg_targets"][i][torch.isnan(example["reg_targets"][i])] = 0.0

poodarchu closed this as completed Jan 3, 2020

muzi2045 mentioned this issue Jan 7, 2020

Trying to train cbfgs. All values are NaN. #43

Closed

peiyunh mentioned this issue Jan 8, 2020

Ground truth velocities are left uninitialized as NaNs #46

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About mghead loss compute question #19

About mghead loss compute question #19

muzi2045 commented Jan 3, 2020

poodarchu commented Jan 3, 2020

muzi2045 commented Jan 3, 2020

poodarchu commented Jan 3, 2020

muzi2045 commented Jan 3, 2020

muzi2045 commented Jan 4, 2020

About mghead loss compute question #19

About mghead loss compute question #19

Comments

muzi2045 commented Jan 3, 2020

poodarchu commented Jan 3, 2020

muzi2045 commented Jan 3, 2020

poodarchu commented Jan 3, 2020

muzi2045 commented Jan 3, 2020

muzi2045 commented Jan 4, 2020