Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About mghead loss compute question #19

Closed
muzi2045 opened this issue Jan 3, 2020 · 5 comments
Closed

About mghead loss compute question #19

muzi2045 opened this issue Jan 3, 2020 · 5 comments

Comments

@muzi2045
Copy link

muzi2045 commented Jan 3, 2020

Beside CBGS, tring train original pointpillars in nuscenes with the repo.
find the loss compute problem leading to a gradient explosion
here is the first epoch Head1 box_conv weight:

 box conv weight: Parameter containing:
tensor([[[[-0.0235]],

         [[-0.0223]],

         [[ 0.0100]],

         ...,

         [[ 0.0126]],

         [[-0.0176]],

         [[ 0.0154]]],


        [[[-0.0487]],

         [[ 0.0367]],

         [[ 0.0096]],

         ...,

         [[ 0.0182]],

         [[ 0.0200]],

         [[-0.0325]]],


        [[[ 0.0089]],

         [[-0.0121]],

         [[-0.0017]],

         ...,

         [[-0.0492]],

         [[-0.0505]],

         [[-0.0137]]],


        ...,


        [[[-0.0302]],

         [[-0.0257]],

         [[-0.0246]],

         ...,

         [[ 0.0090]],

         [[-0.0497]],

         [[ 0.0128]]],


        [[[ 0.0449]],

         [[ 0.0291]],

         [[ 0.0460]],

         ...,

         [[ 0.0024]],

         [[-0.0081]],

         [[-0.0162]]],


        [[[ 0.0178]],

         [[-0.0133]],

         [[ 0.0189]],

         ...,

         [[ 0.0100]],

         [[-0.0445]],

         [[-0.0162]]]], device='cuda:0', requires_grad=True)

here is the loss output (only compute head1 loss):

OrderedDict([('loss', [203.5531005859375]), ('cls_pos_loss', [0.04986190423369408]), ('cls_neg_loss', [201.2117919921875]), ('dir_loss_reduced', [0.6615481376647949]), ('cls_loss_reduced', [201.26165771484375]), ('loc_loss_reduced', [2.1591315269470215]), ('loc_loss_elem', [[0.05492932349443436, 0.041640881448984146, 0.67469322681427, 0.035490743815898895, 0.05674883723258972, 0.05906621366739273, 0.0, 0.0, 0.15699654817581177]]), ('num_pos', [86]), ('num_neg', [126794])])

in the second epoch:
the head1 cpnv_box weight changed and contain some NaN value:

 box conv weight: Parameter containing:
tensor([[[[-0.0235]],

         [[-0.0223]],

         [[ 0.0100]],

         ...,

         [[ 0.0126]],

         [[-0.0176]],

         [[ 0.0154]]],


        [[[-0.0487]],

         [[ 0.0367]],

         [[ 0.0096]],

         ...,

         [[ 0.0182]],

         [[ 0.0200]],

         [[-0.0325]]],


        [[[ 0.0089]],

         [[-0.0121]],

         [[-0.0017]],

         ...,

         [[-0.0492]],

         [[-0.0505]],

         [[-0.0137]]],


        ...,


        [[[    nan]],

         [[    nan]],

         [[    nan]],

         ...,

         [[    nan]],

         [[    nan]],

         [[    nan]]],


        [[[    nan]],

         [[    nan]],

         [[    nan]],

         ...,

         [[    nan]],

         [[    nan]],

         [[    nan]]],


        [[[ 0.0178]],

         [[-0.0133]],

         [[ 0.0189]],

         ...,

         [[ 0.0100]],

         [[-0.0445]],

         [[-0.0162]]]], device='cuda:0', requires_grad=True)

that's the last layer weight contain nan value leading back propagation to other layer are all nan value, the grad clip are set to:

optimizer_config = dict(grad_clip=dict(max_norm=35, norm_type=2))

Another try is that I set the loss value in a fixed num(300), which leading no nan value in all layer weight, and the loss are normal value(which means the problem is the loss compute rather than the network layer compute problem).

@poodarchu

@poodarchu
Copy link
Collaborator

does the result right?

@muzi2045
Copy link
Author

muzi2045 commented Jan 3, 2020

the test result is correct, and there are other people are get the same problem

@poodarchu
Copy link
Collaborator

I also encounter this problem occasionally, but it's hard to reproduce so I didn't pay much attention to it.

@muzi2045
Copy link
Author

muzi2045 commented Jan 3, 2020

I am checking the loss compute in your repo and the second.pytorch repo, in the original repo, I have never encounter this kind of problem though the loss compute are almost same when train pointpoillars.

@muzi2045
Copy link
Author

muzi2045 commented Jan 4, 2020

here is some problem in data generate, the invalid Nan Value in gt_boxes velocity leading this problem.
If this error occurs to anyone, please check the data generate output, this repo get gt_boxes will generate wrong velocity value.
check the part in nusc_common.py

if not test:
            annotations = [
                nusc.get("sample_annotation", token) for token in sample["anns"]
            ]

            locs = np.array([b.center for b in ref_boxes]).reshape(-1, 3)
            dims = np.array([b.wlh for b in ref_boxes]).reshape(-1, 3)
            # rots = np.array([b.orientation.yaw_pitch_roll[0] for b in ref_boxes]).reshape(-1, 1)
            # velocity = np.array([b.velocity for b in ref_boxes]).reshape(-1, 3)
            velocity = np.array(
                [nusc.box_velocity(token)[:2] for token in sample['anns']]
            )
            # convert velo from global to lidar
            for i in range(len(ref_boxes)):
                velo = np.array([*velocity[i], 0.0])
                velo = velo @ np.linalg.inv(e2g_r_mat).T @ np.linalg.inv(
                    l2e_r_mat).T
                velocity[i] = velo[:2]
            velocity = velocity.reshape(-1,2)

            rots = np.array([quaternion_yaw(b.orientation) for b in ref_boxes]).reshape(
                -1, 1
            )
            names = np.array([b.name for b in ref_boxes])
            tokens = np.array([b.token for b in ref_boxes])
            gt_boxes = np.concatenate(
                [locs, dims, velocity[:, :2], -rots - np.pi / 2], axis=1
            )

although you modify these part, the velo compute may be still be illegal output.

the most dirty way avoid this .
you can just add these code in mg_head.py

if kwargs.get("mode", False):
    reg_targets = example["reg_targets"][task_id][:, :, [0, 1, 3, 4, 6]]
    reg_targets_left = example["reg_targets"][task_id][:, :, [2, 5]]
else:
    reg_targets = example["reg_targets"][task_id]

## Add part
for i in range(6):
    example["reg_targets"][i][torch.isnan(example["reg_targets"][i])] = 0.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants