Grad Norm Becomes Inf #49

TiankaiHang · 2021-12-05T08:03:32Z

On two gpus.

Epoch: [24] [1230/1251] eta: 0:00:06 lr: 0.000375 min_lr: 0.000375 loss: 0.6870 (0.6848) loss_scale: 2097152.0000 (2046895.3111) weight_decay: 0.0500 (0.0500) grad_norm: 0.0929 (0.0969) time: 0.3023 data: 0.0010 max mem: 8361 Epoch: [24] [1240/1251] eta: 0:00:03 lr: 0.000375 min_lr: 0.000375 loss: 0.6877 (0.6848) loss_scale: 2097152.0000 (2047300.2804) weight_decay: 0.0500 (0.0500) grad_norm: 0.0942 (0.0971) time: 0.2731 data: 0.0018 max mem: 8361 Epoch: [24] [1250/1251] eta: 0:00:00 lr: 0.000375 min_lr: 0.000375 loss: 0.6856 (0.6849) loss_scale: 2097152.0000 (2047698.7754) weight_decay: 0.0500 (0.0500) grad_norm: 0.0942 (0.0971) time: 0.2560 data: 0.0012 max mem: 8361 Epoch: [24] Total time: 0:06:23 (0.3067 s / it) Averaged stats: lr: 0.000375 min_lr: 0.000375 loss: 0.6856 (0.6851) loss_scale: 2097152.0000 (2047698.7754) weight_decay: 0.0500 (0.0500) grad_norm: 0.0942 (0.0971) Epoch: [25] [ 0/1251] eta: 1:25:25 lr: 0.000375 min_lr: 0.000375 loss: 0.6770 (0.6770) loss_scale: 2097152.0000 (2097152.0000) weight_decay: 0.0500 (0.0500) grad_norm: 0.0918 (0.0918) time: 4.0974 data: 3.7792 max mem: 8361 Epoch: [25] [ 10/1251] eta: 0:13:50 lr: 0.000375 min_lr: 0.000375 loss: 0.6854 (0.6838) loss_scale: 2097152.0000 (2097152.0000) weight_decay: 0.0500 (0.0500) grad_norm: 0.0910 (0.0949) time: 0.6694 data: 0.3704 max mem: 8361

How does the phenomenon occur?

The text was updated successfully, but these errors were encountered:

insomniaaac · 2021-12-05T16:30:22Z

It also occurs during my training, but my 'grad norm' becomes 'nan'!!!
I wonder if this will stop my training process.

insomniaaac · 2021-12-05T16:40:20Z

after a long time escaping, seems my grad_norm became normal.

ZeWang95 · 2021-12-06T02:08:44Z

Based on my experience, such inf or nan grad problem occurs when using mixed-precision training, which is used in this code. Those abnormal steps are skipped automatically, so there should be nothing to be concerned about.

pengzhiliang · 2021-12-06T17:21:37Z

Yes, @ZeWang95 .
This problem always occurs when using mixed-precision training, but it doesn't matter!

pengzhiliang closed this as completed Dec 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Grad Norm Becomes Inf #49

Grad Norm Becomes Inf #49

TiankaiHang commented Dec 5, 2021

insomniaaac commented Dec 5, 2021

insomniaaac commented Dec 5, 2021

ZeWang95 commented Dec 6, 2021

pengzhiliang commented Dec 6, 2021

Grad Norm Becomes Inf #49

Grad Norm Becomes Inf #49

Comments

TiankaiHang commented Dec 5, 2021

insomniaaac commented Dec 5, 2021

insomniaaac commented Dec 5, 2021

ZeWang95 commented Dec 6, 2021

pengzhiliang commented Dec 6, 2021