You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, so in my case is that i want to use multiple GPU (4 GPU) using batch size of 8, it works well.
but then i see that each GPU only utilized half of its capacity, then i tried to increase my batch size to 12.
but the same error always reappear not OOM but "SILog is NAN stopping training".
do you know why this is happening? or has anyone encounter similar problem?
The text was updated successfully, but these errors were encountered:
Ive found the root cause to this problem. so at some part of the label images, all the pixel is 0 (meaning that the Log Loss would always be 0). eliminating these labels fixed this issue.
tl;dr
caused by uncleaned dataset
Hi, so in my case is that i want to use multiple GPU (4 GPU) using batch size of 8, it works well.
but then i see that each GPU only utilized half of its capacity, then i tried to increase my batch size to 12.
but the same error always reappear not OOM but "SILog is NAN stopping training".
do you know why this is happening? or has anyone encounter similar problem?
The text was updated successfully, but these errors were encountered: