New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
the total_cost and wd_cost become nan. #9
Comments
There is a rare case a single forward can make the loss be nan, and when it is accumulated to the moving average of total_cost and wd_cost variables, it makes the stdout being nan. We haven't fix this internal issue but it does not affect the training. |
Thank you for your reply. In my case, It actually have a bad affect on my result. After the model is trained, when I run 'eval_stg1.sh' , the results eg mAP, AP50... all become zero. |
From your screenshot, the recall results seem correct. So I am wondering if the problem is not because of nan but others. All scripts we provided are tested before. Can you during training, the COCO eval matrics in tensorboard are correct? If not, could you pls list all the exact commands you run. I can not inspect based on currrent information |
Thanks. We haven't met this issue in our training and seems other users also do not report this. I would suggest
No need to tune parameteres to avoid this issue, the default parameters should work. |
Hi, |
hi, when I test your code with train_stg1.sh and compute the teacher model. the logs show that the total_cost and wd_cost become Nan, I did not change any code.
the data and the gpu is as follows:
DATASET='coco_train2017.1@10'
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
The text was updated successfully, but these errors were encountered: