Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

the total_cost and wd_cost become nan. #9

Closed
bobzhang123 opened this issue Jul 10, 2020 · 6 comments
Closed

the total_cost and wd_cost become nan. #9

bobzhang123 opened this issue Jul 10, 2020 · 6 comments

Comments

@bobzhang123
Copy link

hi, when I test your code with train_stg1.sh and compute the teacher model. the logs show that the total_cost and wd_cost become Nan, I did not change any code.
the data and the gpu is as follows:
DATASET='coco_train2017.1@10'
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
image

@zizhaozhang
Copy link
Collaborator

There is a rare case a single forward can make the loss be nan, and when it is accumulated to the moving average of total_cost and wd_cost variables, it makes the stdout being nan. We haven't fix this internal issue but it does not affect the training.

@bobzhang123
Copy link
Author

There is a rare case a single forward can make the loss be nan, and when it is accumulated to the moving average of total_cost and wd_cost variables, it makes the stdout being nan. We haven't fix this internal issue but it does not affect the training.

Thank you for your reply. In my case, It actually have a bad affect on my result. After the model is trained, when I run 'eval_stg1.sh' , the results eg mAP, AP50... all become zero.

@zizhaozhang
Copy link
Collaborator

From your screenshot, the recall results seem correct. So I am wondering if the problem is not because of nan but others. All scripts we provided are tested before. Can you during training, the COCO eval matrics in tensorboard are correct?

If not, could you pls list all the exact commands you run. I can not inspect based on currrent information

@bobzhang123
Copy link
Author

bobzhang123 commented Jul 16, 2020

From your screenshot, the recall results seem correct. So I am wondering if the problem is not because of nan but others. All scripts we provided are tested before. Can you during training, the COCO eval matrics in tensorboard are correct?

If not, could you pls list all the exact commands you run. I can not inspect based on currrent information

Hi,
I try to train the code again! this is my training script in 'train_stg1.sh':
image
the 25th epch:
image
the 26th epch:
image
the 27th epoch:
image

the total_cost and wd_cost diverged in 26th epoch,
the evaluation result when reaching 40 epoches is as follows:
image

I also change the learning rate from 1e-2 to 1e-3, and still meet the same problem.

@zizhaozhang
Copy link
Collaborator

Thanks. We haven't met this issue in our training and seems other users also do not report this. I would suggest

  1. whether your tensorflow version (1.14) mets our listed requirement
  2. whether you follow the right data prepartion.

No need to tune parameteres to avoid this issue, the default parameters should work.

@bobzhang123
Copy link
Author

bobzhang123 commented Jul 20, 2020

Thanks. We haven't met this issue in our training and seems other users also do not report this. I would suggest

  1. whether your tensorflow version (1.14) mets our listed requirement
  2. whether you follow the right data prepartion.

No need to tune parameteres to avoid this issue, the default parameters should work.

Hi,
I have solved the porblem, I used the released branch 'dependabot/pip/tensorflow-gpu-1.15.2', and also my tf version is 1.15.2. Now,I use the master branch, I downgrage the tf version to 1.14.0 and the python version to 3.6.8, the NaN problem and the OOM problem is solved!
It seems that the released branch 'dependabot/pip/tensorflow-gpu-1.15.2' is not robust and still have some bugs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants