the total_cost and wd_cost become nan. #9

bobzhang123 · 2020-07-10T09:09:19Z

hi, when I test your code with train_stg1.sh and compute the teacher model. the logs show that the total_cost and wd_cost become Nan, I did not change any code.
the data and the gpu is as follows:
DATASET='coco_train2017.1@10'
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

zizhaozhang · 2020-07-11T17:10:07Z

There is a rare case a single forward can make the loss be nan, and when it is accumulated to the moving average of total_cost and wd_cost variables, it makes the stdout being nan. We haven't fix this internal issue but it does not affect the training.

bobzhang123 · 2020-07-12T03:06:37Z

There is a rare case a single forward can make the loss be nan, and when it is accumulated to the moving average of total_cost and wd_cost variables, it makes the stdout being nan. We haven't fix this internal issue but it does not affect the training.

Thank you for your reply. In my case, It actually have a bad affect on my result. After the model is trained, when I run 'eval_stg1.sh' , the results eg mAP, AP50... all become zero.

zizhaozhang · 2020-07-14T23:44:02Z

From your screenshot, the recall results seem correct. So I am wondering if the problem is not because of nan but others. All scripts we provided are tested before. Can you during training, the COCO eval matrics in tensorboard are correct?

If not, could you pls list all the exact commands you run. I can not inspect based on currrent information

bobzhang123 · 2020-07-16T02:02:35Z

From your screenshot, the recall results seem correct. So I am wondering if the problem is not because of nan but others. All scripts we provided are tested before. Can you during training, the COCO eval matrics in tensorboard are correct?

If not, could you pls list all the exact commands you run. I can not inspect based on currrent information

Hi，
I try to train the code again! this is my training script in 'train_stg1.sh':

the 25th epch:

the 26th epch:

the 27th epoch:

the total_cost and wd_cost diverged in 26th epoch,
the evaluation result when reaching 40 epoches is as follows:

I also change the learning rate from 1e-2 to 1e-3, and still meet the same problem.

zizhaozhang · 2020-07-17T20:49:51Z

Thanks. We haven't met this issue in our training and seems other users also do not report this. I would suggest

whether your tensorflow version (1.14) mets our listed requirement
whether you follow the right data prepartion.

No need to tune parameteres to avoid this issue, the default parameters should work.

bobzhang123 · 2020-07-20T02:44:42Z

Thanks. We haven't met this issue in our training and seems other users also do not report this. I would suggest

whether your tensorflow version (1.14) mets our listed requirement

whether you follow the right data prepartion.

No need to tune parameteres to avoid this issue, the default parameters should work.

Hi，
I have solved the porblem, I used the released branch 'dependabot/pip/tensorflow-gpu-1.15.2', and also my tf version is 1.15.2. Now,I use the master branch, I downgrage the tf version to 1.14.0 and the python version to 3.6.8, the NaN problem and the OOM problem is solved!
It seems that the released branch 'dependabot/pip/tensorflow-gpu-1.15.2' is not robust and still have some bugs.

vaslamp mentioned this issue Jul 17, 2020

dlerror: libcublas.so.10.0: cannot open shared object file: No such file or directory #8

Closed

bobzhang123 closed this as completed Jul 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

the total_cost and wd_cost become nan. #9

the total_cost and wd_cost become nan. #9

bobzhang123 commented Jul 10, 2020

zizhaozhang commented Jul 11, 2020

bobzhang123 commented Jul 12, 2020

zizhaozhang commented Jul 14, 2020

bobzhang123 commented Jul 16, 2020 •

edited

zizhaozhang commented Jul 17, 2020

bobzhang123 commented Jul 20, 2020 •

edited

the total_cost and wd_cost become nan. #9

the total_cost and wd_cost become nan. #9

Comments

bobzhang123 commented Jul 10, 2020

zizhaozhang commented Jul 11, 2020

bobzhang123 commented Jul 12, 2020

zizhaozhang commented Jul 14, 2020

bobzhang123 commented Jul 16, 2020 • edited

zizhaozhang commented Jul 17, 2020

bobzhang123 commented Jul 20, 2020 • edited

bobzhang123 commented Jul 16, 2020 •

edited

bobzhang123 commented Jul 20, 2020 •

edited