New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loss or weight nan error on ADE dataset #19
Comments
I tried tensorflow 1.8 and 1.12, py2 and py3, all work fine. See the log below. Check the dataset (http://data.csail.mit.edu/places/ADEchallenge/ADEChallengeData2016.zip) and the pretrained model (http://download.tensorflow.org/models/resnet_v1_101_2016_08_28.tar.gz).
|
I re-download the ADE dataset and it works! However I find By the way, I find it takes more than 4 minutes for preprocessing before the first training iteration actually starts. I wonder if most time is spent on the multi-gpu mechanism? |
Usually it takes more time (2~3 minutes on my machine) than a single-GPU task. I think that tensorflow needs time to create graph forward and backward, GPU-GPU, CPU-GPU communications etc. I don't know if there is way to accelerating the graph creation. Update me if you have some ideas or directly pull request. About the NAN error, however, even when I use |
I also tried the same setting on another server (Tesla M40) and it still worked fine for about 200 iters. I will study more on these issues. Thanks for the codes and reply! |
I tried to train on ADE dataset, but I still met the error proposed in #15 . There are two differences with the example script (3.b):
I used
--batch_size 2 --gpu_num 4
because of GPU memory limitation. But I decrease the--lrn_rate
to0.00001
as suggested in loss or weight norm is nan. Training Stopped! #15 .I used
resnet_v1_101
network andresnet_v1_101.ckpt
as the pretrained model.My Tensorflow is 1.8.0. Any idea about this error? Thanks!
The text was updated successfully, but these errors were encountered: