Loss or weight nan error on ADE dataset #19

lcybuzz · 2019-01-28T12:12:10Z

I tried to train on ADE dataset, but I still met the error proposed in #15 . There are two differences with the example script (3.b):

I used --batch_size 2 --gpu_num 4 because of GPU memory limitation. But I decrease the --lrn_rate to 0.00001 as suggested in loss or weight norm is nan. Training Stopped! #15 .
I used resnet_v1_101 network and resnet_v1_101.ckpt as the pretrained model.

My Tensorflow is 1.8.0. Any idea about this error? Thanks!

The text was updated successfully, but these errors were encountered:

holyseven · 2019-01-29T09:10:21Z

I tried tensorflow 1.8 and 1.12, py2 and py3, all work fine. See the log below.

Check the dataset (http://data.csail.mit.edu/places/ADEchallenge/ADEChallengeData2016.zip) and the pretrained model (http://download.tensorflow.org/models/resnet_v1_101_2016_08_28.tar.gz).

python ./train.py --batch_size 2 --gpu_num 4 --weight_decay_mode 0 --weight_decay_rate 0.0001 --weight_decay_rate2 0.0001 --train_max_iter 60000 --snapshot 30000 --random_rotate 0 --database 'ADE' --train_image_size 480 --test_image_size 480 --network 'resnet_v1_101' --fine_tune_filename './z_pretrained_weights/resnet_v1_101.ckpt'
GPU devices:  0,1,2,3
{'batch_size': 2, 'blur': 1, 'bn_frozen': 0, 'color_switch': 0, 'consider_dilated': 0, 'data_format': 'NHWC', 'database': 'ADE', 'eval_only': 0, 'fine_tune_filename': './z_pretrained_weights/resnet_v1_101.ckpt', 'float_type': 32, 'gpu_num': 4, 'has_aux_loss': 1, 'initializer': 'he', 'loss_type': 'normal', 'lr_step': None, 'lrn_rate': 0.01, 'mirror': 1, 'momentum': 0.9, 'network': 'resnet_v1_101', 'new_layer_names': None, 'optimizer': 'mom', 'poly_lr': 1, 'random_rotate': 0, 'random_scale': 1, 'resume_step': None, 'save_first_iteration': 0, 'scale_max': 2.0, 'scale_min': 0.5, 'snapshot': 30000, 'step_size': 0.1, 'structure_in_paper': 0, 'subsets_for_training': 'train', 'test_image_size': 480, 'test_max_iter': None, 'train_image_size': 480, 'train_like_in_paper': 0, 'train_max_iter': 60000, 'weight_decay_mode': 0, 'weight_decay_rate': 0.0001, 'weight_decay_rate2': 0.0001}

< using tf.float32 >

Database has 20210 images.
applying random mirror ...
applying random scale [0.500000, 2.000000]...

< Resnet structure >

num_residual_units:  [3, 4, 23, 3]
rates in each atrous convolution:  [1, 1, 2, 4]
stride in each block:  [1, 2, 1, 1]
channels in each block:  [256, 512, 1024, 2048]
shape after pool1:  (2, 120, 120, 64)
shape after block 1:  (2, 120, 120, 256)
shape after block 2:  (2, 60, 60, 512)
aux_logits:  (2, 60, 60, 256)
upsampled auxiliary_x for loss function:  (2, 480, 480, 150)
shape after block 3:  (2, 60, 60, 1024)
pool6 pooled size:  (2, 6, 6, 512)
pool6 output size:  (2, 60, 60, 512)
pool3 pooled size:  (2, 3, 3, 512)
pool3 output size:  (2, 60, 60, 512)
pool2 pooled size:  (2, 2, 2, 512)
pool2 output size:  (2, 60, 60, 512)
pool1 pooled size:  (2, 1, 1, 512)
pool1 output size:  (2, 60, 60, 512)
shape after block 4:  (2, 60, 60, 512)
logits:  (2, 60, 60, 512)
logits after upsampling:  (2, 480, 480, 150)
normal cross entropy with softmax ... 

< weight decay info >

Applying L2 regularization...
============================================
=============== LogDir Info ================
log_dir ./log
database_dir ./log/ADE
exp_dir ./log/ADE/resnet_v1_101-480-train-L2-wd_alpha0.0001-wd_beta0.0001-batch_size8-lrn_rate0.01-consider_dilated0-random_rotate0-random_scale1
snapshot_dir ./log/ADE/resnet_v1_101-480-train-L2-wd_alpha0.0001-wd_beta0.0001-batch_size8-lrn_rate0.01-consider_dilated0-random_rotate0-random_scale1/snapshot
=============== LogDir Info ================
============================================

2019-01-29 09:56:40.924629: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0, 1, 2, 3
2019-01-29 09:56:42.236680: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-01-29 09:56:42.236729: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929]      0 1 2 3 
2019-01-29 09:56:42.236753: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0:   N Y Y Y 
2019-01-29 09:56:42.236759: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 1:   Y N Y Y 
2019-01-29 09:56:42.236762: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 2:   Y Y N Y 
2019-01-29 09:56:42.236766: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 3:   Y Y Y N 
< Finetuning Process: not import resnet_v1_101/block3/unit_24/weights:0 >
< Finetuning Process: not import resnet_v1_101/block3/unit_24/BatchNorm/beta:0 >
< Finetuning Process: not import resnet_v1_101/block3/unit_24/BatchNorm/gamma:0 >
< Finetuning Process: not import resnet_v1_101/aux_logits/weights:0 >
< Finetuning Process: not import resnet_v1_101/aux_logits/biases:0 >
< Finetuning Process: not import resnet_v1_101/psp/pool6/weights:0 >
< Finetuning Process: not import resnet_v1_101/psp/pool6/BatchNorm/beta:0 >
< Finetuning Process: not import resnet_v1_101/psp/pool6/BatchNorm/gamma:0 >
< Finetuning Process: not import resnet_v1_101/psp/pool3/weights:0 >
< Finetuning Process: not import resnet_v1_101/psp/pool3/BatchNorm/beta:0 >
< Finetuning Process: not import resnet_v1_101/psp/pool3/BatchNorm/gamma:0 >
< Finetuning Process: not import resnet_v1_101/psp/pool2/weights:0 >
< Finetuning Process: not import resnet_v1_101/psp/pool2/BatchNorm/beta:0 >
< Finetuning Process: not import resnet_v1_101/psp/pool2/BatchNorm/gamma:0 >
< Finetuning Process: not import resnet_v1_101/psp/pool1/weights:0 >
< Finetuning Process: not import resnet_v1_101/psp/pool1/BatchNorm/beta:0 >
< Finetuning Process: not import resnet_v1_101/psp/pool1/BatchNorm/gamma:0 >
< Finetuning Process: not import resnet_v1_101/block4/unit_4/weights:0 >
< Finetuning Process: not import resnet_v1_101/block4/unit_4/BatchNorm/beta:0 >
< Finetuning Process: not import resnet_v1_101/block4/unit_4/BatchNorm/gamma:0 >
< Finetuning Process: not import resnet_v1_101/logits/weights:0 >
< Finetuning Process: not import resnet_v1_101/logits/biases:0 >
< Succesfully loaded fine-tune model from ./z_pretrained_weights/resnet_v1_101.ckpt. >

< training process begins >

2019-01-29 09:57:55.321205 39990] Step 20, lr = 0.009997, wd_rate = 0.000100, wd_rate_2 = 0.000100 
	 loss = 5.5795, precision = 0.0129, wd = 0.6102
	 estimated time left: 0.0 hours. 20/60000
2019-01-29 09:58:05.562760 39990] Step 40, lr = 0.009994, wd_rate = 0.000100, wd_rate_2 = 0.000100 
	 loss = 4.4038, precision = 0.0200, wd = 0.6112
	 estimated time left: 8.5 hours. 40/60000
2019-01-29 09:58:15.895888 39990] Step 60, lr = 0.009991, wd_rate = 0.000100, wd_rate_2 = 0.000100 
	 loss = 3.7097, precision = 0.0241, wd = 0.6118
	 estimated time left: 8.6 hours. 60/60000
2019-01-29 09:58:26.273976 39990] Step 80, lr = 0.009988, wd_rate = 0.000100, wd_rate_2 = 0.000100 
	 loss = 3.5920, precision = 0.0270, wd = 0.6120
	 estimated time left: 8.6 hours. 80/60000
2019-01-29 09:58:36.568903 39990] Step 100, lr = 0.009985, wd_rate = 0.000100, wd_rate_2 = 0.000100 
	 loss = 3.6517, precision = 0.0302, wd = 0.6123
	 estimated time left: 8.6 hours. 100/60000
2019-01-29 09:58:46.962572 39990] Step 120, lr = 0.009982, wd_rate = 0.000100, wd_rate_2 = 0.000100 
	 loss = 3.5043, precision = 0.0301, wd = 0.6124
	 estimated time left: 8.6 hours. 120/60000
2019-01-29 09:58:57.264132 39990] Step 140, lr = 0.009979, wd_rate = 0.000100, wd_rate_2 = 0.000100 
	 loss = 3.5823, precision = 0.0321, wd = 0.6126
	 estimated time left: 8.6 hours. 140/60000

lcybuzz · 2019-01-29T12:22:03Z

I re-download the ADE dataset and it works! However I find batch size * gpu_num should not be too few indeed. My training still fails occasionally for --batch_size 2 --gpu_num 4 on Tesla K80.

By the way, I find it takes more than 4 minutes for preprocessing before the first training iteration actually starts. I wonder if most time is spent on the multi-gpu mechanism?

holyseven · 2019-01-29T14:14:17Z

Usually it takes more time (2~3 minutes on my machine) than a single-GPU task. I think that tensorflow needs time to create graph forward and backward, GPU-GPU, CPU-GPU communications etc. I don't know if there is way to accelerating the graph creation. Update me if you have some ideas or directly pull request.

About the NAN error, however, even when I use --batch_size 1 --gpu_num 4, there is no NAN error at least for the first 100 iterations (repeated 5 times). I am not sure what happens. Let me know if you have any thoughts.

lcybuzz · 2019-01-29T14:36:45Z

I also tried the same setting on another server (Tesla M40) and it still worked fine for about 200 iters. I will study more on these issues.

Thanks for the codes and reply!

lcybuzz closed this as completed Jan 29, 2019

holyseven mentioned this issue Mar 16, 2019

Does loss regularization always improve accuracy? #20

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loss or weight nan error on ADE dataset #19

Loss or weight nan error on ADE dataset #19

lcybuzz commented Jan 28, 2019

holyseven commented Jan 29, 2019

lcybuzz commented Jan 29, 2019 •

edited

holyseven commented Jan 29, 2019

lcybuzz commented Jan 29, 2019 •

edited

Loss or weight nan error on ADE dataset #19

Loss or weight nan error on ADE dataset #19

Comments

lcybuzz commented Jan 28, 2019

holyseven commented Jan 29, 2019

lcybuzz commented Jan 29, 2019 • edited

holyseven commented Jan 29, 2019

lcybuzz commented Jan 29, 2019 • edited

lcybuzz commented Jan 29, 2019 •

edited

lcybuzz commented Jan 29, 2019 •

edited