Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loss or weight nan error on ADE dataset #19

Closed
lcybuzz opened this issue Jan 28, 2019 · 4 comments
Closed

Loss or weight nan error on ADE dataset #19

lcybuzz opened this issue Jan 28, 2019 · 4 comments

Comments

@lcybuzz
Copy link

lcybuzz commented Jan 28, 2019

I tried to train on ADE dataset, but I still met the error proposed in #15 . There are two differences with the example script (3.b):

  1. I used --batch_size 2 --gpu_num 4 because of GPU memory limitation. But I decrease the --lrn_rate to 0.00001 as suggested in loss or weight norm is nan. Training Stopped! #15 .

  2. I used resnet_v1_101 network and resnet_v1_101.ckpt as the pretrained model.

My Tensorflow is 1.8.0. Any idea about this error? Thanks!

@holyseven
Copy link
Owner

I tried tensorflow 1.8 and 1.12, py2 and py3, all work fine. See the log below.

Check the dataset (http://data.csail.mit.edu/places/ADEchallenge/ADEChallengeData2016.zip) and the pretrained model (http://download.tensorflow.org/models/resnet_v1_101_2016_08_28.tar.gz).

python ./train.py --batch_size 2 --gpu_num 4 --weight_decay_mode 0 --weight_decay_rate 0.0001 --weight_decay_rate2 0.0001 --train_max_iter 60000 --snapshot 30000 --random_rotate 0 --database 'ADE' --train_image_size 480 --test_image_size 480 --network 'resnet_v1_101' --fine_tune_filename './z_pretrained_weights/resnet_v1_101.ckpt'
GPU devices:  0,1,2,3
{'batch_size': 2, 'blur': 1, 'bn_frozen': 0, 'color_switch': 0, 'consider_dilated': 0, 'data_format': 'NHWC', 'database': 'ADE', 'eval_only': 0, 'fine_tune_filename': './z_pretrained_weights/resnet_v1_101.ckpt', 'float_type': 32, 'gpu_num': 4, 'has_aux_loss': 1, 'initializer': 'he', 'loss_type': 'normal', 'lr_step': None, 'lrn_rate': 0.01, 'mirror': 1, 'momentum': 0.9, 'network': 'resnet_v1_101', 'new_layer_names': None, 'optimizer': 'mom', 'poly_lr': 1, 'random_rotate': 0, 'random_scale': 1, 'resume_step': None, 'save_first_iteration': 0, 'scale_max': 2.0, 'scale_min': 0.5, 'snapshot': 30000, 'step_size': 0.1, 'structure_in_paper': 0, 'subsets_for_training': 'train', 'test_image_size': 480, 'test_max_iter': None, 'train_image_size': 480, 'train_like_in_paper': 0, 'train_max_iter': 60000, 'weight_decay_mode': 0, 'weight_decay_rate': 0.0001, 'weight_decay_rate2': 0.0001}
< using tf.float32 >

Database has 20210 images.
applying random mirror ...
applying random scale [0.500000, 2.000000]...

< Resnet structure >

num_residual_units:  [3, 4, 23, 3]
rates in each atrous convolution:  [1, 1, 2, 4]
stride in each block:  [1, 2, 1, 1]
channels in each block:  [256, 512, 1024, 2048]
shape after pool1:  (2, 120, 120, 64)
shape after block 1:  (2, 120, 120, 256)
shape after block 2:  (2, 60, 60, 512)
aux_logits:  (2, 60, 60, 256)
upsampled auxiliary_x for loss function:  (2, 480, 480, 150)
shape after block 3:  (2, 60, 60, 1024)
pool6 pooled size:  (2, 6, 6, 512)
pool6 output size:  (2, 60, 60, 512)
pool3 pooled size:  (2, 3, 3, 512)
pool3 output size:  (2, 60, 60, 512)
pool2 pooled size:  (2, 2, 2, 512)
pool2 output size:  (2, 60, 60, 512)
pool1 pooled size:  (2, 1, 1, 512)
pool1 output size:  (2, 60, 60, 512)
shape after block 4:  (2, 60, 60, 512)
logits:  (2, 60, 60, 512)
logits after upsampling:  (2, 480, 480, 150)
normal cross entropy with softmax ... 

< weight decay info >

Applying L2 regularization...
============================================
=============== LogDir Info ================
log_dir ./log
database_dir ./log/ADE
exp_dir ./log/ADE/resnet_v1_101-480-train-L2-wd_alpha0.0001-wd_beta0.0001-batch_size8-lrn_rate0.01-consider_dilated0-random_rotate0-random_scale1
snapshot_dir ./log/ADE/resnet_v1_101-480-train-L2-wd_alpha0.0001-wd_beta0.0001-batch_size8-lrn_rate0.01-consider_dilated0-random_rotate0-random_scale1/snapshot
=============== LogDir Info ================
============================================
2019-01-29 09:56:40.924629: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0, 1, 2, 3
2019-01-29 09:56:42.236680: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-01-29 09:56:42.236729: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929]      0 1 2 3 
2019-01-29 09:56:42.236753: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0:   N Y Y Y 
2019-01-29 09:56:42.236759: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 1:   Y N Y Y 
2019-01-29 09:56:42.236762: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 2:   Y Y N Y 
2019-01-29 09:56:42.236766: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 3:   Y Y Y N 
< Finetuning Process: not import resnet_v1_101/block3/unit_24/weights:0 >
< Finetuning Process: not import resnet_v1_101/block3/unit_24/BatchNorm/beta:0 >
< Finetuning Process: not import resnet_v1_101/block3/unit_24/BatchNorm/gamma:0 >
< Finetuning Process: not import resnet_v1_101/aux_logits/weights:0 >
< Finetuning Process: not import resnet_v1_101/aux_logits/biases:0 >
< Finetuning Process: not import resnet_v1_101/psp/pool6/weights:0 >
< Finetuning Process: not import resnet_v1_101/psp/pool6/BatchNorm/beta:0 >
< Finetuning Process: not import resnet_v1_101/psp/pool6/BatchNorm/gamma:0 >
< Finetuning Process: not import resnet_v1_101/psp/pool3/weights:0 >
< Finetuning Process: not import resnet_v1_101/psp/pool3/BatchNorm/beta:0 >
< Finetuning Process: not import resnet_v1_101/psp/pool3/BatchNorm/gamma:0 >
< Finetuning Process: not import resnet_v1_101/psp/pool2/weights:0 >
< Finetuning Process: not import resnet_v1_101/psp/pool2/BatchNorm/beta:0 >
< Finetuning Process: not import resnet_v1_101/psp/pool2/BatchNorm/gamma:0 >
< Finetuning Process: not import resnet_v1_101/psp/pool1/weights:0 >
< Finetuning Process: not import resnet_v1_101/psp/pool1/BatchNorm/beta:0 >
< Finetuning Process: not import resnet_v1_101/psp/pool1/BatchNorm/gamma:0 >
< Finetuning Process: not import resnet_v1_101/block4/unit_4/weights:0 >
< Finetuning Process: not import resnet_v1_101/block4/unit_4/BatchNorm/beta:0 >
< Finetuning Process: not import resnet_v1_101/block4/unit_4/BatchNorm/gamma:0 >
< Finetuning Process: not import resnet_v1_101/logits/weights:0 >
< Finetuning Process: not import resnet_v1_101/logits/biases:0 >
< Succesfully loaded fine-tune model from ./z_pretrained_weights/resnet_v1_101.ckpt. >

< training process begins >

2019-01-29 09:57:55.321205 39990] Step 20, lr = 0.009997, wd_rate = 0.000100, wd_rate_2 = 0.000100 
	 loss = 5.5795, precision = 0.0129, wd = 0.6102
	 estimated time left: 0.0 hours. 20/60000
2019-01-29 09:58:05.562760 39990] Step 40, lr = 0.009994, wd_rate = 0.000100, wd_rate_2 = 0.000100 
	 loss = 4.4038, precision = 0.0200, wd = 0.6112
	 estimated time left: 8.5 hours. 40/60000
2019-01-29 09:58:15.895888 39990] Step 60, lr = 0.009991, wd_rate = 0.000100, wd_rate_2 = 0.000100 
	 loss = 3.7097, precision = 0.0241, wd = 0.6118
	 estimated time left: 8.6 hours. 60/60000
2019-01-29 09:58:26.273976 39990] Step 80, lr = 0.009988, wd_rate = 0.000100, wd_rate_2 = 0.000100 
	 loss = 3.5920, precision = 0.0270, wd = 0.6120
	 estimated time left: 8.6 hours. 80/60000
2019-01-29 09:58:36.568903 39990] Step 100, lr = 0.009985, wd_rate = 0.000100, wd_rate_2 = 0.000100 
	 loss = 3.6517, precision = 0.0302, wd = 0.6123
	 estimated time left: 8.6 hours. 100/60000
2019-01-29 09:58:46.962572 39990] Step 120, lr = 0.009982, wd_rate = 0.000100, wd_rate_2 = 0.000100 
	 loss = 3.5043, precision = 0.0301, wd = 0.6124
	 estimated time left: 8.6 hours. 120/60000
2019-01-29 09:58:57.264132 39990] Step 140, lr = 0.009979, wd_rate = 0.000100, wd_rate_2 = 0.000100 
	 loss = 3.5823, precision = 0.0321, wd = 0.6126
	 estimated time left: 8.6 hours. 140/60000

@lcybuzz
Copy link
Author

lcybuzz commented Jan 29, 2019

I re-download the ADE dataset and it works! However I find batch size * gpu_num should not be too few indeed. My training still fails occasionally for --batch_size 2 --gpu_num 4 on Tesla K80.

By the way, I find it takes more than 4 minutes for preprocessing before the first training iteration actually starts. I wonder if most time is spent on the multi-gpu mechanism?

@holyseven
Copy link
Owner

Usually it takes more time (2~3 minutes on my machine) than a single-GPU task. I think that tensorflow needs time to create graph forward and backward, GPU-GPU, CPU-GPU communications etc. I don't know if there is way to accelerating the graph creation. Update me if you have some ideas or directly pull request.

About the NAN error, however, even when I use --batch_size 1 --gpu_num 4, there is no NAN error at least for the first 100 iterations (repeated 5 times). I am not sure what happens. Let me know if you have any thoughts.

@lcybuzz
Copy link
Author

lcybuzz commented Jan 29, 2019

I also tried the same setting on another server (Tesla M40) and it still worked fine for about 200 iters. I will study more on these issues.

Thanks for the codes and reply!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants