Training on VOC from Scratch #49

mcever · 2019-04-11T21:58:55Z

Hi,

I am attempting to train this network on VOC from scratch, essentially trying to recreate the pre-trained weights available for download; however, after 70+ epochs, my model is still just predicting background for an mIOU of 3.49%. Here is the command I am running to train:

python issegm/voc.py --gpus 1,2,3 --split train --data-root data/VOCdevkit/ --output train_out/ --model voc_rna-a1_cls21 --batch-images 12 --crop-size
500 --origin-size 2048 --scale-rate-range 0.7,1.3 --lr-type fixed --base-lr 0.0016 --to-epoch 140 --kvstore local --prefetch-threads 4 --prefetcher thread --backward-do-mirror

Inside data/VOCdevkit/VOC2012 I have the original download of JPEGImages and SegmentationClass, which provides the full color segmentation images. Any help would be much appreciated.

Here's a snippet of output that may or may not help, showing fcn_valid moving a lot. I'm not entirely sure what the output means, so any explanation on what it is could be useful.

2019-04-11 15:00:09,073 Host Epoch[78] Batch [66-67] Speed: 11.93 samples/sec fcn_valid=0.623302
2019-04-11 15:00:10,056 Host Labels: 0 0.6 -1.0
2019-04-11 15:00:10,058 Host Labels: 0 0.6 -1.0
Waited for 2.59876251221e-05 seconds
2019-04-11 15:00:10,075 Host Epoch[78] Batch [67-68] Speed: 11.98 samples/sec fcn_valid=0.644102
2019-04-11 15:00:10,076 Host Labels: 0 0.6 -1.0
2019-04-11 15:00:11,055 Host Labels: 0 0.6 -1.0
2019-04-11 15:00:11,056 Host Labels: 0 0.6 -1.0
Waited for 3.50475311279e-05 seconds
2019-04-11 15:00:11,074 Host Labels: 0 0.6 -1.0
2019-04-11 15:00:11,077 Host Epoch[78] Batch [68-69] Speed: 11.98 samples/sec fcn_valid=0.632405
2019-04-11 15:00:12,056 Host Labels: 0 0.6 -1.0
2019-04-11 15:00:12,058 Host Labels: 0 0.6 -1.0
Waited for 2.50339508057e-05 seconds
2019-04-11 15:00:12,074 Host Labels: 0 0.6 -1.0
2019-04-11 15:00:12,077 Host Epoch[78] Batch [69-70] Speed: 12.00 samples/sec fcn_valid=0.775874
2019-04-11 15:00:13,057 Host Labels: 0 0.6 -1.0
2019-04-11 15:00:13,058 Host Labels: 0 0.6 -1.0
Waited for 2.59876251221e-05 seconds
2019-04-11 15:00:13,074 Host Labels: 0 0.6 -1.0
2019-04-11 15:00:13,077 Host Epoch[78] Batch [70-71] Speed: 12.01 samples/sec fcn_valid=0.562744
2019-04-11 15:00:14,056 Host Labels: 0 0.6 -1.0
2019-04-11 15:00:14,058 Host Labels: 0 0.6 -1.0
Waited for 0.000184059143066 seconds
2019-04-11 15:00:14,074 Host Labels: 0 0.6 -1.0
2019-04-11 15:00:14,075 Host Epoch[78] Batch [71-72] Speed: 12.03 samples/sec fcn_valid=0.552027

The text was updated successfully, but these errors were encountered:

mcever · 2019-04-23T20:32:45Z

After much debugging, I found that part of my issue was apparently that I was training without initializing the weights, so my predictions quickly converged to a bunch of NaN's. I decided to retrain, initializing with the ImageNet weights like so

Host start with arguments Namespace(backward_do_mirror=True, base_lr=0.0016, batch_images=12, cache_images=None, check_start=1, check_step=4, crop_size=500, data_root='data/VOCdevkit/', dataset=None, debug=False, from_epoch=0, gpus='1,2,3', kvstore='local', log_file='voc_rna-a1_cls21.log', lr_steps=None, lr_type='fixed', model='voc_rna-a1_cls21', origin_size=2048, output='train+_out/', phase='train', prefetch_threads=4, prefetcher='thread', save_predictions=False, save_results=True, scale_rate_range='0.7,1.3', split='train+', stop_epoch=None, test_flipping=False, test_scales=None, test_steps=1, to_epoch=500, weight_decay=0.0005, weights='models/ilsvrc-cls_rna-a_cls1000_ep-0001.params')

Meanwhile, I ran validation on the validation and train+ sets every 5 epochs to track training progress. Performance on the validation set began to stabilize around 250 epochs around 45 mIOU, so I began then reducing the learning rate like so

2019-04-19 16:52:48,408 Host start with arguments Namespace(backward_do_mirror=True, base_lr=0.0016, batch_images=12, cache_images=None, check_start=1, check_step=4, crop_size=500, data_root='data/VOCdevkit/', dataset=None, debug=False, from_epoch=240, gpus='1,2,3', kvstore='local', log_file='voc_rna-a1_cls21.log', lr_steps=None, lr_type='linear', model='voc_rna-a1_cls21', origin_size=2048, output='train+_outp2/', phase='train', prefetch_threads=4, prefetcher='thread', save_predictions=False, save_results=True, scale_rate_range='0.7,1.3', split='train+', stop_epoch=None, test_flipping=False, test_scales=None, test_steps=1, to_epoch=500, weight_decay=0.0005, weights='train+_out/voc_rna-a1_cls21_ep-0240.params')

Now, after a total of about 410 epochs (started reducing learning rate from 240), I am still only achieving a max of 54.77 mIOU on the validation set. This is very much lower than the results presented in the paper. Any advice on how to improve would be greatly appreciated.

rulixiang · 2020-07-19T04:01:25Z

Hi, @mcever , I'm also trying to reproduce the results on VOC 2012 dataset. Have you reproduced the results as paper reported? If you have did it, can you share your training command?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training on VOC from Scratch #49

Training on VOC from Scratch #49

mcever commented Apr 11, 2019 •

edited

mcever commented Apr 23, 2019

rulixiang commented Jul 19, 2020

Training on VOC from Scratch #49

Training on VOC from Scratch #49

Comments

mcever commented Apr 11, 2019 • edited

mcever commented Apr 23, 2019

rulixiang commented Jul 19, 2020

mcever commented Apr 11, 2019 •

edited