Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training on VOC from Scratch #49

Open
mcever opened this issue Apr 11, 2019 · 2 comments
Open

Training on VOC from Scratch #49

mcever opened this issue Apr 11, 2019 · 2 comments

Comments

@mcever
Copy link

mcever commented Apr 11, 2019

Hi,

I am attempting to train this network on VOC from scratch, essentially trying to recreate the pre-trained weights available for download; however, after 70+ epochs, my model is still just predicting background for an mIOU of 3.49%. Here is the command I am running to train:

python issegm/voc.py --gpus 1,2,3 --split train --data-root data/VOCdevkit/ --output train_out/ --model voc_rna-a1_cls21 --batch-images 12 --crop-size
500 --origin-size 2048 --scale-rate-range 0.7,1.3 --lr-type fixed --base-lr 0.0016 --to-epoch 140 --kvstore local --prefetch-threads 4 --prefetcher thread --backward-do-mirror

Inside data/VOCdevkit/VOC2012 I have the original download of JPEGImages and SegmentationClass, which provides the full color segmentation images. Any help would be much appreciated.

Here's a snippet of output that may or may not help, showing fcn_valid moving a lot. I'm not entirely sure what the output means, so any explanation on what it is could be useful.

2019-04-11 15:00:09,073 Host Epoch[78] Batch [66-67] Speed: 11.93 samples/sec fcn_valid=0.623302
2019-04-11 15:00:10,056 Host Labels: 0 0.6 -1.0
2019-04-11 15:00:10,058 Host Labels: 0 0.6 -1.0
Waited for 2.59876251221e-05 seconds
2019-04-11 15:00:10,075 Host Epoch[78] Batch [67-68] Speed: 11.98 samples/sec fcn_valid=0.644102
2019-04-11 15:00:10,076 Host Labels: 0 0.6 -1.0
2019-04-11 15:00:11,055 Host Labels: 0 0.6 -1.0
2019-04-11 15:00:11,056 Host Labels: 0 0.6 -1.0
Waited for 3.50475311279e-05 seconds
2019-04-11 15:00:11,074 Host Labels: 0 0.6 -1.0
2019-04-11 15:00:11,077 Host Epoch[78] Batch [68-69] Speed: 11.98 samples/sec fcn_valid=0.632405
2019-04-11 15:00:12,056 Host Labels: 0 0.6 -1.0
2019-04-11 15:00:12,058 Host Labels: 0 0.6 -1.0
Waited for 2.50339508057e-05 seconds
2019-04-11 15:00:12,074 Host Labels: 0 0.6 -1.0
2019-04-11 15:00:12,077 Host Epoch[78] Batch [69-70] Speed: 12.00 samples/sec fcn_valid=0.775874
2019-04-11 15:00:13,057 Host Labels: 0 0.6 -1.0
2019-04-11 15:00:13,058 Host Labels: 0 0.6 -1.0
Waited for 2.59876251221e-05 seconds
2019-04-11 15:00:13,074 Host Labels: 0 0.6 -1.0
2019-04-11 15:00:13,077 Host Epoch[78] Batch [70-71] Speed: 12.01 samples/sec fcn_valid=0.562744
2019-04-11 15:00:14,056 Host Labels: 0 0.6 -1.0
2019-04-11 15:00:14,058 Host Labels: 0 0.6 -1.0
Waited for 0.000184059143066 seconds
2019-04-11 15:00:14,074 Host Labels: 0 0.6 -1.0
2019-04-11 15:00:14,075 Host Epoch[78] Batch [71-72] Speed: 12.03 samples/sec fcn_valid=0.552027

@mcever
Copy link
Author

mcever commented Apr 23, 2019

After much debugging, I found that part of my issue was apparently that I was training without initializing the weights, so my predictions quickly converged to a bunch of NaN's. I decided to retrain, initializing with the ImageNet weights like so

Host start with arguments Namespace(backward_do_mirror=True, base_lr=0.0016, batch_images=12, cache_images=None, check_start=1, check_step=4, crop_size=500, data_root='data/VOCdevkit/', dataset=None, debug=False, from_epoch=0, gpus='1,2,3', kvstore='local', log_file='voc_rna-a1_cls21.log', lr_steps=None, lr_type='fixed', model='voc_rna-a1_cls21', origin_size=2048, output='train+_out/', phase='train', prefetch_threads=4, prefetcher='thread', save_predictions=False, save_results=True, scale_rate_range='0.7,1.3', split='train+', stop_epoch=None, test_flipping=False, test_scales=None, test_steps=1, to_epoch=500, weight_decay=0.0005, weights='models/ilsvrc-cls_rna-a_cls1000_ep-0001.params')

Meanwhile, I ran validation on the validation and train+ sets every 5 epochs to track training progress. Performance on the validation set began to stabilize around 250 epochs around 45 mIOU, so I began then reducing the learning rate like so

2019-04-19 16:52:48,408 Host start with arguments Namespace(backward_do_mirror=True, base_lr=0.0016, batch_images=12, cache_images=None, check_start=1, check_step=4, crop_size=500, data_root='data/VOCdevkit/', dataset=None, debug=False, from_epoch=240, gpus='1,2,3', kvstore='local', log_file='voc_rna-a1_cls21.log', lr_steps=None, lr_type='linear', model='voc_rna-a1_cls21', origin_size=2048, output='train+_outp2/', phase='train', prefetch_threads=4, prefetcher='thread', save_predictions=False, save_results=True, scale_rate_range='0.7,1.3', split='train+', stop_epoch=None, test_flipping=False, test_scales=None, test_steps=1, to_epoch=500, weight_decay=0.0005, weights='train+_out/voc_rna-a1_cls21_ep-0240.params')

Now, after a total of about 410 epochs (started reducing learning rate from 240), I am still only achieving a max of 54.77 mIOU on the validation set. This is very much lower than the results presented in the paper. Any advice on how to improve would be greatly appreciated.

@rulixiang
Copy link

Hi, @mcever , I'm also trying to reproduce the results on VOC 2012 dataset. Have you reproduced the results as paper reported? If you have did it, can you share your training command?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants