Unable to reproduce Fig 1A #2

JoshVarty · 2017-12-06T20:43:26Z

I'm trying to reproduce evidence of superconvergence in Figure 1A shown below:

I am using the following values for the solver.prototxt

net: "/home/jovarty/git/super-convergence/architectures/Resnet56Cifar.prototxt"
test_iter: 200
test_interval: 100
display: 100
lr_policy: "triangular"
base_lr: 0.1
max_lr:  3.0
stepsize: 5000
max_iter: 10000
solver_mode: GPU
weight_decay: 1e-4
momentum: 0.9

I have implemented triangular cyclical learning in Caffe as specified here.

I am using Resnet56Cifar.prototxt as the network.

I am achieving final training accuracy of ~90% but final test accuracy as low as 10-20%.

I note that the paper specifies that you achieve these results with large batch sizes of ~1,000 images. However, I wouldn't expect smaller batch sizes (125, as specified in Resnet56Cifar.prototxt) to completely destroy the results.

Are there any additional steps I must take to reproduce this work?

The text was updated successfully, but these errors were encountered:

JoshVarty · 2017-12-06T20:44:58Z

I should note that I originally attempted to reproduce this work in Tensorflow here. However, after achieving similar results I wanted to sanity check that I could at least reproduce the results with the provided Caffe code.

JoshVarty · 2017-12-13T21:18:40Z

From the paper:

We tested the effect of batch normalization on super-convergence. Initially, we found that having
use_global_stats : true in the test phase prevents super-convergence. However, we realized this
was due to using the default value of moving_average_fraction = 0.999 that is only appropriate
for the typical, long training times.

The current Resnet56Cifar.prototxt uses the following:

super-convergence/architectures/Resnet56Cifar.prototxt

Lines 113 to 116 in 45330fd

    
           batch_norm_param { 
        
             use_global_stats: true 
        
             moving_average_fraction: 0.999 
        
           }

Are these correct? Are there any other changes to the network I should make before trying to reproduce?

lnsmith54 · 2017-12-14T21:16:29Z

Josh,

Thank you for your efforts at reproducing our results. I hoped it would be much simpler to do.

You might take a look at the x.sh file. I created that shell script to make modifications to the files and submit them to our GPU system. It shows the change to moving_average_fraction from 0.999 to 0.95 for Figure 1. This script might answer some of your questions.

Based on your comments, I plan to upload the output for a job that created the super-convergence run for Figure 1 (my server is currently down). That should help answer your questions and make reproducing the results easier.

Thanks,
Leslie

JoshVarty · 2017-12-14T22:12:16Z

@lnsmith54 Thanks! I'll take a look and I'll review the script in closer detail tomorrow. If I am able to reproduce after making these changes I'll be sure to make corresponding edits on OpenReview as well.

I'll let you know how it goes.

lnsmith54 · 2017-12-15T14:30:40Z

Josh,

I've uploaded the output files to a new results folder. Please look at the clr3SS5kFig1a file for your reference. Good luck and I look forward to hearing how it goes.

Leslie

JoshVarty · 2017-12-20T19:57:35Z

I think I'm getting much closer. After changing moving_average_fraction to 0.95:

For CLR training with Caffe I get a final accuracy of 84%

For multistep training with Caffe I get a final accuracy of 91.5%

I'll diff your output files against mine and see if I'm missing anything else.

lnsmith54 · 2017-12-20T20:08:48Z

Your CLR curve looks qualitatively similar to mine. It is my guess that running with 8 GPUs makes a major difference and you won't be able to reproduce the CLR results without similar hardware. Please prove me wrong!

JoshVarty · 2017-12-20T20:15:33Z

One difference I've noticed is:

The pre_bn layer in the Results/clr3SS5kFig1a

super-convergence/Results/clr3SS5kFig1a

Lines 144 to 156 in ecc5615

    
           layer { 
        
             name: "pre_bn" 
        
             type: "BatchNorm" 
        
             bottom: "pre_conv_top" 
        
             top: "pre_bn_top" 
        
             include { 
        
               phase: TRAIN 
        
             } 
        
             batch_norm_param { 
        
               use_global_stats: false 
        
               moving_average_fraction: 0.95 
        
             } 
        
           }

The pre_bn layer in Resnet56Cifar

super-convergence/architectures/Resnet56Cifar.prototxt

Lines 68 to 92 in ecc5615

    
           layer { # pre_bn 
        
             name: "pre_bn" 
        
             type: "BatchNorm" 
        
             bottom: "pre_conv_top" 
        
             top: "pre_bn_top" 
        
             param { 
        
               lr_mult: 0 
        
               decay_mult: 0 
        
             } 
        
             param { 
        
               lr_mult: 0 
        
               decay_mult: 0 
        
             } 
        
             param { 
        
               lr_mult: 0 
        
               decay_mult: 0 
        
             } 
        
             include { 
        
               phase: TRAIN 
        
             } 
        
             batch_norm_param { 
        
               use_global_stats: false 
        
               moving_average_fraction: 0.999 
        
             } 
        
           }

Should I be removing these params when reproducing?

lnsmith54 · 2017-12-20T20:24:32Z

That is curious. The params should be there. My server has been down all week but when my server is fixed, I will rerun this example with the Resnet56Cifar.prototxt just to double check this.

JoshVarty · 2017-12-20T20:41:09Z

Sounds good. In the meantime, I've updated my reproducibility report with these stronger results. Thanks for your help!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to reproduce Fig 1A #2

Unable to reproduce Fig 1A #2

JoshVarty commented Dec 6, 2017

JoshVarty commented Dec 6, 2017

JoshVarty commented Dec 13, 2017

lnsmith54 commented Dec 14, 2017

JoshVarty commented Dec 14, 2017

lnsmith54 commented Dec 15, 2017

JoshVarty commented Dec 20, 2017

lnsmith54 commented Dec 20, 2017

JoshVarty commented Dec 20, 2017

lnsmith54 commented Dec 20, 2017

JoshVarty commented Dec 20, 2017

Unable to reproduce Fig 1A #2

Unable to reproduce Fig 1A #2

Comments

JoshVarty commented Dec 6, 2017

JoshVarty commented Dec 6, 2017

JoshVarty commented Dec 13, 2017

lnsmith54 commented Dec 14, 2017

JoshVarty commented Dec 14, 2017

lnsmith54 commented Dec 15, 2017

JoshVarty commented Dec 20, 2017

lnsmith54 commented Dec 20, 2017

JoshVarty commented Dec 20, 2017

lnsmith54 commented Dec 20, 2017

JoshVarty commented Dec 20, 2017