Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to reproduce Fig 1A #2

Open
JoshVarty opened this issue Dec 6, 2017 · 10 comments
Open

Unable to reproduce Fig 1A #2

JoshVarty opened this issue Dec 6, 2017 · 10 comments

Comments

@JoshVarty
Copy link

I'm trying to reproduce evidence of superconvergence in Figure 1A shown below:
image

I am using the following values for the solver.prototxt

net: "/home/jovarty/git/super-convergence/architectures/Resnet56Cifar.prototxt"
test_iter: 200
test_interval: 100
display: 100
lr_policy: "triangular"
base_lr: 0.1
max_lr:  3.0
stepsize: 5000
max_iter: 10000
solver_mode: GPU
weight_decay: 1e-4
momentum: 0.9

I have implemented triangular cyclical learning in Caffe as specified here.

I am using Resnet56Cifar.prototxt as the network.

I am achieving final training accuracy of ~90% but final test accuracy as low as 10-20%.

I note that the paper specifies that you achieve these results with large batch sizes of ~1,000 images. However, I wouldn't expect smaller batch sizes (125, as specified in Resnet56Cifar.prototxt) to completely destroy the results.

Are there any additional steps I must take to reproduce this work?

@JoshVarty
Copy link
Author

I should note that I originally attempted to reproduce this work in Tensorflow here. However, after achieving similar results I wanted to sanity check that I could at least reproduce the results with the provided Caffe code.

@JoshVarty
Copy link
Author

From the paper:

We tested the effect of batch normalization on super-convergence. Initially, we found that having
use_global_stats : true in the test phase prevents super-convergence. However, we realized this
was due to using the default value of moving_average_fraction = 0.999 that is only appropriate
for the typical, long training times.

The current Resnet56Cifar.prototxt uses the following:

batch_norm_param {
use_global_stats: true
moving_average_fraction: 0.999
}

Are these correct? Are there any other changes to the network I should make before trying to reproduce?

@lnsmith54
Copy link
Owner

Josh,

Thank you for your efforts at reproducing our results. I hoped it would be much simpler to do.

You might take a look at the x.sh file. I created that shell script to make modifications to the files and submit them to our GPU system. It shows the change to moving_average_fraction from 0.999 to 0.95 for Figure 1. This script might answer some of your questions.

Based on your comments, I plan to upload the output for a job that created the super-convergence run for Figure 1 (my server is currently down). That should help answer your questions and make reproducing the results easier.

Thanks,
Leslie

@JoshVarty
Copy link
Author

@lnsmith54 Thanks! I'll take a look and I'll review the script in closer detail tomorrow. If I am able to reproduce after making these changes I'll be sure to make corresponding edits on OpenReview as well.

I'll let you know how it goes.

@lnsmith54
Copy link
Owner

Josh,

I've uploaded the output files to a new results folder. Please look at the clr3SS5kFig1a file for your reference. Good luck and I look forward to hearing how it goes.

Leslie

@JoshVarty
Copy link
Author

I think I'm getting much closer. After changing moving_average_fraction to 0.95:

For CLR training with Caffe I get a final accuracy of 84%

For multistep training with Caffe I get a final accuracy of 91.5%

I'll diff your output files against mine and see if I'm missing anything else.

@lnsmith54
Copy link
Owner

Your CLR curve looks qualitatively similar to mine. It is my guess that running with 8 GPUs makes a major difference and you won't be able to reproduce the CLR results without similar hardware. Please prove me wrong!

@JoshVarty
Copy link
Author

One difference I've noticed is:

The pre_bn layer in the Results/clr3SS5kFig1a

layer {
name: "pre_bn"
type: "BatchNorm"
bottom: "pre_conv_top"
top: "pre_bn_top"
include {
phase: TRAIN
}
batch_norm_param {
use_global_stats: false
moving_average_fraction: 0.95
}
}

The pre_bn layer in Resnet56Cifar

layer { # pre_bn
name: "pre_bn"
type: "BatchNorm"
bottom: "pre_conv_top"
top: "pre_bn_top"
param {
lr_mult: 0
decay_mult: 0
}
param {
lr_mult: 0
decay_mult: 0
}
param {
lr_mult: 0
decay_mult: 0
}
include {
phase: TRAIN
}
batch_norm_param {
use_global_stats: false
moving_average_fraction: 0.999
}
}

Should I be removing these params when reproducing?

@lnsmith54
Copy link
Owner

That is curious. The params should be there. My server has been down all week but when my server is fixed, I will rerun this example with the Resnet56Cifar.prototxt just to double check this.

@JoshVarty
Copy link
Author

Sounds good. In the meantime, I've updated my reproducibility report with these stronger results. Thanks for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants