Notebook by Zach Mueller

In [0]:
from fastai.vision import *

# Dataset:

Our dataset today will be ImageWoof. [Link](https://github.com/fastai/imagenette)

Goal: Using no pre-trained weights, see how well of accuracy we can get in x epochs

This dataset is generally harder than imagenette, both are a subset of ImageNet.

Models are leaning more towards being faster, more effecient


In [0]:
def get_data(size, woof, bs, workers=None):
    if   size<=128: path = URLs.IMAGEWOOF if woof else URLs.IMAGENETTE
    elif size<=224: path = URLs.IMAGEWOOF_320 if woof else URLs.IMAGENETTE_320
    else          : path = URLs.IMAGEWOOF     if woof else URLs.IMAGENETTE
    path = untar_data(path)

    n_gpus = num_distrib() or 1
    if workers is None: workers = min(8, num_cpus()//n_gpus)

    return (ImageList.from_folder(path).split_by_folder(valid='val')
            .label_from_folder().transform(([flip_lr(p=0.5)], []), size=size)
            .databunch(bs=bs, num_workers=workers)
            .presize(size, scale=(0.35,1))
            .normalize(imagenet_stats))

In [0]:
data = get_data(128, True, 64)

We will be following a progression that started on the fastai forums [here](https://forums.fast.ai/t/meet-mish-new-activation-function-possible-successor-to-relu/53299/) on August 26th of this year.

In this "competition" included:
  * [Less](https://forums.fast.ai/u/lessw2020)
  * [Seb](https://forums.fast.ai/u/seb)
  * [Mikhail Grankin](https://forums.fast.ai/u/grankin)
  * [Federico Lois](https://forums.fast.ai/u/redknight)
  * [Ignacio Oguiza](https://forums.fast.ai/u/oguiza)


# The Competition:

* Lasted roughly 3 days
* We explored a variety of papers and combining various ideas to see what *together* could work the best

## Papers Referenced:

* [Bag of Tricks for Resnet (aka the birth of xResNet)](https://arxiv.org/abs/1812.01187)
* [Large Batch Optimization for Deep Learning, LAMB](https://arxiv.org/abs/1904.00962)
* [Large Batch Training of Convolutional Networks, LARS](https://arxiv.org/pdf/1708.03888.pdf)
* [Lookahead Optimizer: k steps forward, 1 step back](https://arxiv.org/abs/1907.08610)
* [Mish: A Self Regularized Non-Monotonic Neural Activation Function](https://arxiv.org/abs/1908.08681v1)
* [On the Variance of the Adaptive Learning Rate and Beyond, RAdam](https://arxiv.org/abs/1908.03265)
* [Self-Attention Generative Adversarial Networks](https://arxiv.org/abs/1805.08318)
* [Stochastic Gradient Methods with Layer-wise
Adaptive Moments for Training of Deep Networks, Novograd](https://arxiv.org/pdf/1905.11286.pdf)


## Other Equally as Important Noteables:
* Flatten + Anneal Scheduling - Mikhail Grankin
* Simple Self Attention - Seb

One trend you will notice throughout this exercise is we (everyone mentioned above and myself) all tried combining a variety of these tools and papers together before Seb eventually came up with the winning solution. For a bit of context, here is the pre-competition State of the Art for ImageWoof:
![](https://forums.fast.ai/uploads/default/original/3X/9/3/9386db85de3d7ad9c7d567484fb929bb40a93d85.jpeg)

And here was the winning results:

![](https://forums.fast.ai/uploads/default/optimized/3X/a/6/a68876e6f99a87c8c81db6c39125f8f1eae99f1f_2_690x271.jpeg)

As a general rule of thumb, we always want to make sure our results are reproducable, hence the multiple runs and reports of the Standard Deviation, Mean, and the Maximum found. For today, we will just do one run of five for time. Following no particular order, here is a list of what was tested, and what we will be testing today:

* Baseline (Adam + xResnet50) + OneCycle
* Ranger (RAdam + LookAhead) + OneCycle
* Ranger + Flatten Anneal
* Ranger + MXResnet (xResnet50 + Mish) + Flatten Anneal
* RangerLars (Ralamb + LARS + Ranger) + Flatten Anneal
* RangerLars + xResnet50 + Flatten Anneal
* Ranger + SimpleSelfAttention + MXResnet + Flatten Anneal

The last of which did achieve the best score overall.

## Functions:

For the sake of simplicity, we will borrow from Seb's gitub repository.

In [4]:
!git clone https://github.com/sdoria/mish

Cloning into 'mish'...
remote: Enumerating objects: 46, done.[K
remote: Counting objects: 100% (46/46), done.[K
remote: Compressing objects: 100% (40/40), done.[K
remote: Total 46 (delta 21), reused 19 (delta 6), pack-reused 0[K
Unpacking objects: 100% (46/46), done.


In [4]:
%cd mish
from rangerlars import *
from mish import *
from mxresnet import *
from ranger import *

/content/mish
Mish activation loaded...


# Running the tests

For our tests, we will use the overall accuracy as well as the top_k, as this is what was used in Jeremy's example. Do note that top_k is not quite as relevent here as we only have 10 classes

## Baseline

In [0]:
opt_func = partial(optim.Adam, betas=(0.9,0.99), eps=1e-6)

In [0]:
learn = Learner(data, models.xresnet50(c_out=10), wd=1e-2, opt_func=opt_func,
               bn_wd=False, true_wd=True, loss_func=LabelSmoothingCrossEntropy(),
               metrics=[accuracy, top_k_accuracy])

In [7]:
learn.fit_one_cycle(5, 3e-3, div_factor=10, pct_start=0.3)

epoch,train_loss,valid_loss,accuracy,top_k_accuracy,time
0,2.153507,2.155606,0.236,0.764,01:12
1,1.950127,2.465281,0.282,0.788,01:13
2,1.722233,1.586035,0.488,0.932,01:12
3,1.523372,1.403256,0.588,0.952,01:12
4,1.379958,1.31599,0.624,0.97,01:12


## Ranger + OneCycle

In [0]:
opt_func = partial(Ranger, betas=(0.9,0.99), eps=1e-6)

In [0]:
learn = Learner(data, models.xresnet50(c_out=10), wd=1e-2, opt_func=opt_func,
               bn_wd=False, true_wd=True, loss_func=LabelSmoothingCrossEntropy(),
               metrics=[accuracy, top_k_accuracy])

In [11]:
learn.fit_one_cycle(5, 3e-3, div_factor=10, pct_start=0.3)

epoch,train_loss,valid_loss,accuracy,top_k_accuracy,time
0,1.897928,2.008069,0.312,0.796,01:12
1,1.791323,1.728758,0.418,0.898,01:13
2,1.669062,1.666615,0.478,0.912,01:13
3,1.570262,1.517981,0.548,0.936,01:12
4,1.525395,1.487397,0.548,0.938,01:12


## Ranger + Flatten Anneal

In [0]:
from fastai.callbacks import *

In [0]:
def flattenAnneal(learn:Learner, lr:float, n_epochs:int, start_pct:float):
  n = len(learn.data.train_dl)
  anneal_start = int(n*n_epochs*start_pct)
  anneal_end = int(n*n_epochs) - anneal_start
  phases = [TrainingPhase(anneal_start).schedule_hp('lr', lr),
           TrainingPhase(anneal_end).schedule_hp('lr', lr, anneal=annealing_cos)]
  sched = GeneralScheduler(learn, phases)
  learn.callbacks.append(sched)
  learn.fit(n_epochs)

In [0]:
opt_func = partial(Ranger, betas=(0.9,0.99), eps=1e-6)
learn = Learner(data, models.xresnet50(c_out=10), wd=1e-2, opt_func=opt_func,
               bn_wd=False, true_wd=True, loss_func=LabelSmoothingCrossEntropy(),
               metrics=[accuracy, top_k_accuracy])

In [21]:
flattenAnneal(learn, 3e-3, 5, 0.7)

epoch,train_loss,valid_loss,accuracy,top_k_accuracy,time
0,2.098281,2.399817,0.266,0.744,01:13
1,1.929641,2.321711,0.302,0.798,01:13
2,1.733089,1.623181,0.506,0.92,01:13
3,1.582382,1.617398,0.496,0.924,01:13
4,1.361795,1.311129,0.67,0.952,01:13


## Ranger + MXResnet + Flatten Anneal

In [0]:
opt_func = partial(Ranger, betas=(0.9,0.99), eps=1e-6)
learn = Learner(data, mxresnet50(c_out=10), wd=1e-2, opt_func=opt_func,
               bn_wd=False, true_wd=True, loss_func=LabelSmoothingCrossEntropy(),
               metrics=[accuracy, top_k_accuracy])

In [25]:
flattenAnneal(learn, 4e-3, 5, 0.7)

epoch,train_loss,valid_loss,accuracy,top_k_accuracy,time
0,2.054917,2.199414,0.286,0.804,01:17
1,1.804237,3.025912,0.254,0.732,01:17
2,1.616225,1.517143,0.574,0.944,01:17
3,1.449524,1.379319,0.622,0.938,01:18
4,1.221281,1.168319,0.728,0.958,01:18


## RangerLars + MXResnet + Flatten Anneal

In [0]:
opt_func = partial(RangerLars, betas=(0.9,0.99), eps=1e-6)
learn = Learner(data, mxresnet50(c_out=10), wd=1e-2, opt_func=opt_func,
               bn_wd=False, true_wd=True, loss_func=LabelSmoothingCrossEntropy(),
               metrics=[accuracy, top_k_accuracy])

In [34]:
flattenAnneal(learn, 4e-3, 5, 0.72)

epoch,train_loss,valid_loss,accuracy,top_k_accuracy,time
0,1.945484,2.401585,0.318,0.78,01:31
1,1.71429,1.956744,0.384,0.844,01:31
2,1.587898,1.804619,0.416,0.906,01:31
3,1.50296,1.5409,0.536,0.928,01:31
4,1.361548,1.34227,0.646,0.954,01:31


## RangerLars + xResnet50 + Flatten Anneal

In [0]:
opt_func = partial(RangerLars, betas=(0.9,0.99), eps=1e-6)
learn = Learner(data, models.xresnet50(c_out=10), wd=1e-2, opt_func=opt_func,
               bn_wd=False, true_wd=True, loss_func=LabelSmoothingCrossEntropy(),
               metrics=[accuracy, top_k_accuracy])

In [36]:
flattenAnneal(learn, 4e-3, 5, 0.72)

epoch,train_loss,valid_loss,accuracy,top_k_accuracy,time
0,2.006913,2.315927,0.284,0.708,01:22
1,1.80161,1.92757,0.378,0.852,01:22
2,1.703221,1.858955,0.394,0.88,01:22
3,1.643228,1.700991,0.448,0.872,01:22
4,1.488607,1.462179,0.594,0.94,01:22


## Ranger + SimpleSelfAttention + MXResnet + Flatten Anneal

In [0]:
opt_func = partial(Ranger, betas=(0.95,0.99), eps=1e-6)
learn = Learner(data, mxresnet50(c_out=10, sa=True), wd=1e-2, opt_func=opt_func,
               bn_wd=False, true_wd=True, loss_func=LabelSmoothingCrossEntropy(),
               metrics=[accuracy, top_k_accuracy])

In [38]:
flattenAnneal(learn, 4e-3, 5, 0.72)

epoch,train_loss,valid_loss,accuracy,top_k_accuracy,time
0,1.968602,2.051655,0.33,0.83,01:20
1,1.709247,2.174664,0.384,0.892,01:21
2,1.537062,1.451682,0.598,0.946,01:21
3,1.393161,1.419797,0.578,0.954,01:21
4,1.17215,1.122868,0.746,0.978,01:20


As we can see, 74.6 is what we got. The highest recorded is 78%. 

From here:

I encourage you all to try out some of the combinations seen here today and apply a bit more to it. For instance, are we using the best hyperparameters? What about Cut-Out? MixUp? Plenty more to explore!

Thanks to everyone mentioned above for their hard work and determination to getting to where we are now. The fastai forum is an amazing place to bounce ideas and try new things. Also thank you to Jeremy for making *all* of this possible!