# Winning the Lottery with fastai
> How to find winning tickets in your neural network

- toc: true
- badges: false
- categories: [Deep Learning]
- comments: true
- image: images/pruning.png
- hide: true

<br>

<br>

## **Lottery Ticket Hypothesis**

The Lottery Ticket Hypothesis is a fascinating characteristic of neural networks that has been found by Frankle and Carbin in 2019. The hypothesis is the following: in a neural network, there exists a subnetwork that can be trained to a comparable accuracy and in a comparable training time than the whole network. The only condition is that the subnetwork starts from the same initial condition than when it was part of the whole network. 

In practice, this subnetwork, called "winning ticket", can be found by using pruning on the network, removing useless connections.

The steps to isolate this winning ticket are: 
1. Get a freshly initialized network
2. Train it to convergence
3. Prune the smallest weights, i.e. the weights that possess the lowest $l_1$-norm
4. Reinitialize the remaining weights to their original value, i.e. their value at step 1)
5. Repeat

![Alt Text](images/LTH/test2.gif)

Using fasterai, we already know how to prune a network. The only change here is that we have to keep track of initialization since we want to start from the initial conditions each time.

In the original paper, the idea was to iteratively prune the network, resetting the remaining weights to their initial value after each pruning step.

In [15]:
#hide
from fastai.vision import *

In [16]:
#hide
def get_data(size, bs):
    path = untar_data(URLs.IMAGENETTE_160)

    return (ImageList.from_folder(path).split_by_folder(valid='val')
            .label_from_folder().transform(([flip_lr(p=0.5)], []), size=size)
            .databunch(bs=bs)
            .presize(size, scale=(0.35,1))
            .normalize(imagenet_stats))

In [None]:
#hide
def count_parameters(model):
    num_params = sum(p.numel() for p in model.parameters())
    print(f'Total parameters : {num_params:,}' )

In [None]:
#hide
def print_sparsity(model):
    for k,m in enumerate(model.modules()):
        if isinstance(m, nn.Conv2d):
            print(f"Sparsity in {m.__class__.__name__} {k}: {100. * float(torch.sum(m.weight == 0))/ float(m.weight.nelement()):.2f}%")

In [14]:
#hide
size, bs = 224, 16
data = get_data(size, bs)

Let's first get our baseline:

In [None]:
learn = Learner(data, models.resnet18(num_classes=10), metrics=[accuracy])

In [15]:
learn.fit_one_cycle(10, 1e-4)

epoch,train_loss,valid_loss,accuracy,time
0,2.016354,1.778865,0.368917,01:31
1,1.77757,1.50886,0.523567,01:31
2,1.436139,1.421571,0.569172,01:32
3,1.275864,1.11884,0.630064,01:31
4,1.13662,0.994999,0.687898,01:31
5,0.970474,0.824344,0.739618,01:31
6,0.878756,0.764273,0.765605,01:32
7,0.817084,0.710727,0.781911,01:31
8,0.716041,0.625853,0.804841,01:31
9,0.668815,0.605727,0.810955,01:31


What would be the performance of our model with regular pruning ? 

We will now try the LTH. The first test will be using One-Shot, i.e. we will prune our network and reset the weights once.

The first step is thus to get our pruned model, then setting the parameter `reset_end` to `lth`, meaning that after the training, we will reset the remaining weights to their original initialization value.

In [None]:
learn = Learner(data, models.resnet18(num_classes=10), metrics=[accuracy])
sched_func = annealing_cos

prune = SparsifyCallback(learn, sparsity=sp, granularity=granularity, method=method, criteria=criteria, sched_func=sched_func, reset='lth')
learn.fit_one_cycle(int(epochs), 1e-3, callbacks=[prune])

Now we can retrain the submodel, starting with their original initialization values, for the same amount of epochs and check if the performance is comparable.

In [None]:
ft = SparsifyCallback(learn, sparsity=sp, granularity=granularity, method=method, criteria=criteria, sched_func=annealing_no)

learn.fit_one_cycle(int(epochs), 1e-3, callbacks=[ft])

LTH can also be done iteratively, with each pruning iteration being followed by a weight reset.

In [None]:
learn = Learner(data, models.resnet18(num_classes=10), metrics=[accuracy])
sched_func = iterative

prune = SparsifyCallback(learn, sparsity=sp, granularity=granularity, method=method, criteria=criteria, sched_func=sched_func, reset='lth')
learn.fit_one_cycle(int(epochs), 1e-3, callbacks=[prune])

Actually, authors have suggested that one shouldn't necessarily reset the weights to their initial value, i.e their value at step 0 but at a further step. This can be done by changing the `rewind` value to the epoch you want your weights to be reset to.

In [None]:
learn = Learner(data, models.resnet18(num_classes=10), metrics=[accuracy])
sched_func = iterative

prune = SparsifyCallback(learn, sparsity=sp, granularity=granularity, method=method, criteria=criteria, sched_func=sched_func, reset='lth', rewind=2)
learn.fit_one_cycle(int(epochs), 1e-3, callbacks=[prune])

---

<br>

<br>

**That's all! Thank you for reading, I hope that you'll like FasterAI. I do not claim that it is perfect, you'll probably find a lot of bugs. If you do, just please tell me, so I can try to solve them 😌 **

<br>

---

<br>

<p style="font-size: 15px"><i>If you notice any mistake or improvement that can be done, please contact me ! If you found that post useful, please consider citing it as:</i></p>

```
@article{hubens2020fasterai,
  title   = "Winning the Lottery with fastai",
  author  = "Hubens, Nathan",
  journal = "nathanhubens.github.io",
  year    = "2020",
  url     = "https://nathanhubens.github.io/posts/deep%20learning/2020/08/17/FasterAI.html"
}
```

## **References**

- {{'[Cristian Buciluǎ, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, 2006](https://www.cs.cornell.edu/~caruana/compression.kdd06.pdf)' | fndetail: 1}}
- {{'[Qizhe Xie, Minh-Thang Luong, Eduard Hovy, Quoc V. Le: Self-training with Noisy Student improves ImageNet classification. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020](https://arxiv.org/abs/1911.04252)' | fndetail: 2}}
- {{'[H. Li, "Exploring knowledge distillation of Deep neural nets for efficient hardware solutions," CS230 Report, 2018](http://cs230.stanford.edu/files_winter_2018/projects/6940224.pdf)' | fndetail: 3}}
- {{'[Zhu, M. & Gupta, S. (2017). To prune, or not to prune: exploring the efficacy of pruning for model compression. ICLR, 2018 ](https://openreview.net/pdf?id=Sy1iIDkPM)' | fndetail: 4}}