# FasterAI
> How to make your network smaller and faster with the use of fastai library

- toc: true
- badges: false
- categories: [Deep Learning]
- comments: true
- image: images/pruning.png

<br>

<p style="font-size: 15px"><i>The code is available <a href="https://github.com/nathanhubens/fasterai">here</a></i></p>

<br>

## **Introducing FasterAI**

**FasterAI** is a project that I started to make my neural networks **smaller** and **faster** with the use of the [fastai](https://github.com/fastai/fastai) library. The techniques implemented here can easily be used with plain Pytorch but the idea was to express them in an abstract and easy-to-use  manner (à la fastai). 

In this article, we'll explain how to use FasterAI by going through an example use-case.

<br>

> **Ready ? Let's dive in then !**

<br>

In [15]:
#hide
from fastai.vision import *

In [16]:
#hide
def get_data(size, bs):
    path = untar_data(URLs.IMAGENETTE_160)

    return (ImageList.from_folder(path).split_by_folder(valid='val')
            .label_from_folder().transform(([flip_lr(p=0.5)], []), size=size)
            .databunch(bs=bs)
            .presize(size, scale=(0.35,1))
            .normalize(imagenet_stats))

In [None]:
#hide
def count_parameters(model):
    num_params = sum(p.numel() for p in model.parameters())
    print(f'Total parameters : {num_params:,}' )

In [None]:
#hide
def print_sparsity(model):
    for k,m in enumerate(model.modules()):
        if isinstance(m, nn.Conv2d):
            print(f"Sparsity in {m.__class__.__name__} {k}: {100. * float(torch.sum(m.weight == 0))/ float(m.weight.nelement()):.2f}%")

In [14]:
#hide
size, bs = 224, 16
data = get_data(size, bs)

Let's start with a bit of context for the purpose of the demonstration. Imagine that we want to deploy a **VGG16** model on a mobile device that has limited storage capacity and that our task requires our model to run sufficiently fast. It is known that parameters and speed efficiency are not the strong points of **VGG16** but let's see what we can do with it.

Let's first check the number of parameters and the inference time of **VGG16**.

In [None]:
learn = Learner(data, models.vgg16_bn(num_classes=10), metrics=[accuracy])

So, **VGG16** has **134** millions of parameters

In [122]:
#hide_input
count_parameters(learn.model)

Total parameters : 134,309,962


And takes **4.03ms** to perform inference on a single image.

In [None]:
#hide
model = learn.model.eval()
x,y = data.one_batch()

In [12]:
#hide_input
%%timeit
model(x[0][None].cuda())

4.03 ms ± 18.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


Snap ! This is more than we can afford for deployment, ideally we would like our model to take only half of that...but should we give up ? Nope, there are actually a lot of techniques that we can use to help reducing the size and improve the speed of our models! Let's see how to apply them with **FasterAI**.

<br>

We will first train our **VGG16** model to have a **baseline** of what performance we should expect from it.

In [15]:
learn.fit_one_cycle(10, 1e-4)

epoch,train_loss,valid_loss,accuracy,time
0,2.016354,1.778865,0.368917,01:31
1,1.77757,1.50886,0.523567,01:31
2,1.436139,1.421571,0.569172,01:32
3,1.275864,1.11884,0.630064,01:31
4,1.13662,0.994999,0.687898,01:31
5,0.970474,0.824344,0.739618,01:31
6,0.878756,0.764273,0.765605,01:32
7,0.817084,0.710727,0.781911,01:31
8,0.716041,0.625853,0.804841,01:31
9,0.668815,0.605727,0.810955,01:31


So we would like our network to have comparable accuracy but fewer parameters and running faster... And the first technique that we will show how to use is called **Knowledge Distillation**

<br>

---

<br>

## **Knowledge Distillation**

Knowledge distillation is a simple yet very efficient way to train a model. It was introduced in 2006 by [Caruana et al.](https://www.cs.cornell.edu/~caruana/compression.kdd06.pdf){% fn 1 %}. The main idea behind is to use a small model (called the **student**) to approximate the function learned by a larger and high-performing model (called the **teacher**). This can be done by using the large model to pseudo-label the data. This idea has been used very recently to [break the state-of-the-art accuracy on ImageNet](https://arxiv.org/abs/1911.04252){% fn 2 %}.

When we train our model for classification, we usually use a softmax as last layer. This softmax has the particularity to squish low value logits towards **0**, and the highest logit towards **1**. This has for effect to completely lose all the inter-class information, or what is sometimes called the *dark knowledge*. This is the information that is valuable and that we want to transfer from the teacher to the student.

To do so, we still use a regular classification loss but at the same time, we'll use another loss, computed between the *softened* logits of the teacher (our *soft labels*) and the *softened* logits of the student (our *soft predictions*). Those soft values are obtained when you use a **soft-softmax**, that avoids squishing the values at its output. Our implementation follows [this paper](http://cs230.stanford.edu/files_winter_2018/projects/6940224.pdf){% fn 3 %} and the basic principle of training is represented in the figure below:

<br>

<img src="images/fasterai/KD.png" width="600"> 

To use **Knowledge Distillation** with FasterAI, you only need to use this callback when training your student model:

<br>

<blockquote>
<pre><b><i> KnowledgeDistillation(student:Learner, teacher:Learner) </i></b></pre>
<p style="font-size: 15px"><i>
You only need to give to the callback function your student learner and your teacher learner. Behind the scenes, FasterAI will take care of making your model train using knowledge distillation.
</i></p>
</blockquote>

<br>

In [None]:
#hide
from fasterai.distillation import *

The first thing to do is to find a teacher, which can be any model, that preferrably performs well. We will chose **VGG19** for our demonstration. To make sure it performs better than our **VGG16** model, let's start from a pretrained version.

In [20]:
teacher = cnn_learner(data, models.vgg19_bn, metrics=[accuracy])
teacher.fit_one_cycle(3, 1e-4)

epoch,train_loss,valid_loss,accuracy,time
0,0.249884,0.088749,0.972739,01:02
1,0.201829,0.087495,0.974268,01:02
2,0.261882,0.082631,0.974013,01:01


Our teacher has **97.4%** of accuracy which is pretty good, it is ready to take a student under its wing. So let's create our student model and train it with the **Knowledge Distillation** callback:

In [9]:
student = Learner(data, models.vgg16_bn(num_classes=10), metrics=[accuracy])
student.fit_one_cycle(10, 1e-4, callbacks=[KnowledgeDistillation(student, teacher)])

epoch,train_loss,valid_loss,accuracy,time
0,2.323744,2.102873,0.410955,02:16
1,2.099557,2.441147,0.571465,02:16
2,1.829197,2.215419,0.607643,02:16
3,1.617705,1.683477,0.667006,02:16
4,1.364808,1.366435,0.713376,02:16
5,1.257906,0.985063,0.788025,02:16
6,1.087404,0.877424,0.801019,02:17
7,0.94996,0.77763,0.822166,02:16
8,0.868683,0.733206,0.837707,02:17
9,0.75663,0.707806,0.843057,02:16


And we can see that indeed, the knowledge of the teacher was useful for the student, as it is clearly overperforming the vanilla **VGG16**.

Ok, so now we are able to get more from a given model which is kind of cool ! With some experimentations we could come up with a model smaller than **VGG16** but able to reach the same performance as our baseline! You can try to find it by yourself later, but for now let's continue with the next technique !

<br>

---

<br>

## **Sparsifying**

Now that we have a student model that is performing better than our baseline, we have some room to compress it. And we'll start by making the network sparse. As explained in a previous [article](https://nathanhubens.github.io/posts/deep%20learning/2020/05/22/pruning.html), there are many ways leading to a sparse network.


<br>

> Note: Usually, the process of making a network sparse is called Pruning. I prefer using the term Pruning when parameters are **actually** removed from the network, which we will do in the next section. 

<br>
<img src="images/pruning/schedules.png" width="680">
<br>

By default, FasterAI uses the **Automated Gradual Pruning** paradigm as it removes parameters as the model trains and doesn't require to pretrain the model, so it is usually much faster. In FasterAI, this is also managed by using a callback, that will replace the *least important* parameters of your model by zeroes during the training. The callback has a wide variety of parameters to tune your **Sparsifying** operation, let's take a look at them:

<br>

<blockquote>
    <pre><b><i>SparsifyCallback(learn, sparsity, granularity, method, criteria, sched_func)</i></b></pre>

<ul><i>
<li style="font-size:15px"><b>sparsity</b>: the percentage of sparsity that you want in your network </li>
<li style="font-size:15px"><b>granularity</b>: on what granularity you want the sparsification to be operated (currently supported: <code>weight</code>, <code>filter</code>)</li>
<li style="font-size:15px"><b>method</b>: either <code>local</code> or <code>global</code>, will affect the selection of parameters to be choosen in each layer independently (<code>local</code>) or on the whole network (<code>global</code>).</li>
<li style="font-size:15px"><b>criteria</b>: the criteria used to select which parameters to remove (currently supported: <code>l1</code>, <code>taylor</code>)</li>
<li style="font-size:15px"><b>sched_func</b>: which schedule you want to follow for the sparsification (currently supported: <a href="https://docs.fast.ai/callback.html#Annealing-functions">any scheduling function of fastai</a>, i.e <code>annealing_linear</code>, <code>annealing_cos</code>, ... and <code>annealing_gradual</code>, the schedule proposed by <a href="https://openreview.net/pdf?id=Sy1iIDkPM">Zhu & Gupta</a>{% fn 4 %}, represented in Figure below)</li>
</i></ul>
</blockquote>

<img src="images/pruning/gradual.png" width="500">
<br>

Although I found that **Automated Gradual Pruning** usually works best, you may want to use the other paradigms. They can easily be achieved by doing:

> **One-Shot Pruning**
>    ```python
>sparsifier = Sparsifier(granularity, method, criteria)
>new_model = sparsifier.prune(learn.model, sparsity)
>```

<p style="font-size: 15px">
To perform <b>One-Shot Pruning</b>, you can simply prune your model to the desired sparsity. This is probably highly suboptimal as removing parameters will shake up the model and hurt it quite a bit.
</p>
<br>

> **Iterative Pruning**
> ```python
> new_model = sparsifier.prune(learn.model, sparsity)
> learn = Learner(data, new_model)
> learn.fit(num_epochs, lr, callbacks=[SparsifyCallback(learn, sparsity, granularity, method, criteria, sched_func=annealing_no)])
> sparsity += increase_value
> # REPEAT
> ```

<p style="font-size: 15px">
    To perform <b>Iterative Pruning</b>, we first need to train our model, then perform several iterations of pruning and fine-tuning until desired sparsity. Fine-tuning has to be done with <code>SparsifyCallback</code> and the <code>annealing_no</code> schedule to ensure our zero-weights don't get updated.
</p>

<br>

**But let's come back to our example!**

In [134]:
#hide
from fasterai.sparsifier import *

Here, we will make our network **40%** sparse, and remove entire **filters**, selected **locally** and based on **L1 norm**. We will train with a learning rate a bit smaller to be gentle with our network because it has already been trained. The **scheduling** selected is cosinusoidal, so the pruning starts and ends quite slowly.

In [8]:
student.fit(10, 1e-5, callbacks=[SparsifyCallback(student, sparsity=40, granularity='filter', method='local', criteria='l1', sched_func=annealing_cos)])

Pruning of filter until a sparsity of 40%


epoch,train_loss,valid_loss,accuracy,time
0,0.584072,0.532074,0.838471,01:34
1,0.583805,0.499353,0.844586,01:34
2,0.59941,0.527805,0.836433,01:34
3,0.610081,0.544566,0.828025,01:35
4,0.625637,0.543279,0.829809,01:34
5,0.628777,0.563051,0.819618,01:34
6,0.688617,0.617627,0.8,01:34
7,0.691044,0.629927,0.801019,01:34
8,0.669935,0.57622,0.814013,01:33
9,0.682428,0.562718,0.823949,01:34


Sparsity at epoch 0: 0.98%
Sparsity at epoch 1: 3.83%
Sparsity at epoch 2: 8.25%
Sparsity at epoch 3: 13.83%
Sparsity at epoch 4: 20.01%
Sparsity at epoch 5: 26.19%
Sparsity at epoch 6: 31.76%
Sparsity at epoch 7: 36.19%
Sparsity at epoch 8: 39.02%
Sparsity at epoch 9: 40.00%
Final Sparsity: 40.00


Our network now has **40%** of its filters composed entirely of zeroes, at the cost of **2%** of accuracy. Obviously, choosing a higher sparsity, makes it more difficult for the network to keep a similar accuracy. Other parameters can also widely change the behaviour of our sparsification process. For example choosing a more fine-grained sparsity usually leads to better results but is then more difficult to take advantage of in terms of speed.

We can double-check that our model has indeed been pruned by **40%** of its parameters.

In [9]:
#hide_input
print_sparsity(student.model)

Sparsity in Conv2d 2: 39.06%
Sparsity in Conv2d 5: 39.06%
Sparsity in Conv2d 9: 39.84%
Sparsity in Conv2d 12: 39.84%
Sparsity in Conv2d 16: 39.84%
Sparsity in Conv2d 19: 39.84%
Sparsity in Conv2d 22: 39.84%
Sparsity in Conv2d 26: 39.84%
Sparsity in Conv2d 29: 39.84%
Sparsity in Conv2d 32: 39.84%
Sparsity in Conv2d 36: 39.84%
Sparsity in Conv2d 39: 39.84%
Sparsity in Conv2d 42: 39.84%


We don't have **exactly 40%** because, as we removed complete filters, we don't necesserally have a round number.

<br>

Let's now see how much we gained in terms of speed. Because we removed **40%** of convolution filters, we should expect crazy speed-up right ? 

In [11]:
#hide
model = student.model.eval()

In [13]:
#hide_input
%%timeit
model(x[0][None].cuda())

4.02 ms ± 5.77 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


Well actually, no. We didn't remove any parameters, we just replaced some by zeroes, remember? The amount of parameters is still the same:

In [122]:
#hide_input
count_parameters(model)

Total parameters : 134,309,962


Which leads us to the next section.

<br>

---

<br>

## **Pruning**

> Important: This is currently only supported for fully-feedforward models such as VGG-like models as more complex architectures require increasingly difficult and usually model-dependant implementations.

<br>

Why don't we see any acceleration even though we removed half of the parameters? That's because natively, our **GPU** does not know that our matrices are sparse and thus isn't able to accelerate the computation. The easiest work around, is to **physically** remove the parameters we zeroed-out. But this operation requires to change the architecture of the network. 

This pruning only works if we have zeroed-out entire filters beforehand as it is the only case where you can change the architecture accordingly. Hopefully, sparse computations will [soon be available](https://pytorch.org/docs/stable/sparse.html) on common deep learning librairies so this section will become useless in the future, but for the moment, it is the best solution I could come up with 🤷

<br>

Here is what it looks like with fasterai:
<br>

In [2]:
#hide_input
#from IPython.display import SVG, display
#display(SVG(data='images/fasterai/filter_pruning.svg'))

<br>

<blockquote>
<pre><b><i>pruner = Pruner()
pruned_model = pruner.prune_model(learn.model)</i></b></pre>
<p style="font-size: 15px"><i>
You just need to pass the model whose filters has previously been sparsified and FasterAI will take care of removing them.
</i></p>
</blockquote>

> Note: This operation should be lossless as it only removes filters that already do not participate in the network anymore.

<br>

In [14]:
#hide
from fasterai.pruner import *

So in the case of our example, it gives: 

In [None]:
pruner = Pruner()
pruned_model = pruner.prune_model(student.model)

Let's now see what our model is capable of now:

In [17]:
#hide
model = pruned_model.eval()

In [25]:
#hide_input
count_parameters(model)

Total parameters : 83,975,344


And in terms of speed:

In [19]:
#hide_input
%%timeit
model(x[0][None].cuda())

2.44 ms ± 3.51 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


<br>

Yay ! Now we can talk ! Let's just double check that our accuracy is unchanged and that we didn't mess up somewhere:

In [29]:
#hide_input
pruned_learner = Learner(data, pruned_model, metrics=[accuracy])
pruned_learner.validate()

[0.5641388, tensor(0.8229)]

<br>

And there is actually more that we can do ! Let's keep going ! 

<br>

---

<br>

## **Batch Normalization Folding**

**Batch Normalization Folding** is a really easy to implement and straightforward idea. The gist is that batch normalization is nothing more than a normalization of the input data at each layer. Moreover, at inference time, the batch statistics used for this normalization are fixed. We can thus incorporate the normalization process directly in the convolution by changing its weights and completely remove the batch normalization layers, which is a gain both in terms of parameters and in terms of computations. For a more in-depth explaination, see my [previous post](https://nathanhubens.github.io/posts/deep%20learning/2020/04/20/BN.html). 

This is how to use it with FasterAI:

<br>

<blockquote>
<pre><b><i>bn_folder = BN_Folder()
bn_folder.fold(learn.model))</i></b></pre>
<p style="font-size: 15px"><i>
Again, you only need to pass your model and FasterAI takes care of the rest. For models built using the nn.Sequential, you don't need to change anything. For others, if you want to see speedup and compression, you actually need to subclass your model to remove the batch norm from the parameters and from the <code>forward</code> method of your network.
</i></p>
</blockquote>

> Note: This operation should also be lossless as it redefines the convolution to take batch norm into account and is thus equivalent.

<br>

In [23]:
#hide
from fasterai.bn_folder import *

Let's do this with our model ! 

In [24]:
folded_model = bn_folding_model(pruned_learner.model)

The parameters drop is generally not that significant, especially in a network such as **VGG** where almost all parameters are contained in the FC layers but, hey, any gain is good to take.

In [26]:
#hide_input
count_parameters(folded_model)

Total parameters : 83,970,260


<br>

Now that we removed the batch normalization layers, we should again see a speedup.

In [None]:
#hide
folded_model = folded_model.eval()

In [27]:
#hide_input
%%timeit
folded_model(x[0][None].cuda())

2.27 ms ± 1.22 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


Again, let's double check that we didn't mess up somewhere:

In [29]:
#hide_input
folded_learner = Learner(data, folded_model, metrics=[accuracy])
folded_learner.validate()

[0.5641388, tensor(0.8229)]

<br>

And we're still not done yet ! As we know for **VGG16**, most of the parameters are comprised in the fully-connected layers so there should be something that we can do about it, right ? 

<br>

---

<br>

## **FC Layers Factorization**

We can indeed, factorize our big fully-connected layers and replace them by an approximation of two smaller layers. The idea is to make an **SVD** decomposition of the weight matrix, which will express the original matrix in a product of 3 matrices: $U \Sigma V^T$. With $\Sigma$ being a diagonal matrix with non-negative values along its diagonal (the singular values). We then define a value $k$ of singular values to keep and modify matrices $U$ and $V^T$ accordingly. The resulting will be an approximation of the initial matrix.

<img src="images/fasterai/svd.png" width="600">

In FasterAI, to decompose the fully-connected layers of your model, here is what you need to do:
<br>

<br>

<blockquote>
<pre><b><i>FCD = FCDecomposer()
decomposed_model = FCD.decompose(model, percent_removed)</i></b></pre>
<p style="font-size: 15px"><i>
    The <code>percent_removed</code> corresponds to the percentage of singular values removed (<i>k</i> value above).
</i></p>
</blockquote>

> Note: This time, the decomposition is not exact, so we expect a drop in performance afterwards and further retraining will be needed.

<br>

Which gives with our example, if we only want to keep half of them:

In [None]:
#hide
from fasterai.fc_decomposer import *

In [112]:
fc_decomposer = FCDecomposer()
decomposed_model = fc_decomposer.decompose(folded_model, percent_removed=0.5)

How many parameters do we have now ?

In [113]:
#hide_input
count_parameters(decomposed_model)

Total parameters : 61,430,022


And how much time did we gain ? 

In [None]:
#hide
decomposed_model = decomposed_model.eval()

In [114]:
#hide_input
%%timeit
decomposed_model(x[0][None].cuda())

2.11 ms ± 462 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)


<br>

However, this technique is an approximation so it is not lossless, so we should retrain our network a bit to recover its performance.

In [117]:
final_learner = Learner(data, decomposed_model, metrics=[accuracy])
final_learner.fit_one_cycle(5, 1e-5)

epoch,train_loss,valid_loss,accuracy,time
0,0.795416,0.759886,0.772994,00:51
1,0.752566,0.701141,0.794395,00:52
2,0.700373,0.650178,0.804841,00:51
3,0.604264,0.606801,0.821656,00:51
4,0.545705,0.592318,0.823185,00:52


This operation is usually less useful for more recent architectures as they usually do not have that many parameters in their fully-connected layers.

<br>

---

<br>

So to recap, we saw in this article how to use fasterai to:
1. Make a student model learn from a teacher model (**Knowledge Distillation**)
2. Make our network sparse (**Sparsifying**)
3. Optionnaly physically remove the zero-filters (**Pruning**)
4. Remove the batch norm layers (**Batch Normalization Folding**)
5. Approximate our big fully-connected layers by smaller ones (**Fully-Connected Layers Factorization**)



<br>

And we saw that by applying those, we could reduce our **VGG16** model from **134 million** of parameters down to **61 million**, and also speed-up the inference from **4.03ms** to **2.11ms** without any drop in accuracy (even a slight increase actually) compared to the baseline.

Of course, those techniques can be used in conjunction with [quantization](https://pytorch.org/docs/stable/quantization.html) or [mixed-precision training](https://pytorch.org/docs/stable/notes/amp_examples.html), which are already available in Pytorch for even more compression and speedup.

<br>

> Note: Please keep in mind that the techniques presented above are not magic 🧙‍♂️, so do not expect to see a 200% speedup and compression everytime. What you can achieve highly depend on the architecture that you are using (some are already speed/parameter efficient by design) or the task it is doing (some datasets are so easy that you can remove almost all your network without seeing a drop in performance)

<br>

**That's all! Thank you for reading, I hope that you'll like FasterAI. I do not claim that it is perfect, you'll probably find a lot of bugs. If you do, just please tell me, so I can try to solve them 😌 **

<br>

---

<br>

<p style="font-size: 15px"><i>If you notice any mistake or improvement that can be done, please contact me ! If you found that post useful, please consider citing it as:</i></p>

```
@article{hubens2020fasterai,
  title   = "FasterAI",
  author  = "Hubens, Nathan",
  journal = "nathanhubens.github.io",
  year    = "2020",
  url     = "https://nathanhubens.github.io/posts/deep%20learning/2020/08/17/FasterAI.html"
}
```

## **References**

- {{'[Cristian Buciluǎ, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, 2006](https://www.cs.cornell.edu/~caruana/compression.kdd06.pdf)' | fndetail: 1}}
- {{'[Qizhe Xie, Minh-Thang Luong, Eduard Hovy, Quoc V. Le: Self-training with Noisy Student improves ImageNet classification. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020](https://arxiv.org/abs/1911.04252)' | fndetail: 2}}
- {{'[H. Li, "Exploring knowledge distillation of Deep neural nets for efficient hardware solutions," CS230 Report, 2018](http://cs230.stanford.edu/files_winter_2018/projects/6940224.pdf)' | fndetail: 3}}
- {{'[Zhu, M. & Gupta, S. (2017). To prune, or not to prune: exploring the efficacy of pruning for model compression. ICLR, 2018 ](https://openreview.net/pdf?id=Sy1iIDkPM)' | fndetail: 4}}