# Efficient Deep Neural Networks
AI is getting super smart and impressive but the model size is increasing exponentially either in recent years. Model size is basically the number of parameters in the models but another important concept is the memory which is much more expensive than computation. For example, a multiplication operation requires 3.7 pico joules but accessing the DRAM memory requires 640  pico joules which is significantly higher. Hence, more data movement means we need to do more memory reference which will lead to a higher amount of energy consumption. Eventually, data movement from memory drains the battery of our mobile devices so we want to make the memory movement as little as possible.

In order to do that we can reduce the model size, activation size and the workload. Then, from the systems perspective, we can build more efficient hardware or better compilers with better scheduling policy to encourage more locality to reduce data movement. Here, we will talk about the first method which is more of an algorithm perspective to fundamentally reduce the requirement for data movement. It generally includes; pruning, quantization, knowledge distilation, and more recently NAS (Neural Architecture Search).

<img src='https://github.com/muratonuryildirim/Tutorials/blob/master/images/efficient_learning/ai_is_too_big.png?raw=true' width=650 >


# Pruning
Pruning naturally happens in human brain. A newborn child has about 2500 synapses per neuron and when the child becomes a toddler this number surges very quickly to 15000 synapses per neuron. However when he or she gets to adolescence this number didn't keep increasing but started to decrease until to adulthood where we have roughly 7000 synapses per neuron it's still more than a newborn child but it's definitely half times smaller than a toddler.
So, pruning naturally happens in the human brain when we are developing our brain. Those important connections get capped and unimportant synapses gets pruned away. 

Inspired by that, we can make neural networks smaller by removing some of the connections and some of the neurons. But, how exactly are we going to do pruning? 

<img src='https://github.com/muratonuryildirim/Tutorials/blob/master/images/efficient_learning/pruning_inspiration.png?raw=true' width=650 >

























### Pruning Formulation
In general, during the training of neural networks, we try to minimize the loss function by an optimizer such as Stochastic Gradient Descent (SGD). However, with pruning, a constraint is added to the loss function which limits the number of parameters to be smaller than a threshold. Basically, our objective function is finding the weights that minimize the loss with subject to certain number of parameters.

<img src='https://github.com/muratonuryildirim/Tutorials/blob/master/images/efficient_learning/pruning_formulation.png?raw=true' width=400 >


### Pruning Granularity
*Fine-grained/Unstructured Pruning*: First, we start with a dense neural network and prune away some of the weights where there is no pattern at all. It is flexible since we can prune away whichever weight we want. Although it is very flexible, this one is hard to accelerate on GPUs. The reason is weight matrix we obtained at the end of unstructured pruning very irregular. In order to store the weight matrix, we do not only have to store those weights but also their positions. Hence,  either irregular pruning force us to store all weight matrix even most of it contains weights of 0. Recently, some specialized hardware such as Efficient Inference Engine (EIE) that directly run on these kind of sparse matrices can accelerate the process but still it is limited. If the target is just reducing the model size without a computational acceleration this is probably a very attractive approach.

*Coarse-grained/Structured Pruning*: In this approach, just like unstructured pruning, we start with a dense neural network but we prune away the entire row of a weight matrix. For example, we can prune third and fourth row completely so that we can condense this weight matrix from originally eight rows into only five rows. That is why, structured pruning actually can reduce the actual computation. However, this is not that flexible since we have to prune the entire row in which accuracy degradation will be stronger than a fine-grained pruning method.

So in CNNs, we have four dimensions which give us many choices to select different pruning granularities. For this example, the k<sub>w</sub> and k<sub>h</sub> are both three and, c<sub>i</sub> is two we have two input
channels and, c<sub>o</sub> is three we have three output channels.
Hence, we have this full spectrum of different pruning granularities. we can prune individual weights or prune based on some patterns like tetris as well as pruning on vector level. In more extreme, we can do the kernel-level pruning or even a channel-level pruning. However, there is no one best pruning approach. It completely depends on the target. If we target the extreme compression ratio we want to choose the unstructured pruning while if we just target accelerating on cpu the channel pruning is the best approach. For a specialized case of structured pattern-based pruning, roughly around 2020, NVDIA started to support 2x acceleration for two to four (2:4) sparsity. In this special case, for every 4 element, **at least** 2 of them has to be 0.

<img src='https://github.com/muratonuryildirim/Tutorials/blob/master/images/efficient_learning/sparsity_granularities.png?raw=true' width=700 >



### Pruning Criterion

*Magnitude-based Pruning*: This is the most simple but effective criterion in pruning. It basically assumes that weight with a larger magnitude is more important than weight with a smaller magnitude because the inputs received from each weight would be more or less similar due to normalization. Imagine we have a weight matrix that has only four parameters. To calculate the magnitude we can use L<sub>1</sub> norm basically for each weight which returns the absolute value of the weights. If we want to do the same with a structured rule-wise pruning, we can simply consider the entire row. Besides the L<sub>1</sub> norm, we can also use L<sub>2</sub> norm or even L<sub>p</sub> norm yet the first two norms are the most popular criterias when we are doing magnitude
based pruning.

<img src='https://github.com/muratonuryildirim/Tutorials/blob/master/images/efficient_learning/magnitude_pruning.png?raw=true' width=700 >

*Scaling-based Pruning*: In this approach, before doing convolution, we multiply every element in each filter by a trainable *scaling factor*. And then we can apply pruning those channels with a small scaling factor. For example, we can use batch normalization layer as a scaling factor by adding an regularization like L<sub>2</sub> norm or L<sub>1</sub> norm on the gamma (γ) parameter of batch normalization. This way, we don't add any extra layers or any
overhead to the model which is pretty simple so that we can determine which channel to keep or prune by considering the statistics from batch normalization layer.

<img src='https://github.com/muratonuryildirim/Tutorials/blob/master/images/efficient_learning/scaling_pruning.png?raw=true' width=400 >

*Second Order Pruning*: Instead of the weight point of view, we can also use loss point of view where we want to minimize the error on the loss function introduced by pruning. The introduced error can be approximated roughly by **Taylor Series**: Originally the loss $L$ was based on weight $w$ and input $x$. Now we changed the weight by delta δ since some of the weights are proven to be zero. In other words, we just we perturb the weights. According to Taylor Expansion; when we are changing the $w$ by delta δ,  we end up with the first order chain difference, the second order difference, the cross term
and the third order term. In the Second Order Pruning, it assumed that the third order term is pretty small so we can neglect that. The first order term is also negligible because during the neural net training it is basically has converged to zero. Finally, because the error by different parameters is independent, the cross term is also neglected. Hence, we can just approximate the change in loss function with respect to the pruning by second order derivative. However, it is computationally heavy since we have to calculate the second order derivative and utilize the diagonal of the Hessian Matrix.

*First Order Pruning*: Similar to Second Order Pruning, this one approximates the change in loss function with respect to the pruning by first order derivative under an independent and identically distributed (i.i.d) assumption which allows to derive learning methods.

<img src='https://github.com/muratonuryildirim/Tutorials/blob/master/images/efficient_learning/taylor_approx_pruning.png?raw=true' width=650 >

*Regression Based Pruning*: Instead of considering pruning error of the objective function, regression-based pruning try to minimize the reconstruction error of the output. Imagine we prune away one row of the of the weight matrix and put away one column of the input matrix, we can still obtain the same output matrix. Here, we want to make sure the difference between the pruned output vs. the original output is minimal. This is a local optimization since this is just considering the output of each layer rather than the whole neural net. We can solve this problem iteratively: First, fix the $w$ and do the channel selection. Then, fix the channels and solve $w$ to minimize
the reconstruction error. In short, the regression-based approach  tries to minimize the difference of the activation for individual layers.

<img src='https://github.com/muratonuryildirim/Tutorials/blob/master/images/efficient_learning/regression_pruning.png?raw=true' width=500 >


### Pruning Ratio
While pruning the dense network, we are generally removing the connections uniformly by disconnecting same number of links at each layer. However, we can also do it non-uniformly by disconnecting different numbers of links at each layer. The assumption in that approach is sensivity of the layers for the sparsity would be different. Some layers can be very sensitive and even pruning few connections would drop the accuracy drastically. On the other hans, some layers can be very insensitive and pruning it 95% to 99%  does not affect the accuracy at all. Indeed, the deeper layers usually contain more
redundancy and can be more aggressively pruned. Moreover, if a neural net has
many Fully Connected (FC) layers, those FC layers can be aggressively pruned while the earlier layers are usually more difficult to prune.

<img src='https://github.com/muratonuryildirim/Tutorials/blob/master/images/efficient_learning/pruning_rate.png?raw=true' width=600 >

Then the question is how do we find the ratio for each layer? In fact, we can analyze the sensitivity of each layer. We prune a layer a bit and observe how does accuracy drop. We can prune a bit more and check the accuracy again. We can obtain a sensitivity analysis if we repeat this process for all the layers. Finally, we decide a threshold which cuts the sensitivity map through and the intersections indicate the sparsity we want for each layer. For example, blue layer is really sensitive and we just cut around 75% pruning ratio. One downside of this approach, it is not considering the interaction between each layer by assuming that each layer is independent. Nevertheless, in real world design, this is an easy, widely used and robust heuristic.

<img src='https://github.com/muratonuryildirim/Tutorials/blob/master/images/efficient_learning/automl_pruning.png?raw=true' width=600 >

Can we go beyond this heuristic? Yes! We can let the machines decide and design per-layer pruning ratios with AutoML instead of manual engineering. For example, a study called AMC (AutoML for Model Compression) uses a Reinforcement Learning (RL) agent to do the job. The state of the agent consists of 11 different features such as number of channels, kernel sizes, FLOPs etc. Action that the agent should take is the continious pruning ratio *α* [0, 1]. The reward function is simply reward the negative of error rate. However, we can also define model size, latency etc. if we want to penalize those attributes as well. Finally as an agent, DDPG is selected because this agent supports a 
continuous action rather than discrete ones like up, down, left and right. In the graph we can see that AMC which spend only a couple of GPU hours to outperforms manual tuning which required almost a week.

Another method called NetAdapt, uses a threshold ΔR to prune each layer adaptively. For example ΔR can be a latency reduction in p% percent. Then, it starts with 1st layer an applies an iterative pruning (prune + fine tune) approach until ΔR is satisfied. This is repeated for all layers in the network and the layer with minimal loss is pruned. The algorithm stops when all layers are scanned and pruned.

<img src='https://github.com/muratonuryildirim/Tutorials/blob/master/images/efficient_learning/net_adapt.png?raw=true' width=600 >

### Fine Tuning in Sparse Models

Intuitively, we can first train the model to convergence and recognize which neurons are important and which are not. Then, we can prune away some of those unimportant connections. However, in this case, the model degrades too much. That is why, '*pruning + fine-tuning*' is introduced which basically finetunes the model after pruning. A very key point when we are fine-tuning the model is we need to decrease the learning rate 1/10 to 1/100 since the model is already well trained and almost converged. Actually, a good practice is, not pruning the model directly to the target sparcity, Instead, applying the sparsity step by step while fine-tuning between each step returns much better results: called '*iterative pruning*'.

At the end, we can remove 90 percent of AlexNet without losing any accuracy which is wonderful. However, if we would repeat the same processes without retraining, we would not be able to reach same accuracy level since we are interfere the weight distribution.

<img src='https://github.com/muratonuryildirim/Tutorials/blob/master/images/efficient_learning/iterative_pruning.png?raw=true' width=550 >
