# Pruning of Neural Networks

## Objective

Following module is intended to give a high-level overview of research landscape in neural network pruning. 
We also talk very brielfy about other neural network compression techniques. 
Thus, the module aims at  

- understanding the **motivation** behind neural network compression
- being aware of **different approaches** for neural network compression
- learning **Whats and Hows of of neural network pruning**
- getting comfortable with the **research and advances in  neural network pruning**

## Motivation

## Bigger models: Performance

 - Bigger the model, more the number of parameters, **more expressive is the functional space**
 - As a result, SOTA language models are getting bigger and bigger

<img src="img/language-models-scaling.png" width="500">

[Image source: Robo-writers: the rise and risks of language-generating AI](https://www.nature.com/articles/d41586-021-00530-0)

## Bigger models: Storage & memory-bandwidth

- Bigger models take up huge space to store

- Difficult to store on low-space devices, e.g., smartphones, IoT devices

- Difficult to distribute necessary for real-world application


## Bigger models: Computational efficiency

- Requires expensive computational hardwares, limiting the progress in the hands of giants

<img src="img/giants.png" width="500">

- Runtimes are slow, thereby making them not suitable for time-critical applications

- Unsuitable for embedded mobile applications. For example, 1 billion connection neural network that are not suitable for on-chip storage, takes 12.8W of energy just for DRAM access [Han et al. 2015]

[Image source: DistilBERT, a distilled version of BERT: smaller,
faster, cheaper and lighter](https://arxiv.org/pdf/1910.01108.pdf)

[[Han et al. 2015] Learing both Weights and Connections for Efficient Neural Networks](https://arxiv.org/pdf/1506.02626.pdf)

## Compression of Neural Networks

- Aim is to reduce the size of the models a.k.a model compression

- While **minimizing the loss in the quality** of the model
    - Quality measure depends on the task 
    - For example, perplexity for language models, accuracy for visual recognition, etc.
    
- While **increasing the efficiency** of the models, where efficiency can relate to 
    - computational requirements, e.g, FLOPS, latency, etc.
    
    - storage requirements, e.g., compression ratio 


## Compression of Neural Networks: Approaches

- Constructive approaches

    - **Hand-design** a smaller network: For example, replacing fully connected layers with global pooling average in GoogLenet [Szegedy et al. 2015], or Depthwise Separable Convolution in MobileNet [Howard et al. 2017]
    
    - **Auto-ML** Neural network architecture search (NAS) with a constraint on the number of parameters [Dong et el. 2019]
    


[[Szegedy et al. 2015] Going deeper with convolutions](https://openaccess.thecvf.com/content_cvpr_2015/papers/Szegedy_Going_Deeper_With_2015_CVPR_paper.pdf)

[[Howard et al. 2017] MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications](https://arxiv.org/abs/1704.04861)

[[Dong et el. 2019] Network Pruning via Transformable Architecture Search](https://arxiv.org/abs/1905.09717)


## Compression of Neural Networks: Approaches

- Destructive approaches

    - **Network Pruning**: Removing redundant connections or weights
    
    - **Knowledge Distillation**: Transferring of knowledge from the larger model to a smaller one [Hinton et al. 2015]
    
    - **Quantization**: Reducing the precision of the weights and biases so that the model consumes less memmory, e.g., using 8-bit integers to represent 32-bit floats for network parameters reduces the size by a factor of 4. 
        - Post-training Quantization: Quantize the parameters after the training (leads to a higer loss in accuracy)
        - Quantization aware training: Forward pass is with quantized parameters while the backward pass is assuming non-quantized parameters
        
        - Refer to Gholami et al. 2021 for an extensive survey of these methods 
    
    - **Tensor Decomposition**: Low-rank approximation of fully connected layers in an over-parametrized neural network. [Read more here.](https://jacobgil.github.io/deeplearning/tensor-decompositions-deep-learning)
    
    - **Mix of above**: These approaches can be combined together [Han et al. 2016, Wang et al. 2020]


[[Hinton et al. 2015] Distilling the Knowledge in a Neural Network](https://arxiv.org/abs/1503.02531)

[[Gholami et al. 2021] A Survey of Quantization Methods for Efficient Neural Network Inference](https://arxiv.org/pdf/2103.13630.pdf)

[[Han et al. 2016] Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization, and Huffman Coding](https://arxiv.org/pdf/1510.00149.pdf)

[[Wang et al. 2020] APQ: Joint Search for Network Architecture, Pruning and Quantization Policy](https://openaccess.thecvf.com/content_cvpr_2015/papers/Szegedy_Going_Deeper_With_2015_CVPR_paper.pdf)



## Network Pruning: History 

- Le Cun et al. 1990 proposed Optimal Brain Damage (OBD) to prune neural networks by **removing redundant/less useful weights** that do not contribute significantly to the output 

- Hassibi et al. 1992 recognized that OBD often removed wrong weights, and proposed Optimal Brain Surgeon (OBS) to prune more weights while retaining the generalization error

- Several works have followed up with different heuristics and methodologies to recognize redundancies in neural networks

[[Le Cun et al. 1990] Optimal Brain Damage](http://yann.lecun.com/exdb/publis/pdf/lecun-90b.pdf)

[[Hassibi et al. 1992] Second order derivatives for network pruning: Optimal Brain Surgeon](https://proceedings.neurips.cc/paper/1992/hash/303ed4c69846ab36c2904d3ba8573050-Abstract.html)

## Network Pruning: Outline

- **Pipeline**: At which stage of modelling should the pruning be done?

- **Unstructured vs Structured**: From which parts of the model should the parameters be pruned?

- **Criterion**: What is the quantitative metric to determining pruning?

- **Prune Rate**: How much of network to prune in each iteration?

## Network Pruning: Pipeline

- Pruning is done after the model is trained

- Naturally, it leads to a higher loss of accuracy, thereby requiring **iterative training**

- **Iterative Finetuning**: Pruned model is trained starting from the weights retained from the initial training phase with ***smaller learning rates*** [Han et al. 2015]

- **Iterative Retraining**: Pruned model is trained starting from the weights retained from the initial training phase with the same learning rate schedule as was used in the training of the bigger model, a.k.a **Learning-rate Rewinding** [Renda et al. 2020]

- **Iterative Rewinding**: Pruned model is trained with initial weights as that of the initialized model; weights and learning rates are all rewound to the initial values [Frankle et al. 2018]

- **Prune before training**: The model is pruned before it is trained. This reduces the computational overhead related to iterative finetuning/retraining/rewinding [Lee et al. 2018]

- **Pruning as an objective**: Some methods learn the sparse structure during training either by penalizing weights or by explicitly learning a pruning mask [Savarese et al. 2019]

<img src="img/pipeline.png" width="500">


[[Han et al. 2015] Learing both Weights and Connections for Efficient Neural Networks](https://arxiv.org/pdf/1506.02626.pdf)

[[Renda et al. 2020] Comparing Rewinding and Fine-tuning in Neural Network Pruning](https://arxiv.org/abs/2003.02389)

[[Frankle et al. 2018] The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks](https://arxiv.org/abs/1803.03635)

[[Lee et al. 2018] SNIP: Single-shot Network Pruning based on Connection Sensitivity](https://arxiv.org/abs/1810.02340)


[[Savarese et al. 2019] Winning the Lottery with Continuous Sparsification](https://arxiv.org/abs/1912.04427)

## Network Pruning: Unstructured vs Structured 

- **Unstructured Pruning**: Removes individual parameters, e.g, weights and biases
    - Connections are the most fundamental units of a network - numerous enough to prune them in large quantities 
    - There are no constraints on which connections can be pruned
    - Simple and intuitive
    - Directly reduces FLOPs (floating-point operations per second) by removing individual connections or neurons [Han et al. 2015]
    - ***Disadvantage*** - Most work shows the reduction in FLOPs, however, to actualize such gains, specialized hardware for sparse computation are required

<img src="img/unstructured.png" width="500">



[Image Source [Han et al. 2015] Learing both Weights and Connections for Efficient Neural Networks](https://arxiv.org/pdf/1506.02626.pdf)

## Network Pruning: Unstructured vs Structured 
    
- **Structured Pruning**: Removes larger structures, e.g., convolution filters or kernels
    - Applicable to specialized architectures, e.g., convolutional neural networks
    - Final architectures do not require specialized hardwares 
    - Applications like object detection and segmentation needs intermediate representations. Thus, filter pruning techniques can be useful because the final models have low bandwidth for intermediate representations
    - Differs from Network Architecture Search as it is a destructive approach 
    - [Anwar et al. 2015], [Li et al. 2016], [Wen et al. 2016], [Liu et al. 2017], [Hacene et al. 2019]

[[Anwar et al. 2015] Structured Pruning of Deep Convolutional Neural Networks](https://arxiv.org/abs/1512.08571)

[[Li et al. 2016] Pruning filters for efficient convnets.](https://arxiv.org/abs/1608.08710)

[[Wen et al. 2016] Learning Structured Sparsity in Deep Neural Networks](https://arxiv.org/abs/1608.03665)

[[Liu et al. 2017] Learning Efficient Convolution Networks through Network Slimming](https://arxiv.org/abs/1708.06519)

[[Hacene et al. 2019] Attention Based Pruning for Shift Networks](https://arxiv.org/abs/1905.12300)


## Network Pruning: Unstructured vs Structured 
    

<img src="img/convolution-pruning.png" width="500">


[Image source: Neural Network Pruning 101](https://towardsdatascience.com/neural-network-pruning-101-af816aaea61)


## Network Pruning: Criterion


- We will consider the following notations

    * $\mathbf{x}$ is the input vector
    * $\mathbf{W}$ is the weight matrix 
    * $\mathbf{b}$ is the bias vector
    * $\sigma$ is the activation function
    * $f$ is the functional representation of a neural network


$$
\mathbf{a} = f(\mathbf{x}) = \sigma(\mathbf{W}\mathbf{x} + \mathbf{b})
$$

- We use $\mathbf{M}$ (of the same shape as $\mathbf{W}$)  as a mask to zero-out weights. Thus, 

$$ \mathbf{a} = f(\mathbf{x}) = \sigma(\mathbf{M} \odot \mathbf{W}\mathbf{x} + \mathbf{b}) $$

## Network Pruning: Criterion

- **Weight magnitude**
    * Widely used criterion that is simple and works well in practice
    * Can be applied to individual weights
    
    $$ M_{i,j} = |\mathbf{W}_{i,j}| \leq \lambda $$

    * Can be applied to a group of weights, e.g., convolution kernels

    $$ \mathbf{M}_{l2} = \frac{||\mathbf{W}||_2}{||\mathbf{W}||_0} \leq \lambda $$

    $$ \mathbf{M}_{l1} = \frac{||\mathbf{W}||_1}{||\mathbf{W}||_0} \leq \lambda $$

    * Above can also be done in conjunction with using $l1$ or $l2$ regularization
    
    * Use a learnable gate factor to completely switch off the connections (e.g, convolutional channels) [Liu et al. 2017]

[[Liu et al. 2017] Learning Efficient Convolutional Networks through Network Slimming](https://arxiv.org/abs/1708.06519)

## Network Pruning: Criterion

- ** Activation value **
    * **if the feature map is not useful, remove the weights that produce and use it**
    * Example 1 (MLP): 
        * if a neuron in MLP is deactivated all the time, remove the associated weights 
        * For $B$ batches with $N_b$ samples in each batch, we compute the saliency score for the $k^{th}$ neuron of $l^{th}$ layer as  
        
        $$S_{avg}(\mathbf{a}^l_{i,k}) = \frac{1}{B}\sum_{b}^{B} \frac{1}{N_b}\sum_{i}^{N_b} |\mathbf{a}^l_{i,k}| $$
        
        * Thus, if $S_{avg} \leq \lambda$, remove $\mathbf{W}^{l-1}_{k, :}$, i.e., the weights producing $\mathbf{a}_l^k$ and $\mathbf{W}^{l}_{:, k}$ weights using $\mathbf{a}_l^k$.
       
        * Similarly, one can define a saliency score $S_{std}$ using standard deviation of $\mathbf{a}^l_{i,k}$ across the batch
        

## Network Pruning: Criterion

- ** Activation value **
    * **if the feature map is not useful, remove the weights that produce and use it**
    
    * Example 2 (CNN): 
        * Recall, a convolution layer has $C_l$ filters each with $C_{l-1}$ kernels containing $p \times p$ parameters producing, thereby producing $C_{l}$ feature maps
        
        * Represent the $k^{th}$ feature map of the layer $l$ by $\mathbf{z}_l^k$

        * if a feature map is not active, remove the corresponding kernel that produces it and the kernels in the subsequent filters that uses it
        
        * We estimate saliency score of the $k^{th}$ filter in the $l^{th}$ layer, $\mathbf{z}_k^l$ as 
    
    $$ S_{avg}(\mathbf{z}^l_{k}) =  \frac{1}{B}\sum_{b}^{B} \frac{1}{N_b}\sum_{i}^{N_b}\Big|  \frac{||\mathbf{z}_{k,i}^l||_1}{||\mathbf{z}_{k,i}^l||_0}   \Big| $$
    
        * Thus, if $S_{avg} \leq \lambda$, remove the $k^{th}$ filter producing the $k^{th}$ feature map, and kernels in $l+1$ layer using the $k^{th}$ feature map
    

## Network Pruning: Criterion

- ** Gradient magnitude - Activation ** [Molchanov et al. 2016]
    * Saliency of a feature map is computed as a product of gradient (w.r.t the feature map) and the feature map
    * Example 1 (MLP):
        * Compute saliency score for the activation $\mathbf{a}^l_{i,k}$ as 
    
    $$ S_{avg}(\mathbf{a}^l_{i,k}) = \frac{1}{B}\sum_{b}^{B} \frac{1}{N_b}\sum_{i}^{N_b} \Big| \frac{\delta \mathcal{L}}{\delta \mathbf{a}^l_{i,k}} \times \mathbf{a}^l_{i,k} \Big| $$
    
        * Thus, if $S_{avg} \leq \lambda$, remove $\mathbf{W}^{l-1}_{k, :}$, i.e., the weights producing $\mathbf{a}_l^k$ and $\mathbf{W}^{l}_{:, k}$ weights using $\mathbf{a}_l^k$
       
    
[[Molchanov et al. 2016] Pruning Convolutional Neural Networks for Resource Efficient Inference](https://arxiv.org/abs/1611.06440)

## Network Pruning: Criterion

- ** Gradient magnitudde - Activation ** [Molchanov et al. 2016]
    * Saliency of a feature map is computed as a product of gradient (w.r.t the feature map) and the feature map   
    * Example 2 (CNN):
        * Compute saliency score for the feature map $\mathbf{z}^l_{i,k}$ as 
    
       $$ S_{avg}(\mathbf{z}^l_{k}) = \frac{1}{B}\sum_{b}^{B} \frac{1}{N_b}\sum_{i}^{N_b} \Big| \frac{\delta \mathcal{L}}{\delta \mathbf{z}^l_{i,k}} \odot \mathbf{z}^l_{i,k} \Big| $$
       
        * Thus, remove the $k^{th}$ filter producing the $k^{th}$ feature map, and kernels in $l+1$ layer using the $k^{th}$ feature map
    
    
[[Molchanov et al. 2016] Pruning Convolutional Neural Networks for Resource Efficient Inference](https://arxiv.org/abs/1611.06440)

## Network Pruning: Criterion

- **Average Percentage of Zeros (APoZ)** [Le Cun et al. 1990, Hu et al. 2016]
    - ReLU activation imposes sparsity, therefore APoZ in $k^{th}$ feature map of the $l^{th}$ layer is used as a saliency score
    

- **FLOPs Regularization** [Molchanov et al. 2016]

    - Minmizing FLOPs is an objective
    
    - Higher the FLOPs in producing $\mathbf{z}^l_k$ more useful to prune the filter 
    
    - Thus, if $S_l^{flops}$ is the FLOPs associated with $\mathbf{z}^l_{i,k}$, we compute the FLOPs-regularized saliency for $\mathbf{z}^l_{i,k}$ the as 
    
    $$S_{FLOPs}(\mathbf{z}^l_{k}) = S_{avg}(\mathbf{z}^l_{k}) - \lambda \cdot S_l^{flops}$$ 


[[Le Cun et al. 1990] Optimal Brain Damage](http://yann.lecun.com/exdb/publis/pdf/lecun-90b.pdf)

[[Hu et al. 2016] Network Trimming: A Data-Driven Neuron Pruning Approach towards Efficient Deep Architectures](https://arxiv.org/abs/1607.03250)

[[Molchanov et al. 2016] Pruning Convolutional Neural Networks for Resource Efficient Inference](https://arxiv.org/abs/1611.06440)

## Network Pruning: Pruning rate

- How many parameters to prune? 
    * Could be a **number of parameters** each iteration
    * Could be a **percentage of parameters** to prune each iteration

- Should the prune rate be applied 
    * **locally**, i.e., the prune rate is applied to each layer, or
    * **globally**, i.e., the prune rate is applied to all the parameters in the network

[Image source: Neural Network Pruning 101](https://towardsdatascience.com/neural-network-pruning-101-af816aaea61)


[[Tanaka et al. 2020] Pruning neural networks without any data by iteratively conserving synaptic flow](https://arxiv.org/abs/2006.05467)

## Network Pruning: Pruning rate

- **Global Pruning** (right): might lead to better results, but it might lead to layer collapse (the entire layer is pruned, thereby preventing the backpropagation of errors) [Tanaka et al. 2020]


<img src="img/prune_rate.png" width="500">


[Image source: Neural Network Pruning 101](https://towardsdatascience.com/neural-network-pruning-101-af816aaea61)


[[Tanaka et al. 2020] Pruning neural networks without any data by iteratively conserving synaptic flow](https://arxiv.org/abs/2006.05467)

## Network Pruning: Sparse Training

- Step 1: Initialize a network with a random mask to prune certain connections of the network
- Step 2: Train for one epoch
- Step 3: Remove the weights of lower magnitude 
- Step 4: Regrow the same amount of weights in those layers

[[Mocanu et al. 2018] Scalable training of artificial neural networks with adaptive sparse conneectivity inspired by network science](https://www.nature.com/articles/s41467-018-04316-3)

## Network Pruning: Sparse Training

<img src="img/sparse-training.webp" width="1000">
An illustration of the SET procedure. For each sparse connected layer, $SC^k$ (a), of an ANN at the end of a training epoch a fraction of the weights, the ones closest to zero, are removed (b). Then, new weighs are added randomly in the same amount as the ones previously removed (c). Further on, a new training epoch is performed (d), and the procedure to remove and add weights is repeated. The process continues for a finite number of training epochs, as usual in the ANNs training



[[Mocanu et al. 2018] Scalable training of artificial neural networks with adaptive sparse conneectivity inspired by network science](https://www.nature.com/articles/s41467-018-04316-3)

## Network Pruning: Sparse Training


- Mocanu et al. (2018) used local pruning, i.e., maintaining pruning rate per layer
- Mostafa et al. (2019) used global pruning
- Dettmers et al. (2019) and Evci et al. (2020) proposed novel paramater regrowing techniques


[[Mocanu et al. 2018] Scalable training of artificial neural networks with adaptive sparse conneectivity inspired by network science](https://www.nature.com/articles/s41467-018-04316-3)

[[Mostafa et al. 2019] Parameter Efficient Training of Deep Convolution Neural Networks by Dynamic Sparse Reparameterization](https://arxiv.org/abs/1902.05967)

[[Dettmers et al. 2019] Sparse Networks from Scratch: Faster Training without Losing Performance](https://arxiv.org/abs/1907.04840)

[[Evci et al. 2020] Rigging the Lottery: Making All Tickets Winners](https://proceedings.mlr.press/v119/evci20a.html)

## Network Pruning: Pruning while Training

- Learn a pruning mask during training via a separate network:
    - Huang et al. (2018) and He et al. (2018) trained the reinforcement learning agents to prune filters in CNNs
    - Yamamoto et al. (2019) proposed using attention network before the layers of pre-trained CNN to identify filters that can be pruned

[[Huang et al. 2018] Learning to Prune Filters in Convolutional Neural Networks](https://arxiv.org/pdf/1801.07365.pdf)

[[He et al. 2018] AMC: AutoML for Model Compression and Acceleration on Mobile Devices](https://arxiv.org/abs/1802.03494)

[[Yamamoto et al. 2019] PCAS: Pruning Channels with Attention Statistics for Deep Network Compression](https://arxiv.org/pdf/1806.05382.pdf)

## Network Pruning: Pruning as an objective

- **Penalty-based methods** or **Mask learning through auxiliary parameters**
    - Ideally, $L_0$ regularization should do the job, but it is non-differentiable.
    - Use differentiable penalty schemes to reduce weights to 0, e.g., L1-regularization
    - **Modified $L_{1/2}$ regularization proposed** by Chang et al. (2018) lends differentiability to $L_{1/2}$ regularizer
  

[[Chang et al. 2018] Prune deep neural networks with the modified L1/2 penalty](https://ieeexplore.ieee.org/iel7/6287639/6514899/08579132.pdf)


## Network Pruning: Pruning as an objective

- **Penalty-based methods**
     - **Group Lasso** [Meier et al. 2008]: Applies LASSO regression to the group of parameters. 
        - Assuming $\theta_G$ as  $m$ independent set of parameters , $\theta_G = \{\theta^{(1)}, \theta^{(2)}, ..., \theta^{(m)}\}$
        - With $p_l$ as the number of parameters in $\theta^{(l)}$, the loss function becomes
        $$ \mathcal{L}(\mathbf{X}, \mathbf{y}) + \lambda \sum_{l=1}^m \sqrt{p_l} \cdot \big|\big|\theta^{(l)}\big|\big|_2$$
        
        - $m=1$ is equivalent to Ridge regression
        - $m=n$ is equaivalent to Lasso regression
        - Intermediate values of m is commonly known as Group Lasso


[[Meier et al. 2008] The group lasso for logistic regression](http://people.ee.duke.edu/~lcarin/lukas-sara-peter.pdf)

## Network Pruning: Pruning as an objective

- **Penalty-based methods**      
    - Various ways to **target the above penalties selectively**  
        - Carreira-Perpin et al. (2018) explore the subset of weights to prune in "compression" step 
        - Tessier et al. (2021) proposed Selective Weight Decay (SWD) to strongly penalize the weights below a certain threshold 

[[Tessier et al. 2022] Rethinking Weight Decay for Efficient Neural Network Pruning](https://arxiv.org/pdf/2011.10520.pdf)

[[Carreira-Perpin et al. 2018] "Learning Compression" Algorithms for Neural Net Pruning](https://faculty.ucmerced.edu/mcarreira-perpinan/papers/cvpr18.pdf)


## Network Pruning: Pruning as an objective

- **Penalty-based methods**      
    - Various ways to **target the above penalties selectively**  
        - Tessier et al. (2021) proposed Selective Weight Decay (SWD) to stongly penalize the weights below a certain threshold 
    

<img src="img/swd.png" width="1000">

[[Tessier et al. 2022] Rethinking Weight Decay for Efficient Neural Network Pruning](https://arxiv.org/pdf/2011.10520.pdf)


## Network Pruning: Pruning as an objective

- **Penalty-based methods**      
    - **Bayesian methods **
        - Molchanov et al. (2017) uses variational dropout to learn individial dropout rates such that high dropout rates effectively prunes the weight 
        - Louizos et al. (2017) uses sparsity inducing hierarchical priors to prune nodes (set of weights)
        - Neklyudov et al.(2017) applies dropout-based regularization to structured elements, e.g, neurons, convolutional layers. 
  
   
[[Molchanov et al. 2017] Variational Droupout Sparsifies Deep Neural Networks](https://arxiv.org/pdf/1701.05369.pdf)

[[Louizos et al. 2017] Bayesian Compression for Deep Learning](https://arxiv.org/abs/1705.07283)

[[Neklyudov et al. 2017] Structured Bayesian Pruning via Log-Normal Multiplicative Noise](https://arxiv.org/abs/1705.07283)



## Network Pruning: Lottery Ticket Hypothesis

- Frankle et al. (2018) empirically investigated whether there exists a smaller subnetwork within a dense neural network that performs just as well as the original network (within the same time budget)


- The **experiments** involved the following steps:
    1. Intialize a sufficiently deep neural network with random weights and train it for some iterations
    2. Prune the weights with the least magnitude (threshold is a hyperparameter)
    3. Re-initialize the remaining weights randomly
    4. Train the model for some number of iterations again.
    5. Repeat 2, 3, and 4 until the desired level of sparsity (70-80% in their experiments)


- **Findings**: With the correct choice of hyperparameters (number of iterations, choice of threshold, choice of sparsity structurre), the above procedure finds a much smaller subnetwork (lottery ticket) that performs just as well as the larger unpruned network



[[Frankle et al. 2018] The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks](https://arxiv.org/abs/1803.03635)


## Knowledge Distillation

- Hinton et al. (2015) proposed Knowledge Distillation (KD) as a technique to "distill" the knowledge of a bigger neural network to smaller and simpler neural networks

<img src="img/kd.png" width="1000">

[[Hinton et al. 2015] Distilling the Knowledge in a Neural Network](https://arxiv.org/abs/1503.02531)

## Knowledge Distillation

- KD requires modifying the loss function for the simpler network

- For example, for the classification task, if $y$ is the one-hot encoded target vector and $z_l$ is the vector of logits output by the larger pre-trained network, the loss function for the smaller network will be

$$
\alpha \times CE(z_{s}, y) + (1-\alpha) \times KL(z_{s}, Softmax(z_l/T)),
$$
here, T is the temperature parameter, and $\alpha \in [0, 1]$ is the mixing parameter

- Various ways of distilation have been studied since 2015. Refer to this [blog post](https://neptune.ai/blog/knowledge-distillation) for a complete overview. 


[[Hinton et al. 2015] Distilling the Knowledge in a Neural Network](https://arxiv.org/abs/1503.02531)

## Quantization

- **Weight Hashing**
    - Group weights into buckets to reduce model storage
    - Thus, final model is represented by the weight values and indices
    - However, at the inference time, these weights need to be restored for computation, so it might not lead to savings in inference time 

- **Weight Quantization**:
    - Quantize the weights into integers, i.e., binary, ternary, etc. 
    - Requires special formulations for training neural networks
    

## Quantization

- Chen et al. (2015) proposed **HashedNets**
    - Before training, network weights are hashed into different groups 
    - These groups have the same value for the parameters
    - Thus, the storage is consumed only by the shared weights and their corresponding indices 
    - However, during the inference, these weights will be restored to their original indices, there is less impact on the the run-time memory and the inference time 
    
- Courbariaux et al. (2016) restrict the weights to binary, i.e., $\{-1, +1\}$, or and Rastegari et al. (2016) restricted it to teriatiary weights , i.e., $\{-1, 0, 1\}$
    - Large model size savings 
    - Significant speedups
    - Moderate accuracy loss 

[[Chen et al. 2015] Compressing Neural Networks with the Hashing Trick](https://arxiv.org/abs/1504.04788)

[[Courbariaux et al. 2016] Binarized Neural Networks: Training Deep Neural Networks with Weights and Activtations Constrained to +1 or -1](https://arxiv.org/abs/1602.02830)

[[Rastegari et al. 2016] XNOR-Net: ImageNet Classifcation Using Binary Convolutional Neural Networks](https://arxiv.org/abs/1603.05279)


Now open the following workbook `pruning-practical.ipynb` to learn how to build an InceptionNet

<img src="img/jupyter.png" width="250px">