# SparCL: Sparse Continual Learning on the Edge
## NeurIPS 2022
#### Zifeng Wang, Zheng Zhan, Yifan Gong, Geng Yuan, Wei Niu, Tong Jian, Bin Ren, Stratis Ioannidis, Yanzhi Wang, Jennifer Dy
Sparse Continual Learning (SparCL) is the first study that leverages sparsity to enable cost-effective continual learning on edge devices. SparCL achieves both training acceleration and accuracy preservation through the synergy of three aspects: weight sparsity, data efficiency, and gradient sparsity Specifically, task-aware dynamic masking (TDM) for weight sparsity, dynamic data removal (DDR) for data efficiency, and dynamic gradient masking (DGM) for gradient sparsity.

They developed a novel task-aware dynamic masking (TDM) strategy to keep only important weights for both the current and past tasks, with special consideration during task transitions. They propose a dynamic data removal (DDR) scheme, which progressively removes “easy-to-learn” examples from training iterations, which further accelerates the training process and also improves accuracy of CL by balancing current and past data and keeping more informative samples in the buffer. Finally, they provide a dynamic gradient masking (DGM) strategy to leverage gradient sparsity for even better efficiency and knowledge preservation of learned tasks, such that only a subset of sparse weights are updated.

**Task-aware Dynamic Masking (TDM):**
It dynamically removes less important weights and grows back unused weights for stronger representation power by maintaining a single binary weight mask throughout the entire CL process. In other words, there is only one subnetwork for all the tasks.

*Continual weight importance (CWI)* is a score that decides to prune or not to prune the connections. It considers both current task and the older tasks with a rehearsal buffer. M is the rehearsal buffer, and α, β are coefficients control the influence of current task and older tasks respectively. Moreover, L represents the cross-entropy loss for classification, while L ̃ is the single-head cross-entropy loss, which only considers classes from the current task by masking out the logits of other classes.

**Dynamic Data Removal (DDR):**
the importance of each training example is measured by the occurrence of misclassification during CL. If a data sample is misclassified a lot then this data sample stored in the buffer. 

**Dynamic gradient masking (DGM):** 
To prevent the pruned weights from updating, the weight mask Mθ will be applied onto the gradient matrix G as G ⊙ Mθ during backpropagation as well. Besides the gradients of pruned weights, we in addition consider to remove less important gradients for faster training. To achieve this, they introduce the *Continual Gradient Importance (CGI)* based on the CWI to measure the importance of weight gradients.

During training, it follows two different algorithms which are expand-and-shrink and shrink-and-expand. Expand-and-shrink is used at the beginning of a new task by randomly adding new connections to explore the network (s% + expand) for the new task. After a while, it shrinks again based on the CWI to meet the sparsity condition (s%). Shrink-and-expand, on the other hand, is used within the same task. It basically prunes or shrinks some connections and grows new ones while maintaining the same sparsity level (s%).

<p align="center">
  <img src="https://github.com/muratonuryildirim/Tutorials/blob/master/images/papers/sparcl.png?raw=true" width=800>
</p>


# GDumb: A Simple Approach that Questions Our Progress in Continual Learning
## ECCV 2020
#### Ameya Prabhu, Philip H.S. Torr, Puneet K. Dokania


GDumb (1) greedily stores samples in memory as they come and; (2) at test time, trains a model from scratch using samples only in the memory. Authors showed that even though GDumb is not specifically designed for CL problems, it obtains state-of-the-art accuracies when compared to approaches in CL formulations which they were specifically designed. 

As a method, given a memory budget, the sampler greedily stores samples from a data-stream while making sure that the classes are balanced, and, at inference, the learner (neural network) is trained from scratch (hence dumb) using all the samples stored in the memory.

The conclusion that is made by the authors was, even though GDumb not designed to handle the intricacies in the challenging CL problems, it outperforms recently proposed algorithms in their own experimental set-ups, which is alarming.


<p align="center">
  <img src="https://github.com/muratonuryildirim/Tutorials/blob/master/images/papers/gdumb.png?raw=true" width=600>
</p>




# Expert Gate: Lifelong Learning with a Network of Experts
## CVPR 2017
#### Rahaf Aljundi, Punarjay Chakravarty, Tinne Tuytelaars

In this paper authors introduce a model for CL based on a Network of Experts. In each new task, new experts are learned and added to the model sequentially, building on what was learned before. To select the right expert in test time, they introduce gating autoencoders that learn a representation for the task at hand, and, at test time, automatically forward the test sample to the relevant expert. Further, the autoencoders inherently capture the relatedness of one task to another, so that knowledge transfer between tasks become possible.


Designed autoencoder a simple one layer encoder/decoder. The encoding step consists of one fully connected layer followed by ReLU and decoding step has one fully connected layer again, but followed by a sigmoid. The sigmoid yields values between [0,1] which allows to use cross entropy as the loss function during the training. At test time, they use the Euclidean distance to compute the reconstruction error. Lastly, they add a softmax layer that takes as input the reconstruction errors from the different autoencoders given a test sample x to select the most relevant expert.

For the training of the experts, the algorithm first compute the average reconstruction error for the current task data by the all autoencoders. Then, starts to train/fine-tune the new expert based on the expert of an older task that returned lowest reconstruction error.

This paper was the first to propose a solution that does not require storing data from previous tasks.

<p align="center">
  <img src="https://github.com/muratonuryildirim/Tutorials/blob/master/images/papers/expertgate.png?raw=true" width=600>
</p>




# Continual Learning via Neural Pruning (CLNP)
## NeurIPS 2019
#### Siavash Golkar, Michael Kagan, Kyunghyun Cho

In this method, subsequent tasks are trained using the inactive neurons and filters of the sparsified network. They used a simple neuron/filter based sparsification scheme for more accessibility. And, their sparsification method is comprised of two parts: First, during the training of each task, there is a L1 regularization to promote sparsity in the network. The coefficient of α is a hyperparameter that adjusts the strength of the L1 regularization. Hence, there is a chance to control over the amount of sparsity in each layer by choosing a different α for each layer. Second, post-training neuron pruning is applied based on the average activity of each neuron. CLNP performs only one iteration of pruning for simplicity. They also skip the fine tuning step. At the end, CLNP provides a mask that is obtained by post-training structured pruning for each task with utilizing over-parameterization in the deep neural networks.

# Neuro-Inspired Stability-Plasticity Adaptation (NISPA)
## ICML 2022
#### Mustafa Burak Gurbuz, Constantine Dovrolis

Neuro-Inspired Stability-Plasticity Adaptation (NISPA) proposes a Task-IL architecture that addresses the desiderata of CL through a sparse neural network with fixed density. NISPA performs sparse-to-sparse training to select promising paths in neural nets to preserve knowledge for a given task. It allows knowledge transfer with shared connections between different tasks yet those weights are not updated to not to forget previous tasks.

In future work, authors aim to adapt NISPA to class incremental learning and explore strategies to “unfreeze” carefully selected stable units. Lastly, although random connection growth is appropriate for exploring different network configurations, they aim to develop a more sophisticated approach that considers additional signals such as network-theoretic metric to maximize the benefit of growing new connections.

<p align="center">
  <img src="https://github.com/muratonuryildirim/Tutorials/blob/master/images/papers/nispa.png?raw=true" width=900>
</p>



# XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks
## ECCV 2016
#### Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi

In this paper, authors propose two efficient approximations to standard convolutional neural networks: Binary-Weight-Networks (BWN) and XNOR-Networks. In Binary-Weight-Networks, the filters are approximated with binary values resulting in 32× memory saving. 

In XNOR-Networks, different than Binary-Weight-Networks, both the filters and the input to convolutional layers are binary. This results in 58× faster convolutional operations and 32× memory savings. XNOR Nets offer running state-of-the-art networks on CPUs (rather than GPUs) in real-time.

To binarize the weights, optimal estimation of a binary weight filter can be simply achieved by taking the sign of weight values. To train a CNN with binary weights (in convolutional layers), authors proposed to compute
the gradient based on the real-value weights, update the real-value weights and then binarize again. Because, in gradient descend, the parameter changes are tiny, binarization after updating the parameters ignores these changes and the training objective can not be improved. As a result they obtained more more efficient neural network with a small drop in accuracy compared to full-precision neural net.

<p align="center">
  <img src="https://github.com/muratonuryildirim/Tutorials/blob/master/images/papers/xnornet.png?raw=true" width=800>
</p>








# Wide Neural Networks Forget Less Catastrophically
## ICML 2022 
#### Seyed Iman Mirzadeh, Huiyi Hu, Razvan Pascanu, Dilan Gorur, Mehrdad Farajtabar

Authors motivates the study with the limited knowledge about catastrophic forgetting. To address this, instead of focusing on continual learning algorithms, in this work, they focused on the model itself and study the impact of "width" and "depth" of the neural network architecture on catastrophic forgetting, and showed that width has a surprisingly significant effect on forgetting. They empirically demonstrated that increasing the width alone reduces catastrophic forgetting significantly, while it’s not the case for depth. From the results of the expriments, wider neural networks are much more robust due to **Higher Gradient Orthogonality** and **Lower Gradient Sparsity**.

<p align="center">
  <img src="https://github.com/muratonuryildirim/Tutorials/blob/master/images/papers/widenets_not_forget.png?raw=true" width=600>
</p>




# Architecture Matters in Continual Learning
## ICLR 2023
#### Seyed Iman Mirzadeh, Arslan Chaudhry, Dong Yin, Timothy Nguyen, Razvan Pascanu, Dilan Gorur and Mehrdad Farajtabar

In this work, authors show that the choice of architecture can significantly impact the continual learning performance with trade-offs between the ability to remember previous tasks and learning new ones. They study the role of individual architectural decisions (e.g., width and depth, batch normalization,
skip-connections, and pooling layers) and how they can impact the continual learning performance. And to motivate this study, they showed that only removing global average pooling (GAP) layer of ResNet and fine-tuning works better than EWC (regularization-based) and ER (rehersal-based) which are specially designed methods for continual learning,

Their findings are as follows:

1. **Model**: authors concluded that ResNets and Wide-ResNets have better learning abilities whereas CNNs and ViTs have better retention abilities. In their experiments, simple CNNs achieve the best trade-off between learning and retention.
2. **Width and Depth**: authors found that width translates into drastic improvements in average accuracy and forgetting. They also observed that increasing the depth is not helpful in this case. Overall, they concluded that over-parametrization through width is helpful in continual learning, whereas a similar claim cannot be made for the depth.
3. **Batch Normalization**: authors state that the effect of the batchnorm layer is data-dependent. In the setups where the input distribution relatively stays stable, the BN layer improves the continual learning performance but if input distribution changes a lot across tasks, then BN-layer can hurt the continual learning performance by increasing the forgetting.
4. **Skip Connections** authors mentioned that skip connection may not have a significant positive or negative impact on the model performance in continual learning.
5. **Max and Average Pooling**: authors found that the ability of max pooling to achieve better performance in a continual learning setting can be attributed to a well-known empirical observation where it is shown that max pooling with stride 1 outperforms a CNN with stride 2 and no pooling. Further, authors believe that max pooling might have extracted the low-level features better such as edges.
6. **Global Pooling**: authors claimed that global average pooling (GAP) layers are typically used in convolutional networks just before the final classification layer to reduce the number of parameters in the classifier. The consequence of adding a GAP layer is to reduce the width of the final classifier. Consequently, the architectures with a GAP layer can suffer from increased forgetting.
7. **Attention Heads**: authors discussed that the attention mechanism is a prominent component in the transformer-based architectures because the heads of vision transformers (ViTs) are shown to attend to both local and global features. Authors concluded that even doubling the number of heads in ViTs only marginally helps in increasing the learning accuracy and lowering the forgetting. This suggests that increasing the number of attention heads may not be an efficient approach for improving continual learning performance in ViTs although it ViTs show promising robustness against distributional shifts as evidenced by lower forgetting numbers, for the same amount of parameters compared to other models.

<p align="center">
  <img src="https://github.com/muratonuryildirim/Tutorials/blob/master/images/papers/arch_matters_in_CL.png?raw=true" width=700>
</p>




# To prune, or not to prune: exploring the efficacy of pruning for model compression
## ICLR 2018
#### Michael H. Zhu, Suyog Gupta

This paper compares the accuracy of large, but pruned models (large-sparse) and their smaller, but dense (small-dense) counterparts with identical memory footprint.

Across a broad range of neural network architectures (deep CNNs, stacked LSTM,
and seq2seq LSTM models), authors find that large-sparse models consistently outperform small-dense models and achieve up to 10x reduction in number of non-zero parameters with minimal loss in accuracy.

# Continual Learning of a Mixed Sequence of Similar and Dissimilar Tasks
## NeurIPS 2020
#### Zixuan Ke, Bing Liu and Xingchang Huang

This paper proposes a technique that is named as Continual learning with forgetting Avoidance and knowledge Transfer (CAT) to learn mixed similar and dissimilar tasks in the same network. For dissimilar tasks, the algorithm focuses on dealing with forgetting, and for similar tasks, the algorithm focuses on selectively transferring the knowledge learned from some similar previous tasks to improve the new task learning. To do that, the algorithm automatically detects whether a new task is similar to any previous tasks.

**CAT Algorithm** (t number of tasks = t number of classifier heads → Task-IL)
1. Apply layer level masking
2. if a current $task$ $t$ similar to one of the previous tasks use the former's connections and let them free to update
3. If a current $task$ $t$ dissimilar to previous tasks then, cant use former connections and gradient flow for those connections is freezed to avoid forgetting.
4. Similarty between tasks are calculated as follows: A *transfer model* f<sub>k→t</sub> is used to transfer knowledge from $task$ $k$ to $task$ $t$. A single task model f<sub>∅</sub>, called the *reference model*, is used to learn $task$ $t$ independently. If the *transfer model* f<sub>k→t</sub> classifies the validation data of $task$ $t$ better than the *reference model* f<sub>∅</sub>, then it is assumed that $task$ $k$ contains shareable prior knowledge that can help $task$ $t$ to learn a better model than without the knowledge, f<sub>∅</sub>.

*The transfer model* f<sub>k→t</sub> trains a small readout function (1 layer fully-connected classifier head on top of the CNN-layers of $task$ $k$) In training the transfer network, the CNN-layers is frozen or not updated.

*The reference model* f<sub>∅</sub> is a separate network for building a model for $task$ $t$ alone from scratch with random initialization. It uses the same architecture as  f<sub>k→t</sub> without applying any task masks. However, the size of the network is smaller, it was 50% of f<sub>k→t</sub> in the authors experiments. 




# Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science (SET)  
## Nature Communications 2018

#### Decebal Constantin Mocanu, Elena Mocanu, Peter Stone, Phuong H. Nguyen, Madeleine Gibescu & Antonio Liotta 

Randomly initialize SCLs in our network and start training. 

At the end of each epoch, we remove the connections with the smallest weights (the “weakest” connections) based on a threshold $t$ and replace them with randomly initialized new ones. Repeat.

SET turns out to be surprisingly robust and stable. Encouragingly, the authors are able to show very similar results to FCL models (sometimes surpassing their performance) with SET models that contain far fewer parameters.




# Overcoming Catastrophic Forgetting with Hard Attention to the Task (HAT)
## PMLR 2018
#### Joan Serra, Didac Suris, Marius Miron, Alexandros Karatzoglou

This paper argues that masks should be learnable, instead of heuristically or rule driven pre-defined unlike pruning and sparse training. Therefore, HAT does not assign pre-defined compression ratios nor determine parameter importance through a post-training step. It is learned during training with an attention mechanism. A task-based hard attention mechanism maintains the information from previous tasks without affecting the learning of a new task. This mechanism is placed after each layer so that task embeddings are updated and learned during training as well. To binarize this attention mechanism, HAT utilizes a positive scaler factor (s) the sigmoid function (σ) to decide whether the layer should be activated for a given task. To sum up, HAT gives the neural network models an ability to learn which layers should activated for a task $t$ without any heuristically pre-defined rule.

<p align="center">
  <img src="https://github.com/muratonuryildirim/Tutorials/blob/master/images/papers/hat.png?raw=true" width=450>
</p>


# The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks
## ICLR 2019
#### Jonathan Frankle, Michael Carbin

Most of the experiences showed that sparse architectures produced by pruning are difficult to train from the start. This paper, however, present an algorithm to identify winning tickets and a series of experiments that support the lottery ticket hypothesis and the importance of these fortuitous initializations. It suggests that: 'A randomly-initialized, dense neural network contains a subnetwork that is initialized such that when trained in isolation, it can match the test accuracy of the original network after training for at most the same number of iterations.' 

In the experiments, they identified a winning ticket by training a network and pruning smallest-magnitude weights. The remaining, unpruned connections, constitute the architecture of the winning ticket. After that, each unpruned connection’s value was reset to its initialization from the original network before it was trained. Finally, pruned but re-initialized sparse network trained with same computational budget. It turned out the performance scores of the sparse neural network were similar with original dense network.

It was also suggested that using iterative pruning alleviate to find winning tickets that match the accuracy of the original network at smaller sizes than does one-shot pruning. To prove that, authors set a *n round* rule and at the end of each round they pruned %p of the network, re-initialized it and continue the learning process. 

This was one of the first papers that paved the way for dynamic sparse training literature. 

#Parameter Efficient Training of Deep Convolutional Neural Networks by Dynamic Sparse Reparameterization (Dynamic Sparse Reparameterization - DSR)
##ICML 2019
#### Hesham Mostafa, Xin Wang

This paper improves SET with a simple modification. SET applies magnitude pruning and random growing throughout its learning process. DSR also applies magnitude-based pruning but while growing new connections it considers free parameters in each layer. Highly sparse layers will receive less growth.

In other words, after removing $K$ number of parameters during the pruning phase, $K$ zero-initialized parameters are redistributed among the sparse parameter tensors: layers having larger fractions of non-zero weights receive proportionally more free parameters. This means, free parameters will be redistributed to layers whose parameters receive larger loss gradients.


# Rigging the Lottery: Making All Tickets Winners
## ICML 2019
#### Utku Evci, Trevor Gale, Jacob Menick, Pablo Samuel Castro and Erich Elsen

According to the Lottery Ticket Hypothesis, if we can identify a sparse neural network with iterative pruning, then we can train that sparse network from scratch to the same degree of accuracy by beginning from the initial conditions.

Motivating from that, Rigging the Lottery or *RigL* updates the topology of the sparse network during training based on parameter magnitudes and infrequent gradient calculations.

RigL starts with a random sparse network, and at regularly spaced intervals it removes a fraction of connections based on their magnitudes and activates new ones using instantaneous gradient information. It grows the connections with highest magnitude gradients which brings novelty to this method. Newly activated connections are initialized to zero and therefore don’t affect the output of the network. However they are expected to receive gradients with high magnitudes in the next iteration and therefore reduce the loss fastest.

RigL was able to find more accurate models than the current best dense-to-sparse training algorithms.


<p align="center">
  <img src="https://github.com/muratonuryildirim/Tutorials/blob/master/images/papers/rigl.png?raw=true" width=450>
</p>


# SNIP: Single-Shot Network Pruning Based On Connection Sensitivity
## ICLR 2019
#### Namhoon Lee, Thalaiyasingam Ajanthan, Philip H. S. Torr

In this work, authors present an approach that prunes a given network once at initialization prior to training. To achieve this, a saliency criterion based on connection sensitivity was introduced. Saliency criterion identifies structurally important connections in the network for the given task. This eliminates the need for both pretraining and the complex pruning schedule while making it robust to architecture variations. Here, weight initialization method has a significance since saliency scores are calculated directly from the initial weights without training. Therefore, authors advocate the use of variance scaling methods so that variance remains the same throughout the network.

While computing the saliency scores, one batch from the dataset is sampled. Then, mask values ($m$) which is initially equal to 1 of each weight is modified a bit ($m-δ$ or $1-δ$) and change in the loss is observed sequentially.  Finally, weights with smallest saliency scores are removed based on the predefined sparsity level. After pruning, the sparse network is trained in the standard way.

# Pruning via Iterative Ranking of Sensitivity Statistics
## preprint 2020
#### Stijn Verdenius, Amartya Sanyal, Harkirat S. Behl

It has been demonstrated that modern neural networks can effectively be pruned before training. Yet, its sensitivity criterion has been criticized for not propagating training signal properly or even disconnecting layers which stops information flow. As a remedy, GraSP was introduced but have to compromised on simplicity. This work showed that by applying the SNIP's own sensitivity criterion iteratively in smaller steps - still before training- greatly improves the performance: 'SNIP-it.'

The intuition goes as follows; certain medium-ranking model components, that are not that important to the loss initially, will be more important after the initial pruning-event. Specifically, authors hypothesized that some model components can be more important in the sub-network than they were in the parent-network- i.e. the parent-network does not effectively dictate how information flows through the sub-network in isolation. By pruning more conservatively, in multiple rounds, they grant the model components another chance, therefore making the process more robust to the pruning algorithm and lowering the chance of disconnecting the network. 

As an easy rule of thumb, algorithm starts with pruning half the network’s weights for the first stage, and keep halving the remainder until the desired sparsity is reached with some ε-proximity.

<p align="center">
  <img src="https://github.com/muratonuryildirim/Tutorials/blob/master/images/papers/snip_it.png?raw=true" width=800>
</p>


# Progressive Skeletonization: Trimming More Fat From A Network At Initialization
## ICLR 2021
#### Pau de Jorge, Amartya Sanyal, Harkirat S. Behl, Philip H. S. Torr, Grégory Rogez, Puneet K. Dokania

Recent studies have shown that skeletonization (pruning parameters) of networks at initialization provides all the practical benefits of sparsity both at inference and training time, while only marginally degrading their performance. However, beyond a certain level of sparsity (approx 95%), these approaches fail to preserve the network performance, and suprisingly in many cases perform even worse than trivial random pruning. To this end, authors propose (1) Iterative SNIP: performs better since it parameters that were unimportant at earlier stages of skeletonization can become important at later stages. (2) FORCE: iterative process that allows exploration by allowing already pruned parameters to resurrect at later stages of skeletonization which provide remarkably improved performance on higher pruning levels (could remove up to 99.5% parameters while keeping the networks trainable).

Progressive Sparsification (FORCE) suggest that Iterative SNIP might be restrictive in the sense that while re-assessing the importance of unpruned parameters since it does not consider previously pruned connections (even if they could become important). FORCE, however, allows previously pruned parameters to resurrect. Therefore, while not contributing to the forward signal, they might have a non-zero gradient. This relaxation modifies the saliency in whereby the gradient is now computed at a sparsified network instead of a pruned network.

<p align="center">
  <img src="https://github.com/muratonuryildirim/Tutorials/blob/master/images/papers/snip_force.png?raw=true" width=850>
</p>


# Finding trainable sparse networks through Neural Tangent Transfer
## ICML 2020
#### Tianlin Liu, Friedemann Zenke

In deep learning, trainable sparse networks that perform well on a specific task are usually constructed using label-dependent pruning criteria. In this article, authors introduce Neural Tangent Transfer, a method that instead finds trainable sparse networks in a label-free manner. 

Specifically, they find sparse networks whose training dynamics, as characterized by the neural tangent kernel, mimic those of dense networks with knowledge distillation where teacher model is a dense network and student model is a sparse network in function space. They showed that label-agnostic approach achieves higher classification performance while converging faster compared to other pruning before training methods.

# Sparse Networks from Scratch: Faster Training without Losing Performance (Sparse Momentum - SM)
##ICLR 2020
#### Tim Dettmers, Luke Zettlemoyer

Randomly initialize network and start training.

Calculate mean magnitude of momentum $M$ for each layer .

Remove the smallest 50% of weights for each layer.

Immediately after removing $K$ number of parameters during the pruning phase, $K$ zero-initialized parameters are redistributed among the sparse parameter tensors, based on calculated mean magnitude of momentum $M$: layers having larger fractions of mean momentum will receive proportionally more free parameters.

<p align="center">
  <img src="https://github.com/muratonuryildirim/Tutorials/blob/master/images/papers/SM.jpg?raw=true" width=700>
</p>



# Top-KAST: Top-K Always Sparse Training
## NeurIPS 2020
#### Siddhant M. Jayakumar, Razvan Pascanu, Jack W. Rae, Simon Osindero and Erich Elsen

In this paper, authors aim to propose a fully sparse training approach called Top-KAST. It is suggested that Top-KAST is scalable because it never requires a forward pass with dense parameters, nor calculating a dense gradient.

To be able to that, it selects a subset (*A*) of dense network(θ) with a sparsity level of **1-D** where *D* is the desired density. Subset *A* is used in forward-pass and predictions. While in the backprop, another subset (*B*) of dense network (θ) is employed where subset *B* covers subset *A*. Hence, instead of calculating gradients of the whole dense network, it only calculates a subset of it to grow new connections after removing connections with the lowest magnitudes.

In other words, Top-KAST is a RigL with a sparse backpropagation.

# Dynamic Sparse Training: Find Efficient Sparse Network From Scratch With Trainable Masked Layers (DST)
##ICLR 2020

Randomly initialize Fully Connected Neural Network.

Randomly initialize trainable mask layers (layer-level threshold $t$) and start training on Masked Neural Network.

Instead of pruning (masking) between two training epochs with a predefined pruning schedule, this method prunes and recovers the network parameters at each training step, which is far more fine-grained than existing methods.

In each training step, parameter is subtracted with respective threshold value and it is going to be masked if the values is smaller than 0. Not masked or pruned, otherwise:

$Q$<sub>ij</sub>= $F$($W$<sub>ij</sub> ,$t$<sub>i</sub>) = $|W$<sub>ij</sub>$|$ - $t$<sub>i</sub>

$M$<sub>ij</sub> = $S(Q$<sub>ij</sub>$)$ where $M$<sub>ij</sub>$= 1$ if not pruned, $M$<sub>ij</sub> $= 0$ if pruned

Repeat.

However, authors realize that $t$<sub>i</sub> cannot be learnt or updated under this funtions $S(x)$ since its gradient is always equal to 0. Therefore, they come up with  approximation funtion $H(x)$ which allows to learn ti and consequenly all the mask layers since it has a gradient.

Finally, after training, the model would be sparse based on trained mask layers (layer-level threshold $t$<sub>i</sub>).

<p align="center">
<img src="https://github.com/muratonuryildirim/Tutorials/blob/master/images/papers/DST.jpg?raw=true" width=550>
</p>


# Topological Insights into Sparse Neural Networks
## ECML 2020
#### Shiwei Liu, Tim Van der Lee, Anil Yaman, Zahra Atashgahi, Davide Ferraro, Ghada Sokar, Mykola Pechenizkiy, and Decebal Constantin Mocanu

In this work, authors proposed Neural Network Sparse Topology Distance (NNSTD) to measure the distance between different sparse neural networks. As a result, they showed that adaptive sparse connectivity can always unveil a plenitude of sparse subnetworks with very different topologies which outperform the dense model. This finding complements the Lottery Ticket Hypothesis since it is showing that there is a much more efficient and robust way to find “winning tickets”. As a result, randomly initialized sparse neural networks with adaptive sparse connectivity offer benefits not just in terms of computational and memory costs, but also in terms of the principal performance criteria for neural networks, e.g. accuracy for classification tasks.

Hence, in the light of the foundings, instead of exploring all resources to train over-parameterized models, intrinsically sparse networks with topological optimizers were suggested.

# Picking Winning Tickets Before Training by Preserving Gradient Flow 
## ICLR 2020
#### Chaoqi Wang, Guodong Zhang, Roger Grosse 

Network pruning can reduce test-time resource requirements, but is typically applied to trained networks and therefore cannot avoid the expensive training process. Therefore, authors aimed to apply pruning at network initialization which saves resources at training time as well. They emphasized that efficient training requires preserving the gradient flow through the network. Finally, they proposed a simple but effective pruning criterion called Gradient Signal Preservation (GraSP).

The idea of the GraSP is highly close to SNIP which aims to preserve the loss of the original randomly initialized network. GraSP, however, argues that the loss is no better than chance at initialization. Hence, at beginning of training, it is more important to preserve the training dynamics than the loss itself. SNIP does not carry out this action automatically because, even if cutting off a specific connection has no impact on the loss, it might still obstruct the flow of information across the network. For instance, authors noticed that SNIP with a high pruning ratio (e.g. 99%) tends to eliminate nearly all the weights in a particular layer which creates a bottleneck in the network. Therefore, they went to different direction which considers how the presence or absence of one connection influences the training of the rest of the network.

Mathematically, a larger gradient norm indicates that each gradient update achieves a greater loss reduction. Therefore, authors aimed to preserve or even increase (if possible) the gradient flow after pruning. They cast the pruning operation as adding a perturbation δ to the initial weights and use a Taylor approximation to characterize how removing one weight will affect the gradient flow after pruning.

GraSP compute the score of each weight, which reflects the change in gradient flow after pruning the weight. Specifically, if the weight's score is negative, then removing the corresponding weights will reduce the gradient flow, while if it is positive, it will increase the gradient flow. Therefore,the larger the score of a weight the lower its importance. For a given pruning ratio p, sparse model is found by removing the top p fraction of the weights.




# Sanity-Checking Pruning Methods: Random Tickets can Win the Jackpot
## NeurIPS 2020
#### Jingtong Su, Yihang Chen, Tianle Cai, Tianhao Wu, Ruiqi Gao, Liwei Wang, Jason D. Lee

This paper conducts a sanity check for the beliefs that are indidated below: 

i. Pruning methods exploit information from training data to find good subnetworks

ii. The architecture of the pruned network is crucial for good performance.

They found that: 

i. A set of methods which aims to find good subnetworks without training the network (which we call “initial tickets”), hardly exploits any information from the training data

ii. For the pruned networks obtained by these methods, such as SNIP and GraSP, randomly changing the preserved weights in each layer, while keeping the total number of preserved weights unchanged per layer, does not affect the final performance. 

These findings inspire authors to choose a series of simple data-independent prune ratios* for each layer, and randomly prune each layer accordingly to get a subnetwork (which they called as “random tickets”). They also menitoned that these layer-wise pruning ratios can be found with NAS for more robust networks and results. Experimental results show that their proposed approach zero-shot 'random tickets' outperform or attain a similar performance compared to existing 'initial tickets'. 

Finally, they also indicated that sanity checks are worked with pruning during or after training. Hence, manipulating the data or connections in the trained network drastically change performance whereas it is not affecting the performance at all while pruning before training.

It would also motivate that there are more than one winning tickets in the networks.

*: $(L − l + 1)$<sup>2</sup> + $(L − l + 1)$ where $l$ is the current layer and  $L$ is total number of layers in a ResNet. And, linear layers are set to 0.3 for any sparsity $p$.

<p align="center">
  <img src="https://github.com/muratonuryildirim/Tutorials/blob/master/images/papers/random_tickets_jackpot.png?raw=true" width=800>
</p>

# Pruning Neural Networks Without Any Data by Iteratively Conserving Synaptic Flow
## NeurIPS 2020
#### Hidenori Tanaka, Daniel Kunin, Daniel L. K. Yamins, Surya Ganguli

Video Explanation: https://www.youtube.com/watch?v=8l-TDqpoUQs

This study tries to identify highly sparse trainable subnetworks at initialization, without ever training, or without ever looking at the data While trying to reach this goal, they reached the following conclusions:

i. The premature pruning of an entire layer making a network untrainable due to layer-collapse. Hence, Maximal Critical Compression is formulated that posits a pruning algorithm should avoid layer-collapse whenever possible. It basically suggests that pruning algorithm should never prune a set of parameters that results in layer-collapse if there exists another set of the same cardinality that will keep the network trainable.
    
Max compression (ρ<sub>max</sub> = Number of Parameters / Number of Layers) is the maximal possible compression ratio for a network that doesn’t lead to layer-collapse. Hence, compression ratio (ρ<sub>cr</sub>) should be smaller than equal to  max compression: ρ<sub>cr</sub> ≤ ρ<sub>max</sub>

ii. It is demonstrated that *synaptic saliency*, a general class of gradient-based scores for pruning, is conserved at every hidden unit and layer of a neural network according to 'Conservation Laws of Synaptic Salience'. 

Theory 1 Neuron-wise Conservation of Synaptic Saliency: For a feedforward neural network, the sum of the synaptic saliency for the incoming parameters to a hidden neuron is equal to the sum of the synaptic saliency for the outgoing parameters from the hidden neuron.

Theory 2 Network-wise Conservation of Synaptic Saliency: The sum of the synaptic saliency is same across any set of parameters that exactly separates the input neurons x from the output neurons y of a feedforward neural network.

<p align="center">
  <img src="https://github.com/muratonuryildirim/Tutorials/blob/master/images/papers/synflow.png?raw=true" width=800>
</p>

Consider the set of parameters in a layer of a neural network. This set would exactly separate the input neurons from the output neurons. Thus, by the network-wise conservation of synaptic saliency (law 2), the total score for this set is constant for all layers, implying the average is inversely proportional to the layer size. Therefore, parameters per layer in large layers receive lower scores than parameters in small layers. That is why single-shot pruning disproportionately prunes the largest layer leading to layer-collapse.

However, if conservation laws are coupled with iterative magnitude pruning (IMP), layer-collapse is avoided. Because, when the largest layer is pruned, becoming smaller, then in subsequent iterations the remaining parameters of this layer will receive higher relative scores. So, two key ingradients can be identified for IMP’s ability to avoid layer-collapse: (i) approximate layer-wise conservation of the pruning scores, and (ii) the iterative re-evaluation of these scores.

iv. Data-agnostic algorithm that satisfies Maximal Critical Compression is introduced and referred as Iterative Synaptic Flow Pruning (SynFlow). This algorithm can be interpreted as preserving the total flow of synaptic strengths through the network at initialization subject to a sparsity constraint. Thus this data-agnostic pruning algorithm challenges the existing paradigm that, at initialization, data must be used to quantify which synapses are important.

Algorithm works as follows:

First, all parameters are converted to their absolute values. Second, a data point that is filled with 1's(ones) e.g. an image with a pixel values of 1 simply fed through the network with all of these positive weights and an output vector is obtained. Then each element in the output vector is sum up and get a single number which is R, pseudo loss function value. 

And then, loss is back propagated to the layers so, score is going to be the derivative of R with respect to weight multiplied by that weight. In this way, score for each parameter would be calculated which is easy and not expensive. Finally, parameters that are smaller than the threshold value will be pruned.

<p align="center">
  <img src="https://github.com/muratonuryildirim/Tutorials/blob/master/images/papers/synflow1.png?raw=true" width=650>
</p>



# Embracing Change: Continual Learning in Deep Neural Networks
## Trends in Cognitive Sciences 2020
#### Raia Hadsell, Dushyant Rao, Andrei A. Rusu, Razvan Pascanu

Paper shortly describes what continual learning is, why it is important and how closely related with human learning. Furthermore, it provides desiderata for continual learning and current approaches, regularization-based, architecture-based (also mentioned as modular/ity architectures), memory-based, and metalearning-based to satisfy those conditions. 

Regularization-based approaches directly modify the optimization of neural networks and have been shown to reduce catastrophic forgetting. Modular architectures offer pragmatic solutions to interference and catastrophic forgetting, while enabling forward transfer through hierarchical recomposition of skills and knowledge. End-to-end memory models could be a scalable solution for long timescale learning, and meta-learning approaches could surpass hand-designed algorithms and architectures altogether.


<p align="center">
  <img src="https://github.com/muratonuryildirim/Tutorials/blob/master/images/papers/desiderata_cl.png?raw=true" width=800>
</p>

# Sparse Evolutionary Deep Learning with Over One Million Artificial Neurons on Commodity Hardware
## Neural Computing and Applications 2021
#### Shiwei Liu, Decebal Constantin Mocanu, Amarsagar Reddy Ramapuram Matavalam, Yulong Pei & Mykola Pechenizkiy 

Off-the-shelf sparsity-inducing techniques either operate from a pretrained model or enforce the sparse structure via binary masks. Thus, the training efficiency is only demonstrated theoretically not practically. In this paper, authors introduce a technique allowing to train truly sparse neural networks im practice as well. They present a new way to represent sparse connections because most of the hardwares and softwares works with only dense representations.

Experimental results demonstrate that the method can be applied directly to handle high-dimensional data, while achieving higher accuracy than the traditional two-phase(train and prune) approaches, it especially overcomes the 'curse of dimensionality'.

#Sparse Training via Boosting Pruning Plasticity with Neuroregeneration
## NeurIPS 2021
#### Shiwei Liu, Tianlong Chen, Xiaohan Chen, Zahra Atashgahi, Lu Yin, Huanyu Kou, Li Shen, Mykola Pechenizkiy, Zhangyang Wang, Decebal Constantin Mocanu

Works on lottery ticket hypothesis (LTH) and single-shot network pruning (SNIP) have raised a lot of attention to post-training pruning (iterative magnitude pruning), and before-training pruning (pruning at initialization). The post-training pruning (iterative magnitude pruning) method suffers from an extremely large computation cost and the before-training pruning (pruning at initialization) usually struggles with insufficient performance. In comparison, during-training pruning, a class of pruning methods that simultaneously enjoys the training/inference efficiency and the comparable performance has been less explored. To measure the effect of pruning throughout training, study selects the performance metric as **pruning plasticity** (the ability of the pruned networks to recover the original performance). It was also found that the pruning plasticity can be substantially improved by injecting a brain-inspired mechanism called neuroregeneration, i.e., to regenerate the same number of connections as pruned. Designed novel gradual magnitude pruning (GMP) method advances state of the art, most impressively, its sparse-to-sparse version boosts the training performance over various dense-to-sparse methods.

The study found that:

i. Both pruning rate and learning rate matter for pruning plasticity. When pruned with low pruning rates, both dense-to-sparse training and sparse-to-sparse training can easily recover from pruning. On the contrary, if too many parameters are removed at one time, almost all models suffer from accuracy drops. Hence, gradually increasing the sparsity level boosts the performance of both dense-to-sparse training and sparse-to-sparse approach.

ii. Neuroregeneration improves pruning plasticity. While regenerating the same number of connections as pruned, the pruning plasticity is observed to improve remarkably, indicating a more neuroplastic model being developed. However, it increases memory and computational overheads in dense-to-sparse training. That is why it is more suitable for sparse-to-sparse training: Gradual pruning starts with a sparse subnetwork and gradually prune the subnetwork to the target sparsity during training.

#The Unreasonable Effectiveness of Random Pruning: Return of the Most Naive Baseline for Sparse Training 
##ICLR 2022
#### Shiwei Liu , Tianlong Chen, Xiaohan Chen, Li Shen, Decebal Constantin Mocanu, Zhangyang Wang, Mykola Pechenizkiy

Although random pruning is uncompetitive compared to post-training pruning or sparse training, it is arguably the most simplistic method of achieving sparsity in neural networks. In this research, authors emphasize a counter-intuitive result regarding sparse training: random pruning at initialization can be highly effective. In the study, it was empirically demonstrated that sparse training with randomly pruned network from beginning can match the performance of its dense equivalent, without the use of sophisticated pruning criteria or carefully considered sparsity structures. There are two key factors that contribute to this revival:

i. the network size: It is discovered that the efficiency of sparse training with random pruning depends on the network size. Even with low sparsities (10%, 20%), random pruning hardly ever equals the full accuracy in tiny dense networks. However, the performance of randomly pruned sparse network will quickly catch up to its dense version as the networks get bigger and deeper even at high sparsity ratios.

ii. appropriate layer-wise sparsity ratio: It is also found that, especially for large networks, selecting the right layer-wise sparsity initialization can be a significant help in training a randomly pruned network from scratch. The performance of a completely random sparse Wide ResNet-50 which is ERK initialized can outperform a densely trained Wide ResNet-50 on ImageNet.

# Gradient Flow in Sparse Neural Networks and How Lottery Tickets Win
## AAAI 2022
#### Utku Evci, Yani Ioannou, Cem Keskin and Yann Dauphin

This paper investigates why training unstructured sparse networks from random initialization performs poorly and what makes Lottey Tickets (LTs) and Dynamic Sparse Training (DST) exceptions.

And it is found that Sparse NNs have poor gradient flow at initialization. Hence, the importance of using sparsity-aware initialization is demonstrated. Furthermore, DST methods significantly improve gradient flow during training over traditional sparse training methods. Finally, the success of LTs lies in re-learning the pruning solution from which they are derived, not in improving gradient flow.



# Sparsity May Cry

# PackNet: Adding Multiple Tasks to a Single Network by Iterative Pruning
##CVPR 2018
#### Arun Mallya and Svetlana Lazebnik

Inspired by network pruning techniques, PackNet exploits redundancies in large deep networks to free up parameters that can then be employed to learn new tasks. By performing iterative pruning and network re-training, PackNet is able to sequentially “pack” multiple tasks into a single network. To do that, after finding $n$<sup>th</sup> pack, it freezes the weights and assigns that "pack" as a subnetwork of $task$ $n$. The drawback here is, PackNet forces next tasks to use previously fixed and pretrained connections. It is called a biased transfer. However, it is one of the first approaches that pave the way for trainable subnetworks.

In the figure, white circles represents available neurons in the backbone while bold circles indicates neurons that are already occupied in another pack and fixed that is why in the next pack selection these neurons have to be used whether they are relevant with the task at hand or not.

* **Masking Method**: Train the all backbone then remove 50% or 75% of the connections based on weights' absolute magnitude. Re-train.
* **Mask Selection**: Assumed that task identity is given in both the training and the testing stage.

<p align="center">
  <img src="https://github.com/muratonuryildirim/Tutorials/blob/master/images/papers/packnet.png?raw=true" width=800>
</p>

# Piggyback: Adapting a Single Network to Multiple Tasks by Learning to Mask Weights
## ECCV 2018
#### Arun Mallya, Dillon Davis, and Svetlana Lazebnik

Piggyback questions whether the weights of a network have to be changed at all. It suggest we might get reasonable results with just selectively masking, or setting certain weights to 0, while keeping the rest of the weights the same as before. 

Based on this idea, Piggyback learns how to mask weights of an existing pretrained network (e.g. VGG-16) for obtaining good performance on a new task, as shown in the Figure. Binary masks that take values in {0, 1} are learned and stored after each task. And to learn those binary masks, Piggyback trains the mask since the model itself is already pretrained. To do that, it first starts with real-valued masks and uses the loss of the pretrained model in order to update real-valued masks. Finally, it applies threshold function take make it binary.

* **Masking Method**: Mask the pretrained model based on learnable real-valued masks which are then converted to the binary mask with thresholding.
* **Mask Selection**: Assumed that task identity is given in both the training and the testing stage.

<p align="center">
  <img src="https://github.com/muratonuryildirim/Tutorials/blob/master/images/papers/piggyback.png?raw=true" width=700>
</p>


# SupSup: Supermasks in Superposition
## NeurIPS 2020
#### Mitchell Wortsman et al.

Supermasks in Superposition (SupSup) model uses a randomly initialized, fixed base network and for each task finds a subnetwork (supermask). If task identity is given at test time, the correct subnetwork can be retrieved with minimal memory usage. If not provided, SupSup can infer the task using gradient-based optimization to find a linear superposition of learned supermasks which minimizes the output entropy. Authors experimentally find that a single gradient step is often sufficient to identify the correct mask, even among 2500 tasks. Hence, SupSup is suitable for class-IL.


During training, SupSup learns a separate supermask (subnetwork) for each task. At inference time, SupSup can infer task identity by superimposing all supermasks. Ideally, appropriate supermask for a given task should exhibit a confident output distribution (i.e. low entropy).

* **Masking Method**: Deconstructing lottery tickets: Zeros, signs, and the supermask.
* **Mask Selection**: Try all the supermasks, return mask with a lowest entropy

<p align="center">
  <img src="https://github.com/muratonuryildirim/Tutorials/blob/master/images/papers/supsup.png?raw=true" width=800>
</p>


# SpaceNet: Make Free Space for Continual Learning
## Neurocomputing 2021
#### Ghada Sokar, Decebal Constantin Mocanu, and Mykola Pechenizkiy

SpaceNet trains sparse deep neural networks from scratch to have compact number of neurons for each task. When the model faces a new task, new sparse connections are randomly allocated between a selected number of neurons in each layer. At the end of the training, the initial distribution of the connections changes and connections that are important for that task group together.

The most important neurons for a specific task are reserved to be specific to this task, and will not be seen by the following tasks and will freeze. However, other neurons that are somehow important or not important at all, will continue to be shared between the tasks. For example, in the figure, fully filled circles represent the neurons that are most important and become specific for $task$ $t$, where partially filled ones are less important and could be shared by other tasks. Multiple colored circles represent the neurons that are used by multiple tasks. After learning $task$ $t$, the corresponding weights are kept fixed.

For convolutional neural networks, SpaceNet performs a coarse manner in drop and grow phases to impose structured sparsity instead of irregular sparsity. In particular, in the drop phase, coarse removal for the whole kernel is applied instead of removing scalar weights. Similarly, in the grow phase, the whole connections of a kernel are added instead of adding single weights. The likelihood of adding a kernel between two feature maps is inversely proportional to their significance, similar to multilayer perceptron networks. 

* **Masking Method**: A fraction $r$ of the sparse connections in each layer is dynamically changed based on the importance of the connections and neurons in that layer. Connection importance is estimated by its contribution to the change in the loss function. The first-order Taylor approximation is used to approximate the change in loss during one training iteration $i$. Growing and Dropping the connection is applied based on the importance score.

* **Mask Selection**: Use the whole network structure, no masking.

<p align="center">
  <img src="https://github.com/muratonuryildirim/Tutorials/blob/master/images/papers/spacenet.png?raw=true" width=700>
</p>


# Avoiding Forgetting and Allowing Forward Transfer in Continual Learning via Sparse Networks: AFAF
## ECML 2022
#### Ghada Sokar, Decebal Constantin Mocanu and Mykola Pechenizkiy

This paper improves SpaceNet by allowing knowledge transfer between tasks which in the end increased the overall accuracy. It is still based on structured pruning (neuron pruning in FCN and channel pruning in CNN) while dynamically trains the sparse network. To enable the knowlege transfer between tasks, some new hyperparameters were introduced. Layers from $l$ = 1 up to but excluding layer $l$<sub>reuse</sub>, a hyperparameter, remains unchanged. To *selectively transfer* the relevant knowledge, for each layer l ≥ $l$<sub>reuse</sub>, they identify a set of candidate neurons R<sup>c</sup><sub>l</sub> that has a high potential of being useful when “reused” in learning class $c$ in a new task $t$. This high potential is determined based on the neuron activation. If the neuron higly activated for the new task then it is assumed that this neuron has a high potential to transfer knowledge. Here, they introduced another hyperparameter κ to select number of neurons to reuse per layer. Finally, starting from $l$<sub>reuse</sub>, new sparse connections are added to the network to learn patterns that are specific to new task $t$.

# Continual Prune-and-Select (CP&S): Class-Incremental Learning with specialized subnetworks
##*2022*
#### Aleksandr Dekhovich, David M.J. Tax, Marcel H.F. Sluiter, and Miguel A. Bessa

During training, Continual-Prune-and-Select (CP&S) finds a subnetwork within the DNN that is responsible for solving a given $task$ $t$. 

A new task is learned by training available neuronal connections (previously untrained) of the DNN to create a new subnetwork which can include previously trained connections belonging to other subnetwork(s) but those will not be updated. In other words, previously trained connections can be shared between tasks yet cannot be updated.

Then, during inference, CP&S selects the correct subnetwork to make predictions for that task.  This enables to eliminate catastrophic forgetting by creating specialized regions in the DNN that do not conflict with each other while still allowing knowledge transfer across them. 


* **Masking Method**: NNrelief
* **Mask Selection**: Try all the subnetworks, return mask with a lowest entropy (*maximum output response*)

# Forget-free Continual Learning with Winning Subnetworks
## ICML 2022
#### Haeyong Kang et al.

For each task, the WSN sequentially learns and chooses the best subnetwork. Specifically, WSN jointly learns the model weights and task-adaptive binary masks that pertaining to subnetworks associated with each task. It also attemps to select a small set of weights to be activated *(winning ticket)* by reusing weights of the prior subnetworks.

It updates only the weights that have not been trained on the previous tasks. After training for each task, the model freezes the subnetwork parameters. Therefore, WSN is also immune to the catastrophic forgetting by design.

For example, in the figure, $task$ $t-1$ is learned by a orange subnetwork. In the $task$ $t$, we can still use the orange weights while doing a forward pass yet we cannot use them on the backward pass. It is only allowed for unassigned weights.

Its pruning (masking) approach a bit different though. WSN's network contains two different parameters: one for learning with parameters θ and, one for masking the network(θ) with parameters $s$. Based on the weight scoring parameters ($s$), $c$% weights are selected where $c$ is the target layerwise capacity ratio in %. Then the top c% are assigned to 1 and remainings are assigned to 0. This approach indirectly applies the masking to the main network with parameters θ. To update the weight scores $s$, loss of the main network (θ) is used in its backward pass.


* **Masking Method**: WSN tries to find best subnetwork by selecting the $c$% weights with respect to learnable weight scores $s$, where $c$ is the target layerwise capacity ratio in %.

* **Mask Selection**: Assumed that task identity is given in both the training and the testing stage.


<p align="center">
  <img src="https://github.com/muratonuryildirim/Tutorials/blob/master/images/papers/winningsubnetworks.png?raw=true" width=700>
</p>




# Lifelong Learning with Dynamically Expandable Networks (DEN)
##ICLR 2018
#### Jaehong Yoon, Eunho Yang, Jeongtae Lee and Sung Ju Hwang

DEN selectively retrains the old network, expands its capacity when necessary, and thus dynamically deciding its optimal capacity as it trains on. DEN consists of 3 steps:
1. Train the network with $L$<sub>1</sub> regularization for $task$ $t$ to create some sparsity in the network. Then, train the network with $L$<sub>1</sub> regularization for $task$ $t$+$1$ again while only considering the remaining parameters this time.
2. If remaining parameters is not enough (𝕃oss ≥ τ) to learn $task$ $t$+$1$, then expand the network in a top-down manner, still eliminating any unnecessary neurons by $L$<sub>1</sub> regularization which force to make it sparse.
3. If parameters shifted too much from their inital values (𝕡 ≥ σ), then duplicate the weights. After this duplication of the neurons, the network needs to train the weights again since split changes the overall structure. However, in practice this secondary training usually converges fast due to the reasonable parameter initialization from the initial training.

    (𝕡 = ℓ<sub>2</sub>-distance between the incoming weights at $t$-$1$ and at $t$)

<p align="center">
  <img src="https://github.com/muratonuryildirim/Tutorials/blob/master/images/papers/den.png?raw=true" width=800>
</p>



# Squeeze-and-Excitation Networks
##CVPR 2018
#### Jie Hu, Li Shen and Gang Sun

“Squeeze-and- Excitation” (SE) blocks try to improve the representational power of a network by explicitly modelling the interdependencies between the channels of its convolutional features. To achieve this, it uses the global information to selectively emphasise informative features while suppressing less useful ones.

The basic structure of the SE block is illustrated in figure. For any given transformation (e.g. a convolution or a set of convolutions), in features are first passed through a *squeeze* operation. It aggregates the feature maps across spatial dimensions to produce a channel descriptor with global pooling (CxHxW -> Cx1x1). 

This is followed by an *excitation* operation, in which channel-specific activations obtained by sigmoid function are learned for each channel. In features are then reweighted by corespondent channel-specific activations. At the end, it helps to consider important channels (or feature maps) heavily than the others for a given input set with a such simple attention mechanism.

<p align="center">
  <img src="https://github.com/muratonuryildirim/Tutorials/blob/master/images/papers/squeezeandexcitation.png?raw=true" width=800>
</p>


# Powerpropagation: A sparsity inducing weight reparameterisation
##NeurIPS 2021
#### Jonathan Schwarz et al.

A weight-parameterization method for neural networks called powerpropagation produces models that are inherently sparse. Exploiting the behaviour of gradient descent, Powerprop gives rise to weight updates exhibiting a “rich get richer” dynamic by leaving low-magnitude parameters largely unaffected by learning. As a result, models trained in this way perform similarly but have a distribution with a noticeably higher density at zero, enabling the safe pruning of more parameters.

In the forward pass of a neural networks, raise the parameters of the model to the α-th power (where α > 1) while preserving the sign. Parameters that are raised to α − 1 will appear in the gradient computation, scaling the usual update. Because of this, larger magnitude parameters receive larger gradient updates than smaller magnitude parameters do, resulting in the previously mentioned "rich get richer" phenomenon.

In a simple formulation where w = v|v|<sup>α−1</sup>, for any arbitrary power α ≥ 1 we preserved the sign of v so that it can still represent both negative and positive values. For α = 1 this recovers the standard backpropagation setting. For α ≥ 1, updates in the weights are naturally obtained by the standard backpropagation but with enhanced gradients to enforce "rich get richer" phenomenon.

<p align="center">
  <img src="https://github.com/muratonuryildirim/Tutorials/blob/master/images/papers/powerprop.png?raw=true" width=900>
</p>


# Learning without Forgetting
## TPAMI 2017
#### Zhizhong Li and Derek Hoiem

Learning without Forgetting (LwF) approach could be seen as a combination of Distillation Networks and fine-tuning. Fine-tuning modifies the parameters of an existing CNN to train a new task. A small learning rate is often used, and sometimes part of the network is frozen to prevent overfitting. Distillation Networks helps simpler networks to return more reasonable outputs by providing additional info from the input data.

In LwF, let number of tasks is equal to $t$ and let tasks are defined in a class-incremental manner. Then each previous task's network will become a distillation network of the following task. (e.g. network of the $task$ $t-1$ will be assigned as a distillation network of $task$ $t$.) This distillation network will guide the main network of the current task while fine tuning so that it will not forget the previously learned tasks:

0. Add $c$ number of nodes to the classifier of old network (distillation network) from $task$ $t-1$ to create new network for $task$ $t$.
1. Forward input of the $task$ $t$ to both networks.
2. Name logits of the distillation as *soft targets* and classes of the input of $task$ $t$ as *hard targets*.
3. Compare hard targets with the predicted labels to calculate $cross$-$entropy$ $loss$.
4. Compare soft targets with logits of the new network to calculate $distillation$ $loss$. (ignore newly added $c$  number of node(s) - in other words, consider only old classes which should not be forget)
5. Initiate backbropagation with **Loss** so that new network will be enforced to learn new task while preseving the old taks by distilled information where

    **Loss** = $cross$-$entropy$ $loss$ + λ*$distillation$ $loss$.

<p align="center">
  <img src="https://github.com/muratonuryildirim/Tutorials/blob/master/images/papers/lwf.png?raw=true" width=500>
</p>

# Overcoming catastrophic forgetting in neural networks
## PNAS 2017
#### James Kirkpatrick et al.

This paper developed an algorithm analogous to synaptic consolidation for artificial neural networks, which is referred as elastic weight consolidation (EWC). This algorithm slows down learning (or changing) on certain weights based on how important they are to previously seen tasks. This importance is called Fisher information matrix $F$ and it has three key properties: (i) It is equivalent to the second derivative of the loss near a minimum, (ii) it can be computed from first-order derivatives alone and is thus easy to calculate even for large models, and (iii) it is guaranteed to be positive semidefinite.

Overall, the loss function that we try to minimize in EWC is:

$L$(θ) = $L$<sub>B</sub>(θ) + ∑ λ / 2 * $F$<sub>i</sub>(θ<sub>i</sub> + θ<sub>A,i</sub>)<sup>2</sup> 

where

$L$<sub>B</sub>(θ): is the loss for latter task B only,

λ: sets how important the old task A is compared with the new one and,

i: labels each parameter.

EWC will attempt to keep the network parameters close to the learned parameters of both tasks A and B when switching to a third task. This can be enforced either with two separate penalties or as one by noting that the sum of two quadratic penal- ties is itself a quadratic penalty.

In the figure, after learning the first task, the parameters θ<sub>A</sub><sup>*</sup> were obtained. If we take gradient steps according to task B alone (blue arrow), we will minimize the loss of task B but destroy what we have learned for task A. However, if we impose an excessive amount of limitation by assigning the same coefficient to each weight (green arrow), we can only recall task A at the risk of failing to learn task B. EWC, on the other hand, clearly calculates how important weights are for job A in order to discover a solution for task B without suffering a major loss on task A (red arrow).


<p align="center">
  <img src="https://github.com/muratonuryildirim/Tutorials/blob/master/images/papers/ewc.png?raw=true" width=450>
</p>

# Continual Learning Through Synaptic Intelligence
## ICML 2017
#### Friedemann Zenke, Ben Poole and Surya Ganguli

Synaptic Intelligence (SI) accumulates task relevant information over time, and exploits this information to store new memories without forgetting old ones. To do that, each individual synapse is measured with a local measure of “importance” in solving tasks that the network has been trained on in the past For brevity, the term “synapse” used synonymously with the term “parameter”, which includes weights and biases between layers.

When training on a new task, SI penalizes change in the important parameters to avoid old memories from being over-written. To that end, SI calculates each parameters' contribution to the loss function by slowly changing each parameter. If this small change in parameter affects the loss heavily, then this means that parameter plays an important role for the task. It should be not be updated in order to preserve old knowledge. Otherwise, if a small change in parameter doesnt affects the loss at all, then that parameter is not crucial for the task. Hence, it can easily be updated in the next task to learn that new task.

**Importance** = *Parameter's contribution to the loss function* = **Gradient of the parameter**

# Gradmax: Growing neural networks using gradient information
## ICLR 2022
#### Utku Evci, Bart van Merrienboer, Thomas Unterthiner, Max Vladymyrov, Fabian Pedregosa

Gradmax aims to grow the network architecture without costly retraining. It adds new neurons during training without impacting what is already learned, while improving the training dynamics by maximizing the gradients of the new weights and efficiently find the optimal initialization by means of the singular value decomposition (SVD). 

It starts with a small seed architecture. Then over the course of the training, new neurons are added to the seed architecture: either increasing the width of the existing layers or creating new layers. 

In the illustration of Gradmax, growing neurons require initializing incoming W<sub>l</sub><sup>new</sup> and outgoing W<sub>l+1</sub><sup>new</sup> weights for the new neuron. GradMax sets incoming weights W<sub>l</sub><sup>new</sup> to zero (dashed lines) in order to keep the output unchanged so that backprop will not impact the currently learned weights. It initializes outgoing weights W<sub>l+1</sub><sup>new</sup> using SVD. This maximizes the gradients on the incoming weights with the aim of accelerating training.


<p align="center">
  <img src="https://github.com/muratonuryildirim/Tutorials/blob/master/images/papers/gradmax.png?raw=true" width=700>
</p>

# Head2Toe: Utilizing Intermediate Representations for Better Transfer Learning
##ICML 2022
#### Utku Evci, Vincent Dumoulin, Hugo Larochelle and Michael C. Mozer

Head-to-Toe probing (Head2Toe) selects features from all layers of the source model to train a classification head for the target domain. It aims to replace traditional transfer learning approach by assuming relevant feature maps can occur anywhere in the network instead of the last layer.

It connects outputs (feature maps) of the all layers with classifier head. And then applies Lasso Regulizer to select only relevant features for the classifier. Since Lasso force connections to be zero if they are irrevelant only important featur maps would contribute to the output.

Head2Toe matches performance obtained with fine-tuning on average while reducing training and storage cost, but critically, for out-of-distribution transfer, Head2Toe outperforms fine-tuning.


<p align="center">
  <img src="https://github.com/muratonuryildirim/Tutorials/blob/master/images/papers/head2toe.png?raw=true" width=350>
</p>