# Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science (SET)  
##*Nature Communications 2018*

Randomly initialize SCLs in our network and start training. 

At the end of each epoch, we remove the connections with the smallest weights (the “weakest” connections) based on a threshold $t$ and replace them with randomly initialized new ones. 

Repeat.

SET turns out to be surprisingly robust and stable. Encouragingly, the authors are able to show very similar results to FCL models (sometimes surpassing their performance) with SET models that contain far fewer parameters.




#Parameter Efficient Training of Deep Convolutional Neural Networks by Dynamic Sparse Reparameterization (Dynamic Sparse Reparameterization - DSR)
##*ICML 2019*

Randomly initialize SCLs in our network and start training.

Calculate mean magnitude of momentum $G$ for each layer .

Remove connections with smallest weights based on global adaptive threshold $t$.

Immediately after removing $K$ number of parameters during the pruning phase, $K$ zero-initialized parameters are redistributed among the sparse parameter tensors, based on calculated mean magnitude of gradient $G$: layers having larger fractions of non-zero weights receive proportionally more free parameters. This means, free parameters should be redistributed to layers whose parameters receive larger loss gradients.

Repeat.


# Sparse Networks from Scratch: Faster Training without Losing Performance (Sparse Momentum - SM)
##*ICLR 2020*

Randomly initialize SCLs in our network and start training.

Calculate mean magnitude of momentum $M$ for each layer .

Remove the smallest 50% of weights for each layer.

Immediately after removing $K$ number of parameters during the pruning phase, $K$ zero-initialized parameters are redistributed among the sparse parameter tensors, based on calculated mean magnitude of momentum $M$: layers having larger fractions of mean momentum will receive proportionally more free parameters.

Repeat.

<p align="center">
  <img src="https://github.com/muratonuryildirim/Tutorials/blob/master/images/papers/SM.jpg?raw=true" width=700>
</p>



# Dynamic Sparse Training: Find Efficient Sparse Network From Scratch With Trainable Masked Layers (DST)
##*ICLR 2020*

Randomly initialize Fully Connected Neural Network.

Randomly initialize trainable mask layers (layer-level threshold $t$) and start training on Masked Neural Network.

Instead of pruning (masking) between two training epochs with a predefined pruning schedule, this method prunes and recovers the network parameters at each training step, which is far more fine-grained than existing methods.

In each training step, parameter is subtracted with respective threshold value and it is going to be masked if the values is smaller than 0. Not masked or pruned, otherwise:

$Q$<sub>ij</sub>= $F$($W$<sub>ij</sub> ,$t$<sub>i</sub>) = $|W$<sub>ij</sub>$|$ - $t$<sub>i</sub>

$M$<sub>ij</sub> = $S(Q$<sub>ij</sub>$)$ where $M$<sub>ij</sub>$= 1$ if not pruned, $M$<sub>ij</sub> $= 0$ if pruned

Repeat.

However, authors realize that $t$<sub>i</sub> cannot be learnt or updated under this funtions $S(x)$ since its gradient is always equal to 0. Therefore, they come up with  approximation funtion $H(x)$ which allows to learn ti and consequenly all the mask layers since it has a gradient.

Finally, after training, the model would be sparse based on trained mask layers (layer-level threshold $t$<sub>i</sub>).

<p align="center">
<img src="https://github.com/muratonuryildirim/Tutorials/blob/master/images/papers/DST.jpg?raw=true" width=550>
</p>


# PackNet: Adding Multiple Tasks to a Single Network by Iterative Pruning
##*CVPR 2018*
#### Arun Mallya and Svetlana Lazebnik

Inspired by network pruning techniques, PackNet exploits redundancies in large deep networks to free up parameters that can then be employed to learn new tasks. By performing iterative pruning and network re-training, PackNet is able to sequentially “pack” multiple tasks into a single network. To do that, after finding $n$<sup>th</sup> pack, it freezes the weights and assigns that "pack" as a subnetwork of $task$ $n$. The drawback here is, PackNet forces next tasks to use previously fixed and pretrained connections. It is called a biased transfer. However, it is one of the first approaches that pave the way for trainable subnetworks.

In the figure, white circles represents available neurons in the backbone while bold circles indicates neurons that are already occupied in another pack and fixed that is why in the next pack selection these neurons have to be used whether they are relevant with the task at hand or not.

* **Masking Method**: Train the all backbone then remove 50% or 75% of the connections based on weights' absolute magnitude. Re-train.
* **Mask Selection**: Assumed that task identity is given in both the training and the testing stage.

<p align="center">
  <img src="https://github.com/muratonuryildirim/Tutorials/blob/master/images/papers/packnet.png?raw=true" width=800>
</p>

# Piggyback: Adapting a Single Network to Multiple Tasks by Learning to Mask Weights
## *ECCV 2018*
#### Arun Mallya, Dillon Davis, and Svetlana Lazebnik

Piggyback questions whether the weights of a network have to be changed at all. It suggest we might get reasonable results with just selectively masking, or setting certain weights to 0, while keeping the rest of the weights the same as before. 

Based on this idea, Piggyback learns how to mask weights of an existing pretrained network (e.g. VGG-16) for obtaining good performance on a new task, as shown in the Figure. Binary masks that take values in {0, 1} are learned and stored after each task. And to learn those binary masks, Piggyback trains the mask since the model itself is already pretrained. To do that, it first starts with real-valued masks and uses the loss of the pretrained model in order to update real-valued masks. Finally, it applies threshold function take make it binary.

* **Masking Method**: Mask the pretrained model based on learnable real-valued masks which are then converted to the binary mask with thresholding.
* **Mask Selection**: Assumed that task identity is given in both the training and the testing stage.

<p align="center">
  <img src="https://github.com/muratonuryildirim/Tutorials/blob/master/images/papers/piggyback.png?raw=true" width=700>
</p>


# SupSup: Supermasks in Superposition
## *NeurIPS 2020*
#### Mitchell Wortsman et al.

Supermasks in Superposition (SupSup) model uses a randomly initialized, fixed base network and for each task finds a subnetwork (supermask). If task identity is given at test time, the correct subnetwork can be retrieved with minimal memory usage. If not provided, SupSup can infer the task using gradient-based optimization to find a linear superposition of learned supermasks which minimizes the output entropy. Authors experimentally find that a single gradient step is often sufficient to identify the correct mask, even among 2500 tasks. Hence, SupSup is suitable for class-IL.


During training, SupSup learns a separate supermask (subnetwork) for each task. At inference time, SupSup can infer task identity by superimposing all supermasks. Ideally, appropriate supermask for a given task should exhibit a confident output distribution (i.e. low entropy).

* **Masking Method**: Deconstructing lottery tickets: Zeros, signs, and the supermask.
* **Mask Selection**: Try all the supermasks, return mask with a lowest entropy

<p align="center">
  <img src="https://github.com/muratonuryildirim/Tutorials/blob/master/images/papers/supsup.png?raw=true" width=800>
</p>


# SpaceNet: Make Free Space for Continual Learning
## *Neurocomputing 2021*
#### Ghada Sokar, Decebal Constantin Mocanu, and Mykola Pechenizkiy

SpaceNet trains sparse deep neural networks from scratch to have compact number of neurons for each task. When the model faces a new task, new sparse connections are randomly allocated between a selected number of neurons in each layer. At the end of the training, the initial distribution of the connections changes and connections that are important for that task group together.

The most important neurons for a specific task are reserved to be specific to this task, and will not be seen by the following tasks and will freeze. However, other neurons that are somehow important or not important at all, will continue to be shared between the tasks. For example, in the figure, fully filled circles represent the neurons that are most important and become specific for $task$ $t$, where partially filled ones are less important and could be shared by other tasks. Multiple colored circles represent the neurons that are used by multiple tasks. After learning $task$ $t$, the corresponding weights are kept fixed.

For convolutional neural networks, SpaceNet performs a coarse manner in drop and grow phases to impose structure sparsity instead of irregular sparsity. In particular, in the drop phase, coarse removal for the whole kernel is applied instead of removing scalar weights. Similarly, in the grow phase, the whole connections of a kernel are added instead of adding single weights. The likelihood of adding a kernel between two feature maps is inversely proportional to their significance, similar to multilayer perceptron networks. 

* **Masking Method**: A fraction $r$ of the sparse connections in each layer is dynamically changed based on the importance of the connections and neurons in that layer. Connection importance is estimated by its contribution to the change in the loss function. The first-order Taylor approximation is used to approximate the change in loss during one training iteration $i$. Growing and Dropping the connection is applied based on the importance score.

* **Mask Selection**: Use the whole network structure, no masking.

<p align="center">
  <img src="https://github.com/muratonuryildirim/Tutorials/blob/master/images/papers/spacenet.png?raw=true" width=700>
</p>


# Continual Prune-and-Select (CP&S): Class-Incremental Learning with specialized subnetworks
##*2022*
#### Aleksandr Dekhovich, David M.J. Tax, Marcel H.F. Sluiter, and Miguel A. Bessa

During training, Continual-Prune-and-Select (CP&S) finds a subnetwork within the DNN that is responsible for solving a given $task$ $t$. 

A new task is learned by training available neuronal connections (previously untrained) of the DNN to create a new subnetwork which can include previously trained connections belonging to other subnetwork(s) but those will not be updated. In other words, previously trained connections can be shared between tasks yet cannot be updated.

Then, during inference, CP&S selects the correct subnetwork to make predictions for that task.  This enables to eliminate catastrophic forgetting by creating specialized regions in the DNN that do not conflict with each other while still allowing knowledge transfer across them. 


* **Masking Method**: NNrelief
* **Mask Selection**: Try all the subnetworks, return mask with a lowest entropy (*maximum output response*)

# Forget-free Continual Learning with Winning Subnetworks
##*ICML 2022*
#### Haeyong Kang et al.

For each task, the WSN sequentially learns and chooses the best subnetwork. Specifically, WSN jointly learns the model weights and task-adaptive binary masks that pertaining to subnetworks associated with each task. It also attemps to select a small set of weights to be activated *(winning ticket)* by reusing weights of the prior subnetworks.

It updates only the weights that have not been trained on the previous tasks. After training for each task, the model freezes the subnetwork parameters. Therefore, WSN is also immune to the catastrophic forgetting by design.

For example, in the figure, $task$ $t-1$ is learned by a orange subnetwork. In the $task$ $t$, we can still use the orange weights while doing a forward pass yet we cannot use them on the backward pass. It is only allowed for unassigned weights.

Its pruning (masking) approach a bit different though. WSN's network contains two different parameters: one for learning with parameters θ and, one for masking the network(θ) with parameters $s$. Based on the weight scoring parameters ($s$), $c$% weights are selected where $c$ is the target layerwise capacity ratio in %. Then the top c% are assigned to 1 and remainings are assigned to 0. This approach indirectly applies the masking to the main network with parameters θ. To update the weight scores $s$, loss of the main network (θ) is used in its backward pass.


* **Masking Method**: WSN tries to find best subnetwork by selecting the $c$% weights with respect to learnable weight scores $s$, where $c$ is the target layerwise capacity ratio in %.

* **Mask Selection**: Assumed that task identity is given in both the training and the testing stage.


<p align="center">
  <img src="https://github.com/muratonuryildirim/Tutorials/blob/master/images/papers/winningsubnetworks.png?raw=true" width=700>
</p>




# Lifelong Learning with Dynamically Expandable Networks (DEN)
## *ICLR 2018*
#### Jaehong Yoon, Eunho Yang, Jeongtae Lee and Sung Ju Hwang

DEN selectively retrains the old network, expands its capacity when necessary, and thus dynamically deciding its optimal capacity as it trains on. DEN consists of 3 steps:
1. Train the network with $L$<sub>1</sub> regularization for $task$ $t$ to create some sparsity in the network. Then, train the network with $L$<sub>1</sub> regularization for $task$ $t$+$1$ again while only considering the remaining parameters this time.
2. If remaining parameters is not enough (𝕃oss ≥ τ) to learn $task$ $t$+$1$, then expand the network in a top-down manner, still eliminating any unnecessary neurons by $L$<sub>1</sub> regularization which force to make it sparse.
3. If parameters shifted too much from their inital values (𝕡 ≥ σ), then duplicate the weights. After this duplication of the neurons, the network needs to train the weights again since split changes the overall structure. However, in practice this secondary training usually converges fast due to the reasonable parameter initialization from the initial training.

    (𝕡 = ℓ<sub>2</sub>-distance between the incoming weights at $t$-$1$ and at $t$)

<p align="center">
  <img src="https://github.com/muratonuryildirim/Tutorials/blob/master/images/papers/den.png?raw=true" width=800>
</p>



# Squeeze-and-Excitation Networks
##*CVPR 2018*
#### Jie Hu, Li Shen and Gang Sun

“Squeeze-and- Excitation” (SE) blocks try to improve the representational power of a network by explicitly modelling the interdependencies between the channels of its convolutional features. To achieve this, it uses the global information to selectively emphasise informative features while suppressing less useful ones.

The basic structure of the SE block is illustrated in figure. For any given transformation (e.g. a convolution or a set of convolutions), in features are first passed through a *squeeze* operation. It aggregates the feature maps across spatial dimensions to produce a channel descriptor with global pooling (CxHxW -> Cx1x1). 

This is followed by an *excitation* operation, in which channel-specific activations obtained by sigmoid function are learned for each channel. In features are then reweighted by corespondent channel-specific activations. At the end, it helps to consider important channels (or feature maps) heavily than the others for a given input set with a such simple attention mechanism.

<p align="center">
  <img src="https://github.com/muratonuryildirim/Tutorials/blob/master/images/papers/squeezeandexcitation.png?raw=true" width=800>
</p>


# Convolutional networks with adaptive inference graphs

...

# Powerpropagation: A sparsity inducing weight reparameterisation
## *NeurIPS 2021*
#### Jonathan Schwarz et al.

A weight-parameterization method for neural networks called powerpropagation produces models that are inherently sparse. Exploiting the behaviour of gradient descent, Powerprop gives rise to weight updates exhibiting a “rich get richer” dynamic by leaving low-magnitude parameters largely unaffected by learning. As a result, models trained in this way perform similarly but have a distribution with a noticeably higher density at zero, enabling the safe pruning of more parameters.

In the forward pass of a neural networks, raise the parameters of the model to the α-th power (where α > 1) while preserving the sign. Parameters that are raised to α − 1 will appear in the gradient computation, scaling the usual update. Because of this, larger magnitude parameters receive larger gradient updates than smaller magnitude parameters do, resulting in the previously mentioned "rich get richer" phenomenon.

In a simple formulation where w = v|v|<sup>α−1</sup>, for any arbitrary power α ≥ 1 we preserved the sign of v so that it can still represent both negative and positive values. For α = 1 this recovers the standard backpropagation setting. For α ≥ 1, updates in the weights are naturally obtained by the standard backpropagation but with enhanced gradients to enforce "rich get richer" phenomenon.

<p align="center">
  <img src="https://github.com/muratonuryildirim/Tutorials/blob/master/images/papers/powerprop.png?raw=true" width=900>
</p>


# Learning without Forgetting
## *TPAMI 2017*
#### Zhizhong Li and Derek Hoiem

Learning without Forgetting (LwF) approach could be seen as a combination of Distillation Networks and fine-tuning. Fine-tuning modifies the parameters of an existing CNN to train a new task. A small learning rate is often used, and sometimes part of the network is frozen to prevent overfitting. Distillation Networks helps simpler networks to return more reasonable outputs by providing additional info from the input data.

In LwF, let number of tasks is equal to $t$ and let tasks are defined in a class-incremental manner. Then each previous task's network will become a distillation network of the following task. (e.g. network of the $task$ $t-1$ will be assigned as a distillation network of $task$ $t$.) This distillation network will guide the main network of the current task while fine tuning so that it will not forget the previously learned tasks:

0. Add $c$ number of nodes to the classifier of old network (distillation network) from $task$ $t-1$ to create new network for $task$ $t$.
1. Forward input of the $task$ $t$ to both networks.
2. Name logits of the distillation as *soft targets* and classes of the input of $task$ $t$ as *hard targets*.
3. Compare hard targets with the predicted labels to calculate $cross$-$entropy$ $loss$.
4. Compare soft targets with logits of the new network to calculate $distillation$ $loss$. (ignore newly added $c$  number of node(s) - in other words, consider only old classes which should not be forget)
5. Initiate backbropagation with **Loss** so that new network will be enforced to learn new task while preseving the old taks by distilled information where

    **Loss** = $cross$-$entropy$ $loss$ + λ*$distillation$ $loss$.

<p align="center">
  <img src="https://github.com/muratonuryildirim/Tutorials/blob/master/images/papers/lwf.png?raw=true" width=500>
</p>

# Overcoming catastrophic forgetting in neural networks
## *PNAS 2017*
#### James Kirkpatrick et al.

This paper developed an algorithm analogous to synaptic consolidation for artificial neural networks, which is referred as elastic weight consolidation (EWC). This algorithm slows down learning (or changing) on certain weights based on how important they are to previously seen tasks. This importance is called Fisher information matrix $F$ and it has three key properties: (i) It is equivalent to the second derivative of the loss near a minimum, (ii) it can be computed from first-order derivatives alone and is thus easy to calculate even for large models, and (iii) it is guaranteed to be positive semidefinite.

Overall, the loss function that we try to minimize in EWC is:

$L$(θ) = $L$<sub>B</sub>(θ) + ∑ λ / 2 * $F$<sub>i</sub>(θ<sub>i</sub> + θ<sub>A,i</sub>)<sup>2</sup> 

where

$L$<sub>B</sub>(θ): is the loss for latter task B only,

λ: sets how important the old task A is compared with the new one and,

i: labels each parameter.

EWC will attempt to keep the network parameters close to the learned parameters of both tasks A and B when switching to a third task. This can be enforced either with two separate penalties or as one by noting that the sum of two quadratic penal- ties is itself a quadratic penalty.

In the figure, after learning the first task, the parameters θ<sub>A</sub><sup>*</sup> were obtained. If we take gradient steps according to task B alone (blue arrow), we will minimize the loss of task B but destroy what we have learned for task A. However, if we impose an excessive amount of limitation by assigning the same coefficient to each weight (green arrow), we can only recall task A at the risk of failing to learn task B. EWC, on the other hand, clearly calculates how important weights are for job A in order to discover a solution for task B without suffering a major loss on task A (red arrow).


<p align="center">
  <img src="https://github.com/muratonuryildirim/Tutorials/blob/master/images/papers/ewc.png?raw=true" width=450>
</p>

# Continual Learning Through Synaptic Intelligence
## *ICML 2017*
#### Friedemann Zenke, Ben Poole and Surya Ganguli

Synaptic Intelligence (SI) accumulates task relevant information over time, and exploits this information to store new memories without forgetting old ones. To do that, each individual synapse is measured with a local measure of “importance” in solving tasks that the network has been trained on in the past For brevity, the term “synapse” used synonymously with the term “parameter”, which includes weights and biases between layers.

When training on a new task, SI penalizes change in the important parameters to avoid old memories from being over-written. To that end, SI calculates each parameters' contribution to the loss function by slowly changing each parameter. If this small change in parameter affects the loss heavily, then this means that parameter plays an important role for the task. It should be not be updated in order to preserve old knowledge. Otherwise, if a small change in parameter doesnt affects the loss at all, then that parameter is not crucial for the task. Hence, it can easily be updated in the next task to learn that new task.

**Importance** = *Parameter's contribution to the loss function* = **Gradient of the parameter**

# Gradmax: Growing neural networks using gradient information

Gradmax aims to grow the network architecture without costly retraining. It adds new neurons during training without impacting what is already learned, while improving the training dynamics by maximizing the gradients of the new weights and efficiently find the optimal initialization by means of the singular value decomposition (SVD). 

It starts with a small seed architecture. Then over the course of the training, new neurons are added to the seed architecture: either increasing the width of the existing layers or creating new layers. 

In the illustration of Gradmax, growing neurons require initializing incoming W<sub>l</sub><sup>new</sup> and outgoing W<sub>l+1</sub><sup>new</sup> weights for the new neuron. GradMax sets incoming weights W<sub>l</sub><sup>new</sup> to zero (dashed lines) in order to keep the output unchanged so that backprop will not impact the currently learned weights. It initializes outgoing weights W<sub>l+1</sub><sup>new</sup> using SVD. This maximizes the gradients on the incoming weights with the aim of accelerating training.


<p align="center">
  <img src="https://github.com/muratonuryildirim/Tutorials/blob/master/images/papers/gradmax.png?raw=true" width=700>
</p>

# Head2Toe: Utilizing Intermediate Representations for Better Transfer Learning
##*ICML 2022*
#### Utku Evci, Vincent Dumoulin, Hugo Larochelle and Michael C. Mozer

Head-to-Toe probing (Head2Toe) selects features from all layers of the source model to train a classification head for the target domain. It aims to replace traditional transfer learning approach by assuming relevant feature maps can occur anywhere in the network instead of the last layer.

It connects outputs (feature maps) of the all layers with classifier head. And then applies Lasso Regulizer to select only relevant features for the classifier. Since Lasso force connections to be zero if they are irrevelant only important featur maps would contribute to the output.

Head2Toe matches performance obtained with fine-tuning on average while reducing training and storage cost, but critically, for out-of-distribution transfer, Head2Toe outperforms fine-tuning.


<p align="center">
  <img src="https://github.com/muratonuryildirim/Tutorials/blob/master/images/papers/head2toe.png?raw=true" width=350>
</p>

# Gradient Flow in Sparse Neural Networks and How Lottery Tickets Win
## *AAAI 2022*
#### Utku Evci, Yani Ioannou, Cem Keskin and Yann Dauphin

This paper investigates why training unstructured sparse networks from random initialization performs poorly and what makes Lottey Tickets (LTs) and Dynamic Sparse Training (DST) exceptions? 

And it is found that Sparse NNs have poor gradient flow at initialization. Hence, the importance of using sparsity-aware initialization is demonstrated. Furthermore, DST methods significantly improve gradient flow during training over traditional sparse training methods. Finally, the success of LTs lies in re-learning the pruning solution from which they are derived, not in improving gradient flow.



# Rigging the Lottery: Making All Tickets Winners
## *ICML 2019*
#### Utku Evci, Trevor Gale, Jacob Menick, Pablo Samuel Castro and Erich Elsen

According to the Lottery Ticket Hypothesis, if we can identify a sparse neural network with iterative pruning, then we can train that sparse network from scratch to the same degree of accuracy by beginning from the initial conditions.

Motivating from that, Rigging the Lottery or *RigL* updates the topology of the sparse network during training based on parameter magnitudes and infrequent gradient calculations.

RigL starts with a random sparse network, and at regularly spaced intervals it removes a fraction of connections based on their magnitudes and activates new ones using instantaneous gradient information. It grows the connections with highest magnitude gradients which brings novelty to this method. Newly activated connections are initialized to zero and therefore don’t affect the output of the network. However they are expected to receive gradients with high magnitudes in the next iteration and therefore reduce the loss fastest.

RigL was able to find more accurate models than the current best dense-to-sparse training algorithms.


<p align="center">
  <img src="https://github.com/muratonuryildirim/Tutorials/blob/master/images/papers/rigl.png?raw=true" width=450>
</p>
