# Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science (SET)  
##*Nature Communications 2018*

Randomly initialize SCLs in our network and start training. 

At the end of each epoch, we remove the connections with the smallest weights (the “weakest” connections) based on a threshold $t$ and replace them with randomly initialized new ones. 

Repeat.

SET turns out to be surprisingly robust and stable. Encouragingly, the authors are able to show very similar results to FCL models (sometimes surpassing their performance) with SET models that contain far fewer parameters.




#Parameter Efficient Training of Deep Convolutional Neural Networks by Dynamic Sparse Reparameterization (Dynamic Sparse Reparameterization - DSR)
##*ICML 2019*

Randomly initialize SCLs in our network and start training.

Calculate mean magnitude of momentum $G$ for each layer .

Remove connections with smallest weights based on global adaptive threshold $t$.

Immediately after removing $K$ number of parameters during the pruning phase, $K$ zero-initialized parameters are redistributed among the sparse parameter tensors, based on calculated mean magnitude of gradient $G$: layers having larger fractions of non-zero weights receive proportionally more free parameters. This means, free parameters should be redistributed to layers whose parameters receive larger loss gradients.

Repeat.


# Sparse Networks from Scratch: Faster Training without Losing Performance (Sparse Momentum - SM)
##*ICLR 2020*

Randomly initialize SCLs in our network and start training.

Calculate mean magnitude of momentum $M$ for each layer .

Remove the smallest 50% of weights for each layer.

Immediately after removing $K$ number of parameters during the pruning phase, $K$ zero-initialized parameters are redistributed among the sparse parameter tensors, based on calculated mean magnitude of momentum $M$: layers having larger fractions of mean momentum will receive proportionally more free parameters.

Repeat.

<p align="center">
  <img src="https://github.com/muratonuryildirim/Tutorials/blob/master/images/papers/SM.jpg?raw=true" width=700>
</p>



# Dynamic Sparse Training: Find Efficient Sparse Network From Scratch With Trainable Masked Layers (DST)
##*ICLR 2020*

Randomly initialize Fully Connected Neural Network.

Randomly initialize trainable mask layers (layer-level threshold $t$) and start training on Masked Neural Network.

Instead of pruning (masking) between two training epochs with a predefined pruning schedule, this method prunes and recovers the network parameters at each training step, which is far more fine-grained than existing methods.

In each training step, parameter is subtracted with respective threshold value and it is going to be masked if the values is smaller than 0. Not masked or pruned, otherwise:

$Q$<sub>ij</sub>= $F$($W$<sub>ij</sub> ,$t$<sub>i</sub>) = $|W$<sub>ij</sub>$|$ - $t$<sub>i</sub>

$M$<sub>ij</sub> = $S(Q$<sub>ij</sub>$)$ where $M$<sub>ij</sub>$= 1$ if not pruned, $M$<sub>ij</sub> $= 0$ if pruned

Repeat.

However, authors realize that $t$<sub>i</sub> cannot be learnt or updated under this funtions $S(x)$ since its gradient is always equal to 0. Therefore, they come up with  approximation funtion $H(x)$ which allows to learn ti and consequenly all the mask layers since it has a gradient.

Finally, after training, the model would be sparse based on trained mask layers (layer-level threshold $t$<sub>i</sub>).

<p align="center">
<img src="https://github.com/muratonuryildirim/Tutorials/blob/master/images/papers/DST.jpg?raw=true" width=550>
</p>


# PackNet: Adding Multiple Tasks to a Single Network by Iterative Pruning
##*CVPR 2018*
#### Arun Mallya and Svetlana Lazebnik

Inspired by network pruning techniques, PackNet exploits redundancies in large deep networks to free up parameters that can then be employed to learn new tasks. By performing iterative pruning and network re-training, PackNet is able to sequentially “pack” multiple tasks into a single network while ensuring minimal drop in performance and minimal storage overhead. To do that, after finding $n$<sup>th</sup> pack, it freezes and removes that "pack" from the large backbone. Hence, $n+1$<sup>th</sup> pack that will be constructed for the next task will not interfere the $n$<sup>th</sup> pack and it repeats same steps for the rest of the tasks to avoid catastrophic forgetting. In the figure, white circles represents available neurons in the backbone while bold circles indicates neurons that are already occupied in another pack that is why in the next pack selection these neurons will be discarded.

* **Masking Method**: Train the all backbone then remove 50% or 75% of the connections based on weights' absolute magnitude. Re-train.
* **Mask Selection**: -

<p align="center">
  <img src="https://github.com/muratonuryildirim/Tutorials/blob/master/images/papers/packnet.png?raw=true" width=800>
</p>

# Piggyback: Adapting a Single Network to Multiple Tasks by Learning to Mask Weights
## *ECCV 2018*
#### Arun Mallya, Dillon Davis, and Svetlana Lazebnik

Inspired by PackNET which adopts a different route by iteratively pruning unimportant weights and fine-tuning them for learning new tasks. 
Piggyback questions whether the weights of a network have to be changed at all. It suggest we might get reasonable results with just selectively masking, or setting certain weights to 0, while keeping the rest of the weights the same as before. 

Based on this idea, Piggyback learns how to mask weights of an existing “backbone” network for obtaining good performance on a new task, as shown in the Figure. Binary masks that take values in {0, 1} are learned and stored after each task. This simple idea is mostly suitable for task-IL scenario.

* **Masking Method**:
* **Mask Selection**: -

<p align="center">
  <img src="https://github.com/muratonuryildirim/Tutorials/blob/master/images/papers/piggyback.png?raw=true" width=700>
</p>


# SupSup: Supermasks in Superposition
## *NeurIPS 2020*
#### Mitchell Wortsman et al.

Supermasks in Superposition (SupSup) model uses a randomly initialized, fixed base network and for each task finds a subnetwork (supermask). If task identity is given at test time, the correct subnetwork can be retrieved with minimal memory usage. If not provided, SupSup can infer the task using gradient-based optimization to find a linear superposition of learned supermasks which minimizes the output entropy. Authors experimentally find that a single gradient step is often sufficient to identify the correct mask, even among 2500 tasks. Hence, SupSup is suitable for class-IL as well.


During training, SupSup learns a separate supermask (subnetwork) for each task. At inference time, SupSup can infer task identity by superimposing all supermasks. Ideally, appropriate supermask for a given task should exhibit a confident output distribution (i.e. low entropy).

* **Masking Method**: Deconstructing lottery tickets: Zeros, signs, and the supermask.
* **Mask Selection**: Try all the supermasks, return mask with a lowest entropy

<p align="center">
  <img src="https://github.com/muratonuryildirim/Tutorials/blob/master/images/papers/supsup.png?raw=true" width=800>
</p>


# SpaceNet: Make Free Space for Continual Learning
## *Neurocomputing 2021*
#### Ghada Sokar, Decebal Constantin Mocanu, and Mykola Pechenizkiy

SpaceNet trains sparse deep neural networks from scratch to have compact number of neurons for each task. The adaptive training of the sparse connections results in sparse representations per task that reduce the interference time. Experimental results show the robustness of the method against catastrophic forgetting and leaving space for more tasks to be learned. 

When the model faces a new task, new sparse connections are randomly allocated between a selected number of neurons in each layer. At the end of the training, the initial distribution of the connections changes and connections that are important for that task group together.
The most important neurons for a specific task are reserved to be specific to this task, and will not be seen by the following tasks and will freeze. However, other neurons that are somehow important or not important at all will continue to be shared between the tasks. For example, in the figure, fully filled circles represent the neurons that are most important and become specific for $task$ $t$, where partially filled ones are less important and could be shared by other tasks. Multiple colored circles represent the neurons that are used by multiple tasks. After learning $task$ $t$, the corresponding weights are kept fixed.

For convolutional neural networks, SpaceNet performs a coarse manner in drop and grow phases to impose structure sparsity instead of irregular sparsity. In particular, in the drop phase, coarse removal for the whole kernel is applied instead of removing scalar weights. Similarly, in the grow phase, the whole connections of a kernel are added instead of adding single weights. The likelihood of adding a kernel between two feature maps is inversely proportional to their significance, similar to multilayer perceptron networks. The feature map's importance is determined by adding the importance of its connected kernels.

* **Masking Method**: A fraction $r$ of the sparse connections in each layer is dynamically changed based on the importance of the connections and neurons in that layer. Connection importance is estimated by its contribution to the change in the loss function. The first-order Taylor approximation is used to approximate the change in loss during one training iteration $i$. Growing and Dropping the connection is applied based on the importance score.

* **Mask Selection**: -

<p align="center">
  <img src="https://github.com/muratonuryildirim/Tutorials/blob/master/images/papers/spacenet.png?raw=true" width=700>
</p>


# Continual Prune-and-Select (CP&S): Class-Incremental Learning with specialized subnetworks
##*2022*
#### Aleksandr Dekhovich, David M.J. Tax, Marcel H.F. Sluiter, and Miguel A. Bessa

During training, Continual-Prune-and-Select (CP&S) finds a subnetwork within the DNN that is responsible for solving a given $task$ $t$. 

A new task is learned by training available neuronal connections (previously untrained) of the DNN to create a new subnetwork which can include previously trained connections belonging to other subnetwork(s) but those will not be updated. In other words, previously trained connections can be shared between tasks yet cannot be updated.

Then, during inference, CP&S selects the correct subnetwork to make predictions for that task.  This enables to eliminate catastrophic forgetting by creating specialized regions in the DNN that do not conflict with each other while still allowing knowledge transfer across them. 


* **Masking Method**: NNrelief
* **Mask Selection**: Try all the subnetworks, return mask with a lowest entropy (*maximum output response*)

# Forget-free Continual Learning with Winning Subnetworks
##*ICML 2022*
#### Haeyong Kang et al.

For each task, the WSN sequentially learns and chooses the best subnetwork. Specifically, WSN jointly learns the model weights and task-adaptive binary masks that pertaining to subnetworks associated with each task. It also attemps to select a small set of weights to be activated *(winning ticket)* by reusing weights of the prior subnetworks.

Similar to the CP&S, it updates only the weights that have not been trained on the previous tasks. After training for each task, the model freezes the subnetwork parameters. Therefore, WSN is also immune to the catastrophic forgetting by design just like CP&S. Its pruning (masking) approach a bit different though.  

For example, in the figure, $task$ $t-1$ is learned by a orange subnetwork. In the $task$ $t$, we can still use the orange weights while doing a forward pass yet we cannot use them on the backward pass. It is only allowed for unassigned weights.

* **Masking Method**: WSN tries to find best subnetwork by selecting the $c$% weights with the highest weight scores $s$, where $c$ is the target layerwise capacity ratio in %.

* **Mask Selection**: ?


<p align="center">
  <img src="https://github.com/muratonuryildirim/Tutorials/blob/master/images/papers/winningsubnetworks.png?raw=true" width=700>
</p>


