# Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science (SET)  
###*Nature Communications 2018*

Randomly initialize SCLs in our network and start training. 

At the end of each epoch, we remove the connections with the smallest weights (the “weakest” connections) based on a threshold $t$ and replace them with randomly initialized new ones. 

Repeat.

SET turns out to be surprisingly robust and stable. Encouragingly, the authors are able to show very similar results to FCL models (sometimes surpassing their performance) with SET models that contain far fewer parameters.




#Parameter Efficient Training of Deep Convolutional Neural Networks by Dynamic Sparse Reparameterization (Dynamic Sparse Reparameterization - DSR)
###*ICML 2019*

Randomly initialize SCLs in our network and start training.

Calculate mean magnitude of momentum $G$ for each layer .

Remove connections with smallest weights based on global adaptive threshold $t$.

Immediately after removing $K$ number of parameters during the pruning phase, $K$ zero-initialized parameters are redistributed among the sparse parameter tensors, based on calculated mean magnitude of gradient $G$: layers having larger fractions of non-zero weights receive proportionally more free parameters. This means, free parameters should be redistributed to layers whose parameters receive larger loss gradients.

Repeat.


# Sparse Networks from Scratch: Faster Training without Losing Performance (Sparse Momentum - SM)
###*ICLR 2020*

Randomly initialize SCLs in our network and start training.

Calculate mean magnitude of momentum $M$ for each layer .

Remove the smallest 50% of weights for each layer.

Immediately after removing $K$ number of parameters during the pruning phase, $K$ zero-initialized parameters are redistributed among the sparse parameter tensors, based on calculated mean magnitude of momentum $M$: layers having larger fractions of mean momentum will receive proportionally more free parameters.

Repeat.

<p align="center">
  <img src="https://github.com/muratonuryildirim/Tutorials/blob/master/images/papers/SM.jpg?raw=true" width=750>
</p>



# Dynamic Sparse Training: Find Efficient Sparse Network From Scratch With Trainable Masked Layers (DST)
###*ICLR 2020*

Randomly initialize Fully Connected Neural Network.

Randomly initialize trainable mask layers (layer-level threshold $t$) and start training on Masked Neural Network.

Instead of pruning (masking) between two training epochs with a predefined pruning schedule, this method prunes and recovers the network parameters at each training step, which is far more fine-grained than existing methods.

In each training step, parameter is subtracted with respective threshold value and it is going to be masked if the values is smaller than 0. Not masked or pruned, otherwise:

$Q$<sub>ij</sub>= $F$($W$<sub>ij</sub> ,$t$<sub>i</sub>) = $|W$<sub>ij</sub>$|$ - $t$<sub>i</sub>

$M$<sub>ij</sub> = $S(Q$<sub>ij</sub>$)$ where $M$<sub>ij</sub>$= 1$ if not pruned, $M$<sub>ij</sub> $= 0$ if pruned

Repeat.

However, authors realize that ti cannot be learnt or updated under this funtions $S(x)$ since its gradient is always equal to 0. Therefore, they come up with  approximation funtion $H(x)$ which allows to learn ti and consequenly all the mask layers since it has a gradient.

Finally, after training, the model would be sparse based on trained mask layers (layer-level threshold $t$<sub>i</sub>).

<p align="center">
<img src="https://github.com/muratonuryildirim/Tutorials/blob/master/images/papers/DST.jpg?raw=true" width=600>
</p>
