Activation functions

Purpose:
- adds non-linearity
- controls output scale

In case of linear activation (LU) network reduces to single linear mapping => no difference from linear regression

Activations are usually chosen to be __differentiable__ almost everywhere (usually except fracture points like 0). Optimization algorithms overcome this by using __subgradients__ or setting derivative to arbitrary value for those points

Other useful properties include convectivity and smoothness

Typical scales:
- LU $\rightarrow (-\infty; +\infty)$<br><br>
- ReLU $\rightarrow(0; +\infty)$<br>when we require output signal to be strictly positive<br><br>
- Tahn $\rightarrow(-1,1)$<br>when we require output signal to be limited<br><br>
- Sigmoid / Softmax $\rightarrow(0,1)$<br>when we model output signal as probabilties<br>

Some functions require setting additional hyperparameters (like pReLU or HardTanh), but most do not

Most functions are applied element-wise, some depend on the whole vector (like Softmax)

In Pytorch activations are implemented as separate Modules (with input and forward() function as in layers)<br>
(<a href="https://github.com/pytorch/pytorch/blob/260d1dcef4d82d0a2181d516707f6cdf2a054413/torch/nn/modules/activation.py#L97">Module class</a> and <a href="https://github.com/pytorch/pytorch/blob/260d1dcef4d82d0a2181d516707f6cdf2a054413/torch/_refs/nn/functional/__init__.py#L257">function</a>)

| ProbDistribution | Accuracy ||
| --- | --- | --- |
|Basic functions|
| <img src="img/sigmoid.png" width=500> | <img src="img/Sigmoid_formula.png" width=200>||
| <img src="img/tahn.png" width=500> | <img src="img/Tahn_formula.png" width=300> |
| <img src="img/softmax.png" width=500> | <img src="img/Softmax_formula.png" width=150> |
|Rectified Linear functions|
| <img src="img/ReLU.png" width=500> | <img src="img/ReLU_formula.png" width=200>|Rectified Linear Unit|
| <img src="img/pReLU.png" width=500> | <img src="img/PReLU_formula.png" width=200> |
| <img src="img/SeLU.png" width=500> | <img src="img/SeLU_formula.png" width=500> |
| <img src="img/ELU.png" width=500> | <img src="img/ELU_formula.png" width=300>|Rectified Linear Unit|
| <img src="img/CELU.png" width=500> | <img src="img/CELU_formula.png" width=300> |
| <img src="img/GELU.png" width=500> | <img src="img/GELU_formula.png" width=150> |
| <img src="img/RRELU.png" width=500> | <img src="img/RRELU_formula.png" width=200>|Rectified Linear Unit|
| <img src="img/ReLU6.png" width=500> | <img src="img/ReLU6_formula.png" width=200> |
| <img src="img/SiLU.png" width=500> | <img src="img/SiLU_formula.png" width=500> |
| <img src="img/LogSigmoid.png" width=500> | <img src="img/LogSigmoid_formula.png" width=200>||
| <img src="img/Mish.png" width=500> | <img src="img/Mish.png" width=300>|Rectified Linear Unit|
| <img src="img/softtahn.png" width=500> | <img src="img/softtahn_formula.png" width=300> |
| <img src="img/Hardtanh.png" width=500> | <img src="img/hardtahn_formula.png" width=150> |piecewise linear approximation of Tanh|
| <img src="img/hardsigmoid.png" width=500> | <img src="img/hardSigmoid_formula.png" width=300>|Rectified Linear Unit|
| <img src="img/hardswish.png" width=500> | <img src="img/Hardswish_formula.png" width=300>|piecewise approximation of Swish $x \sigma(x)$ as in <a href="https://arxiv.org/abs/1905.02244">MobileNet</a> - allows formulation though popular ReLU|
| <img src="img/hardsigmoid.png" width=500> | <img src="img/hardsigmoid_formula.png" width=250> |linear approximation of sigmoid (no parameters)|
| <img src="img/hardshrink.png" width=500> | <img src="img/hardshrink_formula.png" width=250> |maps zero neighborhood to precise zero|

