## Lecture 3.2: Activation Functions

#### Recap: Non-linearities

Rectified Linear Unit (ReLU) </br>
&ensp;&ensp;&ensp;&ensp;$\text{ReLU}(x)\ =\ \text{max}(x,\ 0)$
* Pro: Simple
* Con: ReLU units can be fragile during training and "die"

Dead ReLUs - How can we prevent dead ReLUs?
* Initialize network carefully
* Decrease the learning rate

Leaky ReLU </br>
&ensp;&ensp;&ensp;&ensp;$\text{LeakyReLU}(x)\ =\ \text{max}(x,\ \alpha x)$
* Where $0\ <\ \alpha\ <\ 1$
* called **PReLU** if $\alpha$ is learned
* Pro: Non-negative gradient for negative inputs
* Con: The slope $\alpha$ needs to be tuned
* Con: Cannot wipe the negative signal out

Exponential Linear Unit (Elu) </br>
$$
\text{ELU}(x) = \begin{cases}
x & \text{if}\ x \ge\ 0 \\
\alpha(e^{x}\ -\ 1) & \text{if}\ x\ <\ 0
\end{cases}
$$

* Pro: Non-negative gradient for negative inputs
* Con: $\alpha$ needs to be tuned
* Con: Exponential is computationally expensive

Gaussian Linear Unit (GeLU) </br>
&ensp;&ensp;&ensp;&ensp;$\text{GeLU}(x)\ =\ x\ \times\ \Phi (x)$
* Where $\Phi (x)$ is the CDF of the standard Gaussian
* $\Phi (x)\ =\ \frac{1}{\sqrt{2 \pi}}\ \int_{- \infty}^{x}\ e^{- \frac{t^{2}}{2}}dt$
* Pro: Non-zero gradient for negative inputs
* Con: Requires more computation

Sigmoid </br>
&ensp;&ensp;&ensp;&ensp;$\sigma (x)\ =\ \frac{1}{1\ +\ e^{-x}}$
* Same as $\text{tanh} (x)\ =\ \frac{e^{x}\ -\ e^{-x}}{e^{x}\ +\ e^{-x}}$
* Con: Saturates on both ends
* Do **not** use sigmoid/tanh

Allows deep networks to model arbitrary differntiable functions

#### Activation Functions - TL;DR

* Use ReLU with careful initialization and small learning rate
* If ReLU fails, try Leaky ReLU or PReLU
* Avoid Sigmoid and Tanh
* Use GeLU for sophisticated models