<a href="https://colab.research.google.com/github/pin2gupta/Deep-Learning/blob/main/Basics/Activation_Function.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## What are activation functions?
The activation functions are the functions which decides whether the current output of the neuron should be triggered to the next cell or not. It also converts the output to a new form that can be accepted to the next layer/neuron.
 
## Why is it needed?
- It is used to add non-linearity into the neural network.
- They also help in keeping the value of the output from the neuron restricted to a certain limit as needed.
 
Imagine a neural network without the activation functions. In that case, every neuron will only be performing a linear transformation on the inputs using the weights and biases. Although linear transformations make the neural network simpler, but this network would be less powerful and will not be able to learn the complex patterns from the data
 
 
## What are types of activation functions?
####**Heaviside Step Function**
This is one of the common activation functions in the neural network.
 
"The function produces 1 (or true) when input passes threshold limit whereas it produces 0 (or false) when input does not pass threshold." Since it produces binary results , it is sometimes called binary step function.
              
               f(x) = 1, if x > 0
               f(x) = 0, if x < 0
 
![](https://drive.google.com/uc?export=view&id=1uZDfxRCUHjP4M1DTpoMBkYtBWJW-6-Pj)
 
**Problem with Step Function:**
 
the gradient of the function became zero. This is because there is no component of x in the binary step function.
Instead we define Linear Function.
 
###**Linear Activation Function**
The function is defined as
 
f(x) = ax
here the activation is directly proportional to the input. The variable "a" in case could be any constant value.
 
![](https://drive.google.com/uc?export=view&id=1XpYemvoA31_d_dAFDOtIID8jFlMdKOmW)
 
Here the gradient does not become zero, but it is a constant which does not depend upon the input value of x at all. This implies that the weights and biases will be updated during the backpropagation process but the updating factor would be the same.
 
**Problem with Linear Activation Function**
 
- The neural network will not really improve the error since the gradient is the same for every iteration.
- The network will not be able to train well and capture the complex patterns from the data. Hence, linear function might be ideal for simple tasks where interpretability is highly desired.
 
The other activation function as describe in the picture
![](https://drive.google.com/uc?export=view&id=1s91Eg7DxA20DGEV03LwLuQqIgTccQnrh)
 
 
###**Sigmoid :**
This is the widely used activation function. THe value ranges from 0 and 1. Unlike the binary step and linear functions, sigmoid is a nonlinear function. From the graph, the gradient values are significant for range -3 and 3 but the graph gets much flatter in the other regions. This implies that the values greater than 3 or less than -3 , will have a very small gradient. As the gradient value approaches zero, the network is not really learning.
 
It is generally used for binary classification problems.
 
**Problem with Sigmoid:**
- It is computationally expensive, causes vanishing gradient problem and not zero-centered.
 
- the sigmoid function is not symmetric around zero. So output of all the neurons will be of the same sign.
 
###**Softmax :**
This is generally the softmax form of the function. It is used in multi-class classification problems.
 
###**tanh**
This is similar to the sigmoid function. The only difference is that it is symmetric around the origin. It ranges between -1 and 1. This input won't always be of the same sign.
 
**Problem with tanh:**
 
It resolves only one problem of zero centered but it is technically a sigmoid activation function.
 
###**ReLU** (Rectified Linear Unit)
 
ReLU is another non-linear activation function that is widely used in Convolutional Neural networks. The main advantage of using Relu is that it does not activate all the neurons at the same time. THis means that the neurons will only be deactivated if the output of the linear transformation is less than 0.
The ReLU function is far more computationally efficient when compared to the sigmoid and tanh function
 
**Problem with ReLU**
- It suffers from "dying Relu". Since the output is zero for all negative inputs. It causes some nodes to completely die and not learn anything.
- It also has "exploding ReLU" issu. the higher limit is infinite which could cause unusable nodes.
 
###**Leaky ReLU**
 
Leaky ReLU function is nothing but an improved version of the ReLU function. As we saw that for the ReLU function, the gradient is 0 for x < 0, which would deactivate the neurons in that region.
Leaky ReLU is defined to address this problem. Instead of defining the Relu function as 0 for negative values of x, we define it as an extremely small linear component of x.
 
The component is called a hyperparameter and generally set to 0.01.
 
Note that, if we set α as 1 then Leaky ReLU will become a linear function f(x) = x and will be of no use. Hence, the value of α is never set close to 1. If we set α as a hyperparameter for each neuron separately, we get parametric ReLU or PReLU.
 
###**Parameterised ReLU**
 
This is another variant of ReLU that aims to solve the problem of gradient’s becoming zero for the left half of the axis. The parameterised ReLU, as the name suggests, introduces a new parameter as a slope of the negative part of the function.
 
![](https://drive.google.com/uc?export=view&id=1r4h0sdEoDgDAvEGhpx5Dp5uaRAf22Itt)
 
When the value of a is fixed to 0.01, the function acts as a Leaky ReLU function. However, in the case of a parameterised ReLU function, ‘a‘ is also a trainable parameter. The network also learns the value of ‘a‘ for faster and more optimum convergence.
The derivative of the function would be the same as the Leaky ReLu function, except the value 0.01 will be replaced with the value of a.
 
The parameterized ReLU function is used when the leaky ReLU function still fails to solve the problem of dead neurons and the relevant information is not successfully passed to the next layer.
 
###**Exponential Linear Unit**
 
Exponential Linear Unit or ELU for short is also a variant of Rectified Linear Unit (ReLU) that modifies the slope of the negative part of the function. Unlike the leaky relu and parametric ReLU functions, instead of a straight line, ELU uses a log curve for defining the negative values.
 
The derivative of the Relu function for values of  x greater than 0 is 1, like all the relu variants. But for values of x < 0, the derivative would be  a.e^x .
 
![](https://drive.google.com/uc?export=view&id=1hzN-MI8e7H6DTQ8v-f0whZhmJtDsjckJ)
 
 
## How to choose different activation functions?
 
###**For Hidden Layer**
![](https://drive.google.com/uc?export=view&id=1ibt2_P9MOQjrPpyrcZwbAAb_8SCE6JF9)
 
###**For Output Layer**
 
![](https://drive.google.com/uc?export=view&id=16yOijtE3tVDtaADfEcL_Jh7YDjLaOjWY)
 
 
 



### **References:**
1. [Everything you need to know about “Activation Functions” in Deep learning models ](https://towardsdatascience.com/everything-you-need-to-know-about-activation-functions-in-deep-learning-models-84ba9f82c253)
2. [Deep Learning cheatsheet](https://stanford.edu/~shervine/teaching/cs-229/cheatsheet-deep-learning)
3. [How to Choose an Activation Function for Deep Learning](https://machinelearningmastery.com/choose-an-activation-function-for-deep-learning/)
