# Activation Function

## 1. Binary step Function

This is the simplest activation function, which can be implemented with a single if-else condition in python.


def binary_step(x):
    if x>0:
        return 0
    else:
        return 1

f(X) = 1, x>=0
     = 0, x<0
     
![image.png](attachment:image.png)
     
### Limitations:
1. The Function will not be useful when there are multiple classes in the target variable.
2. The gradient of the step function is zero which causes a hindrance in the back propogation process.


## 2. Linear Function

f(x) = ax

Here the activation is proportional to the input. The variable 'a' in this case can be any constant value.

def linear_function(x):
    return 4*x
    
![image.png](attachment:image.png)
    
The derivative of the function with respect to x will be the coefficient of x, which is constant.Although the gradient here does not become zero,but it is constant which does not depend upon the input value x at all.This implies that the weights and biases will be updated during the backpropogation process but the updating factor would be the same. 
In this scenario,the neural network will not really improve the error since the gradient is same for every iteration.The network will not be able to train well and capture the complex patterns from the data.

## 3. Sigmoid

It is one of the most widely used non-linear activation function. Sigmoid transforms the value between the range 0 and 1.

f(x) = 1/(1+e^-x

![image.png](attachment:image.png)

As sigmoid is non-linear function. The output will be non-linear as well. The derivative of the sigmoid function comes out to be 1-sigmoid(x).

### Limitations:
1. The gradient values are significant for range -3 and 3 but the graph gets much flatter in other region.This impies that for values greater than 3 and less than -3 will have very small gradients. AS the gradient approaches zero, the network is not really learning.
2. The sigmoid function is not symmetric around the origin. SO output of all the neurons will be the same sign.

## 4. Tanh

The tanh function is very similar to the sigmoid function. the only difference is that it is symmetric around origin. The range pf values in this case is from -1 to 1. thus the inputs to the next layers will not always be of the same sign.

tanh(x) = 2sigmoid(2x) -1

![image.png](attachment:image.png)

The gradient of the tanh function is steeper as compared to the sigmoid function. Usually tanh is pre ferred over the sigmoid function since it is zero centered and the gradients are not restricted to move in a certain direction.

## 5. ReLU

The ReLU function is another non-linear activation function that has gained popularity in the deep learning domain. ReLU stands for Rectified Linear Unit. The main advantage of using the ReLU function over the other activation functions is that it does not activate all the neurons at the same time.

f(x) = max(0,x)

![image.png](attachment:image.png)

This means that the neurons will only be deactivated if the output of the linear transformation is less than 0.
For the negative input values, the result is zero,that means the neurons does not get activate. Since only a certain number of neurons are activated, the ReLU function is far more computationally efficient when compared to the sigmoid and tanh function.

If you look at the negative side of the graph that the gradient value is zero.Due to this reason,during the backpropogation process,the weights and biases for some neurons are not updated. This can create dead neurons which never get activated.

## 6. Leaky ReLU

Leaky ReLU function is nothing but an improved version of the ReLU function.As we saw that for the ReLU function,the gradient is 0 for x<0 which would dectivate the neurons in that region.
Leaky ReLU is defined to address this problem instead of defining the ReLU function as 0 for the negative values of x, we define it as an extremely small linear component of x. 

f(x)= 0.01x, x<0

    = x, x>=0
    
![image.png](attachment:image.png)
    
By making this small modification, the gradient of the left side of the graph comes out to be a non-zero value. Hence we would no longer encounter dead neurons in that region.

## 7. Parameterised ReLU

This is another varient of ReLU that aims to solve the problem of gradient's becoming zero for the left half of the axis. The parameterised ReLU, as the name suggests, introduces a new parameter as  a slope of the negative part of the function.

f(x) = x, x>=0
     = ax, x<0
     
![image.png](attachment:image.png)
     
When the value of a is fixed to 0.01, the function acts as a Leaky ReLU function, 'a' is also a trainable parameter. The network also learns the value of 'a' for faster and more optimum convergence.
The derivative of the function would be same as the Leaky ReLU function, except the value 0.01 will be replaced with 'a'.
The Parameterised ReLU function is used when the Leaky ReLU function still fails to solve the problem of dead neurons and the relevant information is not successfully passed to the next layer.

## 8. Exponential Linear Unit

Exponential Linear Unit or ELU for short is also a varient of Rectified Linier Unit(ReLU) that modifies the slope of the negative part of the function. Unlike the leaky relu and parameteric ReLU funtions, instead of a straight line. ELU uses a log curve for defining the negative values.

f(x) = x, x>=0
     
     = a(e^x -1), x<0
     
![image.png](attachment:image.png)
     
The derivative of the elu function for values of x greater than 0 is 1,like all the relu varients. But for values of x<0, the derivative would be a.e^x.

## 9. Swish

Swish is a lesser known activation function which was discovered by researchers at Google.Swish is as computationally efficient as ReLU and shows better performance than ReLU on deeper models.The values for swish ranges from negative infinity to infinity.

f(x) = x*sigmoid(x)

f(X) = x/(1 -e^-x)


![image.png](attachment:image.png)

The curve of the function is smopth and the function is differentiable at all points. This is helpful during the model optimization process and is considered to be one of the reasons that swish outperforms ReLU.
A unique fact about this function is that swish function is not monotonic. This means that the vlue of the function may decrease even when the input values are increasing.

## 10. Softmax

Softmax function is often described as a combination of multiple sigmoids. We know that sigmoid returns values between 0 and 1,which can be treated as probabilities of a data point belonging to a particular class. The sigmoid is widely used for binary classification problems.

![image.png](attachment:image.png)

The softmax function can be used for multiclass classification problems. This function return sthe probability for a datapoint belonging to each individual class.

while building a network for a multiclass problem,the output layer would have as many neurons as the number of classes in the target. For instance if you have three classes,there would be three neurons in the output layer.