<a href="https://colab.research.google.com/github/pb111/Neural-Networks-and-Deep-Learning/blob/main/Activation_Functions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **[Activation Functions](https://ml-cheatsheet.readthedocs.io/en/latest/activation_functions.html)**


- In this notebook, we will discuss **Activation Functions** in neural nets.

## **Table of Contents**

- 1  Activation Functions

- 2  Types of Activation Functions

  - [1. Linear](https://ml-cheatsheet.readthedocs.io/en/latest/activation_functions.html#linear)
  - [2. ELU](https://ml-cheatsheet.readthedocs.io/en/latest/activation_functions.html#elu)
  - [3. ReLU](https://ml-cheatsheet.readthedocs.io/en/latest/activation_functions.html#relu)
  - [4. LeakyReLU](https://ml-cheatsheet.readthedocs.io/en/latest/activation_functions.html#leakyrelu)
  - [5. Sigmoid](https://ml-cheatsheet.readthedocs.io/en/latest/activation_functions.html#sigmoid)
  - [6. Tanh](https://ml-cheatsheet.readthedocs.io/en/latest/activation_functions.html#tanh)
  - [7. Softmax](https://ml-cheatsheet.readthedocs.io/en/latest/activation_functions.html#softmax)

## **[1. Activation Functions](https://ml-cheatsheet.readthedocs.io/en/latest/activation_functions.html)**


- **Activation functions** live inside neural network layers and modify the data they receive before passing it to the next layer. 
- **Activation functions** give neural networks their power — allowing them to model complex non-linear relationships. 
- By modifying inputs with non-linear functions neural networks can model highly complex relationships between features. 
- Popular activation functions include [relu](https://ml-cheatsheet.readthedocs.io/en/latest/activation_functions.html#activation-relu) and [sigmoid](https://ml-cheatsheet.readthedocs.io/en/latest/activation_functions.html#activation-sigmoid).



Activation functions typically have the following properties:

- **Non-linear** - In linear regression we’re limited to a prediction equation that looks like a straight line. This is nice for simple datasets with a one-to-one relationship between inputs and outputs, but what if the patterns in our dataset were non-linear? (e.g. x2, sin, log). To model these relationships we need a non-linear prediction equation. Activation functions provide this non-linearity.

- **Continuously differentiable** — To improve our model with gradient descent, we need our output to have a nice slope so we can compute error derivatives with respect to weights. If our neuron instead outputted 0 or 1 (perceptron), we wouldn’t know in which direction to update our weights to reduce our error.

- **Fixed Range** — Activation functions typically squash the input data into a narrow range that makes training the model more stable and efficient.

## **2. Types of Neural Network Activation Functions**

- Here, we will discuss various types of **Neural Network Activation Functions**. There are different types of activation functions which are listed below-

- 1  Linear
- 2  ELU
- 3  ReLU
- 4  LeakyReLU
- 5  Sigmoid
- 6  Tanh
- 7  Softmax

## **[1. Linear](https://ml-cheatsheet.readthedocs.io/en/latest/activation_functions.html#linear)**


- A straight line function where activation is proportional to input ( which is the weighted sum from neuron ).

Function

       - R(z,m)={z∗m}

  ![Linear function](https://ml-cheatsheet.readthedocs.io/en/latest/_images/linear.png)


   

In [1]:
def linear(z,m):
    return m*z


Derivative

    - R′(z,m)={m}

![Derivative of Linear Function](https://ml-cheatsheet.readthedocs.io/en/latest/_images/linear_prime.png)

In [2]:
def linear_prime(z,m):
    return m

#### **Pros**

- It gives a range of activations, so it is not binary activation.
- We can definitely connect a few neurons together and if more than 1 fires, we could take the max ( or softmax) and decide based on that.


#### **Cons**

- For this function, derivative is a constant. That means, the gradient has no relationship with X.
- It is a constant gradient and the descent is going to be on constant gradient.
- If there is an error in prediction, the changes made by back propagation is constant and not depending on the change in input delta(x) !

## **[2. ELU](https://ml-cheatsheet.readthedocs.io/en/latest/activation_functions.html#elu)**


- **Exponential Linear Unit** or its widely known name **ELU** is a function that tend to converge cost to zero faster and produce more accurate results. 
- Different to other activation functions, ELU has a extra alpha constant which should be positive number.

- ELU is very similiar to RELU except negative inputs. They are both in identity function form for non-negative inputs. 
- On the other hand, ELU becomes smooth slowly until its output equal to -α whereas RELU sharply smoothes.

Function

     - R(z) = {  z   ,  z > 0
              α.(e^z–1)  z<= 0 }


![ELU function](https://ml-cheatsheet.readthedocs.io/en/latest/_images/elu.png)

In [3]:
def elu(z,alpha):
	return z if z >= 0 else alpha*(e^z -1)

Derivative

     - R′(z) = {1  ,  z > 0
               α.e^z  z < 0}

![ELU Prime Function](https://ml-cheatsheet.readthedocs.io/en/latest/_images/elu_prime.png)

In [4]:
def elu_prime(z,alpha):
	return 1 if z > 0 else alpha*np.exp(z)

#### **Pros**

- ELU becomes smooth slowly until its output equal to -α whereas RELU sharply smoothes.
- ELU is a strong alternative to ReLU.
- Unlike to ReLU, ELU can produce negative outputs.

#### **Cons**

- For x > 0, it can blow up the activation with the output range of [0, inf].

## **[3. ReLU](https://ml-cheatsheet.readthedocs.io/en/latest/activation_functions.html#relu)**


- A recent invention which stands for **Rectified Linear Units**. 
- The formula is deceptively simple : **max(0,z)**. 
- Despite its name and appearance, it’s not linear and provides the same benefits as Sigmoid but with better performance.

Function

      - R(z) = {z  z > 0
                0  z <= 0}

![ReLU Function](https://ml-cheatsheet.readthedocs.io/en/latest/_images/relu.png)

In [5]:
def relu(z):
  return max(0, z)

Derivative


     - R′(z) = {1  z > 0
                0  z < 0}

![ReLU Prime Function](https://ml-cheatsheet.readthedocs.io/en/latest/_images/relu_prime.png)

In [6]:
def relu_prime(z):
  return 1 if z > 0 else 0

#### **Pros**

- It avoids and rectifies **vanishing gradient problem**.
- **ReLu** is less computationally expensive than tanh and sigmoid because it involves simpler mathematical operations.


#### **Cons**

- One of its limitation is that it should only be used within Hidden layers of a Neural Network Model.
- Some gradients can be fragile during training and can die. It can cause a weight update which will makes it never activate on any data point again. Simply saying that ReLu could result in Dead Neurons.
- In another words, for activations in the region (x < 0) of ReLu, gradient will be 0 because of which the weights will not get adjusted during descent. That means, those neurons which go into that state will stop responding to variations in error/ input ( simply because gradient is 0, nothing changes ). This is called **dying ReLu problem**.
- The range of ReLu is [0, inf). This means it can blow up the activation.

## **[4. LeakyReLU](https://ml-cheatsheet.readthedocs.io/en/latest/activation_functions.html#leakyrelu)**

- **LeakyRelu** is a variant of ReLU. 
- Instead of being 0 when z < 0, a leaky ReLU allows a small, non-zero, constant gradient α (Normally, α=0.01). 
- However, the consistency of the benefit across tasks is presently unclear. 

Function	

    - R(z) = {z  z > 0
              αz   z <= 0}


![LeakyReLU Function](https://ml-cheatsheet.readthedocs.io/en/latest/_images/leakyrelu.png)

In [7]:
def leakyrelu(z, alpha):
	return max(alpha * z, z)

Derivative

     - R′(z) = {1 z > 0
                α z < 0}


![LeakyReLU Function](https://ml-cheatsheet.readthedocs.io/en/latest/_images/leakyrelu_prime.png)

In [8]:
def leakyrelu_prime(z, alpha):
	return 1 if z > 0 else alpha

#### **Pros**

- Leaky ReLUs are one attempt to fix the **“dying ReLU”** problem by having a small negative slope (of 0.01, or so).


#### **Cons**

- As it possess linearity, it can’t be used for the complex Classification. 
- It lags behind the [Sigmoid](https://ml-cheatsheet.readthedocs.io/en/latest/activation_functions.html#sigmoid) and [Tanh](https://ml-cheatsheet.readthedocs.io/en/latest/activation_functions.html#tanh) for some of the use cases.

## **[5. Sigmoid](https://ml-cheatsheet.readthedocs.io/en/latest/activation_functions.html#sigmoid)**


- Sigmoid takes a real value as input and outputs another value between 0 and 1. 
- It’s easy to work with and has all the nice properties of activation functions: it’s 
   - non-linear 
   - continuously differentiable 
   - monotonic and 
   - has a fixed output range.

Function


    - S(z) = 1 / (1 + e ^ −z)


![Sigmoid Function](https://ml-cheatsheet.readthedocs.io/en/latest/_images/sigmoid.png)

In [9]:
def sigmoid(z):
  return 1.0 / (1 + np.exp(-z))

Derivative

    - S′(z) = S(z).(1−S(z))


![Sigmoid Prime Function](https://ml-cheatsheet.readthedocs.io/en/latest/_images/sigmoid_prime.png)

In [10]:
def sigmoid_prime(z):
  return sigmoid(z) * (1-sigmoid(z))

#### **Pros**

- It is nonlinear in nature. Combinations of this function are also nonlinear!
- It will give an analog activation unlike step function.
- It has a smooth gradient too.
- It’s good for a classifier.
- The output of the activation function is always going to be in range (0,1) compared to (-inf, inf) of linear function. So we have our activations bound in a range. Nice, it won’t blow up the activations then.


#### **Cons**

- Towards either end of the sigmoid function, the Y values tend to respond very less to changes in X.
- It gives rise to a problem of **“vanishing gradients”.**
- Its output isn’t zero centered. It makes the gradient updates go too far in different directions. 0 < output < 1, and it makes optimization harder.
- Sigmoids saturate and kill gradients.
- The network refuses to learn further or is drastically slow ( depending on use case and until gradient /computation gets hit by floating point value limits ).

## **[6. Tanh](https://ml-cheatsheet.readthedocs.io/en/latest/activation_functions.html#tanh)**


- **Tanh** squashes a real-valued number to the range [-1, 1]. 
- It’s non-linear. But unlike Sigmoid, its output is zero-centered. 
- Therefore, in practice the tanh non-linearity is always preferred to the sigmoid nonlinearity. 

Function	

    - tanh(z) = (e^z − e^−z) / (e^z + e^−z)


![tanh function](https://ml-cheatsheet.readthedocs.io/en/latest/_images/tanh.png)

In [11]:
def tanh(z):
	return (np.exp(z) - np.exp(-z)) / (np.exp(z) + np.exp(-z))

Derivative


    - tanh′(z) = 1 − tanh(z)^2


![tanh prime function](https://ml-cheatsheet.readthedocs.io/en/latest/_images/tanh_prime.png)

In [12]:
def tanh_prime(z):
	return 1 - np.power(tanh(z), 2)

#### **Pros**

- The gradient is stronger for tanh than sigmoid ( derivatives are steeper).


#### **Cons**

- Tanh also has the **vanishing gradient problem**.

## **[7. Softmax](https://ml-cheatsheet.readthedocs.io/en/latest/activation_functions.html#softmax)**


- Softmax function calculates the probabilities distribution of the event over ‘n’ different events. 
- In general way of saying, this function will calculate the probabilities of each target class over all possible target classes. 
- Later the calculated probabilities will be helpful for determining the target class for the given inputs.

Ref : https://ml-cheatsheet.readthedocs.io/en/latest/activation_functions.html