## Why Sigmoid Activation function should not be used in hidden layers and Why used only in output layer in binary classification?

- Graident Vanishing (Primiary Problem): if we see graph of the derivative of sigmoid, then we can see that if we initialize extremly positive and extremly negative values of weights, then the value of graident is near to zero. This causes vanishing gradient problem. Also, in deep neural networks, not matter how we initilize weights, during backpropogation, the gridents diminishes as it goes to inital layers, so their weights are not updated.

- Slow Convergence (Secondary Problem): the problem with sigmoid activation function is the saturation of the function. These issues can lead to slow convergence during training and it casues longer training time.

![Screenshot%202023-08-18%20164954.png](attachment:Screenshot%202023-08-18%20164954.png)
-----------------------------------------------------------------------------

- Bounded Output: The sigmoid function maps input values to a range between 0 and 1. This property can be useful when you want to interpret the output as a probability or when you need to enforce certain bounds on the activations.

## Advantage of Tanh over SIgmoid?
Zero-Centered Output: Tanh produces output values between -1 and 1, with a center point at 0. This zero-centeredness can help address the vanishing gradient problem more effectively than the sigmoid, which ranges from 0 to 1 and is not centered around zero. as, sigmoid is not zero-centered, it exacerbate the vanishing gradient problem.

Improved Gradient Flow: The symmetric distribution of tanh's outputs around zero helps gradients flow more effectively during backpropagation. This leads to more stable learning and can aid in training deeper networks.

![Screenshot%202023-08-18%20180906.png](attachment:Screenshot%202023-08-18%20180906.png)

## ReLU Over Tanh:

![Screenshot%202023-08-18%20182749.png](attachment:Screenshot%202023-08-18%20182749.png)

Advantages of ReLU:

No Vanishing Gradient Problem (for Positive Inputs): ReLU doesn't suffer from the vanishing gradient problem for positive inputs. The gradient is either 1 or 0, which prevents gradients from becoming very small and promotes faster learning in deeper networks.

Faster Convergence: Due to the lack of saturation for positive inputs, ReLU often leads to faster convergence during training. Neurons that are activated produce stronger gradients, leading to more significant weight updates.

Sparsity: ReLU activation introduces sparsity in the network. When a neuron is not activated (output is 0), the corresponding weights are not updated, effectively reducing the complexity of the model.


Disadvantages of ReLU:

Dying ReLU: ReLU can suffer from the "dying ReLU" problem, where some neurons may become unresponsive and output zero for all inputs. This can happen when the weights are adjusted in such a way that the neuron always outputs zero.

No Zero-Centered Output: ReLU's output is not zero-centered; it ranges from 0 to infinity. This can lead to challenges in training and cause issues when backpropagating through layers.

Negative Values Handling: ReLU completely blocks negative input values, which might not be suitable for all types of data, especially if the data has meaningful negative values.

## What is the dying ReLU problem and how it's cause?

* A dying ReLU always outputs the same value, i.e., 0, on any input value. This condition is known as the dead state of ReLU neurons. due to which in backpropogation, the weights of the dead neurons are not updated.

* Causes of the dying ReLU:
There are two major causes of the dying ReLU problem:
1. Setting high learning rates
2. Having a large negative bias

* Solution:
1. Lowering the learning rates and using a positive bias can mitigate the chance of dying ReLU.
2. Leaky ReLU.


## Leaky ReLU:

* The Leaky Rectified Linear Unit (Leaky ReLU) is a variation of the Rectified Linear Unit (ReLU) activation function. It was introduced to address one of the limitations of the traditional ReLU function, which can lead to "dying neurons" or neurons that become inactive and output zero for all inputs. Leaky ReLU aims to overcome this issue by allowing a small, non-zero gradient for negative input values, thus preventing neurons from becoming completely inactive.

* The Leaky ReLU function is defined as follows: LeakyReLU(x) = max(0.01x, x)
Here, x represents the input to the function, and a is a small positive constant (typically a small value like 0.01) that  determines the slope of the function for negative inputs. When x is positive or zero, the function behaves like the regular ReLU (output is x). When x is negative, the slope of the function is 0.01, allowing a small gradient to flow through the neuron.



![Screenshot%202023-08-18%20222952.png](attachment:Screenshot%202023-08-18%20222952.png)


* Benifits:
1. Preventing Neuron Death: The non-zero slope for negative inputs prevents neurons from becoming completely inactive, ensuring that gradients can flow backward and updates can be made to the neuron's weights.
2. Faster Learning: The non-zero gradient for negative inputs can lead to faster learning, especially in the early stages of training when weights are being adjusted.

## Parameteric ReLU:

* PReLU is a variation of ReLU that allows the slope of the negative part of the function to be learned during training. This introduces additional flexibility and adaptability compared to traditional ReLU. PReLU is defined as follows:

* PReLU(x) = x if x >= 0
         = a * x if x < 0, where a is a learnable parameter

* Advantages of PReLU:

1. Adaptive Slope: PReLU introduces an adaptive slope parameter "a" for negative inputs, allowing the network to learn the optimal slope for each neuron during training.

2. Mitigates Dead Neurons: The adaptive slope helps mitigate the dying ReLU problem, as even if the initial slope is close to zero, the network can learn to adjust it to prevent neurons from becoming inactive.

3. Improves Learning Capacity: The adaptive slope can enhance the learning capacity of the network by allowing for more nuanced behavior in the negative input range.


* Disadvantages of PReLU:

1. Increased Model Complexity: Introducing learnable parameters can increase the complexity of the model, potentially leading to overfitting if not properly regularized.


## Exponential ReLU:

* The ELU activation function is designed to have smoother derivatives than the ReLU, which can lead to more stable training. It also helps mitigate the "dying ReLU" problem by allowing non-zero gradients for negative inputs. The ELU function is defined as follows:
    ELU(x) = x if x >= 0
       = a * (exp(x) - 1) if x < 0, where a is a positive constant (e.g., 1.0)

![image.png](attachment:image.png)

* Advantages of ELU:
1. Smooth Derivatives: The ELU function has a continuous derivative everywhere, including at the point where x = 0. This smoothness can help training converge more smoothly.

2. No Dead Neurons: ELU prevents the "dying ReLU" problem by introducing a non-zero gradient for negative inputs, which helps maintain gradient flow.

3. Better Learning for Negative Inputs: ELU's non-linearity for negative inputs can encourage the network to learn better representations for negative data points.

* Disadvantages of ELU:
1. Computational Cost: The exponential function computation in the negative range can be computationally more expensive compared to ReLU or other variants.