# Activation Functions

Activation functions are crucial components in neural networks. They introduce non-linearity into the model, allowing it to learn and represent complex patterns. Without activation functions, a neural network would simply be a linear regression model, regardless of its depth.

## Key Concepts of Activation Functions:

1. **Definition**:
    - An activation function determines whether a neuron should be activated or not.
    - It maps the input signal to an output signal, introducing non-linear properties to the network.

2. **Common Types of Activation Functions**:

### Sigmoid Function

#### Formula:
$$\sigma(x) = \frac{1}{1 + e^{-x}}$$

#### Derivative:
The derivative of the sigmoid function is:
$$\sigma'(x) = \sigma(x) \cdot (1 - \sigma(x))$$

#### Key Properties:
- **Advantages**:
  - Smooth gradient, preventing jumps in output values.
  - Output range is (0, 1), suitable for probability estimations.
- **Disadvantages**:
  - Can cause vanishing gradient problem during backpropagation.
  - Outputs are not zero-centered, which can slow down convergence.

### Hyperbolic Tangent (Tanh) Function

#### Formula:
$$\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$$

#### Derivative:
The derivative of the tanh function is:
$$\tanh'(x) = 1 - \tanh^2(x)$$

#### Key Properties:
- **Advantages**:
  - Output range is (-1, 1), making it zero-centered.
  - Stronger gradients compared to sigmoid.
- **Disadvantages**:
  - Can also cause vanishing gradient problem during backpropagation.

### Rectified Linear Unit (ReLU)

#### Formula:
$$\text{ReLU}(x) = \max(0, x)$$

#### Derivative:
The derivative of the ReLU function is:
$$\text{ReLU}'(x) = \begin{cases}
1 & \text{if } x > 0 \\
0 & \text{if } x \leq 0
\end{cases}$$

#### Key Properties:
- **Advantages**:
  - Computationally efficient, as it involves simple thresholding.
  - Reduces the likelihood of vanishing gradient problem.
- **Disadvantages**:
  - Can cause "dying ReLU" problem where neurons can become inactive and only output zero.

### Leaky ReLU

#### Formula:
$$\text{Leaky ReLU}(x) = \begin{cases}
x & \text{if } x > 0 \\
\alpha x & \text{if } x \leq 0
\end{cases}$$
where $\alpha$ is a small constant (e.g., 0.01).

#### Derivative:
The derivative of the Leaky ReLU function is:
$$\text{Leaky ReLU}'(x) = \begin{cases}
1 & \text{if } x > 0 \\
\alpha & \text{if } x \leq 0
\end{cases}$$

#### Key Properties:
- **Advantages**:
  - Mitigates the "dying ReLU" problem.
  - Allows a small gradient when $x \leq 0$.
- **Disadvantages**:
  - Introduces a small slope, which may still lead to slow convergence in some cases.

### Softmax Function

#### Formula:
For a vector $z$ of length $K$:
$$\text{Softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}$$

#### Derivative:
The derivative of the Softmax function for a single class $i$ is:
$$\frac{\partial \text{Softmax}(z_i)}{\partial z_j} = \text{Softmax}(z_i) \cdot (\delta_{ij} - \text{Softmax}(z_j))$$
where $\delta_{ij}$ is the Kronecker delta, which is 1 if $i = j$ and 0 otherwise.

#### Key Properties:
- **Advantages**:
  - Converts logits into probabilities, which are useful for classification tasks.
  - Ensures the sum of outputs is 1.
- **Disadvantages**:
  - Can cause gradient saturation for extreme values, leading to slow learning.

## Choosing an Activation Function

- The choice of activation function depends on the specific problem and the architecture of the neural network.
- For hidden layers, ReLU and its variants are commonly used.
- For output layers, sigmoid and softmax are widely used in binary and multiclass classification problems, respectively.

## Example: Feedforward Neural Network

Consider a simple feedforward neural network with one hidden layer. The forward pass involves:

1. Calculating the weighted sum of inputs for the hidden layer neurons.
2. Applying an activation function (e.g., ReLU) to the hidden layer outputs.
3. Calculating the weighted sum of hidden layer outputs for the output layer neurons.
4. Applying an activation function (e.g., softmax) to the output layer to obtain probabilities.

Mathematically, for a single hidden layer:

$$a^{(1)} = \text{ReLU}(W^{(1)} x + b^{(1)})$$
$$a^{(2)} = \text{Softmax}(W^{(2)} a^{(1)} + b^{(2)})$$

Where:
- $x$ is the input vector.
- $W^{(1)}, W^{(2)}$ are the weight matrices for the hidden and output layers, respectively.
- $b^{(1)}, b^{(2)}$ are the bias vectors for the hidden and output layers, respectively.
- $a^{(1)}, a^{(2)}$ are the activations of the hidden and output layers, respectively.

Understanding and correctly choosing activation functions is crucial for building effective neural networks. They directly impact how well the network learns from the data and generalizes to unseen data.
