# Activation Functions in Neural Networks

Activation functions introduce non-linearity into the network, enabling it to learn and model complex patterns. Below is an overview of common activation functions:


## 1. **Sigmoid Activation**
- **Equation**: 
  $$
  \sigma(x) = \frac{1}{1 + e^{-x}}
  $$

### Detailed Explanation:
1. **Function Behavior**:
   - As x approaches positive infinity, σ(x) approaches 1
   - As x approaches negative infinity, σ(x) approaches 0
   - At x = 0, σ(x) = 0.5

2. **S-shaped Curve**:
   - The sigmoid function produces an S-shaped curve
   - This shape allows for smooth transitions between 0 and 1

3. **Derivative**:
   - The derivative of the sigmoid function is:
     $$
     \frac{d}{dx}\sigma(x) = \sigma(x)(1 - \sigma(x))
     $$
   - This property makes it computationally efficient for backpropagation

4. **Symmetry**:
   - The sigmoid function is symmetric around the point (0, 0.5)

5. **Saturation**:
   - For very large positive or negative inputs, the function saturates
   - This can lead to the vanishing gradient problem in deep networks

### Examples:
1. **Input values and corresponding outputs**:
   - σ(-5) ≈ 0.0067 (very close to 0)
   - σ(-2) ≈ 0.1192
   - σ(-1) ≈ 0.2689
   - σ(0) = 0.5 (exactly)
   - σ(1) ≈ 0.7311
   - σ(2) ≈ 0.8808
   - σ(5) ≈ 0.9933 (very close to 1)

2. **Derivative examples**:
   - At x = 0: σ'(0) = σ(0) * (1 - σ(0)) = 0.5 * 0.5 = 0.25 (maximum value)
   - At x = 2: σ'(2) ≈ 0.8808 * (1 - 0.8808) ≈ 0.1050
   - At x = 5: σ'(5) ≈ 0.9933 * (1 - 0.9933) ≈ 0.0067 (very small, illustrating saturation)

3. **Practical example in binary classification**:
   - Input: x = 2.5 (high positive value)
   - Output: σ(2.5) ≈ 0.9241
   - Interpretation: 92.41% confidence in positive class

4. **Symmetry example**:
   - σ(1) ≈ 0.7311
   - σ(-1) ≈ 0.2689
   - Note: 0.7311 + 0.2689 = 1, demonstrating symmetry around 0.5


### Detailed Proof of the Sigmoid Function Derivative with Explanations

We aim to prove that the derivative of the sigmoid function $\sigma(x) = \frac{1}{1 + e^{-x}}$ is $\sigma'(x) = \sigma(x)(1 - \sigma(x))$.

1. Begin with the sigmoid function:
   $$\sigma(x) = \frac{1}{1 + e^{-x}}$$
   
   Explanation: This is our starting point, the standard definition of the sigmoid function. It's crucial to clearly state the function we're differentiating.

2. We'll use the quotient rule to find the derivative. The quotient rule states:
   For $f(x) = \frac{u(x)}{v(x)}$, the derivative is:
   $$f'(x) = \frac{u'(x)v(x) - u(x)v'(x)}{[v(x)]^2}$$
   
   Explanation: We choose the quotient rule because the sigmoid function is a fraction. This rule allows us to differentiate complex fractions by breaking them down into simpler parts.

3. In our case:
   $u(x) = 1$ (constant function)
   $v(x) = 1 + e^{-x}$
   
   Explanation: We split the sigmoid function into numerator and denominator to apply the quotient rule. This decomposition simplifies our calculation process.

4. Calculate the derivatives of $u(x)$ and $v(x)$:
   $u'(x) = 0$ (derivative of a constant is 0)
   $v'(x) = -e^{-x}$ (using the chain rule)
   
   Explanation: We find the derivatives of the numerator and denominator separately. The numerator's derivative is straightforward as it's a constant. For the denominator, we apply the chain rule, recognizing that the derivative of $e^{-x}$ is $-e^{-x}$.

5. Apply the quotient rule:
   $$\begin{align*}
   \sigma'(x) &= \frac{u'(x)v(x) - u(x)v'(x)}{[v(x)]^2} \\[2ex]
   &= \frac{0 \cdot (1 + e^{-x}) - 1 \cdot (-e^{-x})}{(1 + e^{-x})^2} \\[2ex]
   &= \frac{e^{-x}}{(1 + e^{-x})^2}
   \end{align*}$$
   
   Explanation: We substitute the values into the quotient rule formula. The first term becomes zero due to $u'(x) = 0$, simplifying our expression. This step is crucial as it leads us to a more manageable form of the derivative.

6. Manipulate the expression:
   $$\begin{align*}
   \sigma'(x) &= \frac{e^{-x}}{(1 + e^{-x})^2} \\[2ex]
   &= \frac{1}{1 + e^{-x}} \cdot \frac{e^{-x}}{1 + e^{-x}} \\[2ex]
   &= \sigma(x) \cdot \frac{e^{-x}}{1 + e^{-x}} \\[2ex]
   &= \sigma(x) \cdot (1 - \frac{1}{1 + e^{-x}}) \\[2ex]
   &= \sigma(x) \cdot (1 - \sigma(x))
   \end{align*}$$
   
   Explanation: This final manipulation is key to our proof. We first split the fraction, recognizing that $\frac{1}{1 + e^{-x}}$ is the original sigmoid function. We then cleverly rewrite $\frac{e^{-x}}{1 + e^{-x}}$ as $1 - \frac{1}{1 + e^{-x}}$, which allows us to express the entire derivative in terms of the sigmoid function itself. This step showcases the elegant relationship between the sigmoid function and its derivative.

Thus, we have rigorously proven that $\sigma'(x) = \sigma(x)(1 - \sigma(x))$. This result is significant as it demonstrates that the derivative of the sigmoid function can be expressed entirely in terms of the function itself, a property that makes it computationally efficient and contributes to its widespread use in machine learning algorithms, particularly in the context of neural networks and logistic regression.

## 2. **Tanh (Hyperbolic Tangent) Activation**
- **Equation**: 
  $$
  \tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}
  $$
- **Range**: (-1, 1)
- **Usage**: Hidden layers in simple neural networks.
- **Characteristics**: 
  - Similar to sigmoid but outputs values between -1 and 1, allowing stronger gradients.
  - **Issues**: Still susceptible to the vanishing gradient problem.
- **Use Case**: Often used in hidden layers before ReLU became common.


## 3. **ReLU (Rectified Linear Unit)**

### Definition
The Rectified Linear Unit (ReLU) is defined as:

$$
f(x) = \max(0, x) = 
\begin{cases} 
0 & \text{for } x < 0 \\
x & \text{for } x \geq 0
\end{cases}
$$

### Properties
1. **Non-linearity**: Despite its piecewise linear form, ReLU introduces non-linearity into neural networks, allowing them to learn complex patterns.
2. **Sparsity**: ReLU naturally creates sparse representations by setting negative inputs to zero.
3. **Range**: $[0, \infty)$
4. **Non-differentiable at x = 0**: The function has a "kink" at x = 0, making it non-differentiable at this point.

### Advantages
1. **Computational Efficiency**: ReLU is simple to compute, requiring only a max operation.
2. **Gradient Propagation**: For positive inputs, the gradient is always 1, helping to mitigate the vanishing gradient problem.
3. **Sparse Activation**: By outputting 0 for negative inputs, ReLU can lead to sparse representations, which can be beneficial for some tasks.

### Disadvantages
1. **Dead Neurons**: Neurons can "die" if they consistently output 0, leading to parts of the network becoming inactive.
2. **Unbounded Output**: For large positive inputs, ReLU can produce arbitrarily large values, potentially leading to numerical instability.

### Formal Proof of Derivative
We will prove that the derivative of ReLU is:

$$
f'(x) = 
\begin{cases} 
0 & \text{for } x < 0 \\
1 & \text{for } x > 0 \\
\text{undefined} & \text{for } x = 0
\end{cases}
$$

**Proof:**

1) For $x < 0$:
   $$
   \begin{align*}
   f'(x) &= \lim_{h \to 0} \frac{f(x+h) - f(x)}{h} \\
   &= \lim_{h \to 0} \frac{\max(0, x+h) - \max(0, x)}{h} \\
   &= \lim_{h \to 0} \frac{0 - 0}{h} = 0
   \end{align*}
   $$

2) For $x > 0$:
   $$
   \begin{align*}
   f'(x) &= \lim_{h \to 0} \frac{f(x+h) - f(x)}{h} \\
   &= \lim_{h \to 0} \frac{\max(0, x+h) - \max(0, x)}{h} \\
   &= \lim_{h \to 0} \frac{(x+h) - x}{h} = 1
   \end{align*}
   $$

3) For $x = 0$:
   The left-hand and right-hand limits differ:
   $$
   \begin{align*}
   \lim_{h \to 0^-} \frac{f(0+h) - f(0)}{h} &= \lim_{h \to 0^-} \frac{0 - 0}{h} = 0 \\
   \lim_{h \to 0^+} \frac{f(0+h) - f(0)}{h} &= \lim_{h \to 0^+} \frac{h - 0}{h} = 1
   \end{align*}
   $$
   Therefore, the derivative is undefined at x = 0.

### Practical Considerations
In practice, the derivative at x = 0 is often defined as either 0 or 1 for computational purposes. Some implementations use a "leaky ReLU" to address the dead neuron problem:

$$
f(x) = 
\begin{cases} 
\alpha x & \text{for } x < 0 \\
x & \text{for } x \geq 0
\end{cases}
$$

where $\alpha$ is a small positive constant (e.g., 0.01).

### Use in Deep Learning
ReLU is widely used in hidden layers of deep neural networks due to its simplicity and effectiveness in addressing the vanishing gradient problem. It has largely replaced earlier activation functions like sigmoid and tanh in many applications, especially in deep convolutional neural networks for image processing tasks.

## 4. **Leaky ReLU**
- **Equation**: 
  $$
  f(x) = x \, \text{for} \, x > 0, \, \alpha x \, \text{for} \, x \leq 0
  $$
  where $\alpha$ is a small positive constant (e.g., 0.01).
- **Range**: (-∞, ∞)
- **Usage**: Variant of ReLU to address dead neuron problems.
- **Characteristics**:
  - Allows small non-zero gradients for negative values.
  - Prevents neurons from becoming inactive (dead).
- **Use Case**: Used to prevent dead neurons in deeper networks.


## 5. **Softmax Activation Function**

### Definition
The Softmax function is defined as:

$$
f(x_i) = \frac{e^{x_i}}{\sum_{j=1}^K e^{x_j}}
$$

where $x_i$ is the input to the softmax for class $i$, and $K$ is the number of classes.

### Properties
1. **Normalization**: Softmax outputs sum to 1, creating a probability distribution.
2. **Range**: (0, 1) for each output.
3. **Differentiable**: Smooth and differentiable everywhere.
4. **Monotonicity**: Preserves the order of input values.

### Advantages
1. **Probability Interpretation**: Outputs can be interpreted as probabilities.
2. **Multi-class Classification**: Ideal for problems with mutually exclusive classes.
3. **Differentiable**: Allows for effective gradient-based learning.

### Disadvantages
1. **Computational Cost**: Exponential operations can be expensive.
2. **Numerical Stability**: Can suffer from overflow/underflow for large inputs.

### Formal Proof of Derivative
We will prove that the derivative of Softmax is:

$$
\frac{\partial f(x_i)}{\partial x_j} = 
\begin{cases}
f(x_i)(1 - f(x_i)) & \text{if } i = j \\
-f(x_i)f(x_j) & \text{if } i \neq j
\end{cases}
$$

**Proof:**

Let $S = \sum_{k=1}^K e^{x_k}$

1) For $i = j$:
   $$
   \begin{align*}
   \frac{\partial f(x_i)}{\partial x_i} &= \frac{\partial}{\partial x_i} \left(\frac{e^{x_i}}{S}\right) \\
   &= \frac{e^{x_i} \cdot S - e^{x_i} \cdot e^{x_i}}{S^2} \\
   &= \frac{e^{x_i}}{S} - \left(\frac{e^{x_i}}{S}\right)^2 \\
   &= f(x_i) - (f(x_i))^2 \\
   &= f(x_i)(1 - f(x_i))
   \end{align*}
   $$

2) For $i \neq j$:
   $$
   \begin{align*}
   \frac{\partial f(x_i)}{\partial x_j} &= \frac{\partial}{\partial x_j} \left(\frac{e^{x_i}}{S}\right) \\
   &= -\frac{e^{x_i} \cdot e^{x_j}}{S^2} \\
   &= -\frac{e^{x_i}}{S} \cdot \frac{e^{x_j}}{S} \\
   &= -f(x_i)f(x_j)
   \end{align*}
   $$

### Practical Considerations
1. **Numerical Stability**: To prevent overflow, it's common to subtract the maximum value from all inputs:
   $$
   f(x_i) = \frac{e^{x_i - \max(x)}}{\sum_{j=1}^K e^{x_j - \max(x)}}
   $$

2. **Cross-Entropy Loss**: Softmax is often used with cross-entropy loss for classification tasks:
   $$
   L = -\sum_{i=1}^K y_i \log(f(x_i))
   $$
   where $y_i$ is the true label (0 or 1) for class $i$.

### Use in Deep Learning
Softmax is primarily used in the output layer of neural networks for multi-class classification problems. It's particularly effective when classes are mutually exclusive, such as in image classification or natural language processing tasks like sentiment analysis or language identification.

## 6. **ELU (Exponential Linear Unit)**
- **Equation**: 
  $$
  f(x) = 
  \begin{cases} 
    x & \text{if } x > 0 \\
    \alpha(e^x - 1) & \text{if } x \leq 0 
  \end{cases}
  $$
- **Range**: (-α, ∞)
- **Usage**: Alternative to ReLU to avoid dead neurons.
- **Characteristics**:
  - Reduces bias shift toward positive values.
  - Avoids the dead neuron problem.
  - **Issues**: More computationally expensive due to the exponential operation.
- **Use Case**: Used in deep networks to achieve robustness.


## 7. **Swish**
- **Equation**: 
  $$
  f(x) = x \cdot \sigma(x)
  $$
  where $\sigma(x)$ is the sigmoid function.
- **Range**: (-∞, ∞)
- **Usage**: A newer activation function shown to perform better in some deep learning models.
- **Characteristics**:
  - Allows for smooth non-linearity and better gradient flow compared to ReLU.
  - Found to outperform ReLU in some cases.
- **Use Case**: Used in advanced architectures such as image classification.


## 8. **ReLU6**
- **Equation**: 
  $$
  f(x) = \min(\max(0, x), 6)
  $$
- **Range**: [0, 6]
- **Usage**: Common in mobile-friendly models like MobileNet.
- **Characteristics**:
  - Capped version of ReLU to help with overflow issues.
  - Improves model robustness.
- **Use Case**: Used in mobile and edge-device models.

### Summary Table of Activation Functions

| **Activation Function** | **Equation**                               | **Output Range**  | **Common Usage**              |
|------------------------|--------------------------------------------|------------------|-------------------------------|
| **Sigmoid**             | $\sigma(x) = \frac{1}{1 + e^{-x}}$         | (0, 1)           | Binary classification         |
| **Tanh**                | $\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$ | (-1, 1)          | Hidden layers in neural nets  |
| **ReLU**                | $f(x) = \max(0, x)$                        | [0, ∞)           | Deep neural networks          |
| **Leaky ReLU**          | $f(x) = \alpha x \, \text{if} \, x < 0 \, \text{else} \, x$ | (-∞, ∞) | Avoiding dead neurons         |
| **Softmax**             | $f(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}}$  | (0, 1)           | Multi-class classification    |
| **ELU**                 | $f(x) = \alpha(e^x - 1) \, \text{if} \, x \leq 0, \, x \, \text{otherwise}$ | (-α, ∞) | Deep networks for robust learning |
| **Swish**               | $f(x) = x \cdot \sigma(x)$                 | (-∞, ∞)          | Advanced architectures        |
| **ReLU6**               | $f(x) = \min(\max(0, x), 6)$               | [0, 6]           | Mobile and embedded models    |


The final equation of the multilayer network is:

$$
y = \text{softmax}(W_3 \cdot \sigma(W_2 \cdot \sigma(W_1 \cdot \max(0, W_0x + b_0) + b_1) + b_2) + b_3)
$$

where:
- $y$ is the output vector
- $x$ is the input vector
- $W_i$ and $b_i$ are the weights and biases for layer $i$
- $\max(0, z)$ is the ReLU function
- $\sigma(z) = \frac{1}{1 + e^{-z}}$ is the Sigmoid function
- $\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}}$ is the Softmax function

This equation represents the forward pass through the entire network, from input to output.

## Proof of the Final Equation for the Multilayer Network

Let's derive the final equation step by step, starting from the input and moving through each layer.

Given:
- Input vector: $x$
- Weights and biases for each layer: $W_i$ and $b_i$
- ReLU activation: $f_{\text{ReLU}}(z) = \max(0, z)$
- Sigmoid activation: $\sigma(z) = \frac{1}{1 + e^{-z}}$
- Softmax activation: $\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}}$

### Step 1: Layer 0 (ReLU)
The output of the first layer is:
$$a_0 = f_{\text{ReLU}}(W_0x + b_0) = \max(0, W_0x + b_0)$$

### Step 2: Layer 1 (Sigmoid)
The output of the second layer is:
$$a_1 = \sigma(W_1a_0 + b_1)$$

Substituting $a_0$ from Step 1:
$$a_1 = \sigma(W_1 \cdot \max(0, W_0x + b_0) + b_1)$$

### Step 3: Layer 2 (Sigmoid)
The output of the third layer is:
$$a_2 = \sigma(W_2a_1 + b_2)$$

Substituting $a_1$ from Step 2:
$$a_2 = \sigma(W_2 \cdot \sigma(W_1 \cdot \max(0, W_0x + b_0) + b_1) + b_2)$$

### Step 4: Layer 3 (Softmax)
The final output of the network is:
$$y = \text{softmax}(W_3a_2 + b_3)$$

Substituting $a_2$ from Step 3:
$$y = \text{softmax}(W_3 \cdot \sigma(W_2 \cdot \sigma(W_1 \cdot \max(0, W_0x + b_0) + b_1) + b_2) + b_3)$$

This is our final equation, representing the complete forward pass through the network.

### Conclusion
We have derived the final equation by composing the functions for each layer, starting from the input and moving through each activation. The nested structure of the equation reflects the layered architecture of the neural network, with each activation function applied in sequence.