## **Review of Activation Functions**

Activation functions are a crucial component in neural networks. They determine how the weighted sum of inputs and biases is transformed to introduce non-linearity, allowing the network to learn and model complex patterns. Here, we review three widely-used activation functions: ReLU, sigmoid, and softmax.

### **1. Rectified Linear Unit (ReLU)**

The ReLU activation function is defined as:

$$
f(x) = \max(0, x)
$$

#### Key Features:
- **Behavior**: Outputs the input directly if it's positive; otherwise, it outputs zero.
- **Advantages**:
  - Efficient computation.
  - Helps mitigate the vanishing gradient problem by keeping gradients large for positive inputs.
- **Disadvantages**:
  - May suffer from the "dying ReLU" problem, where neurons can become inactive and output zero for all inputs.

#### Plot:
ReLU is a piecewise linear function with a sharp transition at zero.

---

### **2. Sigmoid**

The sigmoid activation function is given by:

$$
f(x) = \frac{1}{1 + e^{-x}}
$$

#### Key Features:
- **Behavior**: Maps inputs to a range between 0 and 1.
- **Advantages**:
  - Useful for models where outputs need to represent probabilities.
  - Smooth gradients can aid optimization in shallow networks.
- **Disadvantages**:
  - Gradients can vanish for inputs that are far from zero.
  - Output saturation leads to slower convergence.

#### Plot:
The sigmoid function has an S-shaped curve, asymptoting at 0 and 1.

---

### **3. Softmax**

The softmax function is often used for the output layer in classification problems. It is defined as:

$$
f(x_i) = \frac{e^{x_i}}{\sum_{j} e^{x_j}}
$$

#### Key Features:
- **Behavior**: Converts logits (raw scores) into probabilities that sum to 1.
- **Advantages**:
  - Ideal for multi-class classification tasks.
  - Outputs are interpretable as probabilities.
- **Disadvantages**:
  - Computationally expensive for large output spaces due to the exponential and summation operations.

#### Plot:
Unlike ReLU or sigmoid, the softmax function operates on vectors rather than scalar values, and its output is a probability distribution.

---

### **Comparison**

| Function  | Range         | Applications               | Pros                              | Cons                              |
|-----------|---------------|----------------------------|-----------------------------------|-----------------------------------|
| **ReLU**  | [0, ∞)        | Hidden layers in deep nets | Computationally efficient         | Dying ReLU problem                |
| **Sigmoid** | (0, 1)       | Binary classification      | Probabilistic output              | Vanishing gradients               |
| **Softmax** | [0, 1] (sum=1) | Multi-class classification| Probabilistic interpretation      | Computationally intensive for large outputs |

Activation functions play a pivotal role in enabling neural networks to learn intricate relationships in data. Choosing the right activation function is context-dependent and can significantly impact a model's performance.


## **Categorical Cross-Entropy Loss**

Categorical Cross-Entropy (CCE) is a widely used loss function for classification tasks, especially multi-class classification. It quantifies the difference between the predicted probability distribution (from the model) and the true distribution (ground truth labels).

---

### **1. Softmax for Probability Distribution**

To compute categorical cross-entropy, the network output is typically passed through a softmax activation function to convert logits into probabilities:

$$
f(x_i) = \frac{e^{x_i}}{\sum_{j} e^{x_j}}
$$

where:
- $ x_i $ is the $ i^{th} $ logit (raw output) from the model.
- $ \sum_{j} e^{x_j} $ is the sum of exponentials of all logits, normalizing the probabilities.

---

### **2. Categorical Cross-Entropy Loss**

The categorical cross-entropy loss is defined as:

$$
\mathcal{L} = -\sum_{i} y_i \log(\hat{y}_i)
$$

where:
- $ y_i $ is the ground truth label for class $ i $ (one-hot encoded: 1 for the correct class and 0 for others).
- $ \hat{y}_i $ is the predicted probability for class $ i $, obtained from the softmax function.

---

### **Key Points:**

1. **Interpretation**:
   - The loss measures the difference between the true label distribution $ y_i $ and the predicted probabilities $ \hat{y}_i $.
   - For the correct class, $ y_i = 1 $, and the loss simplifies to $ -\log(\hat{y}_i) $, penalizing low confidence predictions for the correct class.

2. **Applications**:
   - Commonly used in multi-class classification tasks.
   - Works well with neural networks where the output layer uses softmax activation.

3. **Advantages**:
   - Provides a probabilistic interpretation of predictions.
   - Directly optimizes for the correct class's probability.

4. **Disadvantage**:
   - Susceptible to overfitting, especially in cases with imbalanced datasets.

---

### **Summary**

Categorical Cross-Entropy Loss is a cornerstone in machine learning classification tasks. Its synergy with softmax activation allows models to output interpretable probabilities, while the logarithmic penalty drives the model to predict with higher confidence for the correct class.


In [1]:
import numpy as np

""" Imagine we are trying to use a Neural Network (NN) to predict whether an image
is a cat (index 0), a dog (index 1), or a bird (index 2).
"""

output = [4, # cat
         2,  # dog
         3]  # bird

# To convert the outputs into probabilities we apply the Softmax activation function

def softmax(list):
    exp_values = np.exp(output - np.max(output)) # Subtract max for numerical stability
    return exp_values / np.sum(exp_values)

softmax_output = softmax(output)
print(softmax_output)

[0.66524096 0.09003057 0.24472847]


In [2]:
# Using a Class

class Activation_Softmax:
    # Forward pass
    def forward(self, inputs):
        # Get unnormalized probabilities
        exp_values = np.exp(inputs - np.max(inputs, axis=1, keepdims=True))  # Stabilize
        # Normalize them for each sample
        probabilities = exp_values / np.sum(exp_values, axis=1, keepdims=True)
        self.output = probabilities
        return probabilities  # Add return statement




inputs = np.array([[4.0, 2.0, 3.0], 
                   [0.5, 2.5, 1.0]])  # Example batch input

activation_softmax = Activation_Softmax()  # Create Softmax Activation
softmax_output = activation_softmax.forward(inputs)  # Compute softmax
print("Softmax output:\n", softmax_output)



Softmax output:
 [[0.66524096 0.09003057 0.24472847]
 [0.09962365 0.73612472 0.16425163]]


In [3]:
# Compute the CCE Loss

softmax_outputs = np.array([[0.7, 0.1, 0.2],
                            [0.1, 0.5, 0.4],
                            [0.02, 0.9, 0.08]])

class_targets = np.array([[1, 0, 0],
                          [0, 1, 0],
                          [0, 1, 0]])

# Another way of representing class targets

class_targets2 = np.array([0,1,1])

# Probabilities for target values -
# only if categorical labels
if len(class_targets.shape) == 1:
    correct_confidences = softmax_outputs[
        range(len(softmax_outputs)),
        class_targets
    ]
    
elif len(class_targets.shape) == 2:
    correct_confidences = np.sum(
        softmax_outputs * class_targets,
        axis=1
    )

neg_log = -np.log(correct_confidences)
average_loss = np.mean(neg_log)
print(average_loss)

0.38506088005216804


In [4]:
class Loss:

    # Calculates the data and regularization losses
    # given model output and ground truth values
    def calculate(self, output, y):

        # Calculate sample losses
        sample_losses = self.forward(output, y)
        print("Sample losses: ", sample_losses)

        # Calculate mean loss
        data_loss = np.mean(sample_losses)

        # Return loss
        return data_loss

class Loss_CategoricalCrossentropy(Loss):

    # Forward pass
    def forward(self, y_pred, y_true):

        # Number of samples in a batch
        samples = len(y_pred)

        # Clip data to prevent division by 0
        # Clip both sides to not drag mean towards any value
        y_pred_clipped = np.clip(y_pred, 1e-7, 1 - 1e-7)


        # Probabilities for target values -
        # only if categorical labels
        if len(y_true.shape) == 1:
            correct_confidences = y_pred_clipped[
                range(samples),
                y_true
            ]

        # Mask values - only for one-hot encoded labels
        elif len(y_true.shape) == 2:
            correct_confidences = np.sum(
                y_pred_clipped*y_true,
                axis=1
            )

        # Losses
        negative_log_likelihoods = -np.log(correct_confidences)
        return negative_log_likelihoods

loss_function = Loss_CategoricalCrossentropy()
loss = loss_function.calculate(softmax_outputs, class_targets)
print(loss)

Sample losses:  [0.35667494 0.69314718 0.10536052]
0.38506088005216804
