### Lecture Notes: Loss Functions in Deep Learning

lecture for 14 dl
#### 1. Introduction to Loss Functions
*   **Definition**: A loss function is a mechanism for evaluating how well an algorithm's model is performing on a given dataset. It helps determine if the algorithm is working well or poorly.
*   **Nature**: Loss functions are mathematical functions that depend on the parameters of the machine learning algorithm.
*   **Performance Indication**:
    *   A **large output value** from the loss function means the algorithm is performing poorly.
    *   A **small output value** from the loss function means the algorithm is performing well.
*   **Adjusting Parameters**: By adjusting the algorithm's parameters (e.g., weights and biases in a neural network), the value of the loss function changes. The goal is to find parameter values that minimise this loss function.

#### 2. Importance of Loss Functions
*   **The "You Can't Improve What You Can't Measure" Principle**: Loss functions are crucial because they quantify the error or "mistake" of the model, providing a measurable metric for improvement.
*   **Guiding the Algorithm**: Loss functions act as the "eye of the algorithm". They tell the machine learning algorithm:
    *   Where to go.
    *   What parameter values to set.
    *   How much to adjust parameters to reduce error.
*   **Role in Deep Learning Model Training (Backpropagation)**:
    *   **Forward Propagation**: An input (e.g., student's CGPA and IQ) is fed into the neural network with initial random weights and biases. The network makes a prediction (output).
    *   **Loss Calculation**: A chosen loss function calculates the error between the predicted output and the true value for that single input.
    *   **Backpropagation & Gradient Descent**: Based on this loss value, **gradient descent** is used to adjust the weights and biases of the neural network.
    *   **Iteration**: This process is repeated for multiple inputs (or batches of inputs) across the entire dataset for multiple epochs.
    *   **Goal**: The ultimate aim is to minimise the loss function's value, which signifies that the model is well-trained and has learned the optimal weights and biases.

#### 3. Loss Function vs. Cost Function
*   **Loss Function**:
    *   Calculated for a **single training example**.
    *   Measures the error for one prediction.
    *   Sometimes referred to as an "error function".
*   **Cost Function**:
    *   Calculated over an **entire batch** or the **entire training dataset**.
    *   It is typically the average of the loss functions across all training examples in a batch or the dataset.
    *   For example, in Mean Squared Error (MSE), the loss for a single point is `(y_true - y_pred)^2`, while the cost for `N` points is `(1/N) * sum((y_true_i - y_pred_i)^2)`.

#### 4. Types of Loss Functions in Deep Learning
Loss functions vary depending on the type of problem being solved.

*   **For Regression Problems**:
    *   **Mean Squared Error (MSE)**
    *   **Mean Absolute Error (MAE)**
    *   **Huber Loss**
*   **For Classification Problems**:
    *   **Binary Cross Entropy** (for binary classification)
    *   **Categorical Cross Entropy** (for multi-class classification)
    *   **Hinge Loss** (often used in Support Vector Machines)
    *   **Sparse Categorical Cross Entropy** (for multi-class classification with integer-encoded labels)
*   **For Other Problem Types**:
    *   **Autoencoders**: KL Divergence
    *   **Generative Adversarial Networks (GANs)**: Discriminator Loss, Max GAN Loss
    *   **Object Detection**: Focal Loss
    *   **Embeddings**: Triplet Loss
*   **Custom Loss Functions**: Researchers can create their own loss functions, and frameworks like Keras allow for the implementation of custom loss functions.

#### 5. Detailed Discussion of Key Loss Functions

##### 5.1. Mean Squared Error (MSE)
*   **Also Known As**: Squared Loss, L2 Loss.
*   **Application**: Commonly used in **regression problems**.
*   **Formula (for a single data point)**: `(y_true - y_pred)^2`.
    *   `y_true`: The actual (true) value.
    *   `y_pred`: The predicted value by the model.
*   **Why Squaring?**
    *   Ensures all error values are positive, preventing positive and negative errors from cancelling each other out when summed.
    *   **Magnifies larger errors**: Errors are penalised quadratically; an error of 2 units becomes 4 units squared, and an error of 4 units becomes 16 units squared. This means points further from the true value have a disproportionately higher impact on the loss.
    *   **Impact on Weight Updates**: Outlier points with large errors cause more drastic updates to weights and biases, guiding the model strongly towards these points.
*   **Advantages**:
    *   **Easy to interpret**.
    *   **Always differentiable**: The loss function graph is smooth and convex, making it ideal for gradient descent algorithms.
    *   **Single global minimum**: There is only one unique set of weights and biases for which the MSE is minimised, simplifying optimisation.
*   **Disadvantages**:
    *   **Squared Unit**: The unit of the error is squared (e.g., if the output is in 'LPA', the error is in 'LPA squared'), which can be less intuitive to understand.
    *   **Not Robust to Outliers**: Due to the quadratic penalty, outliers heavily influence the model, pulling the regression line significantly towards them, potentially leading to suboptimal predictions for the majority of the data.
*   **Deep Learning Implementation**: When using MSE for regression, the output layer of the neural network should have a **linear activation function**.

##### 5.2. Mean Absolute Error (MAE)
*   **Also Known As**: L1 Loss.
*   **Application**: Another common loss function for **regression problems**.
*   **Formula (for a single data point)**: `|y_true - y_pred|`.
*   **Comparison to MSE**: Replaces the squaring operation with an absolute value operation.
*   **Advantages**:
    *   **Intuitive and easy to understand**.
    *   **Same Unit as Target**: The unit of the error is the same as the target variable (e.g., if the output is 'LPA', the error is in 'LPA'), making it more directly interpretable.
    *   **Robust to Outliers**: MAE penalises errors linearly, not quadratically. This means outliers do not cause disproportionately large penalties, resulting in less drastic adjustments to the model's parameters and a more robust fit to the general data trend.
*   **Disadvantages**:
    *   **Not Differentiable at Zero**: The graph of MAE (loss vs. weights) has a sharp point at zero, making it non-differentiable at that specific point. This requires the use of **subgradients** in gradient descent, which can slightly increase computational complexity.
*   **Recommendation**: Use MAE when your dataset contains **outliers**.

##### 5.3. Huber Loss
*   **Application**: Used in **regression problems** when there is a significant mixture of normal points and outliers.
*   **Idea**: A hybrid loss function that combines the benefits of MSE and MAE.
    *   Behaves like **MSE for small errors** (points that are not outliers).
    *   Behaves like **MAE for large errors** (points that are outliers).
*   Formula
   <img src='https://pbs.twimg.com/media/FNXUFtaWQAQSPyy.jpg'>
*   **Parameter (Delta, δ)**: A threshold parameter (hyperparameter) determines what constitutes a "small" error (MSE-like behaviour) versus a "large" error (MAE-like behaviour). Its value can be tuned.
*   **Benefit**: Provides a balance, robustly handling outliers without completely ignoring smaller errors.

##### 5.4. Binary Cross Entropy (BCE)
*   **Also Known As**: Log Loss.
*   **Application**: Used for **binary classification problems**, where there are exactly two classes (e.g., Yes/No, 0/1, placement/no placement).
*   **Formula (for a single data point)**: `-y_true * log(y_pred) - (1 - y_true) * log(1 - y_pred)`.
    *   `y_true`: The actual target value (0 or 1).
    *   `y_pred`: The predicted probability of the positive class (between 0 and 1).
*   **Deep Learning Implementation**: The output layer of the neural network must use a **sigmoid activation function**.
*   **Advantages**:
    *   **Differentiable**: Enables easy application of gradient descent.
*   **Disadvantages**:
    *   Can have **multiple local minima**.
    *   **Less intuitive** to understand compared to MSE/MAE.

##### 5.5. Categorical Cross Entropy (CCE)
*   **Application**: Used for **multi-class classification problems**, where there are three or more classes (e.g., 'Yes', 'No', 'Maybe').
*   **Formula (for a single data point)**: `-sum(y_true_j * log(y_pred_j))` for `j=1 to k` classes.
    *   `y_true_j`: The true label (1 for the correct class, 0 for others in one-hot encoding).
    *   `y_pred_j`: The predicted probability for class `j`.
*   **Deep Learning Architecture**:
    *   The output layer should have **as many neurons as there are classes**.
    *   The activation function for the output layer neurons must be **Softmax**. Softmax outputs probabilities for each class that sum up to 1.
*   **Data Preparation**: The true target labels must be **one-hot encoded**. For example, if there are three classes ('Yes', 'No', 'Maybe'), 'Yes' might be ``, 'No' as ``, and 'Maybe' as ``.

##### 5.6. Sparse Categorical Cross Entropy (SCCE)
*   **Application**: Also used for **multi-class classification problems**.
*   **Key Difference from CCE**: Instead of requiring one-hot encoded true labels, SCCE expects the true labels to be **integer encoded**. For example, 'Yes' could be 1, 'No' could be 2, and 'Maybe' could be 3.
*   **Deep Learning Architecture**: The architecture (number of output neurons and Softmax activation) remains the same as CCE.
*   **Advantages**:
    *   **Faster**: Especially beneficial when dealing with a very large number of classes, as it avoids the computational overhead of one-hot encoding the labels for each batch.
    *   **Memory Efficient**: Reduces memory usage by storing integer labels instead of sparse one-hot vectors.

#### 6. Summary and Recommendations
*   **Regression Problems**:
    *   Use **MSE** if there are no significant outliers in your data.
    *   Use **MAE** if your data contains outliers, as it is more robust.
    *   Use **Huber Loss** if your data has a mixture of normal points and outliers, offering a balance.
*   **Classification Problems**:
    *   Use **Binary Cross Entropy** for binary classification (two classes).
    *   Use **Categorical Cross Entropy** for multi-class classification when true labels are one-hot encoded.
    *   Use **Sparse Categorical Cross Entropy** for multi-class classification when true labels are integer encoded, especially with many classes for speed and memory efficiency.

---