### Gradient Descent in Neural Networks: Batch vs. Stochastic vs. Mini-Batch

This video focuses on the **three main variants of Gradient Descent** used in Backpropagation algorithms, explaining their differences, when to use each, and how they impact training. Gradient Descent is a fundamental **optimisation algorithm** used to **minimise an objective function**, specifically the **loss function** in neural networks. It works by updating model parameters (weights and biases) in the **opposite direction of the gradient** of the loss function. The **learning rate** determines the size of the steps taken towards the minimum.

The core difference between the Gradient Descent variants lies in **how much data is used to compute the gradient** of the objective function.

#### 1. Backpropagation and Gradient Descent

*   The Backpropagation algorithm, as discussed previously, involves deciding on epochs, looping through them, taking a data point, calculating a prediction, calculating loss, and then **updating weights and biases**.
*   This weight and bias update process is essentially **Gradient Descent**.

#### 2. Gradient Descent Variants

There are three main types of Gradient Descent:
*   **Batch Gradient Descent**
*   **Stochastic Gradient Descent (SGD)**
*   **Mini-Batch Gradient Descent**

These variants present a **trade-off between accuracy and time**.

#### 3. Batch Gradient Descent (Vanilla Gradient Descent)

*   **Definition**: Also known as "Vanilla Gradient Descent," this is the most normal type. In Batch Gradient Descent, the **entire dataset is used** to compute the gradient for a single weight update.
*   **Algorithm (Pseudocode)**:
    1.  Decide on the number of `epochs`.
    2.  Loop `for i in range(epochs)`:
        *   Take the **entire dataset** (`X`).
        *   Calculate predictions (`y_hat`) for **all rows simultaneously** using a dot product (vectorisation).
        *   Calculate the **total loss** for all `y_hat` and actual `y` values.
        *   **Update all weights and biases once** using the calculated gradient of this total loss: `w_new = w_old - (learning_rate * ∂L/∂w)`.
*   **Characteristics**:
    *   **Weight Updates**: Weights are updated **once per epoch**. If there are 10 epochs, there will be 10 weight updates.
    *   **Speed (per epoch)**: It is **faster to complete each epoch** compared to SGD, because it performs fewer updates.
    *   **Convergence Speed (to solution)**: It is **slower to converge to the optimal solution** because it makes fewer updates overall. It requires more epochs to reach a good solution.
    *   **Loss Curve Behaviour**: The loss function decreases in a very **smooth and stable** manner.
    *   **Local Minima**: It is prone to getting **stuck in local minima** because its smooth movement might not allow it to escape.
    *   **Vectorisation**: It benefits from **vectorisation** (using dot products instead of explicit loops) for calculating predictions and gradients across the entire dataset, which is generally faster than loops.
    *   **Memory Requirement (Downside of Vectorisation)**: Requires loading the **entire dataset into RAM simultaneously** for the dot product operation. This makes it unsuitable for very large datasets that cannot fit into memory.

#### 4. Stochastic Gradient Descent (SGD)

*   **Definition**: This is the type of Gradient Descent often described in initial Backpropagation explanations. In SGD, the model updates weights **for each individual data point**.
*   **Algorithm (Pseudocode)**:
    1.  Decide on the number of `epochs`.
    2.  Loop `for i in range(epochs)`:
        *   **Shuffle the entire dataset** (optional, but recommended to eliminate bias).
        *   Loop `for each data_point in dataset`: (e.g., if 50 rows, inner loop runs 50 times).
            *   Take **one random data point**.
            *   Calculate prediction (`y_hat`) for **that single point**.
            *   Calculate the loss for that single point.
            *   **Update all weights and biases** based on the gradient of this single-point loss.
*   **Characteristics**:
    *   **Weight Updates**: Weights are updated **`number of epochs * number of rows` times**. If 10 epochs and 50 rows, there will be 500 weight updates.
    *   **Speed (per epoch)**: It is **slower to complete each epoch** because it performs many updates.
    *   **Convergence Speed (to solution)**: It is **faster to converge to a good solution** because it makes frequent updates, allowing it to move quickly towards the minimum. It requires fewer epochs to reach a good solution than Batch GD.
    *   **Loss Curve Behaviour**: The loss function decreases in an **unstable, "spiky," or "jerky"** manner.
    *   **Local Minima**: This "random behaviour" or "jerkiness" is **beneficial for escaping local minima** and potentially finding the global minimum.
    *   **Exact Solution**: Due to its erratic movement, it might **not converge to an exact solution** but rather an approximate one around the minimum.
    *   **Memory Requirement**: It processes one data point at a time, so it does **not require loading the entire dataset** into memory.

#### 5. Mini-Batch Gradient Descent

*   **Definition**: Mini-Batch Gradient Descent is the **middle ground** between Batch and Stochastic Gradient Descent and often considered the **"best of both worlds"**. It uses **small batches (subsets) of the data** to compute gradients and update weights.
*   **Algorithm (Pseudocode)**:
    1.  Decide on the number of `epochs`.
    2.  Decide on `batch_size` (e.g., 32, 64, 128).
    3.  Loop `for i in range(epochs)`:
        *   **Shuffle the entire dataset**.
        *   Divide the dataset into `number_of_batches` (e.g., if 320 rows and batch_size 32, then 10 batches).
        *   Loop `for each batch in batches`:
            *   Calculate predictions (`y_hat`) for **all rows within that batch** using vectorisation.
            *   Calculate the **loss for that batch**.
            *   **Update all weights and biases** based on the gradient of that batch's loss.
*   **Characteristics**:
    *   **Weight Updates**: Weights are updated **`number of epochs * number of batches` times**.
    *   **Speed (per epoch)**: Faster than SGD, slower than Batch Gradient Descent.
    *   **Convergence Speed (to solution)**: Faster than Batch Gradient Descent, slower than SGD. It converges efficiently.
    *   **Loss Curve Behaviour**: Provides a **smoother loss curve than SGD**, but with some "jerkiness" that helps escape local minima.
    *   **Local Minima**: Its balanced behaviour allows it to effectively navigate the loss landscape, potentially escaping local minima while still converging reasonably well.
    *   **Memory Requirement**: Uses vectorisation within batches, so it **does not require loading the entire dataset** into memory, making it suitable for large datasets.
    *   **Popularity**: It is the **most commonly used** Gradient Descent variant in deep learning.

#### 6. Comparison Summary

| Feature               | Batch Gradient Descent      | Stochastic Gradient Descent | Mini-Batch Gradient Descent |
| :-------------------- | :-------------------------- | :-------------------------- | :-------------------------- |
| **Data Used**         | Entire dataset (N rows)     | 1 data point                | `batch_size` data points    |
| **Weight Updates/Epoch** | 1                           | N (number of rows)          | N / `batch_size`            |
| **Speed (per epoch)** | Fastest                     | Slowest                     | Medium                      |
| **Convergence Speed** | Slowest                     | Fastest                     | Medium                      |
| **Loss Curve Stability** | Most stable, very smooth    | Most unstable, spiky        | Moderately stable, less spiky than SGD |
| **Local Minima Escape** | Poor (can get stuck)        | Good (jumps out)            | Good (balanced)             |
| **Vectorisation**     | Yes (entire dataset)        | No                          | Yes (within batches)        |
| **Memory**            | High (entire dataset in RAM) | Low                         | Medium                      |
| **Popularity**        | Less common                 | Less common                 | **Most common**             |

#### 7. Practical Considerations for `batch_size` in Keras/Deep Learning

*   **Multiples of 2**: Batch sizes are often chosen as multiples of 2 (e.g., 32, 64, 128, 256). This is an **optimisation technique** that allows the RAM architecture to be used more effectively, making the code faster.
*   **Imperfect Division**: If the total number of rows is not perfectly divisible by the `batch_size`, the **last batch will simply contain the remaining rows**. For example, 400 rows with a `batch_size` of 150 would create two batches of 150 rows and one final batch of 100 rows.