## I. Context: The Flaw in Batch Gradient Descent


### A. Problems with Batch Gradient Descent
1.  **Memory Inefficiency:** To perform Batch Gradient Descent, the entire dataset must be loaded into RAM simultaneously. While feasible for small datasets (e.g., 500 rows), this is impossible for large-scale problems like Image Classification with gigabytes of data.
2.  **Slow Convergence:** Parameters are updated only once per epoch (after processing the whole dataset). This results in slow learning and convergence compared to updating parameters more frequently.

### B. The Solution: Mini-Batch Gradient Descent
The standard industry practice is **Mini-Batch Gradient Descent**. Instead of processing 1000 rows at once, the data is divided into smaller batches (e.g., 10 batches of 100 rows). The model performs a forward pass, calculates loss, and updates parameters for *each* batch sequentially.

### C. Limitations of Manual Implementation
While one can manually write loops to slice tensors into batches, this approach lacks robustness:
*   **No Standard Interface:** It does not handle complex data fetching (e.g., images stored in folder structures).
*   **Difficult Transformations:** There is no centralized place to apply data augmentations or transformations (e.g., resizing images, greyscaling).
*   **Shuffling & Sampling:** Implementing efficient random shuffling and sampling requires complex manual logic.
*   **No Parallelization:** Manual loops process data sequentially, failing to utilise multi-core CPUs for faster data loading.

To solve these issues, PyTorch provides two core abstract classes: **Dataset** and **DataLoader**.

---

## II. The Dataset Class (`torch.utils.data.Dataset`)

The **Dataset** class is responsible for defining *how* the data is loaded and *how* a single item is retrieved. It decouples the data storage logic from the training loop.

To create a custom dataset, one must create a class that inherits from `torch.utils.data.Dataset` and implement three specific methods:

### 1. The `__init__` Method (Constructor)
*   **Purpose:** Defines how data is loaded from the source (e.g., reading a CSV file, defining paths to image folders).
*   **Execution:** Runs once when the object is instantiated.

### 2. The `__len__` Method
*   **Purpose:** Returns the total number of samples (rows) in the dataset.
*   **Utility:** This helps the DataLoader calculate how many batches will be created given a specific batch size.

### 3. The `__getitem__` Method
*   **Purpose:** The most critical method. Given an `index`, it retrieves the specific sample (features and label) from the dataset at that position.
*   **Transformations:** This is the correct place to apply transformations. Before returning the row, one can resize images, apply normalization, or convert text to lower case.

---

## III. The DataLoader Class (`torch.utils.data.DataLoader`)

While the Dataset class fetches individual items, the **DataLoader** class orchestrates the creation of batches. It manages shuffling, batching, and parallel processing.

### A. The Workflow of DataLoader
The internal process of a DataLoader follows these steps:
1.  **Index Selection:** It retrieves all indices (e.g., 0 to 9).
2.  **Sampling/Shuffling:** It uses a **Sampler** to shuffle these indices randomly (if `shuffle=True`).
3.  **Chunking:** It groups the shuffled indices into chunks based on the defined `batch_size` (e.g., pairing indices for a batch size of 2).
4.  **Fetching:** It passes these indices one by one to the **Dataset's `__getitem__`** method to retrieve the actual data.
5.  **Collation:** It uses a **Collate Function** to combine the individual data items into a single batch tensor.

### B. Key Parameters
*   **`batch_size`:** The number of samples per batch.
*   **`shuffle`:** If `True`, data is shuffled every epoch. If `False`, it remains sequential (useful for Time Series).
*   **`num_workers`:** Enables **Parallelization**. Instead of fetching data sequentially, multiple "workers" (sub-processes) fetch data batches simultaneously, significantly speeding up training.
*   **`drop_last`:** If `True`, it drops the final batch if it is incomplete (smaller than the batch size). This is useful when using Batch Normalization, which requires consistent batch sizes.

---

## IV. Advanced Concepts: Samplers and Collate Functions

### A. The Sampler
The Sampler determines the strategy for drawing indices from the dataset.
*   **RandomSampler:** Shuffles indices (Standard for training).
*   **SequentialSampler:** Keeps indices in order (Standard for validation/testing or Time Series).
*   **Weighted/Custom Samplers:** Crucial for **Imbalanced Datasets**. For example, if Class A has 99% of data and Class B has 1%, a custom sampler can ensure every batch maintains a specific ratio of Class B to prevent model bias.

### B. The Collate Function
The `collate_fn` defines how a list of individual samples (retrieved by `__getitem__`) is merged into a batch.
*   **Default Behavior:** Simply stacks tensors together.
*   **Custom Usage (e.g., NLP):** If samples are sentences of variable lengths (e.g., one sentence has 4 words, another has 2), they cannot be stacked directly. A custom Collate function is required to add **Padding** (adding zeros) to match lengths before stacking.

---

## V. Implementation Summary

The lecture concludes by refactoring the Breast Cancer training code to use these classes:

1.  **Step 1:** Define a `CustomDataset` class implementing `__init__`, `__len__`, and `__getitem__`.
2.  **Step 2:** Instantiate `train_dataset` and `test_dataset` objects.
3.  **Step 3:** Instantiate `train_loader` and `test_loader` using `DataLoader`, specifying `batch_size` and `shuffle`.
4.  **Step 4:** Refactor the training loop to include a nested loop:
    *   **Outer Loop:** Iterates over Epochs.
    *   **Inner Loop:** Iterates over the `train_loader` to fetch `batch_features` and `batch_labels` for Mini-Batch Gradient Descent.

This structure ensures the code is scalable, memory-efficient, and cleaner compared to manual batching.