# **Dataset and Dataloaders Classes In PyTorch**

#### **What is Batch Gradient Descent?**
- **The Rule**: It calculates the error for every single example in your entire dataset before making just one update to the model's weights.
- **The Analogy**: It is like a teacher grading 1,000 exam papers and calculating the class average before deciding to change their teaching style once.
- **The Math**: It takes the "average direction" of all data points combined, creating a very smooth path down the loss mountain.

#### **The 3 Major Problems (Why we don't use it for Images)**
1. **Memory Crash (RAM/GPU):** To calculate the gradient for the whole dataset at once, the computer must load all 5,000+ images into memory simultaneously. This is impossible for large datasets and causes Out of Memory errors immediately.

2. **Too Slow to Learn:** Since it processes the whole dataset for just one step, it learns very slowly. If you have 10,000 images, you wait for 10,000 calculations just to move the weights slightly. Mini-batch would have moved the weights ~300 times in that same period.

3. **Gets Stuck (Saddle Points):** Because it calculates the "perfect" average, it moves too smoothly. If it hits a flat area (saddle point) or a small hole (local minimum), it stops. It lacks the "random noise" of Stochastic Gradient Descent that helps jiggle the model out of these traps.

### **Mini-Batch Gradient Descent**
- The Concept: Instead of processing the entire dataset (Batch GD) or just one example (Stochastic GD), the model processes data in small groups called Batches (e.g., 32 or 64 images).
- The Standard: This is the default algorithm used for almost all Deep Learning tasks today

#### **Why we use it? (3 Main Advantages)**
- **Prevents Memory Crashes:** It is impossible to load 10,000 high-res medical images into GPU memory at once. Mini-batch only loads 32 images at a time, making it efficient for limited VRAM.

- **Faster Learning (Vectorization):** GPUs are designed for parallel math. Processing 32 images at once is mathematically just as fast as processing 1 image, so you get 32x more work done per step.

- **Escapes "Saddle Points" (Traps):** Batch GD moves too smoothly and often gets stuck in flat areas (saddle points) where the error stops decreasing.

Mini-batch introduces slight noise (randomness) because every batch is different. This "jitter" helps the model shake itself out of these traps and find the true solution.

**DATASET and DATALOADER are core abstractions in PyTorch that decouple how you define your data from how you efficiently interate over it in training loop.**

In [21]:
from sklearn.datasets import make_classification
import torch

In [None]:
X, y = make_classification(
                            n_samples=10, # Total rows of data
                            n_features=2, # Columns (inputs)
                            n_classes=2,   # output 
                            n_informative=2,  # Feature that acutally help predict the output
                            n_redundant=0,    # uselesss noise features
                            random_state=42  # Seeds the random number generator so you get the same numbers every time
                        )

In [23]:
X

array([[ 1.06833894, -0.97007347],
       [-1.14021544, -0.83879234],
       [-2.8953973 ,  1.97686236],
       [-0.72063436, -0.96059253],
       [-1.96287438, -0.99225135],
       [-0.9382051 , -0.54304815],
       [ 1.72725924, -1.18582677],
       [ 1.77736657,  1.51157598],
       [ 1.89969252,  0.83444483],
       [-0.58723065, -1.97171753]])

In [24]:
X.shape

(10, 2)

In [25]:
y

array([1, 0, 0, 0, 0, 1, 1, 1, 1, 0])

In [26]:
y.shape

(10,)

In [27]:
# convert the data into tensors

X_tensor = torch.tensor(X, dtype=torch.float32)
y_tensor = torch.tensor(y, dtype=torch.long)

In [28]:
X_tensor

tensor([[ 1.0683, -0.9701],
        [-1.1402, -0.8388],
        [-2.8954,  1.9769],
        [-0.7206, -0.9606],
        [-1.9629, -0.9923],
        [-0.9382, -0.5430],
        [ 1.7273, -1.1858],
        [ 1.7774,  1.5116],
        [ 1.8997,  0.8344],
        [-0.5872, -1.9717]])

In [29]:
y_tensor

tensor([1, 0, 0, 0, 0, 1, 1, 1, 1, 0])

In [30]:
from torch.utils.data import Dataset, DataLoader

### 1. Generating Fake Data 
- What: Creates a synthetic (fake) dataset using Scikit-Learn.

- Why: Useful for testing code when you don't have a real CSV file yet.

- Key Variables:

    - X (Features): The input data (e.g., Patient Age, BP). Shape is (10, 2) → 10 rows, 2 columns.

    - y (Labels): The target answers (e.g., 0=Healthy, 1=Sick). Shape is (10,).

In [None]:
class CustomDataset(Dataset):
    def __init__(self, features, labels):
        self.features = features
        self.labels = labels

    # 2. THE LENGTH
    def __len__(self):
        return len(self.features)

    # 3. THE FETCHER (The Magic Method)
    def __getitem__(self, index):
        return self.features[index], self.labels[index]

In [32]:
dataset = CustomDataset(X_tensor, y_tensor)

In [33]:
len(dataset)


10

In [34]:
dataset[2]

(tensor([-2.8954,  1.9769]), tensor(0))

In [35]:
dataloader = DataLoader(dataset, batch_size=2, shuffle=True)

In [36]:
for batch_features, batch_labels in dataloader:
    print("Batch Features:\n", batch_features)
    print("Batch Labels:\n", batch_labels)
    print("-"*30)  

Batch Features:
 tensor([[ 1.0683, -0.9701],
        [ 1.8997,  0.8344]])
Batch Labels:
 tensor([1, 1])
------------------------------
Batch Features:
 tensor([[-1.1402, -0.8388],
        [ 1.7774,  1.5116]])
Batch Labels:
 tensor([0, 1])
------------------------------
Batch Features:
 tensor([[-1.9629, -0.9923],
        [ 1.7273, -1.1858]])
Batch Labels:
 tensor([0, 1])
------------------------------
Batch Features:
 tensor([[-0.7206, -0.9606],
        [-2.8954,  1.9769]])
Batch Labels:
 tensor([0, 0])
------------------------------
Batch Features:
 tensor([[-0.9382, -0.5430],
        [-0.5872, -1.9717]])
Batch Labels:
 tensor([1, 0])
------------------------------


1. Generating Fake Data (make_classification)
What: Creates a synthetic (fake) dataset using Scikit-Learn.

Why: Useful for testing code when you don't have a real CSV file yet.

Key Variables:

X (Features): The input data (e.g., Patient Age, BP). Shape is (10, 2) → 10 rows, 2 columns.

y (Labels): The target answers (e.g., 0=Healthy, 1=Sick). Shape is (10,).

2. Converting to TensorsConcept: PyTorch cannot read NumPy arrays; it only understands Tensors.Crucial Rule for Data Types:X must be float32: Neural networks do math with decimals (weights $\times$ inputs). If you leave it as float64 (NumPy default), PyTorch will error out.y must be long (int64): Class labels (Category 0, Category 1) must be integers. Loss functions (like CrossEntropy) expect integers, not decimals.

3. The CustomDataset Class (The "Bookshelf")
Concept: This class standardizes how your data is stored and accessed. It doesn't load data into the model; it just sits there waiting to be asked for data.

The 3 Required Methods:

__init__ (The Setup): Runs once. You store your data (tensors) inside self variables here.

__len__ (The Count): Tells PyTorch the total size of the dataset (e.g., "I have 10 rows").

__getitem__ (The Grabber): The most important part. It tells PyTorch: "When I ask for item idx, give me the Feature and Label at that index."

4. The DataLoader (The "Delivery Truck")
Concept: The Dataset sits on the shelf. The DataLoader is the worker that grabs items, packages them, and delivers them to the model.

batch_size=2: Instead of feeding the model 1 row at a time (too slow) or all 10 rows (crashes RAM), we feed 2 rows at a time.

shuffle=True: Crucial. It shuffles the data before every epoch.

Why? If data is sorted (e.g., all "Cancer" cases first), the model learns the order, not the pattern. Shuffling breaks this bias.

5. The Training Loop
Action: for batch_features, batch_labels in dataloader:

What happens here:

The loader randomly picks 2 indices (e.g., Index 3 and Index 8).

It uses __getitem__ to fetch those 2 specific rows.

It stacks them together into a single batch.

Result: You get a Tensor of shape (2, 2) (2 samples, 2 features) ready for the model.
