# 📘 Lesson 10 — Data Loading & Preprocessing: Preparing Real Datasets

---

### 🎯 Why this lesson matters
Models are only as good as the data they learn from.  
So far, we’ve used toy data — now we handle **real datasets** like images or text.  

👉 PyTorch’s **Dataset** and **DataLoader** make loading efficient.  
Preprocessing (augmentation, normalization) improves model performance.  

This lesson teaches WHY clean data prevents overfitting and speeds training.


In [1]:
# Setup
import torch
from torch.utils.data import Dataset, DataLoader
import torchvision
import torchvision.transforms as transforms
import numpy as np
torch.manual_seed(42)


## 1) What is Dataset & DataLoader?

- **Dataset**: Interface to access data (e.g., images + labels).
- **DataLoader**: Wraps Dataset for batching, shuffling, parallel loading.

👉 WHY? Manual loops are slow; DataLoader optimizes for GPU/CPU.


In [2]:
# MNIST example
transform = transforms.ToTensor()
dataset = torchvision.datasets.MNIST(root='./data', train=True, download=True, transform=transform)
dataloader = DataLoader(dataset, batch_size=64, shuffle=True, num_workers=2)

images, labels = next(iter(dataloader))
print("Batch images shape:", images.shape)  # [batch, channels, height, width]
print("Batch labels shape:", labels.shape)


Batch images shape: torch.Size([64, 1, 28, 28])
Batch labels shape: torch.Size([64])


## 2) Custom Dataset

- Inherit Dataset, implement __len__ and __getitem__.

👉 WHY custom? For your own data (CSV, folders).


In [3]:
class CustomDataset(Dataset):
    def __init__(self, data, labels):
        self.data = data
        self.labels = labels

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        return self.data[idx], self.labels[idx]

data = torch.rand(10)
labels = torch.rand(10)
custom_ds = CustomDataset(data, labels)
print("Data point:", custom_ds[0][0], "Label:", custom_ds[0][1])


Data point: tensor(0.), Label: tensor(0.)


## 3) Data Augmentation — More Variety

- Random flips, rotations, crops to create "new" data.

👉 WHY? Prevents overfitting, makes model robust.


In [4]:
aug_transform = transforms.Compose([
    transforms.RandomHorizontalFlip(),
    transforms.RandomRotation(10),
    transforms.ToTensor(),
    transforms.Normalize((0.5,), (0.5,))  # Mean, std
])

aug_dataset = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=aug_transform)


## 4) Normalization & Standardization

- Scale data to mean=0, std=1.

👉 WHY? Speeds convergence, prevents exploding gradients.


In [5]:
norm_transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5,), (0.5,))
])

norm_dataset = torchvision.datasets.MNIST(root='./data', train=True, download=True, transform=norm_transform)
img, _ = norm_dataset[0]
print("Mean:", img.mean(), "Std:", img.std())


Mean: tensor(0.5000), Std: tensor(0.5000)


## 5) Practice Exercises

- Create DataLoader for custom CSV data.
- Apply transforms to your own images.


In [6]:
# Practice: Custom CSV Dataset
import pandas as pd

class CSVDataset(Dataset):
    def __init__(self, csv_file):
        self.data = pd.read_csv(csv_file)

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        return torch.tensor(self.data.iloc[idx, :-1].values), torch.tensor(self.data.iloc[idx, -1])


## 📚 Summary

✅ What we learned:
- Dataset for data access.
- DataLoader for efficient loading.
- Augmentation for variety.
- Normalization for stability.

🚀 Next Lesson: **Training Loop & Validation** — monitoring model performance.
