## Importing Libraries


torch – the main PyTorch library for building and training neural networks.

numpy – used for numerical operations and array manipulations.

torchvision.datasets and torchvision.transforms – for loading and preprocessing image datasets.

TensorDataset and DataLoader from torch.utils.data – to wrap data into tensors (multi-dimensional arrays) and efficiently load it in batches during training.

train_test_split from sklearn.model_selection – to split our dataset into training and testing subsets.

In [2]:
import torch
import numpy as np
from torchvision import datasets, transforms
from torch.utils.data import TensorDataset, DataLoader
from sklearn.model_selection import train_test_split

## Defining Data Transformations

#### transforms.ToTensor()
Converts the image (originally a PIL image or NumPy array) into a PyTorch tensor and automatically scales pixel values from [0, 255] → [0, 1].

#### transforms.Lambda(lambda x: x.flatten())
Flattens each image tensor from shape (1, 28, 28) (grayscale 28×28 pixels) into a 1D vector of 784 values.

This is useful for models that expect input as a single vector rather than a 2D image (like a simple linear model).

In [7]:
linear_model_transforms = transforms.Compose([
    transforms.ToTensor(),
    transforms.Lambda(lambda x: x.flatten())
])

## Loading and Preparing the MNIST Dataset

In this step, we load the MNIST handwritten digits dataset and prepare it for training.

#### datasets.MNIST
Loads the MNIST dataset directly from torchvision.

    train=True → loads the training split (60,000 images)

    train=False → loads the test split (10,000 images)

    transform=linear_model_transforms → applies the preprocessing we defined earlier (convert to tensor + flatten)

#### Combining both splits
The code merges the training and test datasets into one full dataset to later perform a custom stratified split.

X_combined = torch.cat([...], dim=0).numpy()

y_combined = torch.cat([...], dim=0).numpy()


    torch.cat([...], dim=0) → concatenates tensors along the first dimension (stacking all samples together).

    .data.float().div(255).flatten(start_dim=1) → converts images to floating-point numbers, scales pixel values from 0–255 to 0–1, and flattens them from (1, 28, 28) to (784,).

    .numpy() → converts PyTorch tensors to NumPy arrays for later use with scikit-learn’s train_test_split.

In [None]:
train_set_full = datasets.MNIST('./data', train=True, download=True, transform=linear_model_transforms)
test_set_full = datasets.MNIST('./data', train=False, download=True, transform=linear_model_transforms)

X_combined = torch.cat([train_set_full.data.float().div(255).flatten(start_dim=1),
                        test_set_full.data.float().div(255).flatten(start_dim=1)], dim=0).numpy()
y_combined = torch.cat([train_set_full.targets, test_set_full.targets], dim=0).numpy()

100.0%
100.0%
100.0%
100.0%


## Stratified Train–Validation–Test Split

Now that we have the full MNIST dataset, we divide it into training, validation, and test sets — while keeping the class distribution balanced across all subsets.

#### Split Proportions

    Training set: 60% (≈ 42,000 samples)

    Validation set: 20% (≈ 14,000 samples)

    Test set: 20% (≈ 14,000 samples)

To achieve this, we perform the split in two steps using scikit-learn’s train_test_split with the stratify argument (ensuring equal digit distribution in all sets).

#### Why Stratify is Important

Imagine you’re working with the MNIST dataset — it has 10 digit classes (0–9), each with about the same number of samples.
If you split your dataset randomly without stratification, you could end up with something like:

Training set: more 0’s and 1’s, fewer 8’s and 9’s

Test set: missing some digits entirely 

That would make your model biased and your evaluation unreliable.

By setting stratify=y, the function ensures each class appears in the same proportion in all subsets (train, validation, and test).

#### Convert Back to Tensors

After splitting, all NumPy arrays are converted back to PyTorch tensors using:

    X_train, X_val, X_test = map(torch.tensor, (X_train, X_val, X_test))
    y_train, y_val, y_test = map(torch.tensor, (y_train, y_val, y_test))

In [8]:
RANDOM_SEED = 42
TEST_VALID_SIZE = 0.4 

X_train, X_temp, y_train, y_temp = train_test_split(
    X_combined, y_combined, 
    test_size=TEST_VALID_SIZE, 
    random_state=RANDOM_SEED, 
    stratify=y_combined 
)

X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, 
    test_size=0.5, 
    random_state=RANDOM_SEED, 
    stratify=y_temp 
)

X_train, X_val, X_test = map(torch.tensor, (X_train, X_val, X_test))
y_train, y_val, y_test = map(torch.tensor, (y_train, y_val, y_test))

print(f"Train set size: {len(X_train)} ({len(X_train)/70000*100:.0f}%)") 
print(f"Validation set size: {len(X_val)} ({len(X_val)/70000*100:.0f}%)") 
print(f"Test set size: {len(X_test)} ({len(X_test)/70000*100:.0f}%)") 

Train set size: 42000 (60%)
Validation set size: 14000 (20%)
Test set size: 14000 (20%)


## Creating DataLoaders

Once the dataset is split into training, validation, and test sets, we wrap them in PyTorch DataLoaders.
These objects handle batching, shuffling, and efficient iteration during training — making model training smoother and faster.

#### Batch Size
Each batch contains 64 samples.
This value is a common default — it balances training speed and memory usage, but it can be tuned later for performance optimization.

#### Creating TensorDatasets
    train_dataset = TensorDataset(X_train, y_train)
    val_dataset = TensorDataset(X_val, y_val)
    test_dataset = TensorDataset(X_test, y_test)

Each TensorDataset pairs input tensors (X_*) with their corresponding labels (y_*), forming mini-datasets that PyTorch can iterate over.

#### Creating DataLoaders
    train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
    val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False)
    test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False)

shuffle=True for the training set → ensures the model sees data in a different order each epoch (helps generalization).

shuffle=False for validation/test → keeps order consistent for evaluation.

#### Optional Dictionary Storage
    DATA_LOADERS = {
        'train': train_loader,
        'val': val_loader,
        'test': test_loader
    }

A convenient way to store all loaders together for easier access across multiple notebooks or assignment parts (e.g., A2, A3).

In [None]:
BATCH_SIZE = 64 

train_dataset = TensorDataset(X_train, y_train)
val_dataset = TensorDataset(X_val, y_val)
test_dataset = TensorDataset(X_test, y_test)

train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False)

DATA_LOADERS = {
    'train': train_loader,
    'val': val_loader,
    'test': test_loader
}