<a href="https://colab.research.google.com/github/ishandahal/stats453-deep_learning_torch/blob/main/multi_layered_perceptron/Generating_validation_split.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Generating validation split

- Typical datasets do not have validation set (usually train and test). Validation sets are important for various purposes of model tuning so the test set can remain untouched.
- It can be convenient to have a way of splitting the training set into validation set if needed and merge it with training set if needed. 

## A typical dataset (here: MNIST)

In [2]:
import torch
from torchvision import datasets
from torchvision import transforms
from torch.utils.data import DataLoader

BATCH_SIZE = 64

In [4]:
## MNIST DATASET

# Note transforms.ToTensor() scales input images 
# # to 0-1 range

train_dataset = datasets.MNIST(root='data',
                               train=True,
                               transform=transforms.ToTensor(),
                               download=True)

test_dataset = datasets.MNIST(root='data',
                              train=False,
                              transform=transforms.ToTensor())

train_loader = DataLoader(dataset=train_dataset,
                          batch_size=BATCH_SIZE,
                          shuffle=True,
                          num_workers=4)

test_loader = DataLoader(dataset=test_dataset,
                         batch_size=BATCH_SIZE,
                         num_workers=4,
                         shuffle=False)

# checking dataset

for features, labels in train_loader:
    print(features.size(), labels.size())
    break

torch.Size([64, 1, 28, 28]) torch.Size([64])


In [12]:
print(f"Total number of training examples: {train_dataset.data.size(0)}")

Total number of training examples: 60000


## Subset method

Reserving 1000 training examples for Validation and use the remaining 59000 examples for new training set. Note: Subset method automatically shuffle the data prior to each epoch

In [13]:
from torch.utils.data.dataset import Subset

In [14]:
valid_indices = torch.arange(0, 1000)
train_indices = torch.arange(1000, 60000)

train_and_valid = datasets.MNIST(root='data',
                                 train=True,
                                 transform=transforms.ToTensor(),
                                 download=True)

train_dataset = Subset(train_and_valid, train_indices)
valid_dataset = Subset(train_and_valid, valid_indices)

In [15]:
train_loader = DataLoader(dataset=train_dataset,
                          batch_size=BATCH_SIZE,
                          num_workers=4,
                          shuffle=True)

valid_loader = DataLoader(dataset=valid_dataset,
                          batch_size=BATCH_SIZE,
                          num_workers=4,
                          shuffle=False)

In [17]:
## Checking the dataset
for images, labels in valid_loader:
    print('Image batch dimension: ', images.size())
    print('Image label dimension: ', labels.size())
    break

Image batch dimension:  torch.Size([64, 1, 28, 28])
Image label dimension:  torch.Size([64])


In [19]:
# Check that shuffling works properly
# i.e, label indices should be random order
# Also the label order should be different in the second epoch

for images, labels in train_loader:
    pass
print(labels[:10])

for images, labels in train_loader:
    pass
print(labels[:10])

tensor([0, 0, 7, 5, 0, 9, 8, 1, 4, 7])
tensor([3, 9, 2, 8, 5, 0, 6, 4, 3, 2])


In [20]:
## Checking that shuffling works as expected. 
## fixed random seed should return same labels

torch.manual_seed(1)
for images, labels in valid_loader:
    pass
print(labels[:10])

torch.manual_seed(1)
for images, labels in valid_loader:
    pass
print(labels[:10])

tensor([5, 1, 7, 8, 5, 0, 3, 4, 7, 7])
tensor([5, 1, 7, 8, 5, 0, 3, 4, 7, 7])


## SubsetRandomSampler Method

Compared to Subset method SubsetRandomSampler method is more convenient solution if we want to assign different transformation methods to training and test sets

In [21]:
from torch.utils.data import SubsetRandomSampler

In [31]:
train_indices = torch.arange(1000, 60000)
valid_indices = torch.arange(1000)

train_sampler = SubsetRandomSampler(train_indices)
valid_sampler = SubsetRandomSampler(valid_indices)

training_transform = transforms.Compose([transforms.Resize((32, 32)),
                                         transforms.RandomCrop((28, 28)),
                                         transforms.ToTensor()])

valid_transform = transforms.Compose([transforms.Resize((32, 32)),
                                      transforms.CenterCrop((28, 28)),
                                      transforms.ToTensor()])

train_dataset = datasets.MNIST(root='data',
                               train=True,
                               transform=training_transform,
                               download=True)

valid_dataset = datasets.MNIST(root='data',
                               train=True,
                               transform=valid_transform,
                               download=False)

train_loader = DataLoader(dataset=train_dataset,
                          batch_size=BATCH_SIZE,
                          num_workers=4,
                          sampler=train_sampler)

valid_loader = DataLoader(dataset=valid_dataset,
                          batch_size=BATCH_SIZE,
                          num_workers=4,
                          sampler=valid_sampler)

test_loader = DataLoader(dataset=test_dataset,
                         batch_size=BATCH_SIZE,
                         num_workers=4,
                         shuffle=False)

In [25]:
for images, labels in valid_loader:
    print('Image batch dimensions: ', images.shape)
    print('Lable batch dimensions: ', labels.shape)
    break

Image batch dimensions:  torch.Size([64, 1, 28, 28])
Lable batch dimensions:  torch.Size([64])


In [26]:
## Checking the shuffle
for image, labels in valid_loader:
    break
print(labels[:10])

for image, labels in valid_loader:
    break
print(labels[:10])

tensor([5, 2, 6, 1, 4, 9, 3, 5, 0, 3])
tensor([5, 8, 4, 5, 2, 9, 6, 4, 8, 6])


In [28]:
## checking for consistency with random seeding
torch.manual_seed(0)

for image, labels in valid_loader:
    break
print(labels[:10])

torch.manual_seed(0)
for image, labels in valid_loader:
    break
print(labels[:10])

tensor([3, 4, 1, 1, 2, 3, 8, 1, 4, 9])
tensor([3, 4, 1, 1, 2, 3, 8, 1, 4, 9])
