# DataLoaders: Pipeline for feeding the training data

In [45]:
import mototaxi_utils as mutils
import torch
import torchvision

print(f'Pytorch version {torch.__version__}')

Pytorch version 2.1.2


## Splitting the images into three datasets.

The training data is organized in two sets: images **with** and **without** mototaxis. They are stored in the following directory structure.
```
 mototaxi_training_images/
 ├── mototaxi/ (558 images)
 └── no_mototaxi/ (558 images)
```
Our goal is to split all the collected images (1116) into three mutually exclusive datasets: train, validation (aka. development), and test. 

This can be done manually calling method `torchvision.datasets.ImageFolder()` three times, one for each separate folder containing images for training, validation and test datasets. However, since we want to keep all our images in a single folder (not having to manually store them into three different folders), we use `torchvision.utils.data.random_split()` to automatically split the contents of `mototaxi_training_images` into three different datasets (70%, 20%, and 10% for the train, validation, and test datasets). The seed of the generator is fixed with `manual_seed()` so that the random sampling is kept reproducible for debugging purposes. 

In [49]:
imagenet_mean = [0.485, 0.456, 0.405]
imagenet_std = [0.229, 0.224, 0.225]

img_transforms = {
    'train': torchvision.transforms.Compose([
        torchvision.transforms.Resize(224),
        torchvision.transforms.RandomHorizontalFlip(),
        torchvision.transforms.ToTensor(),
        torchvision.transforms.Normalize(imagenet_mean, imagenet_std)
    ])
}

img_dir = '~/Downloads/dldata/mototaxi_training_images'
img_dataset = torchvision.datasets.ImageFolder(root=img_dir, transform=img_transforms["train"])
train_dataset, val_dataset, test_dataset = mutils.custom_random_split(img_dataset,
                                                                      (0.7, 0.2, 0.1),
                                                                      generator=torch.Generator().manual_seed(42)
                                                                       )
print(f'train_dataset: Number of images={len(train_dataset)}, type {type(train_dataset)}')
print(f'val_dataset: Number of images={len(val_dataset)}, type {type(val_dataset)}')
print(f'test_dataset: Number of images={len(test_dataset)}, type {type(test_dataset)}')

print(img_dataset.class_to_idx)

train_dataset: Number of images=782, type <class 'mototaxi_utils.CustomSubset'>
val_dataset: Number of images=223, type <class 'mototaxi_utils.CustomSubset'>
test_dataset: Number of images=111, type <class 'mototaxi_utils.CustomSubset'>
{'mototaxi': 0, 'no_mototaxi': 1}


The three datasets are of type `Subset`, more specifically, of custom type `CustomSubset`, as discussed below. These dataset objects are used to create DataLoaders. 

## Building the DataLoader

It's useful to perform a dry-run of the training process for taking a look at the actual images (indices) composing each minibatch of each training epoch, as printed in the output cell below. Pytorch methods don't allow returning these indices alongside the images comprising each minibatch. Thus, we customized two methods (look at `mototaxi_utils.py`) to enable such feature:
- Method `custom_random_splits()` based on `torch.utils.data.random_splits()`
- Class `CustomSubset()` based on  `torch.utils.data.Subset()`

These customizations allow returning indices of the selected images, as seen in line 
> `inputs_and_y, data_indices = minibatch_data` .

In [47]:
num_samples_per_epoch = 25 
batch_size = 10 
num_epochs = 4 
num_workers = 0
test_random_sampler = torch.utils.data.RandomSampler(data_source=test_dataset,
                                                     replacement=False,
                                                     num_samples=num_samples_per_epoch,
                                                     generator=torch.Generator().manual_seed(42)
                                                     )

test_loader = torch.utils.data.DataLoader(dataset=test_dataset,
                                          batch_size=batch_size,
                                          shuffle=False,
                                          num_workers=num_workers,
                                          sampler=test_random_sampler
                                          )

## Inspecting the DataLoader: Training dry-run

In [48]:

for epoch in range(num_epochs):
    print(f'Epoch {epoch} ------------------')
    for minibatch_index, minibatch_data in enumerate(test_loader):
        inputs_and_y, data_indices = minibatch_data
        inputs, y = inputs_and_y
        print(f'#{minibatch_index}:', data_indices.tolist())
    print()

Epoch 0 ------------------
#0: [448, 176, 175, 962, 1008, 1111, 729, 63, 749, 1065]
#1: [167, 995, 606, 367, 663, 1109, 708, 703, 548, 66]
#2: [1071, 776, 115, 965, 846]

Epoch 1 ------------------
#0: [277, 82, 392, 448, 1065, 1058, 435, 1046, 982, 808]
#1: [184, 776, 587, 66, 796, 516, 802, 54, 292, 426]
#2: [167, 846, 713, 63, 94]

Epoch 2 ------------------
#0: [184, 817, 516, 868, 708, 384, 167, 606, 75, 331]
#1: [91, 965, 661, 277, 990, 1033, 55, 237, 995, 799]
#2: [802, 63, 917, 683, 776]

Epoch 3 ------------------
#0: [1071, 808, 990, 82, 779, 943, 482, 422, 182, 1111]
#1: [1109, 817, 645, 796, 907, 348, 984, 95, 94, 485]
#2: [1097, 460, 237, 184, 963]


In the output cell above we can verify that
- Each epoch uses only 25 images instead of the default behavior of using the entire 111 images of the test dataset. Thus, using `num_samples_per_epoch` gives the user finer control for managing TAT.
- There are no image repetitions between different batches within the same epoch.
- The images are fairly different between epochs, although there could be some random repetitions.
