## Data Loading

Training your own models might be difficult if you can't load the data.
This notebook covers how the dataloader works in some detail and how you can use it to load your own data.

---

Right now we are going to cover:

 * Custom data - how you can load your own images
 * Normalization - altering image data to make it easier to process
 * Augmentation - altering image data to increase the data quantity

In [None]:
from typing import *

import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision.models as models
import torch.optim as optim

import numpy as np
from torchvision import datasets, transforms
from tensorboardX import SummaryWriter

from tqdm import tqdm

import PIL

In [None]:
def to_image(image: torch.Tensor) -> PIL.Image:
    # the rescaling also reverses the normalization (close enough)
    image -= image.min()
    image /= image.max()
    return transforms.functional.to_pil_image(image.cpu(), 'RGB')

---

### Custom Data

To have a custom data loader (`torch.utils.data.DataLoader`) you just need a custom data set (`datasets.VisionDataset` or even `data.Dataset`).
A data set can be used by a data loader.

Lets start by looking at the CIFAR dataset, then work back from there.

In [None]:
train_ds = datasets.CIFAR10(
    'data',
    download=True,
    train=True,
)
# uncomment the following line to see the source code
# train_ds??

In [None]:
train_ds.__len__??

In [None]:
train_ds.__getitem__??

These are the two methods you need to implement.
The `__get_item__` method should also apply the transformations, as seen above.

Where does the data in `self.data` come from though?
It is loaded as the class is created.
When you create your own dataset you do not need to load all the data at the start - CIFAR10 is relatively small so it can.

---

So we can write an equivalent loader:

In [None]:
class ExampleLoader(datasets.VisionDataset):
    def __init__(self, image, label, transform=None, target_transform=None):
        # the transform and target_transform arguments get saved to self as self.transform and self.target_transform
        super().__init__(image, transform=transform, target_transform=target_transform)
        # load or prepare your own data after this
        self.image = PIL.Image.open(image)
        self.label = label
    
    def __len__(self) -> int:
        return 1
    
    def __getitem__(self, index):
        img = self.image
        target = self.label

        if self.transform is not None:
            img = self.transform(img)

        if self.target_transform is not None:
            target = self.target_transform(target)

        return img, target

In [None]:
example_ds = ExampleLoader('massive-data/cat/cat.jpg', 0, transform=transforms.ToTensor())

In [None]:
image, target = next(iter(example_ds))

print(target)
to_image(image)

This is a working dataset and can be added to a dataloader.

In [None]:
example_dl = torch.utils.data.DataLoader(example_ds)
images, targets = next(iter(example_dl))

images.shape, targets

In [None]:
example_dl = torch.utils.data.DataLoader(example_ds, batch_size=4)
images, targets = next(iter(example_dl))

images.shape, targets

As you can see it works just fine.
The data loader will not repeat the dataset in order to fill out a batch.

Lets see if we can make this easier to use.

In [None]:
datasets.DatasetFolder?

In [None]:
class ExampleFolderLoader(datasets.DatasetFolder):
    def __init__(self, folder, transform=None, target_transform=None):
        super().__init__(root=folder, loader=PIL.Image.open, extensions=('jpg',), transform=transform, target_transform=target_transform)

In [None]:
example_folder_ds = ExampleFolderLoader('massive-data', transform=transforms.ToTensor())

In [None]:
image, target = next(iter(example_folder_ds))

print(target) # it has turned this into an index automatically
print(example_folder_ds.classes[target]) # this is how you find the label
to_image(image)

In [None]:
example_folder_no_class_ds = datasets.DatasetFolder(
    root='massive-data',
    loader=PIL.Image.open,
    extensions=('jpg',),
    transform=transforms.ToTensor()
)

In [None]:
image, target = next(iter(example_folder_no_class_ds))

print(target) # it has turned this into an index automatically
print(example_folder_no_class_ds.classes[target]) # this is how you find the label
to_image(image)

---

### The Simplest Dataset

I strongly recommend you use one of the loading techniques described above.

It is possible to use a list as a dataloader.
It has a length and can get things by index.
You would have to prepare all the data in advance, and hold it all in memory.

In [None]:
list_ds = [
    (transforms.functional.to_tensor(PIL.Image.open('massive-data/cat/cat.jpg')), 0)
]

In [None]:
list_dl = torch.utils.data.DataLoader(list_ds, batch_size=4)
images, targets = next(iter(list_dl))

images.shape, targets

I'm showing you this to show you how simple a dataset really is.
Anything that is like a list of `(input, target)` is a dataset.

Using the "real" dataset classes makes it easier to apply the transformations, and we will see how valuable those are next.