## How to use

Use this URL to access:

**shorturl.at/bevwE**

To use this notebook first you need to create a copy in your own personnal google drive (as per the first picture of the first practical session). **Then you'll need to switch the runtime to GPU to be able to train your models on GPU to reduce runtime** (as per the second picture of the first practical session).

# Deep learning : **_Data_**

In this session, we overview the preparation of data to be fed to a neural network.

![Data overview](https://www.rocq.inria.fr/cluster-willow/tchabal/courses/springschool2022/overview_data.png)

_Illustration created by Thomas Chabal and Clément Riu, 2022._

## Setting up the notebook

We first download code that will be useful for this session.

As we are going to use a GPU, do not forget to activate a GPU for this notebook. To do so, go in _Execution_ > _Change the execution type_ > select _GPU_ in the _Hardware accelerator_ drop-down.

In [None]:
%cd /contentse.zip
%cd casablanca_course
!python setup.py in
!wget https://www.rocq.inria.fr/cluster-willow/tchabal/courses/springschool2022/casablanca_course.zip && \
  unzip casablanca_course.zip && \
  rm casablanca_course.zip
%cd casablanca_course
!python setup.py install

## Create a dataset

### Download images

In [None]:
# Download data and uncompress it

!wget http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz
!wget http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz
!mkdir -p samples
!gunzip -c t10k-labels-idx1-ubyte.gz > samples/t10k-labels-idx1-ubyte
!gunzip -c t10k-images-idx3-ubyte.gz > samples/t10k-images-idx3-ubyte

!pip install python-mnist

In [None]:
from mnist import MNIST
import numpy as np
import matplotlib.pyplot as plt
from pathlib import Path
from PIL import Image

data_dir = "custom_dataset"
NB_IMAGES = 1000


def uncompress_ds():
  mndata = MNIST('samples')
  images, labels = mndata.load_testing()

  p = Path(data_dir)

  images_subset = images[:NB_IMAGES]
  labels_subset = labels[:NB_IMAGES]

  LABELS = ["T-shirt/top", "Trouser", "Pullover", "Dress", "Coat", "Sandal", "Shirt", "Sneaker", "Bag", "Ankle boot"]
  LABELS_DIR = { idx: label.replace(" ", "-").replace("/", "--") for idx, label in enumerate(LABELS) }
  for _, lbl in LABELS_DIR.items():
    (p / lbl).mkdir(parents=True, exist_ok=True)

  def convert_to_img(image, label, idx):
    img = np.array(image).reshape((28, 28)).astype(np.uint8)
    lbl = LABELS_DIR[label]
    file_name = p / lbl / f"{lbl}_{idx}.png"
    Image.fromarray(img).save(file_name, "PNG")

  _ = [convert_to_img(img, label, idx) for idx, (img, label) in enumerate(zip(images_subset, labels_subset))]

uncompress_ds()

### Visualize a few images

Our images are stored in the `custom_dataset` directory. These are images of various clothes from categories that we list below:

In [None]:
!ls custom_dataset

When developing a deep learning application, knowing the data is crucial. It must direct you in your choices of model architecture, training parameters, data augmentation...

We visualize a few images in what follows.

In [None]:
class_dirs = [p for p in Path(data_dir).glob("*")]
nb_classes = len(class_dirs)

plt.figure(figsize=(30,30))
nb_cols = 5
nb_rows = int(np.ceil(nb_classes / nb_cols))

plt.subplots(nb_rows, nb_cols)
for idx, category in enumerate(class_dirs):
  plt.subplot(nb_rows, nb_cols, idx + 1)
  imgs = [p for p in category.glob("*")]
  img_path = np.random.choice(imgs)
  img = np.array(Image.open(img_path))
  plt.imshow(img, cmap="gray")
  plt.axis("off")
  label = category.stem
  plt.title(label)
  if idx == 0:
    print(f"Images are of shape {img.shape}")

As shown above, our images are small images (of size 28x28) and have only one color channel.

### From directories of images to dataset

Neural networks process several images at once. When trained in a "supervised" way, they compute predictions from the input images and compare them to labels.

We must therefore create a structure adapted to these massive computations: a **dataset**.

In Pytorch, datasets have a simple structure. They are implemented as classes with 3 functions:
- `__init__` initializes the class;
- `__len__` can be used to have an idea of how big the dataset is, i.e. of how many images it is composed;
- `__getitem__` prepares one input, and its label in the case of supervised learning.

In [None]:
import torch
from torch.utils.data import Dataset
from torchvision.io import read_image

class CustomImageDataset(Dataset):
    def __init__(self, img_paths, categories, transform=None):
        self.img_paths = img_paths
        self.categories = categories
        self.labels = { category: idx for idx, category in enumerate(self.categories) }
        self.transform = transform

    def __len__(self):
        return len(self.img_paths)

    def __getitem__(self, idx):
        img_path = self.img_paths[idx]
        image = read_image(str(img_path))
        obj_category = img_path.parent.name
        label = self.labels[obj_category]
        if self.transform:
            image = self.transform(image)
        return image, label

Here we defined that we would give as input to the dataset class the set of paths to our images as well as the categories of objects. Note that the structure of the image directories can be different, in which case we would adapt the image paths and categories to fit our structure.

Let's now get this data:

In [None]:
img_dir = Path(data_dir)
img_paths = [p for p in img_dir.rglob("*") if not p.is_dir()]
categories = [p.name for p in img_dir.glob("*") if p.is_dir()]

custom_dataset = CustomImageDataset(img_paths, categories)
print(f"The dataset is made of {len(custom_dataset)} samples.")

In [None]:
idx = 42
sample = custom_dataset[idx]
img, label = sample
print(f"A sample is composed of an image of type {type(img)} and shape {img.shape}, and of a label (here: {label})")

plt.imshow(img[0, :, :], cmap="gray")
plt.title(f"Label: {label}")
plt.show()

The creation of a torch Dataset with thousands of images is very fast. Indeed, images are not loaded at the same time in memory. Instead, they are opened and processed only when we call the `__getitem__` function, for instance by calling `custom_dataset[0]`. During the training of a neural network, this loading may take time if the function is heavy to compute.

### Split in train, validation and test sets

To train a model and evaluate properly its performance, we need to split our data in 3 sets:
- The _training_ set gathers all the images that will be fed in the network and for which the network will predict values and learn,
- The _validation_ set gathers another set of images for which the network only predicts values without modifying its weights. We measure the _accuracy_ of the model on this validation set, which then informs us on how well the model learns and generalizes to unseen images. This validation set guides us in the training and is the reference to define when we stop the training.
- The _test_ set gathers other images. We also measure the accuracy on this test set, but it should not impact the training, i.e. no decision of stopping the training should be taken by looking at this set. Labels of these images are often unknown to avoid cheating, and the purpose of this test set is to have an impartial evaluation of the model.

Let us separate our data in these sets:

In [None]:
# Define the percentage of samples to attribute to the train, validation or test sets
SHARES = {
    "train": 70,
    "val": 20,
    "test": 10,
}

# Compute this split
nb_train_imgs = int(SHARES["train"] * len(custom_dataset) / 100)
nb_val_imgs = int(SHARES["val"] * len(custom_dataset) / 100)

training_paths = img_paths[:nb_train_imgs]
validation_paths = img_paths[nb_train_imgs:nb_train_imgs + nb_val_imgs]
test_paths = img_paths[nb_train_imgs + nb_val_imgs:]

training_dataset = CustomImageDataset(training_paths, categories)
validation_dataset = CustomImageDataset(validation_paths, categories)
test_dataset = CustomImageDataset(test_paths, categories)

print(f"Our sets are made of:")
print(f"- {len(training_dataset)} images for the training set,")
print(f"- {len(validation_dataset)} images for the validation set,")
print(f"- {len(test_dataset)} images for the test set.")

### Create a dataloader

The dataset we defined allows us to load images one by one and always in the same order. Instead, during training, we randomize the order in which the network sees images, and we also process several images at once. To do so, we wrap our datasets in another torch structure: the `DataLoader`.

In [None]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(training_dataset, batch_size=8, shuffle=True)
val_dataloader = DataLoader(validation_dataset, batch_size=8, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=1, shuffle=False)

We can take a look at these structures:

In [None]:
for inputs, labels in train_dataloader:
  print(f"Input tensors are of shape: {inputs.shape}")
  print(f"Labels for this batch are: {labels}")
  break

Inputs in Pytorch are under the form of `torch.Tensor`. They are matrices of several dimensions. When looking at the shape of a tensor, like `(8, 1, 28, 28)`, the first dimension indicates the number of samples (or images here) that are contained in the tensor, and the other dimensions represent the content of the object (here the pixel values).

Here, our tensor is of shape `(8, 1, 28, 28)`. The first dimension, of size `8`, indicates that the tensor contains 8 images, which is the _batch size_ (defined in the previous cell). The images are of shape `(1, 28, 28)`, which means 28 pixels x 28 pixels and 1 color channel.

Accordingly, the labels tensor is of size `8`: the label of an image is a single number, and we consider 8 images in this batch, hence the size 8.

### Common datasets

Here, we have generated a dataset and a dataloader from personal images that were stored in one of our directories. As you have seen, this can be quite tedious.

However, numerous public image datasets are already contained in Pytorch and can be loaded very easily. For instance, the image of clothes we are working on are extracted from a dataset called [_"Fashion MNIST"_](https://github.com/zalandoresearch/fashion-mnist), which we can load with the following lines:

In [None]:
from torchvision.datasets import FashionMNIST

downloaded_path = "./downloaded_datasets"
train_dataset = FashionMNIST(downloaded_path, train=True, download=True)
test_dataset = FashionMNIST(downloaded_path, train=False, download=True)

In [None]:
idx = 42
image, label = train_dataset[idx]

plt.imshow(image, cmap="gray")
plt.title(f"Label: {label}")

You can find a list of some of these datasets [here](https://pytorch.org/vision/stable/datasets.html).

## Data augmentation

### Define a sequence of transformations

In [None]:
# Utils function for following cells
def compare_transforms(transforms):
  base_train_dataset = FashionMNIST(downloaded_path, train=True, download=True)
  transformed_train_dataset = FashionMNIST(downloaded_path, train=True, transform=transforms, download=True)

  idx = 42

  plt.subplot(121)
  img, lbl = base_train_dataset[idx]
  plt.imshow(img, cmap="gray")
  plt.title(f"Base image (label: {lbl})")

  plt.subplot(122)
  img, lbl = transformed_train_dataset[idx]
  plt.imshow(img, cmap="gray")
  plt.title(f"Transformed image (label: {lbl})")

Training neural networks on images requires to present a very large diversity of images to a model. In practice, we often have limited amount of data available. To artificially increase these datasets, we can resort to what is called _data augmentation_.

The principle is simple: we randomly apply some transformations on our images to simulate a different point of view, a different lightning, object, camera parameters, etc. Torchvision, a library used jointly with Pytorch, gives us tools very easy to apply:

In [None]:
from torchvision import transforms

# Define transforms to apply and visualize the results
transforms = torch.nn.Sequential(
    transforms.RandomHorizontalFlip(0.5),
    transforms.RandomRotation(35),
    transforms.CenterCrop(20),
)

compare_transforms(transforms)

You may run several times the previous cell and obtain different results every time.

You can find a list of the available augmentations [here](https://pytorch.org/vision/stable/transforms.html).

### Evaluate the impact of the augmentation on the performance

We evaluate in the next cell how much the previous data augmentation improves the accuracy of the model:

In [None]:
import torch
from torchvision import transforms
from casablanca_course.data import train_for_transforms

torch.manual_seed(0)

dataset = "fashionmnist"
train_share = 15

def train(transformations):
  train_for_transforms(transformations, dataset=dataset, batch_size=32, n_epochs=6, train_share=train_share, show_images=True)

# Without augmentations
print("=" * 30, "\nWITHOUT AUGMENTATIONS\n")
no_transforms = transforms.ToTensor()
train(no_transforms)

# With some augmentations
print("=" * 30, "\nWITH AUGMENTATIONS\n")
train_transforms = transforms.Compose([
    transforms.ToTensor(),
    transforms.RandomHorizontalFlip(0.5),
    transforms.ColorJitter(brightness = 0.15, contrast = 0.2, saturation = 0.2),
])
train(train_transforms)

## Testing dataset size and data augmentation

### Dataset size

Neural networks are very intensive in terms of data they require to be trained on.

In the next cell, we train a model on the CIFAR10 dataset, a dataset of small images of 10 categories including cars, planes or horses among others. We can adjust the share of the training set on which we train the model through the variable `train_share`. Run the cell several times with various train shares and have a look at how the validation loss evolves depending on the number of samples fed to the network.

In [None]:
from torchvision import transforms
from casablanca_course.data import train_for_transforms

train_transforms = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])

# Change the share below (the value x means x% of the train set)
train_share = 100 # %

train_for_transforms(train_transforms, batch_size=32, n_epochs=2, train_share=train_share, show_images=True)

### Data augmentation

As we explained previously, the data augmentation impacts the performance of the trained model. This is visible with the validation loss.

In the next cell, change the transformations of `train_transforms` to increase the diversity of images in the training set and check the impact on the validation loss.

You can find other usable augmentations [here](https://pytorch.org/vision/stable/transforms.html).

In [None]:
train_transforms = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])

train_for_transforms(train_transforms, batch_size=32, n_epochs=4, train_share=100, show_images=True)

## References

- [Pytorch documentation for Dataset and DataLoader](https://pytorch.org/docs/stable/data.html)
- [Common datasets in Pytorch](https://pytorch.org/vision/stable/datasets.html)
- [Documentation for data augmentation](https://pytorch.org/vision/main/transforms.html)
- [Many public datasets used in research](https://paperswithcode.com/datasets)

_Practical session written by Thomas Chabal and Clément Riu • Spring School on Data Science - Ecole Centrale Casablanca • 2022._