In [1]:
%load_ext notexbook

In [2]:
%texify

# Dataset and Transform in PyTorch

In this notebook we will aim at understanding the main concept of **Data handling** and **Data Encapsulation** included in PyTorch. 

In particular, we will work with two example case studies: (a) _Hourly Energy Consumption_ dataset for time series forecasting; and (b) _Facial Emotion Recognition_ for image classification.

The main take away messages from this lecture are: 
- learn how to convert (tabular) data in Pandas/NumPy in a format compliant with PyTorch;
- explore multiple data encapsulation strategies in PyTorch (and their corresponding pros/cons);
- understand the basic principles of data loading in PyTorch;
- master data partitioning for ML in PyTorch.

## Case Study 1: Hourly Energy Consumption

The _Hourly Energy Consumption_ dataset (available on [kaggle](https://www.kaggle.com/robikscube/hourly-energy-consumption)) contains power consumption data across different regions around the United States, recorded on a hourly basis.

In [None]:
# Download and uncompress data
import os
from torchvision.datasets.utils import download_and_extract_archive

DOWNLOAD_ROOT = "./data"
os.makedirs(DOWNLOAD_ROOT, exist_ok=True)

ENERGY_DATASET_URL = "https://www.dropbox.com/s/m21e9cb66cqgjuu/hourly_energy_consumption.zip?dl=1"
ENERGY_DATASET_MD5 = "444c4a8e037897a248aeab64328c2b29"
ENERGY_DATASET_FILE= "data_energy.zip"
ENERGY_DATASET_FOLDER = os.path.join(DOWNLOAD_ROOT, "data_energy")

In [None]:
download_and_extract_archive(url=ENERGY_DATASET_URL, download_root=DOWNLOAD_ROOT, 
                             extract_root=ENERGY_DATASET_FOLDER,
                             filename=ENERGY_DATASET_FILE)

In [None]:
os.listdir(ENERGY_DATASET_FOLDER)

We have a total of `12` `.csv` files containing hourly energy trend data (`'est_hourly.paruqet'` and `'pjm_hourly_est.csv'` will not be used).

Let's have a look at a couple of them to see whether the data format is shared. We will have a sneak peek of the the `APE_hourly.csv`, and `PJME_hourly` files

In [None]:
import numpy as np
import pandas as pd

In [None]:
pd.read_csv(os.path.join(ENERGY_DATASET_FOLDER, "AEP_hourly.csv")).head(n=10)

In [None]:
pd.read_csv(os.path.join(ENERGY_DATASET_FOLDER, "PJME_hourly.csv")).head(n=10)

Apparently, the data format is consistent between two (randomly) picked data files. Let's now have a more thorough look!

### Data Preprocessing and Preparation

In our next step, we will be reading these files and pre-processing these data in this order:

1. **Process Date/time data**: Get the time data of each individual time step and extract individual fields, so we could better partition the data (this is a common practice with data-time data). In particular:
    - Hour of the day (0 - 23)
    - Day of the week (1 - 7)
    - Month (1 - 12)
    - Day of the year (1 - 365)

2. **Generate time series**: Group the data into sequences to be used as inputs to the model, along with corresponding labels
    - The sequence length or (`lookback` period) is the number of data points in history that the model will use to make the prediction
    - The label will be the next data point in time after the last one in the input sequence
    
3. **Data Partition**: Split the inputs and labels into training and test sets

4. **Data Scaling**: Scale the data to values between `0` and `1`: Algorithms tend to perform better or converge faster when features are on a relatively similar scale and/or close to normally distributed

In [None]:
# select datafiles
from functools import partial 
full_path = partial(os.path.join, ENERGY_DATASET_FOLDER)

datafile_paths = map(lambda f: full_path(f), os.listdir(ENERGY_DATASET_FOLDER))
datafiles = filter(lambda fp: fp.endswith(".csv") and not fp.endswith("pjm_hourly_est.csv"), datafile_paths)

In [None]:
def read_data(filepath: str) -> pd.DataFrame:
    """
    This function splits datetime information into mulitple fields. 
    The new fields will be added as new column of the input dataframe.
    
    New cols will be: Hour, DayofTheWeek, Month, DayOfTheYear
    """
    # Read data in Pandas DataFrame
    df = pd.read_csv(filepath, parse_dates=[0], names=["datetime", "energy_consumed"], header=0)
    df["hour"] = df.apply(lambda x: x["datetime"].hour, axis=1)
    df['dayofweek'] = df.apply(lambda x: x["datetime"].dayofweek,axis=1)
    # Complete for Month and Day of the Year
    
    
    df = df.sort_values("datetime").drop("datetime",axis=1)
    return df

In [None]:
LOOKBACK = 90
from typing import Tuple
def generate_time_series(df: pd.DataFrame, lookback: int = LOOKBACK) -> Tuple[np.ndarray, np.ndarray]:
    # Our time series data will be of shape (samples, lookback, features)
    data = df.values
    samples = len(data)-lookback
    features = df.shape[1]
    
    X_seq = np.zeros((samples, lookback, features))
    y_seq = np.zeros(samples)
    for i in range(lookback, len(data)):
        X_seq[i-lookback] = data[(i-lookback):i]
        y_seq[i-lookback] = data[i, 0]  # get only corresponding energy consumed
        
    X_seq = X_seq.reshape(-1, lookback, features)
    y_seq = y_seq.reshape(-1, 1)
    return X_seq, y_seq

In [None]:
Partition = Tuple[np.ndarray, np.ndarray]
def train_test_partition(X_seq: np.ndarray, y_seq: np.ndarray, test_size:float = 0.1) -> Tuple[Partition, Partition]:
    """Partition input data sequence in train and test sets"""
    test_size_idx = int(test_size * len(y_seq))
    X_train, X_test = X_seq[:-test_size_idx], X_seq[-test_size_idx:]
    y_train, y_test = y_seq[:-test_size_idx], y_seq[-test_size_idx:]
    return (X_train, X_test), (y_train, y_test)

In [None]:
from sklearn.preprocessing import MinMaxScaler

def apply_scaling(X: Partition, y: Partition) ->  Tuple[Partition, Partition]:
    """Apply Feature Scaling to Time Series"""
    X_train, X_test = X
    y_train, y_test = y
    # Flatten sequence (MinMaxScaler only supports 2D data)
    _, lookback, features = X_train.shape
    X_train, X_test = X_train.reshape(-1, features), X_test.reshape(-1, features)
    
    feat_scaler = MinMaxScaler()
    lab_scaler = MinMaxScaler()
    
    # Complete HERE
    
    
    X_train, X_test = feat_scaler.transform(X_train), feat_scaler.transform(X_test)
    y_train, y_test = lab_scaler.transform(y_train), lab_scaler.transform(y_test)
    
    # Rollback sequences
    X_train, X_test = X_train.reshape(-1, lookback, features), X_test.reshape(-1, lookback, features)
    return (X_train, X_test), (y_train, y_test)

In [None]:
X_train, X_test, y_train, y_train = None, None, None, None
for i, filepath in enumerate(datafiles):
    print(f"Processing File [{i+1}]: {os.path.split(filepath)[1]}...", end="")
    # Step 1: read data, and re-format date/time fields
    df = read_data(filepath=filepath)
    # Step 2: generate time sequence
    X_seq, y_seq = generate_time_series(df)
    # Step 3: data partition
    (X_tr, X_ts), (y_tr, y_ts) = train_test_partition(X_seq, y_seq)
    if X_train is None:
        X_train, X_test = X_tr, X_ts
        y_train, y_test = y_tr, y_ts
    else:
        X_train = np.concatenate((X_train, X_tr))
        X_test = np.concatenate((X_test, X_ts))
        y_train = np.concatenate((y_train, y_tr))
        y_test = np.concatenate((y_test, y_ts))
    print("...done")

# Step 4: Apply Feature Scaling
(X_train, X_test), (y_train, y_test) = apply_scaling((X_train, X_test), (y_train, y_test))
print("Feature Scaled!")

In [None]:
type(X_train), type(y_train)

In [None]:
X_train.shape, y_train.shape

**Brilliant**!. So at this stage, we have prepared our data which is ready to be used for our <ins>Machine</ins> learning algorithm (_emphasis_ on _machine_). In other words, we have data stored in **NumPy** arrays, which is indeed the preferred data format for `scikit-learn` but not immediately suitable for PyTorch.

**However**, PyTorch has extenstively _pledged_ full support and compatibility with NumPy. In fact, one thing we could immediately do is to convert NumPy arrays into `torch.Tensor` (via the `from_numpy` utility function). 

But, there is **more**[1](#fn1), because in Deep Learning training we need _more_ and _better_ abstractions for data handling than just passing the full dataset. 

A few reasons for this:
- Mini-batch learning for better model convergence: in each epoch, we will process the data into batches
    - usually _shuffled_ to cope with overfitting;
    - balanced batches preparation (i.e. _sampling_) in case of **imbalanced data** (_sampling_);
    - performant (i.e. _parallel_) batch data preparation for performance speed-up[2](#fn2)
    
- GPU memory is limited: data could be too large to fit into the memory of a GPU (so _batches_, again)
- **last but not least**: Data Augmentation to increase variability in training data (_more on this, later_)

<span id="fn1">**[1]**: It must have been! Otherwise, we would have not dedicated an entire notebook on the subject, _ed._</span>
<span id="fn2">**[2]**: You know, they say _the appetite comes with eating_, _ed._</span>

In a *Pytorch-alike* pseudocode (*not much different from the real one, ed.*), the training algorithm would be:

```python
for epoch in range(NUM_EPOCHS):
    for batch in iter(dataset, sampling=no_replacement, shuffle=True):
        X, y_true = batch               # batch is a tuple (samples, labels)
        optimizer.zero_grad()           # zero the gradient of the optimizer
        y_hat = model(batch)            # forward pass
        loss = criterion(y_hat, y_true) # calculate errors
        loss.backward()                 # backward pass
        optimizer.step()                # optimisation step
```

So, we need a <ins>`Dataset`</ins> abstraction that it is **subscriptable** (e.g. `dataset[i]`, or 
`dataset[batch_start:batch_end]`), and **iterable**. 

In particular, the iteration protocol should be flexible enough to adapt to different requirements, e.g. *shuffling* is required or not, sampling *with* or *without* replacement.
In these circumstances, we would presumably need a different `object` to just deal with this iteration protocol for a given input `dataset`.

### From `Bunch` to `Dataset`

If we are transitioning from `sklearn` (ML) to `torch` (DL), we are very used to think of our ML data in terms of `numpy` arrays (and other supported variants, i.e. `scipy.sparse.csr` or `pandas.DataFrame`).

However, it is now clear that we need something more.

`sklearn` indeed has its own general [`dataset` API](https://scikit-learn.org/stable/datasets/index.html#general-dataset-api) whenever a new dataset is loaded:

In [None]:
from sklearn.datasets import load_iris

iris = load_iris()

print(type(iris))

In [None]:
from sklearn.utils import Bunch

Bunch?

A `sklearn.utils.Bunch` object is a class that maps dictionary keys into class fields (*brilliant, ed.*):

In [None]:
iris.keys()

In [None]:
X = iris.data  # same as iris["data"]

This is definitely a first step towards a more OOP-oriented data encapsulation, and *metadata* are handled brilliantly with this abstraction. 

However, we still need a more flexible strategy to handle data *in-memory*:

### Introducing `torch.utils.data`

(_from the doc_)

> At the heart of PyTorch data loading utility is the [`torch.utils.data.DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) class. It represents a Python iterable over a **dataset**. [...] These options are configured by the constructor arguments of a `DataLoader`, which has signature:
>
```python
DataLoader(dataset, batch_size=1, shuffle=False, sampler=None,
           batch_sampler=None, num_workers=0, collate_fn=None,
           pin_memory=False, drop_last=False, timeout=0,
           worker_init_fn=None, *, prefetch_factor=2,
           persistent_workers=False)
```

#### Dataset Types

The most important argument of `DataLoader` constructor is dataset, which indicates a dataset object to load data from. 

PyTorch supports <ins>two different types</ins> of datasets:
- **map-style** datasets;
- **iterable-style** datasets;

##### Map-Style Datasets

`torch.utils.data.Dataset` is an abstract class representing a dataset in `torch`. 

Any custom dataset should inherit `Dataset` and override the following methods:

- `__len__` so that `len(dataset)` returns the size of the dataset.
- `__getitem__` to support a dataset which is **subscriptable** for sample indexing, or batch slicing.

> For example, such a `dataset`, when accessed with `dataset[idx]`, could read the `idx-th` image and its corresponding `label` from a folder on the disk.

<span class="fn">**Note**: This is the power of Python OOP abstraction!</span>

##### Iterable-style Datasets

An iterable-style dataset is an instance of a subclass of `IterableDataset` that implements the `__iter__()` protocol, and represents an **iterable** over data samples. 

This type of datasets is particularly suitable for cases where random reads are expensive or even improbable, and where the batch size depends on the fetched data.

> For example, such a dataset, when called `iter(dataset)`, could return a stream of data reading from a database, a remote server, or even logs generated in real time.

#### Energy Consumption `torch` Dataset

After the data preparation step, we ended up having a total of `980,185` sequences of training data. 

To improve the speed of our training, we can then immediately benefit from `torch` `Dataset` and `DataLoader` abstractions. 
With particular reference to our use case, `torch.utils.data` includes a special `Dataset` subclass called `TensorDataset` which is exactly designed for cases like ours in which we do have data already organised in `torch.Tensor` (_and alike_) objects

In [None]:
import torch
from torch.utils.data import TensorDataset

In [None]:
train_dataset = TensorDataset(torch.from_numpy(X_train), torch.from_numpy(y_train))
test_dataset = TensorDataset(torch.from_numpy(X_test), torch.from_numpy(y_test))

We can now pass on these datasets to two `Dataloader` instances, ready for our _training_ and _evaluation_ steps

In [None]:
from torch.utils.data import DataLoader

In [None]:
BATCH_SIZE = 1024
train_loader = DataLoader(train_dataset, shuffle=True, batch_size=BATCH_SIZE, drop_last=True)
test_loader = DataLoader(test_dataset, shuffle=False, batch_size=BATCH_SIZE, drop_last=True)

Let's now try to get the first batch from the `train_loader` and see what do we get

In [None]:
train_iterator = iter(train_loader)
batch = next(train_iterator)

In [None]:
len(batch), type(batch)

So we got a list of **two** items, presumably the data and its corresponding labels 

In [None]:
samples, labels = batch

In [None]:
type(samples), type(labels)

In [None]:
samples.size(), labels.size()

As expected, data is already in `torch.Tensor` format, ready to be used! Their `shape` is:

- `(BATCH_SIZE, LOOKUP, FEATURES)` for `samples`;
- `(BATCH_SIZE, LABEL)` for `labels`.

##### Exercise

To finally implement a more realistic training loop, let's iterate the `data_loader` for **three** epochs, extracting only **two** batches, (_and then, breaking the loop_).

For each of these batches, let's print the features of the first `5` elements of the time series and their corresponding labels for the the first **two** samples in each batch batches (_to see potentially the automatic shuffling in action_)

Here is the exercise structure:

```python
NUM_EPOCHS = 3
NUM_ITERATIONS = 2

for epoch in range(NUM_EPOCHS):
    for it, batch in enumerate(train_loader):
        samples, labels = batch
        ... # your code here
        if (it+1 == NUM_ITERATIONS):
            break
```

In [None]:
## Exercise CODE HERE

---

## Case Study 2: `Facial Emotion Recognition`

Let's now workout on our **second** **full** case study for data preparation, working on the `FER` (Facial Emotion Recognition) dataset. 

This dataset is interesting at so many level, and it will be used here to follow a _slightly_ alternative approach for Data encapsulation in PyTorch.

**Note**: Both the two case studies will be used later in practice, for **Deep Learning** model training.

So, let's start from the very beginning, working our way towards our own `FERDataset` `torch.utils.datsa.Dataset` abstraction starting from **downloading** the dataset, and quickly preprocessing in its *raw* form.

### The Facial Emotion Recognition Dataset 

The `FER` dataset is a publicly [available](https://www.kaggle.com/c/challenges-in-representation-learning-facial-expression-recognition-challenge/data) dataset, released on **Kaggle** for the **Challenges in Representation Learning: Facial Expression Recognition Challenge**.

The goal of the challenge is to identify emotions from images of facial expressions.

**Reference**:

> *Challenges in Representation Learning: A report on three machine learning
contests.* 
>
> I Goodfellow, D Erhan, PL Carrier, A Courville, M Mirza, B
Hamner, W Cukierski, Y Tang, DH Lee, Y Zhou, C Ramaiah, F Feng, R Li,
X Wang, D Athanasakis, J Shawe-Taylor, M Milakov, J Park, R Ionescu,
M Popescu, C Grozea, J Bergstra, J Xie, L Romaszko, B Xu, Z Chuang, and
Y. Bengio. arXiv 2013.

##### Data Description

(adapted from [page](https://www.kaggle.com/c/challenges-in-representation-learning-facial-expression-recognition-challenge/data) on Kaggle)

> The data consists of `48` $\times$ `48` pixel grayscale images of faces. 
> The faces have been automatically registered so that the face is more or less centered, 
> and occupies about the same amount of space in each image. 
> The task is to categorize each face based on the emotion shown in the facial expression
> into one of seven categories:
> (`0=Angry`, `1=Disgust`, `2=Fear`, `3=Happy`, `4=Sad`, `5=Surprise`, `6=Neutral`).


> The training set consists of `28,709` examples, validation set consists of `3,589` examples. The 
> final test set, also consists of another `3,589` examples.

##### Credits

>This dataset was prepared by Pierre-Luc Carrier and Aaron Courville, as part of an ongoing research project. They have graciously provided the workshop organizers with a preliminary version of their dataset to use for this contest.

**What's the plan, then?**

Well, the plan now is to work out the different bits we would need to encapsulate as `methods` of our `FERDataset` class to load and *preprocess* data. 

In particular, we are aiming to:
1. enable automatic download of the original `data`;
2. preprocess data from its original `raw` format (CSV)
3. transform data (per each partition) into `torch.Tensor`
4. **save** these tensors for later re-use, avoiding to repeat the previous steps at all!

### 1. Downloading the Data

The dataset is available on Kaggle, and mirrored by me on [this](https://www.dropbox.com/s/2rehtpc6b5mj9y3/fer2013.tar.gz?dl=1) Dropbox link in its original form.

To automatically download the data, we will be using the `download_and_extract_archive` utility function included in the `torchvision.datasets.utils` module[$^{2}$](#fn2).

<span id="fn2"><i>[2]: </i> This function is used in **all** the `VisionDataset` instances included in `torchvision.datasets` classes.</span>

In [None]:
FER_DATASET_URL = "https://www.dropbox.com/s/2rehtpc6b5mj9y3/fer2013.tar.gz?dl=1"
FER_DATASET_FILE = "fer2013.tar.gz"
FER_DATASET_MD5 = "ca95d94fe42f6ce65aaae694d18c628a"
FER_DATASET_FOLDER = os.path.join(DOWNLOAD_ROOT, "fer2013")

In [None]:
download_and_extract_archive(url=FER_DATASET_URL, download_root=DOWNLOAD_ROOT, filename=FER_DATASET_FILE)

As usual, let's have a look at what sort of _monster_ we have to deal with:

In [None]:
os.listdir(FER_DATASET_FOLDER)

#### Processing `raw` data using `pandas`

In [None]:
from pathlib import Path 

FER_CSV_PATH = Path(FER_DATASET_FOLDER) / "fer2013.csv"
fer_df = pd.read_csv(FER_CSV_PATH, header=0, names=["emotion", "pixels", "partition"])

In [None]:
fer_df.shape

In [None]:
fer_df.columns

In [None]:
fer_df.head()

The structure of the data looks quite simple: 

1. As documented, image `pixels` are reported _flattened_ into each row;
2. Emotions are encoded as numbers, and presumably we should make it a categorical variable
3. The `Usage` column assign a dataset partition to each sample

In [None]:
fer_df.partition.unique()

Let's remap partitions' names into a more standard _nomenclature_

In [None]:
fer_df.partition = fer_df.partition.apply(lambda v: 
                                          'training' if v == 'Training' else 'validation' 
                                          if v == 'PrivateTest' else 'test')

In [None]:
fer_df.partition.unique()

In [None]:
fer_df.partition.value_counts()

**Emotion** labels encoding:

In [None]:
EMOTION_MAP = {0: 'Angry', 1: 'Disgust', 2: 'Fear', 3: 'Happy', 4: 'Sad', 5: 'Surprise', 6:'Neutral'}

fer_df.emotion = pd.Categorical(fer_df.emotion)
fer_df["emotion_label"] = fer_df.emotion.apply(lambda c: EMOTION_MAP[c])

In [None]:
fer_df.emotion.cat.codes.unique()

In [None]:
fer_df.emotion_label.values.unique()

Now let's try to plot an image loading its pixels from the `DataFrame`.

We will be using the `fromstring` function of `numpy` to convert a string of digits into a `ndarray` object.

In [None]:
from matplotlib import pyplot as plt
%matplotlib inline

In [None]:
pixels = fer_df.iloc[0].pixels
img = np.fromstring(pixels, dtype=np.uint8, sep=" ")
print(img.shape)

In [None]:
plt.imshow(img.reshape(48, 48), cmap="gray", interpolation='bilinear')
plt.axis("off")
plt.show()

Here are `100` random samples at a first glance

In [None]:
from functools import partial 

# shortcut for plt.text with some settings
text_annotation = partial(plt.text, x=36, y=46, fontdict={'color': 'red', 'fontsize': 10, 'ha': 'center', 
                                                          'va': 'center', 
                                                          'bbox': dict(boxstyle="round", fc="white", 
                                                                       ec="black", pad=0.2)})
def overview(samples):
    """
    The function is used to plot first 
    several pictures for overviewing 
    image dataset.
    """
    fig = plt.figure(figsize=(25,25))
    image_shape = (48, 48)
    for i, (em_label, bytestring) in enumerate(samples):
        ax = fig.add_subplot(10,10,i+1)
        comp = np.fromstring(bytestring, dtype=np.uint8, sep=' ')
        vmax = max(comp.max(), -comp.min())
        plt.imshow(comp.reshape(image_shape), cmap=plt.cm.gray,
                   interpolation='bilinear',
                   vmin=-vmax, vmax=vmax)
        text_annotation(s='{}-{}'.format(str(i+1).zfill(2), em_label))
        plt.xticks(np.array([]))
        plt.yticks(np.array([]))
        plt.tight_layout()
    plt.show()

In [None]:
SEED = 920  # 0b1110011000
samples = fer_df.sample(n=100, random_state=SEED, replace=False)
overview(samples[["emotion_label", "pixels"]].values)  # this may take a bit to render

##### Before moving on...

The **last** and quite important bit it is left to consider is how samples are distributed among the three given partitions, per each single `emotion` label.

In other words, we want to investigate **samples per-class distribution**.

Even if this is not necessary for the `Dataset` abstraction per se, it is always a very good idea to explore how imbalanced a dataset can be, in order to be ready to put _remedies_ in place.

In [None]:
fer_df.groupby("emotion").emotion_label.value_counts().unstack().plot(kind="bar", rot=0,
                                                                      figsize=(6, 6))
plt.show()

⚠️ As we can see, this dataset is quite imbalanced towards `happy`, `sad`, and `neutral` emotions, with the `disgust` being the least represented class (by a lot).

To further investigate how this reflects to corresponding **given**[\*](#fnstar) data partitions: 

<span id="fnstar"><i>[$\star$]: </i>Please bear in mind that this dataset comes from a Kaggle challenge, therefore original partitioning might intentionally include some bias of any sort induced by the very nature of the challenge itself.</span>

In [None]:
fer_df.groupby('partition').emotion_label.value_counts().unstack().plot(kind="bar", rot=0,
                                                                        figsize=(10, 10))
plt.show()

From this **plot** we can conclude that this samples distribution in `test` and `validation` is somewhat comparable, if not identical. Nothing we can say on the actual selected samples, yet.

#### Creating `FERDataset`

We are finally ready to encapsulate all the previous steps into a custom `torch.utils.data.Dataset` class

⚠️ **NOTE**

For the sake of this notebook, the `Dataset` class will be directly into a single cell for the sole sake of keeping it all together here, and to narrative, and explanations.

For this reason, also **comments** and *code documentation* will be slightly reduced to the bare minimum here. 

The complete and documented `fer.py` module will be used and re-used in future notebooks, also to avoid useless repetitions, and difficult-to-read//maintain notebook cells :)

In [None]:
from math import sqrt
from torch.utils.data import Dataset
from PIL import Image

In [None]:
class FER(Dataset):
    RAW_DATA_FILE = "fer2013.csv"  # input original CSV filename
    RAW_DATA_FOLDER = "fer2013"  # original data folder name
    resources = [("https://www.dropbox.com/s/2rehtpc6b5mj9y3/fer2013.tar.gz?dl=1",
                  "ca95d94fe42f6ce65aaae694d18c628a",)]
    # torch Tensor filename
    data_files = { "train": "training.pt", "validation": "validation.pt", "test": "test.pt",}
    # classes list
    classes = ["angry", "disgust", "fear", "happy", "sad", "surprise", "neutral",]

    def __init__(self, root: str, split: str = "train", download: bool = False):
        self.root = root
        split = split.strip().lower()
        if split not in self.data_files:
            raise ValueError(
                "Data Partition not recognised. Accepted values are 'train', 'validation', 'test'."
            )
        if download:
            self.download()  # download, preprocess, and store FER data
        if not self._check_exists():
            raise RuntimeError(
                "Dataset not found." + " You can use download=True to download it"
            )
        self.split = split
        data_file = self.data_files[self.split]
        data_filepath = self.processed_folder / data_file
        # load serialisation of dataset as torch.Tensors
        self.data, self.targets = torch.load(data_filepath)

    def __len__(self):
        return len(self.data)

    def __getitem__(self, index: int) -> Tuple[torch.Tensor, int]:
        img, target = self.data[index], int(self.targets[index])
        img = torch.unsqueeze(img, 0)
        return img, target
    
    def _check_exists(self):
        for data_fname in self.data_files.values():
            data_file = self.processed_folder / data_fname
            if not data_file.exists():
                return False
        return True

    def download(self):
        """Download the FER data if it doesn't already exist in the processed folder"""
        if self._check_exists():
            return
        os.makedirs(self.raw_folder, exist_ok=True)
        os.makedirs(self.processed_folder, exist_ok=True)
        # download files
        for url, md5 in self.resources:
            filename = url.rpartition("/")[-1].split("?")[0]
            download_and_extract_archive(
                url, download_root=self.raw_folder, filename=filename, md5=md5)
        # process and save as torch files
        # process and save as torch files
        def _set_partition(label: str) -> str:
            if label == "Training":
                return "train"
            if label == "PrivateTest":
                return "validation"
            return "test"
        
        print("Processing...", end="")
        raw_data_filepath = self.raw_folder / self.RAW_DATA_FOLDER / self.RAW_DATA_FILE
        fer_df = pd.read_csv(raw_data_filepath, header=0, 
                             names=["emotion", "pixels", "partition"])
        fer_df["partition"] = fer_df.partition.apply(_set_partition)
        fer_df.emotion = pd.Categorical(fer_df.emotion)
        for partition in ("train", "validation", "test"):
            dataset = fer_df[fer_df["partition"] == partition]
            images = self._images_as_torch_tensors(dataset)
            labels = self._labels_as_torch_tensors(dataset)
            data_file = self.processed_folder / self.data_files[partition]
            with open(data_file, "wb") as f:
                torch.save((images, labels), f)
        print("Done!")

    def _images_as_torch_tensors(self, dataset: pd.DataFrame) -> torch.Tensor:
        """
        Extract all the pixel from the input dataframes, and convert images in
        a [sample x features] torch.Tensor
        """
        imgs_np = (dataset.pixels.map(self._to_numpy)).values
        imgs_np = np.concatenate(imgs_np, axis=0)
        samples_no, pixels = imgs_np.shape
        new_shape = (samples_no, int(sqrt(pixels)), int(sqrt(pixels)))
        return torch.from_numpy(imgs_np).view(new_shape)

    @staticmethod
    def _labels_as_torch_tensors(dataset: pd.DataFrame) -> torch.Tensor:
        """Extract labels from pd.Series and convert into torch.Tensor"""
        labels_np = dataset.emotion.values.astype(np.int)
        return torch.from_numpy(labels_np)

    @staticmethod
    def _to_numpy(pixels: str) -> np.ndarray:
        """Convert one-line string pixels into NumPy array, adding the first
        extra axis (sample dimension) later used as the concatenation axis"""
        img_array = np.fromstring(pixels, dtype=np.uint8, sep=" ")[np.newaxis, ...]
        return img_array
    
    @property
    def processed_folder(self):
        return Path(self.root) / self.__class__.__name__ / "processed"

    @property
    def raw_folder(self):
        return Path(self.root) / self.__class__.__name__ / "raw"
    
    @property
    def idx_to_class(self):
        return {i: _class for i, _class in enumerate(self.classes)}
    
    

In [None]:
fer_training = FER(root=DOWNLOAD_ROOT, download=True, split="train")

In [None]:
len(fer_training)

In [None]:
img, label = fer_training[3]

In [None]:
img.shape

In [None]:
img, label = fer_training[3]
plt.imshow(img.numpy().transpose((1, 2, 0)), interpolation="bilinear", cmap="gray")
text_annotation(s=f"Training Sample: {fer_training.idx_to_class[label]}")
plt.show()

In [None]:
fer_validation = FER(root=".", download=True, split="validation")

In [None]:
len(fer_validation)

Let's now iterate some samples from the dataset

In [None]:
fig = plt.figure(figsize=(10, 10))

for i, (face, emotion) in enumerate(iter(fer_validation), start=0):
    
    print(i, f"{face.shape}", emotion)

    ax = plt.subplot(1, 4, i + 1)
    plt.tight_layout()
    ax.set_title('Sample #{}'.format(i))
    ax.axis('off')
    ax.imshow(face.numpy().transpose((1, 2, 0)), interpolation="bilinear", cmap="gray")
    if i == 3:
        plt.show()
        break

So, the dataset is **subscriptable** as required, and we can easily `iter`-ate it to access single samples. 

However, as already anticipated, we can do a lot better... also because we need more!  And `torch` provides better abstractions to iterate over a `Dataset`: `torch.utils.data.DataLoader`!

#### Iteration Time

In the previous example, we demonstrated that we can easily use a generic Python `iterator` object to iterate over samples of a `Dataset`. 

A `Dataset` object is indeed subscriptable, therefore is is always possible to do so. 

However, in doing so we would be missing out a lot of features provided _out of the box_ by `torch Dataloaders`:

- Batching the data
- Shuffling the data
- Load the data in parallel using `multiprocessing` workers.

`torch.utils.data.DataLoader` is an iterator which provides all these features. 

Parameters used below should be clear:

- `shuffle=False` whether to shuffle the samples (only required in training)
- `dataset`: the dataset to iterate
- `batch_size`: the size of each `batch`, that is: "how many samples per single batch"
- `num_workers`: how many worker processess will be used to load
- `collate_fn`

One parameter of interest here is `collate_fn`, which is the function that a `Dataloader` instance calls internally to **prepare** the batches.

The `default_collate_fn` however works fine in `90%` of the cases. For example, with our `FER` dataset instance:

1. each `__getitem__` call returns a `tuple` `(torch.Tensor[1, 48, 48], int)`

2. The `default_collate_fn` collects a `batch_size` number of those dictionaries, and stack samples together so that:
    - `batch` $\mapsto$ `[(Tensor[batch_size, 1, 48, 48]`, `Tensor[batch_size])]`
 

In [None]:
validation_loader = DataLoader(fer_validation, batch_size=4, shuffle=False)

In [None]:
batch = next(iter(validation_loader))

In [None]:
type(batch)

In [None]:
batch[0].shape, batch[1].shape

To **customise** our `collate_fn`, we will be wrapping the result of the `default_collate_fn` implementation into a `Batch` namedtuple, for easier fields access:

In [None]:
from collections import namedtuple

from torch.utils.data.dataloader import default_collate

Batch = namedtuple("Batch", ["samples", "emotions"])

def batch_collate_fn(batch):
    batch = default_collate(batch)
    return Batch(*batch)

validation_loader = DataLoader(fer_validation, batch_size=4,
                               collate_fn=batch_collate_fn, shuffle=False)

In [None]:
batch = next(iter(validation_loader))

type(batch)

In [None]:
batch.samples.shape

In [None]:
batch.emotions.shape

##### Exercise

Let's have a go with our `validation_loader` to show some batches, returing our custom `Batch` object. 

We will be using the `make_grid` function from `torchvision.utils` for quicker plotting

In [None]:
from torchvision.utils import make_grid

In [None]:
for i, batch in enumerate(validation_loader):
    # YOUR CODE HERE
    samples, labels =   # COMPLETE HERE
    print(f"Emotions: {list(map(lambda e: fer_validation.idx_to_class[e.item()], labels))}")
    grid = make_grid(samples)
    plt.imshow(grid.numpy().transpose((1, 2, 0)))
    plt.show()
    
    if i == 3:
        break

#### There is more...

So far, we have worked our way towards a better encapsulation for the `FER` Dataset. However, we haven't said anything (_yet_) about _Data Augmentation_ or _sampling_ to cope with data imbalance. 

## References:

- The energy case study data preparation has been inspired from this [blog](https://blog.floydhub.com/gru-with-pytorch/) post 
    - _note_: data preprocessing there is wrong as it is performed before train/test splitting 😬