# Writing Custom Datasets and DataLoaders 

*This tutorial is based on [Writing Custom Datasets, DataLoaders and Transforms](https://pytorch.org/tutorials/beginner/data_loading_tutorial.html#writing-custom-datasets-dataloaders-and-transforms)*

A lot of effort in solving any machine learning problem goes into preparing the data. PyTorch provides many tools to make data loading easy and hopefully, to make your code more readable. In this tutorial, we will see how to load and preprocess/augment data from a non trivial dataset.

## [torch.utils.data.Dataset](https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset)

1.   `torch.utils.data.Dataset` is an abstract class representing a dataset. 
2.   Your custom dataset should inherit `Dataset` and override the following methods:
-   `__len__` so that `len(dataset)` returns the size of the dataset.
-   `__getitem__` to support the indexing such that `dataset[i]` can be used to get ith sample

All datasets are subclasses of `torch.utils.data.Dataset` i.e, they have `__getitem__` and `__len__` methods implemented

### Costum dataset from a `.csv` file

In [70]:
import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn.model_selection import train_test_split

# fecth data
data = datasets.fetch_california_housing(as_frame=True).data
y = datasets.fetch_california_housing(as_frame=True).target

# join data and target
data[y.name] = y.values

# split data
df_train, df_test = train_test_split(data, train_size=0.8, random_state=42)

df_train.head()


Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
14196,3.2596,33.0,5.017657,1.006421,2300.0,3.691814,32.71,-117.03,1.03
8267,3.8125,49.0,4.473545,1.041005,1314.0,1.738095,33.77,-118.16,3.821
17445,4.1563,4.0,5.645833,0.985119,915.0,2.723214,34.66,-120.48,1.726
14265,1.9425,36.0,4.002817,1.033803,1418.0,3.994366,32.69,-117.11,0.934
2271,3.5542,43.0,6.268421,1.134211,874.0,2.3,36.78,-119.8,0.965


In [47]:
df_test.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
20046,1.6812,25.0,4.192201,1.022284,1392.0,3.877437,36.06,-119.01,0.477
3024,2.5313,30.0,5.039384,1.193493,1565.0,2.679795,35.14,-119.46,0.458
15663,3.4801,52.0,3.977155,1.185877,1310.0,1.360332,37.8,-122.44,5.00001
20484,5.7376,17.0,6.163636,1.020202,1705.0,3.444444,34.28,-118.72,2.186
9814,3.725,34.0,5.492991,1.028037,1063.0,2.483645,36.62,-121.93,2.78


The dataset is represented as table with $9$ columns. The first $8$ columns serve as input, while the last column (i.e., `median_house_value`) serves as target. 

We write a dataset `CaliforniaHousingDataset` using this data. 

In [56]:
import torch
from torch.utils.data import Dataset
from typing import Any, Callable, Dict, IO, List, Optional, Tuple, Union


class CalifroniaHousingDataset(Dataset):
    target_name = [y.name]
    features_names = [name for name in data.columns if name != y.name]

    def __init__(self, train: bool = True) -> None:
        self.train = train

        if self.train:
            self.df = df_train
        else:
            self.df = df_test

        # TODO: create attributes `data` and `targets`
        # hints:
        #   - look at `__getitem__` to see how `data` and `targets` are used
        #   - you can use `torch.tensor` to transform a `numpy.array` into `torch.tensor`

        self.data = torch.tensor(self.df[self.features_names].to_numpy())
        self.targets = torch.tensor(self.df[self.target_name].to_numpy())

    def __getitem__(self, index) -> Tuple[torch.Tensor, torch.Tensor]:
        return self.data[index], self.targets[index]

    def __len__(self) -> int:
        # TODO: Implement this function
        # Should return the number of samples in the dataset
        return len(self.df)


chd = CalifroniaHousingDataset()
chd.__len__()


16512

In [None]:
califronia_housing_train = CalifroniaHousingDataset(train=True)

califronia_housing_train[0]

(tensor([-1.1431e+02,  3.4190e+01,  1.5000e+01,  5.6120e+03,  1.2830e+03,
          1.0150e+03,  4.7200e+02,  1.4936e+00], dtype=torch.float64),
 tensor([66900.], dtype=torch.float64))

In [None]:
califronia_housing_test = CalifroniaHousingDataset(train=False)

califronia_housing_test[0]

(tensor([-122.0500,   37.3700,   27.0000, 3885.0000,  661.0000, 1537.0000,
          606.0000,    6.6085], dtype=torch.float64),
 tensor([344700.], dtype=torch.float64))

## Subset of MNIST

We build a subset of MNIST dataset. We use the same style a in [`torchvision.datasets.mnist`](https://pytorch.org/vision/stable/_modules/torchvision/datasets/mnist.html#EMNIST) 

In [62]:
import os
from torchvision.datasets import MNIST
from torchvision.transforms import ToTensor, Normalize, Compose
from PIL import Image


def get_mnist():
    """
    gets full (both train and test) MNIST dataset inputs and labels;
    :return:
        data, targets
    """

    training_data = MNIST(
        root="./data/",
        train=True,
        download=True,
        transform=ToTensor(),
    )

    test_data = MNIST(
        root="./data/",
        train=False,
        download=True,
        transform=ToTensor(),
    )

    data = torch.cat([training_data.data, test_data.data])

    targets = torch.cat([training_data.targets, training_data.targets])

    return data, targets


class SubMNIST(Dataset):
    """
    Constructs a subset of EMNIST dataset from a pickle file;
    expects pickle file to store list of indices

    Attributes
    ----------
    indices: iterable of integers
    transform
    data
    targets

    Methods
    -------
    __init__
    __len__
    __getitem__

    """

    def __init__(self, indices, mnist_data=None, mnist_targets=None, transform=None):
        """
        :param idnices: List[int]
        :param emnist_data: EMNIST dataset inputs
        :param emnist_targets: EMNIST dataset labels
        :param transform:
        """

        self.indices = indices

        if transform is None:
            self.transform = Compose([ToTensor(), Normalize((0.1307,), (0.3081,))])

        if mnist_data is None or mnist_targets is None:
            self.data, self.targets = get_mnist()
        else:
            self.data, self.targets = mnist_data, mnist_targets

        self.data = self.data[self.indices]
        self.targets = self.targets[self.indices]

    def __len__(self):
        return self.data.size(0)

    def __getitem__(self, index):
        img, target = self.data[index], int(self.targets[index])

        img = Image.fromarray(img.numpy(), mode="L")

        if self.transform is not None:
            img = self.transform(img)

        return img, target

In [68]:
get_mnist()[1].shape

torch.Size([120000])

**Exercice:** Randomly partition MNIST dataset into equally sized chunks 

In [110]:
def iid_divide(l, g):
    """
    https://github.com/TalwalkarLab/leaf/blob/master/data/utils/sample.py
    divide list `l` among `g` groups
    each group has either `int(len(l)/g)` or `int(len(l)/g)+1` elements
    returns a list of groups
    """
    num_elems = len(l)
    group_size = int(len(l) / g)
    num_big_groups = num_elems - g * group_size
    num_small_groups = g - num_big_groups
    glist = []
    for i in range(num_small_groups):
        glist.append(l[group_size * i : group_size * (i + 1)])
    bi = group_size * num_small_groups
    group_size += 1
    for i in range(num_big_groups):
        glist.append(l[bi + group_size * i : bi + group_size * (i + 1)])

    return glist


def partition_mnist(n_chunks=10) -> List[SubMNIST]:
    np.random.seed(42)

    TOTAL_N_SAMPLES = 60_000

    all_indices = list(range(TOTAL_N_SAMPLES))

    indices = all_indices
    np.random.shuffle(indices)

    indices_list = iid_divide(all_indices, n_chunks)

    partitions = list()
    mnist_data, mnist_targets = get_mnist()  # compute mnist data once only
    for indices in indices_list:

        dataset = SubMNIST(
            indices,
            mnist_data=mnist_data,
            mnist_targets=mnist_targets,
        )
        partitions.append(dataset)

    return partitions


partition_mnist()


[<__main__.SubMNIST at 0x7f00445c7b50>,
 <__main__.SubMNIST at 0x7f00445c7e80>,
 <__main__.SubMNIST at 0x7f0044075280>,
 <__main__.SubMNIST at 0x7f0054035e50>,
 <__main__.SubMNIST at 0x7f00442e64c0>,
 <__main__.SubMNIST at 0x7f004406b610>,
 <__main__.SubMNIST at 0x7f003bdd5a30>,
 <__main__.SubMNIST at 0x7f003bdd5a90>,
 <__main__.SubMNIST at 0x7f003bddd340>,
 <__main__.SubMNIST at 0x7f003bddd460>]

In [112]:
from torch.utils.data import DataLoader

mnist_partition = partition_mnist(n_chunks=10)
mnist_loaders = [
    DataLoader(partition, batch_size=64, shuffle=True) for partition in mnist_partition
]
mnist_loaders

[<torch.utils.data.dataloader.DataLoader at 0x7f0044075be0>,
 <torch.utils.data.dataloader.DataLoader at 0x7f0044075550>,
 <torch.utils.data.dataloader.DataLoader at 0x7f003bddd0d0>,
 <torch.utils.data.dataloader.DataLoader at 0x7f003bdddd30>,
 <torch.utils.data.dataloader.DataLoader at 0x7f003bddde20>,
 <torch.utils.data.dataloader.DataLoader at 0x7f003bdddf10>,
 <torch.utils.data.dataloader.DataLoader at 0x7f003b809040>,
 <torch.utils.data.dataloader.DataLoader at 0x7f003b809130>,
 <torch.utils.data.dataloader.DataLoader at 0x7f003b809220>,
 <torch.utils.data.dataloader.DataLoader at 0x7f003b809310>]

In [115]:
next(iter(mnist_loaders[0]))

[tensor([[[[-0.4242, -0.4242, -0.4242,  ..., -0.4242, -0.4242, -0.4242],
           [-0.4242, -0.4242, -0.4242,  ..., -0.4242, -0.4242, -0.4242],
           [-0.4242, -0.4242, -0.4242,  ..., -0.4242, -0.4242, -0.4242],
           ...,
           [-0.4242, -0.4242, -0.4242,  ..., -0.4242, -0.4242, -0.4242],
           [-0.4242, -0.4242, -0.4242,  ..., -0.4242, -0.4242, -0.4242],
           [-0.4242, -0.4242, -0.4242,  ..., -0.4242, -0.4242, -0.4242]]],
 
 
         [[[-0.4242, -0.4242, -0.4242,  ..., -0.4242, -0.4242, -0.4242],
           [-0.4242, -0.4242, -0.4242,  ..., -0.4242, -0.4242, -0.4242],
           [-0.4242, -0.4242, -0.4242,  ..., -0.4242, -0.4242, -0.4242],
           ...,
           [-0.4242, -0.4242, -0.4242,  ..., -0.4242, -0.4242, -0.4242],
           [-0.4242, -0.4242, -0.4242,  ..., -0.4242, -0.4242, -0.4242],
           [-0.4242, -0.4242, -0.4242,  ..., -0.4242, -0.4242, -0.4242]]],
 
 
         [[[-0.4242, -0.4242, -0.4242,  ..., -0.4242, -0.4242, -0.4242],
       