## Custom Datasets
---

This notebook is supposed to give you an idea how you can define custom datasets when working with pytorch, and how you can perform train-test-splitting on your data. We start out with a very simple ficticious dataset of 100 samples and 10 features, stored in a 100 x 10 array. We generate the data points randomly and write the resulting numpy array to disk.

In [13]:
import os
import torch
from sklearn.model_selection import train_test_split
from typing import Sequence, Optional
from torch.utils.data import Dataset, DataLoader
import numpy as np

In [14]:
if not os.path.exists("./data"):
    os.mkdir("data")

In [15]:
all_datapath = "data/all.npy"

In [16]:
data = np.random.rand(100, 10)
np.save(all_datapath, data)

In [17]:
data_ = np.load(all_datapath)
assert np.all(data == data_)

### Problem: 
How do we create a Dataset for reading the data from this? And how do we split it into train, valid and test?

### Solution 1: Splitting the data on disk

To prepare the data in a convenient way for further usage by a Pytorch model, we define a `Dataset` class. A `Dataset` is essentially just a container that allows you to interface with your data, however it is provided. All it must provide is an `__init__()` method for instantiation, where the arguments can be complety arbitrary, a `___len__()` method that returns the length of a dataset, and a `__getitem__()` method that receives an integer index (or a sequence thereof if you want to allow for advanced indexing) and returns the respective item from the dataset as tensor.

In [18]:
class CustomDataset(Dataset):
    def __init__(self, dataset_path: str):
        self.dataset_path = dataset_path
        self.data = np.load(dataset_path)

    def __len__(self):
        return self.data.shape[0]

    def __getitem__(self, idx: int):
        return torch.from_numpy(self.data[idx, :])

In [19]:
all_idx = np.arange(0, len(data))
train_split, valid_test_split = train_test_split(all_idx, train_size=0.8)
valid_split, test_split = train_test_split(valid_test_split, train_size=0.5)

In [20]:
train_data = data[train_split, :]
train_datapath = "data/train.npy"
np.save(train_datapath, train_data)

valid_data = data[valid_split, :]
valid_datapath = "data/valid.npy"
np.save(valid_datapath, valid_data)

test_data = data[test_split, :]
test_datapath = "data/test.npy"
np.save(test_datapath, test_data)

In [21]:
train_ds = CustomDataset(train_datapath)
valid_ds = CustomDataset(valid_datapath)
test_ds = CustomDataset(test_datapath)

In [22]:
len(train_ds), len(valid_ds), len(test_ds)

(80, 10, 10)

In [23]:
valid_ds[:10]

tensor([[0.2658, 0.1320, 0.9718, 0.9078, 0.3574, 0.5475, 0.2282, 0.1728, 0.0578,
         0.3115],
        [0.4351, 0.3993, 0.6958, 0.6714, 0.8000, 0.7195, 0.4709, 0.8084, 0.6217,
         0.7403],
        [0.2238, 0.1720, 0.8436, 0.8522, 0.8580, 0.5453, 0.4868, 0.4745, 0.2917,
         0.2008],
        [0.1910, 0.4369, 0.3929, 0.1948, 0.2055, 0.3353, 0.3057, 0.3246, 0.9288,
         0.5554],
        [0.5993, 0.8189, 0.0609, 0.0548, 0.4448, 0.8515, 0.1662, 0.0251, 0.3644,
         0.7465],
        [0.8132, 0.5710, 0.4544, 0.2389, 0.9538, 0.1128, 0.9511, 0.2938, 0.9022,
         0.1942],
        [0.8564, 0.2553, 0.4450, 0.7629, 0.8897, 0.2289, 0.7435, 0.1221, 0.2614,
         0.7545],
        [0.8919, 0.4885, 0.6227, 0.5590, 0.7655, 0.9950, 0.4389, 0.5561, 0.7542,
         0.6109],
        [0.2427, 0.8769, 0.0546, 0.0748, 0.6120, 0.3136, 0.4596, 0.2031, 0.1239,
         0.4740],
        [0.0090, 0.6527, 0.5670, 0.3864, 0.0346, 0.4207, 0.3608, 0.9819, 0.0493,
         0.0180]], dtype=tor

In [24]:
valid_ds[10]

IndexError: index 10 is out of bounds for axis 0 with size 10

Now that we have successfully created a dataset for training, validation and testing, let us wrap the datasets with a DataLoader so we can iterate over the samples.

In [25]:
train_dl = DataLoader(train_ds, batch_size=4, shuffle=True)
valid_dl = DataLoader(valid_ds, batch_size=4, shuffle=True)
test_dl = DataLoader(test_ds, batch_size=4, shuffle=True)

Any number of `DataLoader`s can be created on a dataset. Think of a `DataLoader` as a consumable that allows you to iterate over all data points of a data set exactly once and is then used up. So, to iterate over a dataset multiple times, you would define multiple dataloaders. 

When creating a `DataLoader`, setting the `shuffle` parameter to `False` will respect the order of the dataset, while setting it to `True` will iterate over the dataset in a random order. See for yourself:

In [26]:
for batch in iter(test_dl):
    print(batch)

tensor([[0.6604, 0.3281, 0.3410, 0.1327, 0.6157, 0.9070, 0.3888, 0.4450, 0.9115,
         0.7691],
        [0.1119, 0.4702, 0.8129, 0.9292, 0.7062, 0.4830, 0.5732, 0.3643, 0.4619,
         0.6012],
        [0.1041, 0.4542, 0.7056, 0.7685, 0.2268, 0.3509, 0.4818, 0.4445, 0.2015,
         0.5644],
        [0.3941, 0.7311, 0.0825, 0.2191, 0.7799, 0.7084, 0.4344, 0.9510, 0.4570,
         0.6801]], dtype=torch.float64)
tensor([[0.7908, 0.6135, 0.0877, 0.2364, 0.0934, 0.8608, 0.4089, 0.1683, 0.6018,
         0.3265],
        [0.4291, 0.1776, 0.1431, 0.0675, 0.4716, 0.3212, 0.6341, 0.9953, 0.3252,
         0.9552],
        [0.6680, 0.6070, 0.1877, 0.4678, 0.7224, 0.9612, 0.2207, 0.6191, 0.3543,
         0.1904],
        [0.4047, 0.6705, 0.1524, 0.3851, 0.7845, 0.9092, 0.6224, 0.5635, 0.7001,
         0.5579]], dtype=torch.float64)
tensor([[0.5603, 0.4253, 0.0136, 0.5578, 0.4908, 0.8270, 0.5704, 0.6199, 0.7504,
         0.2012],
        [0.0545, 0.3724, 0.2470, 0.3543, 0.3972, 0.2273, 0.4405, 

In [27]:
test_dl = DataLoader(test_ds, batch_size=4, shuffle=True)
for batch in iter(test_dl):
    print(batch)

tensor([[0.4047, 0.6705, 0.1524, 0.3851, 0.7845, 0.9092, 0.6224, 0.5635, 0.7001,
         0.5579],
        [0.6604, 0.3281, 0.3410, 0.1327, 0.6157, 0.9070, 0.3888, 0.4450, 0.9115,
         0.7691],
        [0.3941, 0.7311, 0.0825, 0.2191, 0.7799, 0.7084, 0.4344, 0.9510, 0.4570,
         0.6801],
        [0.0545, 0.3724, 0.2470, 0.3543, 0.3972, 0.2273, 0.4405, 0.2374, 0.9427,
         0.8252]], dtype=torch.float64)
tensor([[0.1041, 0.4542, 0.7056, 0.7685, 0.2268, 0.3509, 0.4818, 0.4445, 0.2015,
         0.5644],
        [0.7908, 0.6135, 0.0877, 0.2364, 0.0934, 0.8608, 0.4089, 0.1683, 0.6018,
         0.3265],
        [0.1119, 0.4702, 0.8129, 0.9292, 0.7062, 0.4830, 0.5732, 0.3643, 0.4619,
         0.6012],
        [0.5603, 0.4253, 0.0136, 0.5578, 0.4908, 0.8270, 0.5704, 0.6199, 0.7504,
         0.2012]], dtype=torch.float64)
tensor([[0.4291, 0.1776, 0.1431, 0.0675, 0.4716, 0.3212, 0.6341, 0.9953, 0.3252,
         0.9552],
        [0.6680, 0.6070, 0.1877, 0.4678, 0.7224, 0.9612, 0.2207, 

In [28]:
test_dl = DataLoader(test_ds, batch_size=4, shuffle=False)
for batch in iter(test_dl):
    print(batch)

tensor([[0.1119, 0.4702, 0.8129, 0.9292, 0.7062, 0.4830, 0.5732, 0.3643, 0.4619,
         0.6012],
        [0.1041, 0.4542, 0.7056, 0.7685, 0.2268, 0.3509, 0.4818, 0.4445, 0.2015,
         0.5644],
        [0.7908, 0.6135, 0.0877, 0.2364, 0.0934, 0.8608, 0.4089, 0.1683, 0.6018,
         0.3265],
        [0.3941, 0.7311, 0.0825, 0.2191, 0.7799, 0.7084, 0.4344, 0.9510, 0.4570,
         0.6801]], dtype=torch.float64)
tensor([[0.4291, 0.1776, 0.1431, 0.0675, 0.4716, 0.3212, 0.6341, 0.9953, 0.3252,
         0.9552],
        [0.4047, 0.6705, 0.1524, 0.3851, 0.7845, 0.9092, 0.6224, 0.5635, 0.7001,
         0.5579],
        [0.6604, 0.3281, 0.3410, 0.1327, 0.6157, 0.9070, 0.3888, 0.4450, 0.9115,
         0.7691],
        [0.5603, 0.4253, 0.0136, 0.5578, 0.4908, 0.8270, 0.5704, 0.6199, 0.7504,
         0.2012]], dtype=torch.float64)
tensor([[0.6680, 0.6070, 0.1877, 0.4678, 0.7224, 0.9612, 0.2207, 0.6191, 0.3543,
         0.1904],
        [0.0545, 0.3724, 0.2470, 0.3543, 0.3972, 0.2273, 0.4405, 

In [29]:
print(test_data)

[[0.11189311 0.47022563 0.81289964 0.92920906 0.7061915  0.48302178
  0.57320662 0.364294   0.4618689  0.60119647]
 [0.104106   0.45422257 0.70556399 0.76854487 0.22682721 0.35090406
  0.48183919 0.44451674 0.20153437 0.56440944]
 [0.79082651 0.61350899 0.08772634 0.23642197 0.09341591 0.86080943
  0.4089348  0.16828306 0.60179675 0.32645062]
 [0.39412752 0.73113621 0.08249928 0.21905882 0.77990598 0.70838133
  0.43437863 0.95097075 0.45703167 0.68014718]
 [0.42905397 0.17759902 0.14309815 0.06750697 0.47163401 0.3211667
  0.63414452 0.99529385 0.32516169 0.95521448]
 [0.40473015 0.67045864 0.1523741  0.38511324 0.78445348 0.90923892
  0.62237947 0.56351123 0.70009889 0.55794555]
 [0.66037796 0.32813615 0.34095467 0.13273115 0.61572013 0.90702802
  0.3887919  0.44501581 0.91153967 0.76910439]
 [0.56028702 0.42526807 0.01361456 0.55777071 0.49075718 0.82702142
  0.57035887 0.61988486 0.75041955 0.20122308]
 [0.66802578 0.6069664  0.18773405 0.46781397 0.72241543 0.96118654
  0.2207421  

### Solution 2: The elegant way

Mask out what you don't want to keep

In [30]:
class CustomDataset(Dataset):

    def __init__(self, dataset_path: str, subset: Optional[Sequence[int]] = None):
        self.dataset_path = dataset_path
        self.data = np.load(dataset_path)
        self.subset = np.array([*subset], dtype=np.int64) if subset is not None else None

    def __len__(self):
        return self.data.shape[0] if self.subset is None else len(self.subset)

    def __getitem__(self, idx: int):
        if self.subset is not None:
            idx = self.subset[idx]
        return torch.from_numpy(self.data[idx, :])

    def get_subset(self, subset: Optional[Sequence[int]]) -> 'CustomDataset':
        """ Returns a new CustomDataset using only a subset of indices. """
        if self.subset is not None:
            subset = self.subset[subset]
        return CustomDataset(self.dataset_path, subset=subset)


In [31]:
ds = CustomDataset(all_datapath)

In [32]:
ds[0], ds[10]

(tensor([0.9730, 0.8578, 0.8550, 0.3406, 0.5674, 0.8235, 0.5873, 0.8169, 0.0621,
         0.4731], dtype=torch.float64),
 tensor([0.4351, 0.3993, 0.6958, 0.6714, 0.8000, 0.7195, 0.4709, 0.8084, 0.6217,
         0.7403], dtype=torch.float64))

In [33]:
len(ds)

100

In [34]:
all_idx = np.arange(0, len(ds))
train_split, valid_test_split = train_test_split(all_idx, train_size=0.8)
valid_split, test_split = train_test_split(valid_test_split, train_size=0.5)

In [35]:
train_split

array([49, 91, 46, 61, 70, 60, 58,  4, 64, 87, 38, 95, 97, 81, 89, 62, 48,
       47, 66, 65, 17, 14, 42, 92, 83, 50, 41, 93, 52, 80, 99, 67, 84, 27,
       90, 40, 94, 29,  5, 22, 21,  1, 26, 57,  0, 98, 30, 68, 37, 10, 51,
       39, 63, 18,  8, 35, 19, 78, 86, 77, 20,  3, 25, 71, 15, 16, 55, 12,
       23, 36, 96, 24, 32, 69, 73, 85, 11,  2, 75, 45])

In [36]:
valid_split

array([ 6,  7,  9, 76, 44, 74, 33, 31, 13, 28])

In [37]:
test_split

array([54, 72, 43, 79, 34, 88, 56, 82, 53, 59])

In [38]:
train_ds = ds.get_subset(subset=train_split)
valid_ds = ds.get_subset(subset=valid_split)
test_ds = ds.get_subset(subset=test_split)

In [39]:
len(train_ds), len(valid_ds), len(test_ds)

(80, 10, 10)

In [40]:
test_ds[:10]

tensor([[5.4165e-01, 8.8414e-01, 1.0307e-01, 6.8394e-01, 7.5107e-02, 4.4322e-01,
         9.7007e-01, 6.3511e-01, 9.6964e-01, 4.3268e-04],
        [8.8705e-02, 4.9055e-01, 6.2738e-01, 3.4679e-02, 1.6851e-01, 5.0979e-01,
         4.8836e-02, 1.7073e-02, 5.5715e-01, 2.9179e-01],
        [7.5236e-01, 1.3519e-01, 6.4312e-01, 3.8824e-01, 2.8274e-01, 4.0763e-01,
         9.3979e-01, 9.7585e-01, 4.5837e-03, 6.0436e-01],
        [5.7535e-02, 1.0731e-01, 4.3320e-01, 4.1618e-01, 7.2452e-01, 1.4872e-01,
         4.5900e-01, 6.8188e-01, 5.3524e-01, 2.2020e-02],
        [2.7957e-01, 9.1127e-01, 2.2009e-01, 7.0582e-01, 3.1125e-01, 1.9890e-01,
         5.8782e-02, 6.9470e-01, 3.1745e-01, 9.6368e-01],
        [1.0411e-01, 4.5422e-01, 7.0556e-01, 7.6854e-01, 2.2683e-01, 3.5090e-01,
         4.8184e-01, 4.4452e-01, 2.0153e-01, 5.6441e-01],
        [2.3036e-01, 3.0045e-01, 8.3751e-02, 3.1431e-02, 3.9653e-01, 2.0653e-01,
         9.3354e-01, 5.7006e-01, 3.8842e-01, 7.5189e-01],
        [2.3922e-01, 3.0009

In [41]:
test_ds[10]

IndexError: index 10 is out of bounds for axis 0 with size 10

Note that we could use this technique to further subset the test dataset:

In [42]:
test_test_ds = test_ds.get_subset([0, 1, 8])

In [43]:
len(test_test_ds)

3

In [44]:
test_test_ds[:3]

tensor([[5.4165e-01, 8.8414e-01, 1.0307e-01, 6.8394e-01, 7.5107e-02, 4.4322e-01,
         9.7007e-01, 6.3511e-01, 9.6964e-01, 4.3268e-04],
        [8.8705e-02, 4.9055e-01, 6.2738e-01, 3.4679e-02, 1.6851e-01, 5.0979e-01,
         4.8836e-02, 1.7073e-02, 5.5715e-01, 2.9179e-01],
        [8.2997e-01, 1.9476e-01, 2.9357e-01, 6.6820e-01, 7.5377e-01, 7.3101e-01,
         4.8301e-01, 8.1237e-01, 5.2901e-01, 7.0016e-01]], dtype=torch.float64)

In [45]:
test_test_ds[3]

IndexError: index 3 is out of bounds for axis 0 with size 3

As with our previous datasets, we can define DataLoaders to iterate through all samples:

In [46]:
train_dl = DataLoader(train_ds, batch_size=4, shuffle=True)
valid_dl = DataLoader(valid_ds, batch_size=4, shuffle=True)
test_dl = DataLoader(test_ds, batch_size=4, shuffle=True)

In [47]:
for batch in iter(test_dl):
    print(batch)

tensor([[0.1041, 0.4542, 0.7056, 0.7685, 0.2268, 0.3509, 0.4818, 0.4445, 0.2015,
         0.5644],
        [0.2304, 0.3005, 0.0838, 0.0314, 0.3965, 0.2065, 0.9335, 0.5701, 0.3884,
         0.7519],
        [0.0887, 0.4905, 0.6274, 0.0347, 0.1685, 0.5098, 0.0488, 0.0171, 0.5572,
         0.2918],
        [0.2392, 0.0300, 0.9623, 0.6651, 0.2184, 0.1758, 0.0476, 0.1135, 0.0121,
         0.1898]], dtype=torch.float64)
tensor([[5.4165e-01, 8.8414e-01, 1.0307e-01, 6.8394e-01, 7.5107e-02, 4.4322e-01,
         9.7007e-01, 6.3511e-01, 9.6964e-01, 4.3268e-04],
        [5.7535e-02, 1.0731e-01, 4.3320e-01, 4.1618e-01, 7.2452e-01, 1.4872e-01,
         4.5900e-01, 6.8188e-01, 5.3524e-01, 2.2020e-02],
        [2.7957e-01, 9.1127e-01, 2.2009e-01, 7.0582e-01, 3.1125e-01, 1.9890e-01,
         5.8782e-02, 6.9470e-01, 3.1745e-01, 9.6368e-01],
        [7.5236e-01, 1.3519e-01, 6.4312e-01, 3.8824e-01, 2.8274e-01, 4.0763e-01,
         9.3979e-01, 9.7585e-01, 4.5837e-03, 6.0436e-01]], dtype=torch.float64)
tens