# DataLoader
In the previous notebook we have implemented a dataset that we can now use to access our data. However, in machine learning, we often need to perform a few additional data preparation steps before we can start training models.

An important additional class for data preparation is the **DataLoader**. By wrapping a dataset in a dataloader, we will be able to load small subsets of the dataset at a time, instead of having to load each sample separately. In machine learning, the small subsets are referred to as **mini-batches**, which will play an important role later in the lecture.

In this notebook, we will implement our own dataloader, which we can then use to load mini-batches from the dataset we implemented previously.

First, we need to import libraries and code.

In [1]:
import numpy as np

from dl_zero2one.data import DataLoader, DummyDataset

%load_ext autoreload
%autoreload 2

## 1. Iterating over a Dataset
Throughout this notebook a dummy dataset will be used that contains all even numbers from 2 to 100. Similar to the dataset we have implemented before, the dummy dataset has a `__len__()` method that allows us to call `len(dataset)`, as well as a `__getitem__()` method, which allows we to call `dataset[i]` and returns a dict `{"data": val}` where `val` is the i-th even number. 

Let's start by defining the dataset, and calling its methods to get a better feel for it.

In [2]:
dataset = DummyDataset(
    root=None,
    divisor=2,
    limit=100
)
print(
    "Dataset Length:\t", len(dataset),
    "\nFirst Element:\t", dataset[0],
    "\nLast Element:\t", dataset[-1],
)

Dataset Length:	 50 
First Element:	 {'data': 2} 
Last Element:	 {'data': 100}


In the following, we will write some code to iterate over the dataset in mini-batches, similarly to what a dataloader is supposed to do. The number of samples to load per mini-batch is called **batch size**. For the remainder of this notebook, the batch size is 3.

In [3]:
batch_size = 3

Let us now define a simple function that iterates over the dataset and groups samples into mini-batches:

In [4]:
def build_batches(dataset, batch_size):
    batches = []  # list of all mini-batches
    batch = []  # current mini-batch
    for i in range(len(dataset)):
        batch.append(dataset[i])
        if len(batch) == batch_size:  # if the current mini-batch is full,
            batches.append(batch)  # add it to the list of mini-batches,
            batch = []  # and start a new mini-batch
    return batches

batches = build_batches(
    dataset=dataset,
    batch_size=batch_size
)

Let's have a look at the mini-batches:

In [5]:
def print_batches(batches):  
    for i, batch in enumerate(batches):
        print("mini-batch %d:" % i, str(batch))

print_batches(batches)

mini-batch 0: [{'data': 2}, {'data': 4}, {'data': 6}]
mini-batch 1: [{'data': 8}, {'data': 10}, {'data': 12}]
mini-batch 2: [{'data': 14}, {'data': 16}, {'data': 18}]
mini-batch 3: [{'data': 20}, {'data': 22}, {'data': 24}]
mini-batch 4: [{'data': 26}, {'data': 28}, {'data': 30}]
mini-batch 5: [{'data': 32}, {'data': 34}, {'data': 36}]
mini-batch 6: [{'data': 38}, {'data': 40}, {'data': 42}]
mini-batch 7: [{'data': 44}, {'data': 46}, {'data': 48}]
mini-batch 8: [{'data': 50}, {'data': 52}, {'data': 54}]
mini-batch 9: [{'data': 56}, {'data': 58}, {'data': 60}]
mini-batch 10: [{'data': 62}, {'data': 64}, {'data': 66}]
mini-batch 11: [{'data': 68}, {'data': 70}, {'data': 72}]
mini-batch 12: [{'data': 74}, {'data': 76}, {'data': 78}]
mini-batch 13: [{'data': 80}, {'data': 82}, {'data': 84}]
mini-batch 14: [{'data': 86}, {'data': 88}, {'data': 90}]
mini-batch 15: [{'data': 92}, {'data': 94}, {'data': 96}]


As we can see, the iteration works, but the output is not very pretty. Let us now write a simple function that combines the dictionaries of all samples in a mini-batch.

In [6]:
def combine_batch_dicts(batch):
    batch_dict = {}
    for data_dict in batch:
        for key, value in data_dict.items():
            if key not in batch_dict:
                batch_dict[key] = []
            batch_dict[key].append(value)
    return batch_dict

combined_batches = [combine_batch_dicts(batch) for batch in batches]
print_batches(combined_batches)

mini-batch 0: {'data': [2, 4, 6]}
mini-batch 1: {'data': [8, 10, 12]}
mini-batch 2: {'data': [14, 16, 18]}
mini-batch 3: {'data': [20, 22, 24]}
mini-batch 4: {'data': [26, 28, 30]}
mini-batch 5: {'data': [32, 34, 36]}
mini-batch 6: {'data': [38, 40, 42]}
mini-batch 7: {'data': [44, 46, 48]}
mini-batch 8: {'data': [50, 52, 54]}
mini-batch 9: {'data': [56, 58, 60]}
mini-batch 10: {'data': [62, 64, 66]}
mini-batch 11: {'data': [68, 70, 72]}
mini-batch 12: {'data': [74, 76, 78]}
mini-batch 13: {'data': [80, 82, 84]}
mini-batch 14: {'data': [86, 88, 90]}
mini-batch 15: {'data': [92, 94, 96]}


This looks much more organized.

To perform operations more efficiently later, we would also like the values of the mini-batches to be contained in a numpy array instead of a simple list. Let's briefly write a function for that:

In [7]:
def batch_to_numpy(batch):
    numpy_batch = {}
    for key, value in batch.items():
        numpy_batch[key] = np.array(value)
    return numpy_batch

numpy_batches = [batch_to_numpy(batch) for batch in combined_batches]
print_batches(numpy_batches)

mini-batch 0: {'data': array([2, 4, 6])}
mini-batch 1: {'data': array([ 8, 10, 12])}
mini-batch 2: {'data': array([14, 16, 18])}
mini-batch 3: {'data': array([20, 22, 24])}
mini-batch 4: {'data': array([26, 28, 30])}
mini-batch 5: {'data': array([32, 34, 36])}
mini-batch 6: {'data': array([38, 40, 42])}
mini-batch 7: {'data': array([44, 46, 48])}
mini-batch 8: {'data': array([50, 52, 54])}
mini-batch 9: {'data': array([56, 58, 60])}
mini-batch 10: {'data': array([62, 64, 66])}
mini-batch 11: {'data': array([68, 70, 72])}
mini-batch 12: {'data': array([74, 76, 78])}
mini-batch 13: {'data': array([80, 82, 84])}
mini-batch 14: {'data': array([86, 88, 90])}
mini-batch 15: {'data': array([92, 94, 96])}


Lastly, we would like to make the loading a bit more memory efficient. Instead of loading the entire dataset into memory at once, let us only load samples when they are needed. This can also be done by building a Python generator, using the `yield` keyword. See https://wiki.python.org/moin/Generators for more information on generators.

In [8]:
def build_batch_iterator(dataset, batch_size, shuffle):
    if shuffle:
        index_iterator = iter(np.random.permutation(len(dataset)))  # define indices as iterator
    else:
        index_iterator = iter(range(len(dataset)))  # define indices as iterator

    batch = []
    for index in index_iterator:  # iterate over indices using the iterator
        batch.append(dataset[index])
        if len(batch) == batch_size:
            yield batch  # use yield keyword to define a iterable generator
            batch = []
            
batch_iterator = build_batch_iterator(
    dataset=dataset,
    batch_size=batch_size,
    shuffle=True
)
batches = []
for batch in batch_iterator:
    batches.append(batch)

print_batches(
    [batch_to_numpy(combine_batch_dicts(batch)) for batch in batches]
)

mini-batch 0: {'data': array([ 16,  48, 100])}
mini-batch 1: {'data': array([64, 78, 80])}
mini-batch 2: {'data': array([46, 22, 72])}
mini-batch 3: {'data': array([44, 54, 40])}
mini-batch 4: {'data': array([30, 74, 98])}
mini-batch 5: {'data': array([24, 84, 20])}
mini-batch 6: {'data': array([82, 60, 96])}
mini-batch 7: {'data': array([88, 56, 32])}
mini-batch 8: {'data': array([62,  8, 90])}
mini-batch 9: {'data': array([18, 92, 68])}
mini-batch 10: {'data': array([52, 38, 12])}
mini-batch 11: {'data': array([26, 94, 70])}
mini-batch 12: {'data': array([86, 76, 42])}
mini-batch 13: {'data': array([36, 50,  6])}
mini-batch 14: {'data': array([10, 34,  2])}
mini-batch 15: {'data': array([28, 58, 14])}


The functionality of the cell above is now pretty close to what the dataloader is supposed to do. However, there are still two remaining issues:
1. The last two samples of the dataset are not contained in any mini-batch. This is because the number of samples in the dataset is not dividable by the batch size, so there are a few left-over samples which are implicitly discarded. Ideally, an option would be prefered that allows us to decide how to handle these last samples.
2. The order of the mini-batches, as well as the fact which samples are grouped together, is always in increasing order. Ideally, there should be another option that allows us to randomize which samples are grouped together. The randomization could be easily implemented by randomly permuting the indices of the dataset before iterating over it, e.g. using `indices = np.random.permutation(len(dataset))`.

## 2. DataLoader Class Implementation
Now let's put everything together and implement the DataLoader as a proper class.
Have a look at the `class DataLoader` of `dl_zero2one/data/image_folder_dataset.py`. 


Note that the `__init__` method receives four arguments:
* **dataset** is the dataset that the dataloader should load.
* **batch_size** is the mini-batch size, i.e. the number of samples we want to load at a time.
* **shuffle** is binary and defines whether the dataset should be randomly shuffled or not.
* **drop_last**: is binary and defines how to handle the last mini-batch in our dataset. Specifically, if the amount of samples in our dataset is not dividable by the mini-batch size, there will be some samples left over in the end. If `drop_last=True`, we simply discard those samples, otherwise we return them together as a smaller mini-batch.

In [9]:
dataloader = DataLoader(
    dataset=dataset,
    batch_size=batch_size,
    shuffle=True
)

We now check if our dataloader works as intended. We can change the value of drop_last to see the difference.

In [10]:
dataloader = DataLoader(
    dataset=dataset,
    batch_size=batch_size,
    shuffle=True,
    drop_last=False,    # Change here if you want to see the impact of drop last and check out the last batch
)
for batch in dataloader:
    print(batch)

{'data': array([28, 44, 20])}
{'data': array([54, 84, 42])}
{'data': array([34, 68, 96])}
{'data': array([92, 18, 62])}
{'data': array([94, 76, 52])}
{'data': array([38,  4, 48])}
{'data': array([24, 40,  6])}
{'data': array([36, 80, 74])}
{'data': array([46, 22, 70])}
{'data': array([12, 64, 88])}
{'data': array([66, 10, 30])}
{'data': array([72,  2, 78])}
{'data': array([ 8, 60, 90])}
{'data': array([ 98, 100,  56])}
{'data': array([16, 58, 86])}
{'data': array([82, 14, 26])}
{'data': array([32, 50])}


# Key Takeaways
1. In machine learning, we often need to load data in **mini-batches**, which are small subsets of the training dataset. How many samples to load per mini-batch is called the **batch size**.
2. In addition to the Dataset class, we use a **DataLoader** class that takes care of mini-batch construction, data shuffling, and more.
3. The dataloader is iterable and only loads those samples of the dataset that are needed for the current mini-batch. This can lead to bottlenecks later if we are unable to provide enough batches in time for our upcoming pipeline. This is especially true when loading from HDDs as the slow reading time can be a bottleneck in our complete pipeline later.
4. The dataloader task can easily by distributed amongst multiple processes as well as pre-fetched. When we switch to PyTorch later we can directly use our dataset classes and replace our current Dataloader with theirs :).

# Outlook
We have now implemented everything we need to use the CIFAR datasets for deep learning model training. Using our dataset and dataloader, our model training will later look something like the following:

In [11]:
dataset = DummyDataset(
    root=None,
    divisor=2,
    limit=200,
)
dataloader = DataLoader(
    dataset=dataset,
    batch_size=3,
    shuffle=True,
    drop_last=True
)
model = lambda x: x
for minibatch in dataloader:
    model_output = model(minibatch)
    # do more stuff... (soon)