![banner](../src/visuals/banner.png)

# PyTorch DataLoaders
In our [Intro to PyTorch](https://github.com/priyammaz/HAL-DL-From-Scratch/blob/main/Intro%20to%20PyTorch/Intro%20to%20PyTorch.ipynb) we got a basic view of the general PyTorch mechanics. What we didn't discuss in much detail is how to actually pass data to the model efficiently. Lets quickly pull up what we had done before for the MNIST dataset.


The MNIST dataset was offered by default by PyTorch as one of their testing datasets. Our goal for this notebook will be to learn how to build this **Dataset** class from scratch so we can use it on our own custom dataset!!
```
train = torchvision.datasets.MNIST('../data', train=True, download=True,
                      transform=transforms.Compose([ ### CONVERT ARRAY TO TENSOR
                          transforms.ToTensor()
                       ]))

test = torchvision.datasets.MNIST('../data', train=False, download=True,
                       transform=transforms.Compose([
                           transforms.ToTensor()
                       ]))
```

These datasets above can then be passed to the **Dataloader** so we can grab random minibatches. We will also explore the Dataloader specifically and see what other functionality it has!

```
trainset = torch.utils.data.DataLoader(train, batch_size=BATCH_SIZE, shuffle=True, num_workers=2)
testset = torch.utils.data.DataLoader(test, batch_size=BATCH_SIZE, shuffle=False, num_workers=2)
```

In [1]:
import torch
import torch.nn
import torchvision
from torchvision import transforms
from torch.utils.data import Dataset, DataLoader

import os # Allows to access files
import numpy as np 
from PIL import Image # Allows us to Load Images

## Dogs Vs Cats Dataset

The first dataloader we will build will be the Dogs vs Cats dataset. Incase you haven't done this already run the following to download all the dataset to the */data* folder

```
bash download_data.sh
```

### Important Ideas to Keep in Mind
- **Memory Constraints:** When we build a Dataset class we have to think about our hardware constraints. If we have a smaller CSV file, we can use Pandas to read it in and then index the dataframe to grab samples. The issue with this approach is that, the entire Pandas Dataframe is **read to memory**. This means if you have massive tabular datasets, Pandas is not the way to go. You would have to use a system that only loads samples of your tabular data to memory only when accessed. We will cover one such system that has been incredible called [Deep Lake](https://github.com/activeloopai/deeplake) in a future lesson. 

In our Cats vs Dogs dataset, this is what our file directory looks like:

```
.
└── data/
    └── Petimages/
        ├── Dogs/
        │   ├── xxx.jpg
        │   ├── yyy.jpg
        │   └── ...
        └── Cats/
            ├── xxx.jpg
            ├── yyy.jpg
            └── ...
```

In our folder Petimages, we have two more folders called Dogs and Cats, and each folder contains images of Dogs and Cats respectively. So we have two options:
- Load all our images into Numpy Arrays (and tensors later) up front along with their class label 0 or 1 **### PLEASE DONT DO THIS!!!**
- Store only a list of strings that indicate the path to each Image and then load the image only when accessed


### Components of a Dataset

The dataset class will have three components:
- **init**: Initialize the model class with everything you want to store as class variables
- **len**: We have to tell the Dataset how many samples there are. When we go to grab a sample, it can grab all the samples from index 0 to index len
- **getitem**: This is the block that does most of the heavy lifting. Basically, given some index between 0 and the length of data defined before, it will grab some sample.
```
class CustomDataset(Dataset):
    def __init__(self):
        pass

    def __len__(self):
        pass

    def __getitem__(self, idx):
        pass
```

### Lets be a bit more specific about whats going on!

```
.
└── data/
    └── Petimages/
        ├── Dogs/
        │   ├── 111.jpg
        │   ├── 222.jpg
        └── Cats/
            ├── 333.jpg
            ├── 444.jpg
```

Lets say in our directories we have only 4 images in total (2 Cat and 2 Dog Images). Our **len** would then be 4 as we only have 4 images in total! The **getitem** will then use the indexes [0,1,2,3] to access these samples, so our goal in the getitem is to give the Dataset a way to load the image based on the sample number. 

The easiest way to do this is to store a list of filepaths to all of the images, and then when we access the filepath, we can load that image in! So we would first have to build this list of paths and store it to some class variable we can access later! 

**Note**: This function should not be in the **getitem**. You want it to happen when the Dataset is being Initialized, otherwise we will build the list every time we grab a sample. The code in **getitem** runs **EVERY TIME WE GRAB A SAMPLE**, but **init** will only run once.

```
path_to_dogs = [data/Petimages/Dogs/111.jpg, data/Petimages/Dogs/222.jpg] # Make a list of the path to all the Dog files
path_to_cats = [data/Petimages/Cats/333.jpg, data/Petimages/Dogs/444.jpg] # Make a list of the path to all the Cat files

training_files = path_to_dogs + path_to_cats = [data/Petimages/Dogs/111.jpg, data/Petimages/Dogs/222.jpg, 
                                                data/Petimages/Cats/333.jpg, data/Petimages/Dogs/444.jpg]
                                                
```

Now that we have a list of files, we have to set the **len**, which in this case would just be the length of training_files, or 4.

Lastly, when we start doing a forloop through our Dataset, the **getitem** function is run with an input of **idx**. More specifically, as we loop from 0 to 3 (our total samples of 4 as indicated from before), the idx returned for us to access in **getitem** can be utilized. For example, the first loop will return the index 0, and we can then take that and index our training files.

- In the first iteration at the index 0 we have the filepath *data/Petimages/Dogs/111.jpg*.
- In the second iteration at the index 1 we have the filepath *data/Petimages/Dogs/222.jpg*.
- In the third iteration at the index 2 we have the filepath *data/Petimages/Cats/333.jpg*.
- In the fourth iteration at the index 3 we have the filepath *data/Petimages/Cats/444.jpg*.

Once we do the fourth iteration, we have done the entire length as indicated in **len** and the loop will end. Therefore in **getitem**, if we can index the filepath to each sample in this way, we can then load the image from the file and return the image as an array of numbers. Also, we can easily find the label of the image (Is it a Cat or Dog?) because the filepath includes "Dogs" and "Cats" in the name. Therefore we can also return from **getitem** some interger values like 0, if Dog is in the path, or 1 if Cat is in the path.

#### Arrays vs Tensors
When we load an image we will use the PIL Image module that we imported above. All this module does is take a filepath and loads the image in the PIL format. We can then convert this to a numpy array, but numpy arrays dont work with PyTorch, so we need to convert to a tensor. From PyTorch Torchvision module, we have imported transform which has a ton of cool image transformations we can do (and will look at a bit later). The one we need right now is the **ToTensor()** function that can accept a PIL image (or Numpy Array) and convert to a Tensor that PyTorch models can work with!


#### 8Bit Images 
Most images are stored in what is known as an 8bit format. Essentially each pixel in the image can take integer values in the range of [0,255]. Now the problem with this is, Deep Learning tends to prefer numbers scaled between [0,1], so we just need to scale our 8bit images down. One way is to just divide everything by 255, but the **ToTensor()** function will already handle this for us in the PIL -> Tensor transformation.

### Lets Put all the ideas Together!!

In [2]:
class DogsVsCats(Dataset):
    def __init__(self, path_to_folder):
        path_to_cats = os.path.join(path_to_folder, "Cat") # Get Path to Cat folder
        path_to_dogs = os.path.join(path_to_folder, "Dog") # Get Path to Dog folder 
        
        dog_files = os.listdir(path_to_dogs) # Get list of all files inside of dog folder
        cat_files = os.listdir(path_to_cats) # Get list of all files inside cat folder
        
        path_to_dog_files = [os.path.join(path_to_dogs, file) for file in dog_files] # Get full path to each cat file
        path_to_cat_files = [os.path.join(path_to_cats, file) for file in cat_files] # Get full path to each dog file
        
        self.training_files = path_to_dog_files + path_to_cat_files # Concatenate our list of paths to dog and cat files
        self.dog_label, self.cat_label = 0, 1 # Store 0/1 Labels for Dogs and Cats
        
        self.transform = transforms.ToTensor() # Convert a PIL Image or ndarray to tensor and scale the values from [0,255] for 8Bit image to [0,1]
        
        
    def __len__(self):
        return len(self.training_files) # The number of samples we have is just the number of training files we have

    def __getitem__(self, idx):
        """
        Given some index from range [0, len(self.trainig_files) - 1] we want to load the sample
        """
        path_to_image = self.training_files[idx] # Grab file path at the sampled index
        
        if "Dog" in path_to_image: # If the word "Dog" is in the filepath, then set the label to 0
            label = self.dog_label
        
        else:
            label = self.cat_label # Otherwise set the label to 1
            
        image = Image.open(path_to_image) # Open Image with PIL to create a PIL Image
        image = self.transform(image) # Convert image to Tensor
        
        return image, label
    
dogvcat = DogsVsCats("../data/PetImages/")

print("Total Training Samples", len(dogvcat))

for image, label in dogvcat:
    print("Image Label:",label)
    print("Image Shape:", image.shape)
    break

Total Training Samples 25002
Image Label: 0
Image Shape: torch.Size([3, 375, 500])


### We have a Dataset class built and now lets try loading it to the DataLoader!

To reiterate, we have built a dataloader that can access a single image, but in Deep Learning, we want to sample minibatches, so we need to grab a **BATCH_SIZE** amount of images. We can do that pretty easily using the DataLoader!

The Dataloader has some more functions we will look at later but the most basic things we need to include are:

```
DataLoader(dataset=DATASET, # We place here the dataset we have defined previously
           batch_size=16,   # How many samples do you want to put together in each batch?
           shuffle=True)    # Do you want to shuffle the data?
```

Lets then go ahead and instantiate the dataloder and try to do a for loop. Again we are expecting 16 images at once!

In [4]:
dogsvcatsloader = DataLoader(dogvcat, 
                             batch_size=16, 
                             shuffle=True)

for images, labels in dogsvcatsloader:
    print(images.shape)
    print(labels)
    break

RuntimeError: stack expects each tensor to be equal size, but got [3, 375, 500] at entry 0 and [3, 100, 133] at entry 4

## Oh No! What Happened?

Lets take a look at a key assumption of the dataloader before we keep going. Take a closer look at the error:

```
RuntimeError: stack expects each tensor to be equal size, but got [3, 371, 500] at entry 0 and [3, 416, 500] at entry 1
```

What this is trying to say is the first image had dimension [3, 371, 500] and the second image had dimension [3, 416, 500] and because the shapes are different it cannot stack them together. Therefore the only way we can stack these images together is to somehow resize them all into the same shape.


## Torchvision Transforms
We will be quickly taking a look at the different transformations available from [PyTorch Torchvision Transforms](https://pytorch.org/vision/stable/transforms.html). Now there are a ton of these and we can't talk about all of them but I want to cover some common ones we use. 

```
ToTensor(): We already saw this one, but it converts a PIL Image or Numpy Array to a PyTorch Tensor.
Resize(): This will resize an image to the wanted size and is what we need to handle the images in our dataset being different sizes
Normalize(): Allows us to feed in the Means and Standard Deviations for each channel (RGB) and normalize our images.
RandomHorizontalFlip(): Randomly flips an image horizontally with some probability (default 0.5)
RandomVerticalFlip(): Randomly flips an image Vertically with some probability (default 0.5)

Compose(): Allows us to stick together multiple image transformations in a single sequence!
```

**Note**: Why do random flipping? Well an image of a dog flipped around is still a picture of a dog? By including this, it will help further generalize the model and how it interprets different classes. We will also talk about some issues such as overfitting later and this is a great technique to help curtail that!


## Lets put together a Composition of Transfomations and Add it to our Dataset Class!

Compose expects a list of transformations we want to do in the sequence we want it in! Once it is done, lets try to For loop through it to see if we get any errors!

In [16]:
### Create a Composition of Transformations
img_transforms = transforms.Compose(
    [
        transforms.Resize((224,224)), # Resize the image from whatever it is to a [3, 224, 224] image
        transforms.RandomHorizontalFlip(p=0.5), # Do a random flip with 50% probability
        transforms.ToTensor(), # Convert the image to a Tensor
        transforms.Normalize(mean=[0.485,0.456,0.406], std=[0.229,0.224,0.225]) # Normalize the Image
    ]
)

### Lets Add this to our Dataset Class
class DogsVsCats(Dataset):
    def __init__(self, path_to_folder, transforms): ### Our INIT Now Accepts a Transforms
        
        ### PREVIOUS CODE ###
        path_to_cats = os.path.join(path_to_folder, "Cat")
        path_to_dogs = os.path.join(path_to_folder, "Dog")
        dog_files = os.listdir(path_to_dogs)
        cat_files = os.listdir(path_to_cats) 
        path_to_dog_files = [os.path.join(path_to_dogs, file) for file in dog_files] 
        path_to_cat_files = [os.path.join(path_to_cats, file) for file in cat_files] 
        self.training_files = path_to_dog_files + path_to_cat_files 
        self.dog_label, self.cat_label = 0, 1 
        
        
        ### NEW CODE ###
        # self.transform = transforms.ToTensor() -> Notice how our transforms was just ToTensor before. It will be our Composition of Transforms now!
        self.transform = transforms
        
    def __len__(self):
        return len(self.training_files) # The number of samples we have is just the number of training files we have

    def __getitem__(self, idx):
        ### PREVIOUS CODE ###
        path_to_image = self.training_files[idx] # Grab file path at the sampled index      
        if "Dog" in path_to_image: # If the word "Dog" is in the filepath, then set the label to 0
            label = self.dog_label
        else:
            label = self.cat_label # Otherwise set the label to 1
        image = Image.open(path_to_image) # Open Image with PIL to create a PIL Image
        
        ### UPDATED CODE ###
        image = self.transform(image) # Image now will go through series of transforms indicated in self.transform
        return image, label
    

### Instantiate Dataset With the Transforms ###
dogvcat = DogsVsCats(path_to_folder="../data/PetImages/",
                     transforms=img_transforms)


dogsvcatsloader = DataLoader(dogvcat, 
                             batch_size=16, 
                             shuffle=True)

for images, labels in dogsvcatsloader:
    print(images.shape)
    print(labels)
    break

torch.Size([16, 3, 224, 224])
tensor([1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0])


## Voila! It Finally Works!!!
We are now grabbing 16 samples of images, each image has 3 channels, and each channel is of size 224 x 224. We also are getting the 16 binary class labels of 0 or 1 depending on if its a Dog or Cat. This is the common Tensor format that most Image related Deep Learning tasks happen in:

```
[Batch Size x Num Channels x Image Height x Image Width]
```

We are mainly focusing on this notebook in how to use the DataLoader and not really train anything, but we will use this DataLoader in a future lesson!

## The Next Problem: Deep Learning Consists of Highly Expressive Models
Deep Learning models are extremly powerful because of their expressive nature and ability to model very complex, high dimensional relationships. This means that as we train on a model, how it is performing on the training data is deceptive. What I mean by this is, after training a model, you may get an astonishing 100% Accuracy! You think this model is incredible, but the second you start using it for real world prediction problems with examples your model has never seen, you see that it doesn't perform accurately at all.

We breifly talked about this before in the Intro to PyTorch, the problem of **Overfitting** on your data. To have a fair comparison, we have to train our data on some of the labeled data we have, but keep a small part of it for validation. The model will never optimize on the Validation but just inference to see how well it performs on data it has never seen. We will get into that in a future lesson, but for now, how do we setup our DataLoader to be able to do this?


### Random Split
Luckily for us, PyTorch has a function in **torch.utils.data** called **random_split**. In this function we will essentially let PyTorch know how many samples we want in our training set and testing set, and it will randomly split the dataset for us! We saw previously that our datset has about 25000 samples, so what we want to do is give 90 percent of our samples for training and the remaining 10 percent for validation. We will then load each of these datasets into our dataloaders.

In [20]:
train_samples = int(0.9 * len(dogvcat))
test_samples = len(dogvcat) - train_samples

print("Number of Training Samples:", train_samples, "Number of Test Samples", test_samples)

train_dataset, test_dataset = torch.utils.data.random_split(dogvcat, lengths=[train_samples, test_samples]) # Split with the dataset by the lenghts

### Load Datasets into two DataLoaders ###
trainloader = DataLoader(train_dataset, 
                         batch_size=16, 
                         shuffle=True)

testloader = DataLoader(test_dataset, 
                        batch_size=16, 
                        shuffle=True)


### Test Loaders ###
for images, labels in trainloader:
    print(images.shape)
    print(labels)
    break
    
for images, labels in testloader:
    print(images.shape)
    print(labels)
    break

Number of Training Samples: 22437 Number of Test Samples 2494
torch.Size([16, 3, 224, 224])
tensor([0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1])
torch.Size([16, 3, 224, 224])
tensor([0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1])
