![banner](../src/visuals/banner.png)

# PyTorch DataLoaders
In our [Intro to PyTorch](https://github.com/priyammaz/HAL-DL-From-Scratch/blob/main/Intro%20to%20PyTorch/Intro%20to%20PyTorch.ipynb) we got a basic view of the general PyTorch mechanics. What we didn't discuss in much detail is how to actually pass data to the model efficiently. Lets quickly pull up what we had done before for the MNIST dataset.


The MNIST dataset was offered by default by PyTorch as one of their testing datasets. Our goal for this notebook will be to learn how to build this **Dataset** class from scratch so we can use it on our own custom dataset!!
```
train = torchvision.datasets.MNIST('../data', train=True, download=True,
                      transform=transforms.Compose([ ### CONVERT ARRAY TO TENSOR
                          transforms.ToTensor()
                       ]))

test = torchvision.datasets.MNIST('../data', train=False, download=True,
                       transform=transforms.Compose([
                           transforms.ToTensor()
                       ]))
```

These datasets above can then be passed to the **Dataloader** so we can grab random minibatches. We will also explore the Dataloader specifically and see what other functionality it has!

```
trainset = torch.utils.data.DataLoader(train, batch_size=BATCH_SIZE, shuffle=True, num_workers=2)
testset = torch.utils.data.DataLoader(test, batch_size=BATCH_SIZE, shuffle=False, num_workers=2)
```

In [137]:
import torch
import torch.nn as nn
import torchvision
from torchvision import transforms
from torch.utils.data import Dataset, DataLoader
from torchvision.datasets import ImageFolder # Stream data from images stored in folders

import os # Allows to access files
import numpy as np 
from PIL import Image # Allows us to Load Images
from collections import Counter # Utility function to give us the counts of unique items in an iterable

## Dogs Vs Cats Dataset

The first dataloader we will build will be the Dogs vs Cats dataset. Incase you haven't done this already run the following to download all the dataset to the */data* folder

```
bash download_data.sh
```

### Important Ideas to Keep in Mind
- **Memory Constraints:** When we build a Dataset class we have to think about our hardware constraints. If we have a smaller CSV file, we can use Pandas to read it in and then index the dataframe to grab samples. The issue with this approach is that, the entire Pandas Dataframe is **read to memory**. This means if you have massive tabular datasets, Pandas is not the way to go. You would have to use a system that only loads samples of your tabular data to memory only when accessed. We will cover one such system that has been incredible called [Deep Lake](https://github.com/activeloopai/deeplake) in a future lesson. 

In our Cats vs Dogs dataset, this is what our file directory looks like:

```
.
└── data/
    └── Petimages/
        ├── Dogs/
        │   ├── xxx.jpg
        │   ├── yyy.jpg
        │   └── ...
        └── Cats/
            ├── xxx.jpg
            ├── yyy.jpg
            └── ...
```

In our folder Petimages, we have two more folders called Dogs and Cats, and each folder contains images of Dogs and Cats respectively. So we have two options:
- Load all our images into Numpy Arrays (and tensors later) up front along with their class label 0 or 1 **### PLEASE DONT DO THIS!!!**
- Store only a list of strings that indicate the path to each Image and then load the image only when accessed


### Components of a Dataset

The dataset class will have three components:
- **init**: Initialize the model class with everything you want to store as class variables
- **len**: We have to tell the Dataset how many samples there are. When we go to grab a sample, it can grab all the samples from index 0 to index len
- **getitem**: This is the block that does most of the heavy lifting. Basically, given some index between 0 and the length of data defined before, it will grab some sample.
```
class CustomDataset(Dataset):
    def __init__(self):
        pass

    def __len__(self):
        pass

    def __getitem__(self, idx):
        pass
```

### Lets be a bit more specific about whats going on!

```
.
└── data/
    └── Petimages/
        ├── Dogs/
        │   ├── 111.jpg
        │   ├── 222.jpg
        └── Cats/
            ├── 333.jpg
            ├── 444.jpg
```

Lets say in our directories we have only 4 images in total (2 Cat and 2 Dog Images). Our **len** would then be 4 as we only have 4 images in total! The **getitem** will then use the indexes [0,1,2,3] to access these samples, so our goal in the getitem is to give the Dataset a way to load the image based on the sample number. 

The easiest way to do this is to store a list of filepaths to all of the images, and then when we access the filepath, we can load that image in! So we would first have to build this list of paths and store it to some class variable we can access later! 

**Note**: This function should not be in the **getitem**. You want it to happen when the Dataset is being Initialized, otherwise we will build the list every time we grab a sample. The code in **getitem** runs **EVERY TIME WE GRAB A SAMPLE**, but **init** will only run once.

```
path_to_dogs = [data/Petimages/Dogs/111.jpg, data/Petimages/Dogs/222.jpg] # Make a list of the path to all the Dog files
path_to_cats = [data/Petimages/Cats/333.jpg, data/Petimages/Dogs/444.jpg] # Make a list of the path to all the Cat files

training_files = path_to_dogs + path_to_cats = [data/Petimages/Dogs/111.jpg, data/Petimages/Dogs/222.jpg, 
                                                data/Petimages/Cats/333.jpg, data/Petimages/Dogs/444.jpg]
                                                
```

Now that we have a list of files, we have to set the **len**, which in this case would just be the length of training_files, or 4.

Lastly, when we start doing a forloop through our Dataset, the **getitem** function is run with an input of **idx**. More specifically, as we loop from 0 to 3 (our total samples of 4 as indicated from before), the idx returned for us to access in **getitem** can be utilized. For example, the first loop will return the index 0, and we can then take that and index our training files.

- In the first iteration at the index 0 we have the filepath *data/Petimages/Dogs/111.jpg*.
- In the second iteration at the index 1 we have the filepath *data/Petimages/Dogs/222.jpg*.
- In the third iteration at the index 2 we have the filepath *data/Petimages/Cats/333.jpg*.
- In the fourth iteration at the index 3 we have the filepath *data/Petimages/Cats/444.jpg*.

Once we do the fourth iteration, we have done the entire length as indicated in **len** and the loop will end. Therefore in **getitem**, if we can index the filepath to each sample in this way, we can then load the image from the file and return the image as an array of numbers. Also, we can easily find the label of the image (Is it a Cat or Dog?) because the filepath includes "Dogs" and "Cats" in the name. Therefore we can also return from **getitem** some interger values like 0, if Dog is in the path, or 1 if Cat is in the path.

#### Arrays vs Tensors
When we load an image we will use the PIL Image module that we imported above. All this module does is take a filepath and loads the image in the PIL format. We can then convert this to a numpy array, but numpy arrays dont work with PyTorch, so we need to convert to a tensor. From PyTorch Torchvision module, we have imported transform which has a ton of cool image transformations we can do (and will look at a bit later). The one we need right now is the **ToTensor()** function that can accept a PIL image (or Numpy Array) and convert to a Tensor that PyTorch models can work with!


#### 8Bit Images 
Most images are stored in what is known as an 8bit format. Essentially each pixel in the image can take integer values in the range of [0,255]. Now the problem with this is, Deep Learning tends to prefer numbers scaled between [0,1], so we just need to scale our 8bit images down. One way is to just divide everything by 255, but the **ToTensor()** function will already handle this for us in the PIL -> Tensor transformation.

### Lets Put all the ideas Together!!

In [138]:
class DogsVsCats(Dataset):
    def __init__(self, path_to_folder):
        path_to_cats = os.path.join(path_to_folder, "Cat") # Get Path to Cat folder
        path_to_dogs = os.path.join(path_to_folder, "Dog") # Get Path to Dog folder 
        
        dog_files = os.listdir(path_to_dogs) # Get list of all files inside of dog folder
        cat_files = os.listdir(path_to_cats) # Get list of all files inside cat folder
        
        path_to_dog_files = [os.path.join(path_to_dogs, file) for file in dog_files] # Get full path to each cat file
        path_to_cat_files = [os.path.join(path_to_cats, file) for file in cat_files] # Get full path to each dog file
        
        self.training_files = path_to_dog_files + path_to_cat_files # Concatenate our list of paths to dog and cat files
        self.dog_label, self.cat_label = 0, 1 # Store 0/1 Labels for Dogs and Cats
        
        self.transform = transforms.ToTensor() # Convert a PIL Image or ndarray to tensor and scale the values from [0,255] for 8Bit image to [0,1]
        
        
    def __len__(self):
        return len(self.training_files) # The number of samples we have is just the number of training files we have

    def __getitem__(self, idx):
        """
        Given some index from range [0, len(self.trainig_files) - 1] we want to load the sample
        """
        path_to_image = self.training_files[idx] # Grab file path at the sampled index
        
        if "Dog" in path_to_image: # If the word "Dog" is in the filepath, then set the label to 0
            label = self.dog_label
        
        else:
            label = self.cat_label # Otherwise set the label to 1
            
        image = Image.open(path_to_image) # Open Image with PIL to create a PIL Image
        image = self.transform(image) # Convert image to Tensor
        
        return image, label
    
dogvcat = DogsVsCats("../data/PetImages/")

print("Total Training Samples", len(dogvcat))

for image, label in dogvcat:
    print("Image Label:",label)
    print("Image Shape:", image.shape)
    break

Total Training Samples 24931
Image Label: 0
Image Shape: torch.Size([3, 375, 500])


### We have a Dataset class built and now lets try loading it to the DataLoader!

To reiterate, we have built a dataloader that can access a single image, but in Deep Learning, we want to sample minibatches, so we need to grab a **BATCH_SIZE** amount of images. We can do that pretty easily using the DataLoader!

The Dataloader has some more functions we will look at later but the most basic things we need to include are:

```
DataLoader(dataset=DATASET, # We place here the dataset we have defined previously
           batch_size=16,   # How many samples do you want to put together in each batch?
           shuffle=True)    # Do you want to shuffle the data?
```

Lets then go ahead and instantiate the dataloder and try to do a for loop. Again we are expecting 16 images at once!

In [140]:
dogsvcatsloader = DataLoader(dogvcat, 
                             batch_size=16, 
                             shuffle=False)

for images, labels in dogsvcatsloader:
    print(images.shape)
    print(labels)
    break

RuntimeError: stack expects each tensor to be equal size, but got [3, 375, 500] at entry 0 and [3, 500, 327] at entry 1

## Oh No! What Happened?

Lets take a look at a key assumption of the dataloader before we keep going. Take a closer look at the error:

```
RuntimeError: stack expects each tensor to be equal size, but got [3, 375, 500] at entry 0 and [3, 500, 327] at entry 1
```

What this is trying to say is the first image had dimension [3, 371, 500] and the second image had dimension [3, 416, 500] and because the shapes are different it cannot stack them together. Therefore the only way we can stack these images together is to somehow resize them all into the same shape.


## Torchvision Transforms
We will be quickly taking a look at the different transformations available from [PyTorch Torchvision Transforms](https://pytorch.org/vision/stable/transforms.html). Now there are a ton of these and we can't talk about all of them but I want to cover some common ones we use. 

```
ToTensor(): We already saw this one, but it converts a PIL Image or Numpy Array to a PyTorch Tensor.
Resize(): This will resize an image to the wanted size and is what we need to handle the images in our dataset being different sizes
Normalize(): Allows us to feed in the Means and Standard Deviations for each channel (RGB) and normalize our images.
RandomHorizontalFlip(): Randomly flips an image horizontally with some probability (default 0.5)
RandomVerticalFlip(): Randomly flips an image Vertically with some probability (default 0.5)

Compose(): Allows us to stick together multiple image transformations in a single sequence!
```

**Note**: Why do random flipping? Well an image of a dog flipped around is still a picture of a dog? By including this, it will help further generalize the model and how it interprets different classes. We will also talk about some issues such as overfitting later and this is a great technique to help curtail that!


## Lets put together a Composition of Transfomations and Add it to our Dataset Class!

Compose expects a list of transformations we want to do in the sequence we want it in! Once it is done, lets try to For loop through it to see if we get any errors!

In [146]:
### Create a Composition of Transformations
img_transforms = transforms.Compose(
    [
        transforms.Resize((224,224)), # Resize the image from whatever it is to a [3, 224, 224] image
        transforms.RandomHorizontalFlip(p=0.5), # Do a random flip with 50% probability
        transforms.ToTensor(), # Convert the image to a Tensor
        transforms.Normalize(mean=[0.485,0.456,0.406], std=[0.229,0.224,0.225]) # Normalize the Image
    ]
)

### Lets Add this to our Dataset Class
class DogsVsCats(Dataset):
    def __init__(self, path_to_folder, transforms): ### Our INIT Now Accepts a Transforms
        
        ### PREVIOUS CODE ###
        path_to_cats = os.path.join(path_to_folder, "Cat")
        path_to_dogs = os.path.join(path_to_folder, "Dog")
        dog_files = os.listdir(path_to_dogs)
        cat_files = os.listdir(path_to_cats) 
        path_to_dog_files = [os.path.join(path_to_dogs, file) for file in dog_files] 
        path_to_cat_files = [os.path.join(path_to_cats, file) for file in cat_files] 
        self.training_files = path_to_dog_files + path_to_cat_files 
        self.dog_label, self.cat_label = 0, 1 
        
        
        ### NEW CODE ###
        # self.transform = transforms.ToTensor() -> Notice how our transforms was just ToTensor before. It will be our Composition of Transforms now!
        self.transform = transforms
        
    def __len__(self):
        return len(self.training_files) # The number of samples we have is just the number of training files we have

    def __getitem__(self, idx):
        ### PREVIOUS CODE ###
        path_to_image = self.training_files[idx] # Grab file path at the sampled index      
        if "Dog" in path_to_image: # If the word "Dog" is in the filepath, then set the label to 0
            label = self.dog_label
        else:
            label = self.cat_label # Otherwise set the label to 1
        image = Image.open(path_to_image) # Open Image with PIL to create a PIL Image
        
        ### UPDATED CODE ###
        image = self.transform(image) # Image now will go through series of transforms indicated in self.transform
        return image, label
    

### Instantiate Dataset With the Transforms ###
dogvcat = DogsVsCats(path_to_folder="../data/PetImages/",
                     transforms=img_transforms)


dogsvcatsloader = DataLoader(dogvcat, 
                             batch_size=16, 
                             shuffle=True)

for images, labels in dogsvcatsloader:
    print(images.shape)
    print(labels)
    break

torch.Size([16, 3, 224, 224])
tensor([0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0])


## Voila! It Finally Works!!!
We are now grabbing 16 samples of images, each image has 3 channels, and each channel is of size 224 x 224. We also are getting the 16 binary class labels of 0 or 1 depending on if its a Dog or Cat. This is the common Tensor format that most Image related Deep Learning tasks happen in:

```
[Batch Size x Num Channels x Image Height x Image Width]
```

We are mainly focusing on this notebook in how to use the DataLoader and not really train anything, but we will use this DataLoader in a future lesson!

## The Next Problem: Deep Learning Consists of Highly Expressive Models
Deep Learning models are extremly powerful because of their expressive nature and ability to model very complex, high dimensional relationships. This means that as we train on a model, how it is performing on the training data is deceptive. What I mean by this is, after training a model, you may get an astonishing 100% Accuracy! You think this model is incredible, but the second you start using it for real world prediction problems with examples your model has never seen, you see that it doesn't perform accurately at all.

We breifly talked about this before in the Intro to PyTorch, the problem of **Overfitting** on your data. To have a fair comparison, we have to train our data on some of the labeled data we have, but keep a small part of it for validation. The model will never optimize on the Validation but just inference to see how well it performs on data it has never seen. We will get into that in a future lesson, but for now, how do we setup our DataLoader to be able to do this?


### Random Split
Luckily for us, PyTorch has a function in **torch.utils.data** called **random_split**. In this function we will essentially let PyTorch know how many samples we want in our training set and testing set, and it will randomly split the dataset for us! We saw previously that our datset has about 25000 samples, so what we want to do is give 90 percent of our samples for training and the remaining 10 percent for validation. We will then load each of these datasets into our dataloaders.

In [147]:
train_samples = int(0.9 * len(dogvcat))
test_samples = len(dogvcat) - train_samples

print("Number of Training Samples:", train_samples, "Number of Test Samples", test_samples)

train_dataset, test_dataset = torch.utils.data.random_split(dogvcat, lengths=[train_samples, test_samples]) # Split with the dataset by the lenghts

### Load Datasets into two DataLoaders ###
trainloader = DataLoader(train_dataset, 
                         batch_size=16, 
                         shuffle=True)

testloader = DataLoader(test_dataset, 
                        batch_size=16, 
                        shuffle=True)


### Test Loaders ###
for images, labels in trainloader:
    print(images.shape)
    print(labels)
    break
    
for images, labels in testloader:
    print(images.shape)
    print(labels)
    break

Number of Training Samples: 22437 Number of Test Samples 2494
torch.Size([16, 3, 224, 224])
tensor([0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1])
torch.Size([16, 3, 224, 224])
tensor([0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1])


## Quick Dataset Building

If your image files are stored in the format as shown at the beginning (seperate folders for each class with images in the folder) we can then user the PyTorch **ImageFolder** available in the Datasets module in Torchvision. This will functionally do the same thing, but save you a few lines of code so it comes in handy! It will, by default, make each folder for each class a different class label. In our case if we print our the classes attribute of our ImageFolder dataset, it will return a list:
```
['Cat', 'Dog']
```

This tells us that Cat will have the label 0 and Dog will have the label 1 (as per their indexes).

The main input to **ImageFolder** is the path to the root directory (the folder that has all the class folders in it) and then the transformations we want to apply (same as the img_transforms from before)

In [148]:
dogvcat = ImageFolder(root="../data/PetImages/",
                      transform=img_transforms)

print(dogvcat.classes) # See what the Dataset Classes are

### The remaining code is identical to before! ###

train_dataset, test_dataset = torch.utils.data.random_split(dogvcat, lengths=[train_samples, test_samples]) # Split with the dataset by the lenghts

### Load Datasets into two DataLoaders ###
trainloader = DataLoader(train_dataset, 
                         batch_size=16, 
                         shuffle=True)

testloader = DataLoader(test_dataset, 
                        batch_size=16, 
                        shuffle=True)


### Test Loaders ###
for images, labels in trainloader:
    print(images.shape)
    print(labels)
    break
    
for images, labels in testloader:
    print(images.shape)
    print(labels)
    break

['Cat', 'Dog']
torch.Size([16, 3, 224, 224])
tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1])
torch.Size([16, 3, 224, 224])
tensor([1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1])


## DataLoaders for NLP 
We will now repeat what we had done previously for images and just repeat it for NLP. There are two difficulties we will face with NLP:
- We are now working with words, not numbers, so how do we convert it?
- Sentences have different lengths, but PyTorch cannot stack different length tensors as we saw previously with the image example. Sentences aren't also something we resize so what do we do?

### Words to Numbers: Tokenizers
There is a ton of methods that we will explore when we dig into NLP more, but for now lets just look at the most basic one!

**Character Level Tokenizers**
We will essentially take a sentence and just split it by characters. Lets pretend we only have lowercase alphabet letters and spaces in our dataset, in which case [a-z, " "] can be represented with the numbers [0-26] for our 26 letters and one space character.
```
Tokenize Sentence
"hi my name is priyam" -> ['h', 'i', ' ', 'm','y',' ','n','a','m','e',' ','i','s',' ','p','r','i','y','a','m'] 

Convert to Numbers
['h', 'i', ' ', 'm','y',' ','n','a','m','e',' ','i','s',' ','p','r','i','y','a','m'] -> [7, 8, 26, 12, 24, 26, 13, 0, 12, 4, 26, 8, 18, 26, 15, 17, 8, 24, 0, 12]
```

We will have to build this tokenizer by iterating through all of our data, finding all the unique characters available, and make a dictionary that allows us to swap between characters and their numerical values.

### IMDB Dataset 
For this example we will be looking at the IMDB dataset. This dataset is a binary classification problem that expects us to take in some text and then predict if it is a positive or negative review.
```
.
└── alcimdb/
    ├── train/
    │   ├── neg/
    │   │   ├── txt1.txt
    │   │   ├── txt2.txt
    │   │   └── ...
    │   └── pos/
    │       ├── txt3.txt
    │       ├── txt4.txt
    │       └── ...
    └── test/
        ├── neg/
        │   ├── txt5.txt
        │   ├── txt6.txt
        │   └── ...
        └── pos/
            ├── txt7.txt
            ├── txt8.txt
            └── ...
```

As you can see they already split the training and testing data up for us, and each train and test has two classes, positive and negative! Inside each of those folders we have a series of text files that we need to read in. The majority of the Dataset principles we have covered are identical, so we will just do it that way again. The main difference is, instead of making a single dataset and using **random_split**, they already did the split for us so we can just make two datasets with two different filepaths.

### Build Tokenizer Dictionary

We will build a tokenizer off of our training data, and include an Unkown token incase there are characters in our testing data that are not found in the training. We will also just include a threshold that a character has to exist 1500 times for it to be included in our list of available characters that we care about. This 1500 is arbritrary though, and you should pick a threshold that makes sense for your problem.

In [149]:
path_to_data = "../data/aclImdb/train"

path_to_pos_fld = os.path.join(path_to_data, "pos")
path_to_neg_fld = os.path.join(path_to_data, "neg")

path_to_pos_txt = [os.path.join(path_to_pos_fld, file) for file in os.listdir(path_to_pos_fld)]
path_to_neg_txt = [os.path.join(path_to_neg_fld, file) for file in os.listdir(path_to_neg_fld)]

training_files = path_to_pos_txt + path_to_neg_txt

alltxt = ""
for file in training_files:
    with open(file, "r") as f:
        text = f.readlines()
        alltxt += text[0]

### Get the counts of each character and filter 
unique_counts = dict(Counter(alltxt))
characters = sorted([key for (key,value) in unique_counts.items() if value > 1500])

### Concatenate an Unkown Token incase we come across tokens that dont exist in this set
characters.append("<UNK>")

### Concatenate a Padding token, which we will use later to make our tensors all the same length
characters.append("<PAD>")

### Build our Dictionary Mapping between Characters and Numbers
char2idx = {c:i for i,c in enumerate(characters)}
idx2char = {i:c for i,c in enumerate(characters)}

print(char2idx)

{' ': 0, '!': 1, '"': 2, '&': 3, "'": 4, '(': 5, ')': 6, '*': 7, ',': 8, '-': 9, '.': 10, '/': 11, '0': 12, '1': 13, '2': 14, '3': 15, '4': 16, '5': 17, '6': 18, '7': 19, '8': 20, '9': 21, ':': 22, ';': 23, '<': 24, '>': 25, '?': 26, 'A': 27, 'B': 28, 'C': 29, 'D': 30, 'E': 31, 'F': 32, 'G': 33, 'H': 34, 'I': 35, 'J': 36, 'K': 37, 'L': 38, 'M': 39, 'N': 40, 'O': 41, 'P': 42, 'R': 43, 'S': 44, 'T': 45, 'U': 46, 'V': 47, 'W': 48, 'Y': 49, 'Z': 50, 'a': 51, 'b': 52, 'c': 53, 'd': 54, 'e': 55, 'f': 56, 'g': 57, 'h': 58, 'i': 59, 'j': 60, 'k': 61, 'l': 62, 'm': 63, 'n': 64, 'o': 65, 'p': 66, 'q': 67, 'r': 68, 's': 69, 't': 70, 'u': 71, 'v': 72, 'w': 73, 'x': 74, 'y': 75, 'z': 76, 'é': 77, '<UNK>': 78, '<PAD>': 79}


In [150]:
class IMDBDataset(Dataset):
    def __init__(self, path_to_data, char2idx): ### Load in path to the data and our tokenizing dictionary
        path_to_pos_fld = os.path.join(path_to_data, "pos")
        path_to_neg_fld = os.path.join(path_to_data, "neg")
        
        path_to_pos_txt = [os.path.join(path_to_pos_fld, file) for file in os.listdir(path_to_pos_fld)]
        path_to_neg_txt = [os.path.join(path_to_neg_fld, file) for file in os.listdir(path_to_neg_fld)]
        
        self.training_files = path_to_pos_txt + path_to_neg_txt
        self.tokenizer = char2idx
        
    def __len__(self):
        return len(self.training_files)
    
    def __getitem__(self, idx):
        path_to_txt = self.training_files[idx]
        
        with open(path_to_txt, "r") as f:
            txt = f.readlines()[0]
            
        tokenized = []
        for char in txt:
            if char in self.tokenizer.keys(): # Check that the char is available in our tokenizer
                tokenized.append(self.tokenizer[char])
            else:
                tokenized.append(self.tokenizer["<UNK>"]) # Otherwise use our unkown token
                
                
        sample = torch.tensor(tokenized) # Convert list of tokenized numbers to a pytorch tensor
        if "neg" in path_to_txt:
            label = 0
        else:
            label = 1
        
        return sample, label
        
        
    
imdbdataset = IMDBDataset("../data/aclImdb/train", char2idx)

counter = 0
for sample, label in imdbdataset:
    print(sample.shape)
    print(label)
    counter +=1
    
    if counter ==3:
        break


torch.Size([806])
1
torch.Size([2366])
1
torch.Size([841])
1


## Data Collator

As we can see above, we are able to return tokenized forms for each sentence, but they are all different lengths! This makes sense as different sentences are different lengths so we will append  the **<PAD>** token to each of our samples to make sure a batch is the same length. We could have, in the Dataset Class, made it such that we set a default length of 3000 tokens, and sentences that are longer will be cut to that size. The problem with this is, if we have samples that are much shorter than 3000, then we will have unecessary amounts of padding. Instead, we should pad every batch, so the maximum amount of padding would only go to the longest sample in a specific batch. 

    
To do this we will take a look at the **collate_fn** option in the DataLoader. By default, the **collate_fn** expects everything to be of the same size and will stack them together, but that wont work. We will instead write our own function that will first pad all the samples to the longest sample in a batch and then stack them!
    
To do this, we wil use the pad_sequence fucntion give in PyTorch found in the utils for their RNN function. Here is a sample from the [documentation](https://pytorch.org/docs/stable/generated/torch.nn.utils.rnn.pad_sequence.html) that explains how it works! As you can see we give three different tensors of different sizes, and it padds sequences along the first dimension. 
    
Also, when it comes to PyTorch we can have our data tensors in two formats:
    
```
Sequence First [sequence length x batch x ...]
Batch First [batch x sequence length x ...]
```
I personally prefer doing batch first as it makes more sense to me so we will just have to let everything in PyTorch know we will do batch first. 

In [151]:
### Test Pad Sequence ###
a = torch.ones(10)
b = torch.ones(8)
c = torch.ones(2)
padded = nn.utils.rnn.pad_sequence([a, b, c], batch_first=True, padding_value=999) 

print(padded)
print(padded.shape)

tensor([[  1.,   1.,   1.,   1.,   1.,   1.,   1.,   1.,   1.,   1.],
        [  1.,   1.,   1.,   1.,   1.,   1.,   1.,   1., 999., 999.],
        [  1.,   1., 999., 999., 999., 999., 999., 999., 999., 999.]])
torch.Size([3, 10])


Notice that we have padded our batches shorter than the longest batch with the token 999. We will use this ability in our collate function now!

The input to the collate function will be the batch of samples grabbed from the Dataset as a list of tuples like this:

```
[(text1, label1), (text2, label2), ...]
```

We just have to iterate through these tuples, stack our labels together, and then pad/stack our text together! Once we have the Collator we can define the DataLoader as usual with this extra function.

In [152]:
def data_collator(batch):
    texts, labels = [], []
    
    for text, label in batch:
        labels.append(label)
        texts.append(text)
        
    labels = torch.tensor(labels)
    
    ### Pad the list of sequences and then convert to tensor like example above but with our padding token <PAD> ###
    texts = nn.utils.rnn.pad_sequence(texts, batch_first=True, padding_value=char2idx["<PAD>"])
    return texts, labels    

### Run Without Data Collator

In [153]:
imdbloader = DataLoader(imdbdataset, batch_size=16, shuffle=True)

for text, label in imdbloader:
    print(text.shape)
    break

RuntimeError: stack expects each tensor to be equal size, but got [1623] at entry 0 and [2624] at entry 1

### Run Dataloader With Collator!

In [154]:
imdbloader = DataLoader(imdbdataset, batch_size=16, shuffle=True, collate_fn=data_collator)

for text, label in imdbloader:
    print(text.shape)
    break
    

torch.Size([16, 2934])


## It Works!! 
That is basically it! We built dataloaders for Images and Text and as you can see they are almost similar with just a few differences. We will be using versions of these DataLoaders for training later on, and there are a few more complexities that come with that, but this is essentially everything you need to build datasets for PyTorch

## Increase Performance of DataLoaders
There are a couple of things we can also do with our DataLoaders that can make them faster. If we have a lot of preprocessing on our data, this can definitely help a bit!

- Num Workers: DataLoading happens on the CPU and prepared tensors are then passed to the GPU. We can increase the number of workers that do these tasks
- Pin Memory: Pre-Loads Tensors to GPU as Training is happening 