![banner](../src/visuals/banner.png)

# PyTorch DataLoaders
In our [Intro to PyTorch](https://github.com/priyammaz/HAL-DL-From-Scratch/blob/main/Intro%20to%20PyTorch/Intro%20to%20PyTorch.ipynb) we got a basic view of the general PyTorch mechanics. What we didn't discuss in much detail is how to actually pass data to the model efficiently. Lets quickly pull up what we had done before for the MNIST dataset.


The MNIST dataset was offered by default by PyTorch as one of their testing datasets. Our goal for this notebook will be to learn how to build this **Dataset** class from scratch so we can use it on our own custom dataset!!
```
train = torchvision.datasets.MNIST('../data', train=True, download=True,
                      transform=transforms.Compose([ ### CONVERT ARRAY TO TENSOR
                          transforms.ToTensor()
                       ]))

test = torchvision.datasets.MNIST('../data', train=False, download=True,
                       transform=transforms.Compose([
                           transforms.ToTensor()
                       ]))
```

These datasets above can then be passed to the **Dataloader** so we can grab random minibatches. We will also explore the Dataloader specifically and see what other functionality it has!

```
trainset = torch.utils.data.DataLoader(train, batch_size=BATCH_SIZE, shuffle=True, num_workers=2)
testset = torch.utils.data.DataLoader(test, batch_size=BATCH_SIZE, shuffle=False, num_workers=2)
```

In [4]:
import torch
import torch.nn
import torchvision
from torchvision import transforms
from torch.utils.data import Dataset, DataLoader

import os # Allows to access files
import numpy as np 
from PIL import Image # Allows us to Load Images

## Dogs Vs Cats Dataset

The first dataloader we will build will be the Dogs vs Cats dataset. Incase you haven't done this already run the following to download all the dataset to the */data* folder

```
bash download_data.sh
```

### Important Ideas to Keep in Mind
- **Memory Constraints:** When we build a Dataset class we have to think about our hardware constraints. If we have a smaller CSV file, we can use Pandas to read it in and then index the dataframe to grab samples. The issue with this approach is that, the entire Pandas Dataframe is **read to memory**. This means if you have massive tabular datasets, Pandas is not the way to go. You would have to use a system that only loads samples of your tabular data to memory only when accessed. We will cover one such system that has been incredible called [Deep Lake](https://github.com/activeloopai/deeplake) in a future lesson. 

In our Cats vs Dogs dataset, this is what our file directory looks like:

```
.
└── data/
    └── Petimages/
        ├── Dogs/
        │   ├── xxx.jpg
        │   ├── yyy.jpg
        │   └── ...
        └── Cats/
            ├── xxx.jpg
            ├── yyy.jpg
            └── ...
```

In our folder Petimages, we have two more folders called Dogs and Cats, and each folder contains images of Dogs and Cats respectively. So we have two options:
- Load all our images into Numpy Arrays (and tensors later) up front along with their class label 0 or 1 **### PLEASE DONT DO THIS!!!**
- Store only a list of strings that indicate the path to each Image and then load the image only when accessed


### Components of a Dataset

The dataset class will have three components:
- **init**: Initialize the model class with everything you want to store as class variables
- **len**: We have to tell the Dataset how many samples there are. When we go to grab a sample, it can grab all the samples from index 0 to index len
- **getitem**: This is the block that does most of the heavy lifting. Basically, given some index between 0 and the length of data defined before, it will grab some sample.
```
class DogsVsCats(Dataset):
    def __init__(self):
        pass

    def __len__(self):
        pass

    def __getitem__(self, idx):
        pass
```

In [29]:
class DogsVsCats(Dataset):
    def __init__(self, path_to_folder):
        path_to_cats = os.path.join(path_to_folder, "Cat") # Get Path to Cat folder
        path_to_dogs = os.path.join(path_to_folder, "Dog") # Get Path to Dog folder 
        
        dog_files = os.listdir(path_to_dogs) # Get list of all files inside of dog folder
        cat_files = os.listdir(path_to_cats) # Get list of all files inside cat folder
        
        path_to_dog_files = [os.path.join(path_to_dogs, file) for file in dog_files] # Get full path to each cat file
        path_to_cat_files = [os.path.join(path_to_cats, file) for file in cat_files] # Get full path to each dog file
        
        self.training_files = path_to_dog_files + path_to_cat_files # Concatenate our list of paths to dog and cat files
        self.dog_label, self.cat_label = 0, 1 # Store 0/1 Labels for Dogs and Cats
        
        self.transform = transforms.ToTensor() # Convert a PIL Image or ndarray to tensor and scale the values from [0,255] for 8Bit image to [0,1]
        
        
    def __len__(self):
        return len(self.training_files) # The number of samples we have is just the number of training files we have

    def __getitem__(self, idx):
        """
        Given some index from range [0, len(self.trainig_files) - 1] we want to load the sample
        """
        path_to_image = self.training_files[idx]
        
        if "Dog" in path_to_image:
            label = self.dog_label
        
        else:
            label = self.cat_label
            
        image = Image.open(path_to_image) # Open Image with PIL to create a PIL Image
        image = self.transform(image) # Convert image
        
        return image, label
    
catvdog = DogsVsCats("../data/PetImages/")

for image, label in catvdog:
    print("Image Label:",label)
    print("Image Shape:", image.shape)
    break

Image Label: 0
Image Shape: torch.Size([3, 375, 500])
