# Tutorial 3.2: Custom Datasets

Author: [Maren Gröne](mailto:maren.groene@s2016.tu-chemnitz.de)

Our tutorial focuses on supervised learning with PyTorch. This is why it is crucial to understand how to implement a data pipeline for training and testing your models. PyTorch provides two fundamental components for this task: 
- **Dataset**:
    - A Dataset is an abstract class that represents a map from indices to data samples
    - Each sample can be any Python object: tensors, numbers, dictionaries, lists, custom classes and in our case mostly images with shape (shape, height, width).
    - You can import it via:
        
```python
from torch.utils.data import Dataset
```

- **DataLoader**: 
    - Takes a Dataset and creates an iterable that returns batches
    - Can transform individual samples into batches using collate_fn
    - Supports multi-process data loading to parallelize CPU operations
    - Can transfer data asynchronously to the GPU with pinned memory
    - You can import it via:
    
```python
from torch.utils.data import DataLoader
```

## Annotation Files

It is best practice to extract the relevant metadata from your dataset and save it as a separate file. These are called annotation files. The most commonly used formats are CSV, JSON, XML or simple plain text but there exist many more. In this example, we use CSV for good readability.

This process highly depends on your specific data structure and task, e.g. classification, detection etc.

We assume here that the task is image classification and that the data is structured as follows:
```bash
dataset/
    test/
        class1/
            img1.jpg
            img2.jpg
        class2/
            img3.jpg
            img4.jpg
    train/
        class1/
            ...
        class2/
            ...
    val/
        class1/
            ...
        class2/
            ...
```
If you are using your own data instead of a pre-existing dataset, randomly sort the contents of each class into the train/test/validation folders roughly using an 80/10/10 split. 

We define a function to generate three annotation files, one each for the training, validation, and test sets. Each file contains two columns: the image filename and its corresponding label. Instead of using class names (strings), we use the index of a label in a list of all labels. This numerical format aligns with the output classes of the neural network, where each label corresponds to a specific output neuron.

We use the `pandas` function `pd.DataFrame(list_train, columns=['', ''])` to specify the data structure as a table with two columns and save it as a CSV file with `csv_train.to_csv(path_train + 'annot_train.csv', sep=',', index=False)`.

In [1]:
import os
import pandas as pd # package to create CSV
def make_annotation_file(path_train: str, path_test: str, path_val: str, labels: list[str]):
    ## function to iterate through the path and create an annotation csv
    list_train = []
    list_test = []
    list_val = []
    for idx, label in enumerate(labels):
        for file in os.listdir(path_train + label):
            list_train.append((label + '/' + file, idx))

        for file in os.listdir(path_test + label):
            list_test.append((label + '/' + file, idx))

        for file in os.listdir(path_val + label):
            list_val.append((label + '/' + file, idx))

    ## create csv-files for the annotations
    csv_train = pd.DataFrame(list_train, columns=['', ''])
    csv_test = pd.DataFrame(list_test, columns=['', ''])
    csv_val = pd.DataFrame(list_val, columns=['', ''])

    csv_train.to_csv(path_train + 'annot_train.csv', sep=',', index=False)
    csv_test.to_csv(path_test + 'annot_test.csv', sep=',', index=False)
    csv_val.to_csv(path_val + 'annot_val.csv', sep=',', index=False)

## Datasets in PyTorch

The PyTorch `Dataset` controls how individual data samples are loaded and preprocessed.

To make this work, we need to define three essential magic methods:
1. `__init__`

    This method runs when you create an instance of the dataset. It sets up everything needed to load data later. Typically, it includes:
    - Reading the annotation file (e.g., a CSV) to get filenames and labels
    - Storing the path to the dataset directory
    - Setting up any image or label transformations (e.g., resizing, normalization)

2. `__len__`

    This method lets Python know how many samples are in the dataset. It's what makes len(dataset) work.
3. `__getitem__`

    This method retrieves a single sample by index (so you can do `dataset[0]`) and get an image and its label. It handles:
    - Loading the image file from disk
    - Fetching the correct label
    - Applying any transformations

```Python
from torch.utils.data import Dataset

class CustomDataset(Dataset):
    def __init__(self, ...):
        # Initialization logic (e.g., load file paths, labels)
        pass

    def __len__(self):
        # Return total number of samples: len(dataset)
        return length

    def __getitem__(self, idx):
        # Return one sample at index `idx`: dataset[0]
        return image, label
```


The following code section shows you an example for a custom dataset.

In [None]:
from torchvision.io import read_image
from torch.utils.data import Dataset
class CustomDataset(Dataset):
    def __init__(self, annotations_file, img_dir, transform=None, target_transform=None):
        # Initialization logic (e.g., load file paths, labels)
        self.img_labels = pd.read_csv(annotations_file)
        self.img_dir = img_dir
        self.transform = transform
        self.target_transform = target_transform

    def __len__(self):
        # Return total number of samples: len(dataset)  
        return len(self.img_labels)
    
    def __getitem__(self, idx):
        img_path = os.path.join(self.img_dir, self.img_labels.iloc[idx, 0])
        image = read_image(img_path)
        label = self.img_labels.iloc[idx, 1]

        if self.transform:
            image = self.transform(image)
        if self.target_transform:
            label = self.target_transform(label)

        return image, label

## DataLoaders

Where `Dataset` in PyTorch accesses a single data entry, the `DataLoader` uses it to reorganize the data in minibatches and randomizes them for each training epoch. Additionally, it speeds up information retrieval. We already used this in the previous tutorials for exactly this purpose.

The DataLoader function needs a `Dataset` as input. Also, the batch size needs to be defined and with `shuffle=True` randomization is activated. We also recommend to set the number of workers which splits the processing load.

```Python
trainloader = torch.utils.data.DataLoader(training_dataset, batch_size=batch_size, shuffle=True, num_workers=num_workers)
```

For more details, look into the [PyTorch documentation on DataLoader](https://docs.pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader).

## Transformations

Transformations are necessary to prepare the data for the neural network model, e.g. resizing, normalization and tensor conversion. It also handles data augmentation.

The following code block is an example for image transformations.

In [None]:
from torchvision import transforms

transform = transforms.Compose([
    transforms.Resize((128, 128)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.5], std=[0.5])
])

## Common Dataset Workflow: Putting Everything Together

1. Organize your data (with annotation files or directory structure).
2. Create a custom Dataset class to load and return (input, label) pairs.
3. Use DataLoader to batch and shuffle samples during training.

The following is an example for this workflow. The variables in the beginning are to be set according to the data structure.

```Python
train_data_path = '...'
test_data_path = '...'
val_data_path = '...'
classes = ['...',...]
batchsize = ...
transform = transforms.Compose([
    transforms.Resize((128, 128)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.5], std=[0.5])
])

make_annotation_file(train_data_path, test_data_path, val_data_path, classes)

training_data = CustomDataset(train_data_path + 'annot_train.csv', train_data_path, transform=train_compose)
trainloader = torch.utils.data.DataLoader(training_data, batch_size=batch_size, shuffle=True, num_workers=num_workers)

test_data = CustomDataset(test_data_path + 'annot_test.csv', test_data_path, transform=test_compose)
testloader = torch.utils.data.DataLoader(test_data, batch_size=batch_size, shuffle=True, num_workers=num_workers)
```