<h1 align="center">PyTorch Input Pipeline</h1>

Those who are familiar with `scikit-learn` Python package, will remember that most machine learning (ML) methods provided by the package have the same usage pattern. First, we load the training data into numpy arrays that store the features and labels of all training data points. These numpy arrays are then fed to `.fit()` function that implements the learning algorithm for a particular ML model. After the model has been trained, we apply it to the new data points, using `.predict()` function, in order to obtain predictions for the labels of new data points.

In contrast, deep learning often involve extremely large training data sets which cannot be stored entirely in a numpy array (we would run out of RAM on a desktop computer). Therefore, deep learning methods use sequential access to training data. This approach ties in nicely with the working principle of stochastic gradient descent (SGD). In particular, we divide the dataset into smaller sets or parts called **batches** and only store a single batch in the working memory (e.g. as numpy arrays). The total number of samples present in a single batch is called **batch size**.

After loading a new batch of data, we update the neural network parameters (weights and bias) using one iteration of an SGD variant. We repeat this batch-wise optimization until we have read each data point of the dataset. A sequence of batches that together cover the entire dataset is called an **epoch**. Note that the batch size is a hyperparameter of the resulting deep learning method. The choice of hyperparameter is often based on manual tuning ("trial and error") to minimize the validation loss.

In previous rounds we focused on deep learning models and optimizers. While model architecture and optimization algorithm are determining factors for overall performance, there is one more factor that can create a performance bottleneck: input pipeline. Opening, reading and preprocessing data consume significant amount of time and resources.

The need for efficient input pipelines coincide with increased availability of GPU and TPU hardware accelerators. GPU and TPU are optimized for computations with vectors and matrices, but they are not handling data transformation and preprocessing well. Therefore, data preprocessing is often performed on CPU, which limits efficient use of GPU and TPU. Inefficient use of accelerators in its turn increases financial burden as cost of GPU and TPU use is high, not mentioning an ecological footprint from training large deep neural networks.

In previous rounds we trained our model as follows:

- Opening a file if it hasn't been opened yet
- Fetching a data entry from the file
- Using the data for training

In this scenario, the process is sequential: model is sitting idle, while data is read and fetching is stall during training.

**PyTorch DataLoader** is a framework for building and executing efficient input pipelines for deep learning. The idea of PyTorch input pipeline is to decouple data delivery and data consumption steps, thus decreasing time spent in idle state. This is achieved with introducing prefetching of data: while model is training on the current samples, the input pipeline prepares data samples for the next training step using multiple worker processes. This results in overlapping preprocessing with model training computations.

**PyTorch DataLoader Features:**

PyTorch DataLoader provides several key advantages for efficient data processing:

- **Automatic Batching**: Seamlessly combines individual data samples into batches
- **Shuffling**: Randomizes data order to improve training stability
- **Parallel Data Loading**: Uses multiple worker processes to load data in parallel
- **Memory Pinning**: Optimizes GPU transfer speeds
- **Custom Sampling**: Supports various sampling strategies
- **Transform Pipeline**: Integrates preprocessing and augmentation

## Goals

- Building input pipelines with PyTorch
- The basic idea of PyTorch Dataset and DataLoader APIs
- Build custom PyTorch Dataset classes and apply various transformations
- Optimize data loading performance with multi-processing

In [None]:
import torch
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms
import numpy as np
import matplotlib.pyplot as plt

# Create tensor from list - PyTorch equivalent of tf.data.Dataset.from_tensor_slices
tensor_data = torch.tensor([1, 2, 3, 4, 5, 6])
print("Tensor:", tensor_data)

# PyTorch tensors can be iterated directly
for value in tensor_data:
    print(f"Value: {value}, Numpy: {value.numpy()}")

In [None]:
# Create a custom Dataset class
class NumberDataset(Dataset):
    def __init__(self, data, transform=None):
        self.data = data
        self.transform = transform
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        sample = self.data[idx]
        if self.transform:
            sample = self.transform(sample)
        return sample

# Define a preprocessing function
def preprocess(x):
    return x * x

# Create dataset with preprocessing
data = torch.arange(10).repeat(100)  # Create repeated data
dataset = NumberDataset(data, transform=preprocess)

# Create DataLoader for batching, shuffling, and parallel loading
dataloader = DataLoader(dataset, batch_size=5, shuffle=True, num_workers=2)

print("Dataset length:", len(dataset))
print("Number of batches:", len(dataloader))

# Demonstrate batch processing
for i, batch in enumerate(dataloader):
    if i >= 3:  # Show only first 3 batches
        break
    print(f"Batch {i+1}:", batch.numpy())

## What is a Pipeline and Why is it Needed for PyTorch?

A **Pipeline** in machine learning and deep learning refers to the sequence of data processing steps that transform raw data into a format suitable for training neural networks. In PyTorch, the pipeline is crucial for several reasons:

### Key Components of PyTorch Data Pipeline:

1. **Dataset Class**: Abstract base class for representing a dataset
2. **DataLoader**: Provides data batching, shuffling, and parallel loading
3. **Transforms**: Preprocessing and augmentation operations
4. **Samplers**: Control how data samples are drawn from the dataset

### Why PyTorch Pipeline is Essential:

#### 1. **Memory Efficiency**
- **Problem**: Large datasets cannot fit entirely in memory
- **Solution**: Load and process data in small batches on-demand

#### 2. **GPU Utilization**
- **Problem**: GPUs are fast but data loading/preprocessing is slow
- **Solution**: Parallel data loading with multiple workers while GPU trains

#### 3. **Data Preprocessing**
- **Problem**: Raw data needs normalization, augmentation, resizing
- **Solution**: Composable transforms that apply operations efficiently

#### 4. **Batch Processing**
- **Problem**: Neural networks work best with mini-batches
- **Solution**: Automatic batching with customizable batch sizes

#### 5. **Reproducibility**
- **Problem**: Need consistent data ordering and randomization
- **Solution**: Controlled shuffling and sampling with random seeds

### PyTorch Pipeline vs Traditional Approach:

**Traditional Approach:**
```python
# Load all data at once (memory intensive)
all_images = load_all_images()
all_labels = load_all_labels()

# Process entire dataset (slow)
processed_images = preprocess(all_images)

# Manual batching
for i in range(0, len(data), batch_size):
    batch = processed_images[i:i+batch_size]
    # ... training step
```

**PyTorch Pipeline Approach:**
```python
# Custom Dataset (lazy loading)
dataset = CustomDataset(data_paths, transforms=my_transforms)

# DataLoader (efficient batching + parallel loading)
dataloader = DataLoader(dataset, batch_size=32, 
                       shuffle=True, num_workers=4)

# Clean training loop
for batch in dataloader:
    # ... training step
```

### Performance Benefits:

1. **Parallel Processing**: Multiple CPU cores load data while GPU trains
2. **Prefetching**: Next batch prepared while current batch processes
3. **Memory Management**: Only current batch in memory
4. **Automatic Optimization**: Built-in optimizations for common patterns