### **Tutorial 06: Data Loader**

In machine learning, especially with deep learning models, datasets can be very large, making it impractical to load all data at once into memory. **PyTorch's DataLoader** helps handle large datasets efficiently by batching, shuffling, and parallel loading of data, thus speeding up the training process.

---

#### **Why We Need DataLoader**

- **Efficient Batch Loading**: 
  - Loads data in smaller, manageable batches instead of the entire dataset.
  - Reduces memory usage and speeds up computations.

- **Shuffling**:
  - Randomizes the order of the data each epoch.
  - Prevents overfitting by ensuring the model does not learn the order of data.

- **Parallel Data Loading**:
  - Supports multi-threaded loading using the `num_workers` parameter.
  - Speeds up data preparation and minimizes bottlenecks.

- **Custom Dataset Handling**:
  - Works with **custom Dataset objects** to handle various data types (e.g., images, time series).
  - Custom logic for transforming and preprocessing data is easily integrated.

- **Handling Different Data Types**:
  - Supports various data types (e.g., images, sequences, tabular).
  - Simplifies code for diverse data models.

---

#### **Key Features of DataLoader**

- **Batching**:
  - Loads data in fixed-size batches (`batch_size`).
  
- **Shuffling**:
  - Randomly shuffles data to improve model generalization.

- **Parallelization**:
  - Parallel loading of data using multiple CPU workers (`num_workers`).

- **Collate Function**:
  - Customizes how data is combined into batches.
  
- **Pin Memory**:
  - Speeds up data transfer to the GPU by pinning memory (useful for large datasets).

- **Drop Last**:
  - Optionally drops the last batch if it's smaller than the specified batch size.


In [None]:
import torch
from torch.utils.data import Dataset, DataLoader


class ExampleDataset(Dataset):
    def __init__(self):
        self.data = torch.randn(1000, 5)  
        self.labels = torch.randint(0, 2, (1000,)) 

    def __len__(self):
        return len(self.data)

    def __getitem__(self, index):
        return self.data[index], self.labels[index]


dataset = ExampleDataset()

dataloader = DataLoader(dataset, batch_size=32, shuffle=True)
print(f"number of batches:", len(dataloader))

for batch_data, batch_labels in dataloader:
    print(f"Data Batch Shape: {batch_data.shape}")
    print(f"Label Batch Shape: {batch_labels.shape}")
    break
    
