#### `Dataset` and `DataLoader`: The Why?
Loading entire datasets into memory at once can cause problems like memory overflow and slow processing. Dataset and DataLoader classes solve these issues by:
1. Loading data in smaller batches to prevent memory overflow

2. Using parallel processing to speed up computations.


These tools help us work with large datasets more efficiently.

In [15]:
# Importing necessaey libraries
import numpy as np
import pandas as pd

from sklearn.preprocessing import StandardScaler

import torch
from torch.utils.data import Dataset, DataLoader

In [16]:
# Set random seed for reproducibility
np.random.seed(42)

# Generate random data for 100 students
n_students = 1000

# Generate features
cgpa = np.random.uniform(5.0, 10.0, n_students)  # CGPA between 5.0 and 10.0
iq = np.random.normal(100, 15, n_students)  # IQ with mean 100 and std 15
marks_12th = np.random.uniform(60, 100, n_students)  # 12th marks between 60 and 100
marks_10th = np.random.uniform(60, 100, n_students)  # 10th marks between 60 and 100

# Generate placement status (binary)
# Higher probability of placement for students with better scores
placement_prob = (cgpa/10 * 0.3 + iq/150 * 0.3 + marks_12th/100 * 0.2 + marks_10th/100 * 0.2)
placed = np.random.binomial(1, placement_prob)

# Create DataFrame
student_data = pd.DataFrame({
    'CGPA': np.round(cgpa, 2),
    'IQ': np.round(iq),
    'Marks_12th': np.round(marks_12th, 2),
    'Marks_10th': np.round(marks_10th, 2),
    'Placed': placed
})

print("Sample of generated student data:")
student_data.head()

Sample of generated student data:


Unnamed: 0,CGPA,IQ,Marks_12th,Marks_10th,Placed
0,6.87,103.0,98.86,75.51,0
1,9.75,80.0,73.25,92.14,1
2,8.66,106.0,79.28,96.07,1
3,7.99,109.0,67.84,68.14,0
4,5.78,108.0,84.43,62.68,1


In [17]:
# Shape of the dataset
student_data.shape

(1000, 5)

In [18]:
# Splitting the data
X = student_data.drop(columns=['Placed']),
y = student_data['Placed']

#### The Golden Rule: "Fit Once, Transform Anytime"

Preprocessing generally falls into two categories:

##### 1. Global Preprocessing (Requires the whole dataset)
**Examples:** Scaling (StandardScaler, MinMax), Vocabulary Building (NLP), Label Encoding, Imputing Missing Values (Mean/Median).

*   **Where:** In `__init__` (or before passing data to the Dataset).
*   **Why:** These techniques need to see *all* the data to calculate statistics (like Mean or Max). You cannot calculate the "mean of the dataset" by looking at just one row in `__getitem__`.
*   **Best Practice:** Calculate these statistics *once* and apply them *once* (if data fits in memory).

##### 2. On-the-Fly Preprocessing (Per-item independent)
**Examples:** Data Augmentation (Random flips/rotations), Resizing images, Convert to Tensor, Normalization (using *known* mean/std).

*   **Where:** In `__getitem__`.
*   **Why:** These are specific to the individual item.
    *   **Augmentation:** You want a *different* random crop every time you load the image.
    *   **Memory efficiency:** If you have 100GB of images, you can't load them all in `__init__`. You load and resize one specific image in `__getitem__` only when needed.

---

#### Summary of Best Practices

| Type of Data | Technique | Where to put it? |
| :--- | :--- | :--- |
| **Tabular** | Scaling (Standard/MinMax) | `__init__` (Calculate once, apply once) |
| **Tabular** | Encoding / Missing Values | `__init__` |
| **Images** | **Data Augmentation** | `__getitem__` (Needs to be random per epoch) |
| **Images** | Resizing / Normalization | `__getitem__` (If loading images from disk to save RAM) |
| **Text** | Vocabulary Building | `__init__` (Needs full corpus) |
| **Text** | Tokenization / Padding | Can be both, but often `__getitem__` or `collate_fn` |

In [19]:
# Building a Dataset Class
class CustomDataset(Dataset):
    def __init__(self, features: np.array, labels: np.array) -> None:
        """ In this method, you write an operation which loads the data from any local/cloud storage """

        # Global Preprocessing Steps
        scaler = StandardScaler()
        self.features = scaler.fit_transform(X = features) # Applies feature by feature 
        
        # The data we are getting is directly provided by the user through features and labels
        self.features = torch.tensor(features, device='cpu', requires_grad=False)
        self.labels = torch.tensor(labels, device='cpu', requires_grad=False)

    def __len__(self) -> torch.int32: # len(df) -> 1000 rows
        """ Return the size of the dataset """
        return len(self.features)

    def __getitem__(self, index): # df[index] -> recode[index]
        """ Here you can mention the return data you want, however the way you have mentioned here simillarly you have to access it while loading the data for training """

        # Local Preprocessing Steps —→ Augumentation, Image resizing
        return self.features[index], self.labels[index]

In [20]:
# Creating an object of dataset class
dataset = CustomDataset(
    features=np.array(X[0]),
    labels=np.array(y)
)

In [21]:
# Finding the length of the dataset
len(dataset)

1000

In [22]:
# Accessing a row
index = int(input("Enter the index: "))
print(f"Features at index {index}:", dataset[index][0])
print(f"Labels at index {index}:", dataset[index][1])

Features at index 10: tensor([  5.1000, 106.0000,  68.7600,  85.9300], dtype=torch.float64)
Labels at index 10: tensor(1, dtype=torch.int32)


In [23]:
# Creating a DataLoader object
dataloader = DataLoader(
    dataset=dataset,
    batch_size=5,
    shuffle=True, # Will be False for data like Time Series
    pin_memory=True,
    num_workers=0 # If you mention more than zero workers then you will not be able to print the data
)

# Returning the generator
dataloader

<torch.utils.data.dataloader.DataLoader at 0x1669b58efd0>

In [24]:
for batch_features, batch_labels in dataloader: # At every iteration, dataloader return whatever you returned in __getitem__ function
    print(batch_features, batch_labels)
    print("-" * 50)

tensor([[  5.1000, 106.0000,  68.7600,  85.9300],
        [  7.0500, 108.0000,  67.8100,  82.9600],
        [  5.6400, 104.0000,  84.0600,  95.9200],
        [  9.8200,  77.0000,  71.6200,  67.6200],
        [  5.0700, 113.0000,  93.7100,  91.8400]], dtype=torch.float64) tensor([1, 1, 1, 0, 1], dtype=torch.int32)
--------------------------------------------------
tensor([[  9.0200, 137.0000,  75.8100,  61.9800],
        [  8.1100, 111.0000,  67.0100,  93.3500],
        [  7.6600,  95.0000,  87.3000,  90.5000],
        [  5.3700,  90.0000,  88.9600,  67.5200],
        [  7.4600, 125.0000,  69.0200,  92.7000]], dtype=torch.float64) tensor([0, 0, 1, 1, 1], dtype=torch.int32)
--------------------------------------------------
tensor([[  7.9100, 106.0000,  92.6900,  69.0400],
        [  7.1600, 100.0000,  97.2700,  68.8000],
        [  7.9300, 113.0000,  98.0900,  66.8400],
        [  8.1000, 109.0000,  66.6300,  82.6600],
        [  5.7800, 116.0000,  71.2300,  95.0900]], dtype=torch.float