Dataset and DataLoader are core abstractions in Pytorch that can decouple how you define your data from how you can efficiently iterate over it during model training .

**Dataset Class**

The Dataset class is essentially a blueprint. When you create a custom Dataset, you decide how data is loaded and returned.

It defines:
1. __ init__() which tells how data should be loaded.
2. __ len __() which returns the total number of samples.
3. __getitem __(index) which returns the data (and label) at the given index.

**DataLoader Class**

The DataLoader wraps a Dataset and handles batching, shuffling, and parallel loading for you.

DataLoader Control Flow:
1. At the start of each epoch, the DataLoader (if shuffle=True) shuffles indices(using a sampler).
2. It divides the indices into chunks of batch_size.
3. for each index in the chunk, data samples are fetched from the Dataset object
4. The samples are then collected and combined into a batch (using collate_fn)
5. The batch is returned to the main training loop for model training

basically it groups samples from the Dataset into batches of a specified size. and Shuffling, it can shuffle the data to prevent the model from learning the order of samples.

it can load data in parallel using multiple worker processes, which speeds up data loading and automatically collates the samples from the dataset to form a batch.

Dataset class loads data(rows) from memory,
main component is the DataLoader class, as it decides how many rows needs to be kept in each batch, and then batchs are created by DataLoader later these batches are sent for Model Training.

In [6]:
from sklearn.datasets import make_classification

# creating synthetic classification dataset with the help of sklearn

x, y = make_classification(
    n_samples     = 10,    # number of samples
    n_features    = 2,     # number of features
    n_informative = 2,     # number of informative features
    n_redundant   = 0,     # number of redundant features
    n_classes     = 2,     # number of classes
    random_state  = 42     # for reproducibility
)

In [7]:
x

array([[ 1.06833894, -0.97007347],
       [-1.14021544, -0.83879234],
       [-2.8953973 ,  1.97686236],
       [-0.72063436, -0.96059253],
       [-1.96287438, -0.99225135],
       [-0.9382051 , -0.54304815],
       [ 1.72725924, -1.18582677],
       [ 1.77736657,  1.51157598],
       [ 1.89969252,  0.83444483],
       [-0.58723065, -1.97171753]])

In [8]:
y

array([1, 0, 0, 0, 0, 1, 1, 1, 1, 0])

In [10]:
# converting these data to PyTorch tensors

x_tensor = torch.tensor(x, dtype = torch.float32)
y_tensor = torch.tensor(y, dtype = torch.long)

In [2]:
import torch

In [5]:
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader


class CustomDataset(Dataset):


  def __init__(self, features, labels): #tells how data should be loaded

    self.features = features
    self.labels   = labels

  def __len__(self): #returns the total number of samples.

    return len(self.features)

  def __getitem__(self, index): #returns the data and label at the given index.

    return self.features[index], self.labels[index]

In [11]:
dataset = CustomDataset(x_tensor, y_tensor)

In [14]:
x_tensor

tensor([[ 1.0683, -0.9701],
        [-1.1402, -0.8388],
        [-2.8954,  1.9769],
        [-0.7206, -0.9606],
        [-1.9629, -0.9923],
        [-0.9382, -0.5430],
        [ 1.7273, -1.1858],
        [ 1.7774,  1.5116],
        [ 1.8997,  0.8344],
        [-0.5872, -1.9717]])

In [15]:
y_tensor

tensor([1, 0, 0, 0, 0, 1, 1, 1, 1, 0])

In [12]:
len(dataset)

10

In [25]:
getitem1 = dataset[0]
getitem1

(tensor([ 1.0683, -0.9701]), tensor(1))

In [13]:
dataset[2]

(tensor([-2.8954,  1.9769]), tensor(0))

In [16]:
dataset[5]

(tensor([-0.9382, -0.5430]), tensor(1))

In [17]:
dataloader = DataLoader(dataset, batch_size = 2, shuffle = False)

Sampling and Shuffling, Batch Management and Parallelization are handled by **"DataLoader"** class.

The collate_fn in PyTorchs DataLoader class is a function that specifies how to combine a list of samples from a dataset into a single batch.
By default, the DataLoader uses a simple batch collation mechanism,
but collate_in allows you to customize how the data should be processed and batched.

**num workers:**

1. The number of worker processes used to load data in parallel.
2. Setting num_workers > O can speed up data loading by leveraging multiple CPU
cores, especially if I/O or preprocessing is a bottleneck.

In [22]:
for batch_features, batch_labels in dataloader:

  print('Batch Features :',batch_features)
  print('\n')
  print('Batch Labels   :',batch_labels)
  print('_' * 50)

Batch Features : tensor([[ 1.0683, -0.9701],
        [-1.1402, -0.8388]])


Batch Labels   : tensor([1, 0])
__________________________________________________
Batch Features : tensor([[-2.8954,  1.9769],
        [-0.7206, -0.9606]])


Batch Labels   : tensor([0, 0])
__________________________________________________
Batch Features : tensor([[-1.9629, -0.9923],
        [-0.9382, -0.5430]])


Batch Labels   : tensor([0, 1])
__________________________________________________
Batch Features : tensor([[ 1.7273, -1.1858],
        [ 1.7774,  1.5116]])


Batch Labels   : tensor([1, 1])
__________________________________________________
Batch Features : tensor([[ 1.8997,  0.8344],
        [-0.5872, -1.9717]])


Batch Labels   : tensor([1, 0])
__________________________________________________


**Data Transformations**

Data transformations are crucial for preprocessing and augmenting data before feeding it to a model.

Common transformations include:

1. Normalization: Scaling data to a specific range (eg[0, 1] or [-1, 1]).
2. Resizing: Changing the dimensions of images.
3. Cropping: Extracting specific regions from images.
4. Random Rotations/Flips: Augmenting images to increase data diversity.
5. Converting to Tensors: Converting data to PyTorch tensors.

Where to Apply Transformations:

Transformations are typically applied within the Dataset class, specifically in the __getitem__() method or during the Dataset object initialization.

1. During Initialization (__init__):
You can apply transformations that only need to be done ONCE, such as loading and preprocessing static data.

2. Inside __getitem__:
most common place to apply transformations, especially those need to be done on-the-fly, like random augmentations.
it is best place to perform transformations on each sample.



---



---



---



.

**MODIFYING OUR EXISTING NEURAL NETWORK CODE OF BREAST CANCER DATASET**

In [26]:

import numpy as np
import pandas as pd
import torch

In [27]:
df = pd.read_csv('https://raw.githubusercontent.com/gscdit/Breast-Cancer-Detection/refs/heads/master/data.csv')
df.head(10)

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,
5,843786,M,12.45,15.7,82.57,477.1,0.1278,0.17,0.1578,0.08089,...,23.75,103.4,741.6,0.1791,0.5249,0.5355,0.1741,0.3985,0.1244,
6,844359,M,18.25,19.98,119.6,1040.0,0.09463,0.109,0.1127,0.074,...,27.66,153.2,1606.0,0.1442,0.2576,0.3784,0.1932,0.3063,0.08368,
7,84458202,M,13.71,20.83,90.2,577.9,0.1189,0.1645,0.09366,0.05985,...,28.14,110.6,897.0,0.1654,0.3682,0.2678,0.1556,0.3196,0.1151,
8,844981,M,13.0,21.82,87.5,519.8,0.1273,0.1932,0.1859,0.09353,...,30.73,106.2,739.3,0.1703,0.5401,0.539,0.206,0.4378,0.1072,
9,84501001,M,12.46,24.04,83.97,475.9,0.1186,0.2396,0.2273,0.08543,...,40.68,97.65,711.4,0.1853,1.058,1.105,0.221,0.4366,0.2075,


In [28]:
df.shape

(569, 33)

In [29]:
df.drop(columns=['id', 'Unnamed: 32'], inplace= True)

In [30]:
df.head(10)

Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678
5,M,12.45,15.7,82.57,477.1,0.1278,0.17,0.1578,0.08089,0.2087,...,15.47,23.75,103.4,741.6,0.1791,0.5249,0.5355,0.1741,0.3985,0.1244
6,M,18.25,19.98,119.6,1040.0,0.09463,0.109,0.1127,0.074,0.1794,...,22.88,27.66,153.2,1606.0,0.1442,0.2576,0.3784,0.1932,0.3063,0.08368
7,M,13.71,20.83,90.2,577.9,0.1189,0.1645,0.09366,0.05985,0.2196,...,17.06,28.14,110.6,897.0,0.1654,0.3682,0.2678,0.1556,0.3196,0.1151
8,M,13.0,21.82,87.5,519.8,0.1273,0.1932,0.1859,0.09353,0.235,...,15.49,30.73,106.2,739.3,0.1703,0.5401,0.539,0.206,0.4378,0.1072
9,M,12.46,24.04,83.97,475.9,0.1186,0.2396,0.2273,0.08543,0.203,...,15.09,40.68,97.65,711.4,0.1853,1.058,1.105,0.221,0.4366,0.2075


In [31]:
from sklearn.model_selection import train_test_split

#X_train, X_test, y_train, y_test = train_test_split(df.iloc[:, 1:], df.iloc[:, 0], test_size=0.2)

# target columns
y = df['diagnosis']

# input columns i.e all columns except 'diagnosis'
X = df.drop(columns=['diagnosis'])

# split dataset into training and testing sets (80-20 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# shapes of datasets
print('X_train shape :', X_train.shape)
print('X_test shape  :', X_test.shape)
print('y_train shape :', y_train.shape)
print('y_test shape  :', y_test.shape)

X_train shape : (455, 30)
X_test shape  : (114, 30)
y_train shape : (455,)
y_test shape  : (114,)


In [32]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test  = scaler.transform(X_test)

In [33]:
X_train

array([[-1.44075296, -0.43531947, -1.36208497, ...,  0.9320124 ,
         2.09724217,  1.88645014],
       [ 1.97409619,  1.73302577,  2.09167167, ...,  2.6989469 ,
         1.89116053,  2.49783848],
       [-1.39998202, -1.24962228, -1.34520926, ..., -0.97023893,
         0.59760192,  0.0578942 ],
       ...,
       [ 0.04880192, -0.55500086, -0.06512547, ..., -1.23903365,
        -0.70863864, -1.27145475],
       [-0.03896885,  0.10207345, -0.03137406, ...,  1.05001236,
         0.43432185,  1.21336207],
       [-0.54860557,  0.31327591, -0.60350155, ..., -0.61102866,
        -0.3345212 , -0.84628745]])

In [34]:
y_train

Unnamed: 0,diagnosis
68,B
181,M
63,B
248,B
60,B
...,...
71,B
106,B
270,B
435,M


In [35]:
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
y_train = encoder.fit_transform(y_train)
y_test = encoder.transform(y_test)

In [36]:
y_train

array([0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1,
       0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0,
       1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0,
       1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0,
       0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1,
       0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0,
       1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1,
       1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0,

In [37]:
y_test

array([0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0,
       1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0,
       1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0,
       1, 0, 0, 1])

In [38]:
X_train_tensor = torch.from_numpy(X_train.astype(np.float32))
X_test_tensor = torch.from_numpy(X_test.astype(np.float32))
y_train_tensor = torch.from_numpy(y_train.astype(np.float32))
y_test_tensor = torch.from_numpy(y_test.astype(np.float32))

In [39]:
X_train_tensor.shape

torch.Size([455, 30])

In [40]:
y_train_tensor.shape

torch.Size([455])

In [46]:
from torch.utils.data import Dataset, DataLoader

class MyDataset(Dataset):

  def __init__(self, features, labels):

    self.features = features
    self.labels = labels

  def __len__(self):

    return len(self.features)

  def __getitem__(self, idx):

    return self.features[idx], self.labels[idx]

In [47]:
training_dataset = MyDataset(X_train_tensor, y_train_tensor)
testing_dataset  = MyDataset(X_test_tensor, y_test_tensor)

In [49]:
training_dataset[10]

(tensor([-0.4976,  0.6137, -0.4981, -0.5310, -0.5769, -0.1749, -0.3622, -0.2849,
          0.4335,  0.1782, -0.3684,  0.5531, -0.3167, -0.4052,  0.0403, -0.0380,
         -0.1804,  0.1648, -0.1217,  0.2308, -0.5004,  0.8194, -0.4692, -0.5331,
         -0.0491, -0.0416, -0.1491,  0.0968,  0.1062,  0.4904]),
 tensor(0.))

In [54]:
training_loader = DataLoader(training_dataset, batch_size=32, shuffle=True)

testing_loader = DataLoader(testing_dataset, batch_size=32, shuffle=True)

In [55]:
import torch.nn as nn


class MySimpleNN(nn.Module):

  def __init__(self, num_features):

    super().__init__()
    self.linear1 = nn.Linear(num_features, 4)
    self.relu   = nn.ReLU()
    self.linear2 = nn.Linear(4, 2)
    self.relu   = nn.ReLU()
    self.linear3 = nn.Linear(2, 1)
    self.sigmoid = nn.Sigmoid()

  def forward(self, features):

    out = self.linear1(features)
    out = self.relu(out)
    out = self.linear2(out)
    out = self.relu(out)
    out = self.linear3(out)
    out = self.sigmoid(out)

    return out

In [56]:
learning_rate = 0.1
epochs = 30

In [57]:
# create model
model = MySimpleNN(X_train_tensor.shape[1])

# define optimizer
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

# define loss function
loss_function = nn.BCELoss()

In [58]:
# Training Loop

for epoch in range(epochs):

  for batch_features, batch_labels in training_loader:

    # forward pass
    y_pred = model(batch_features)

    # loss calculate
    loss = loss_function(y_pred, batch_labels.view(-1,1))

    # clear gradients
    optimizer.zero_grad()

    # backward pass
    loss.backward()

    # parameters update
    optimizer.step()

    #print loss in each epoch
  print(f'Epoch: {epoch + 1}, Loss: {loss.item()}')

Epoch: 1, Loss: 0.6245325207710266
Epoch: 2, Loss: 0.6887750029563904
Epoch: 3, Loss: 0.5982382893562317
Epoch: 4, Loss: 0.5485871434211731
Epoch: 5, Loss: 0.39628860354423523
Epoch: 6, Loss: 0.3977431356906891
Epoch: 7, Loss: 0.10803245007991791
Epoch: 8, Loss: 0.11309971660375595
Epoch: 9, Loss: 0.038751743733882904
Epoch: 10, Loss: 0.0236782468855381
Epoch: 11, Loss: 0.6889012455940247
Epoch: 12, Loss: 0.04864979907870293
Epoch: 13, Loss: 0.16502591967582703
Epoch: 14, Loss: 0.6674279570579529
Epoch: 15, Loss: 0.03856906294822693
Epoch: 16, Loss: 0.04940050095319748
Epoch: 17, Loss: 0.0028327382169663906
Epoch: 18, Loss: 0.47392651438713074
Epoch: 19, Loss: 0.0023966971784830093
Epoch: 20, Loss: 0.009046901948750019
Epoch: 21, Loss: 0.00988232996314764
Epoch: 22, Loss: 0.01446314062923193
Epoch: 23, Loss: 0.02083640731871128
Epoch: 24, Loss: 0.007853366434574127
Epoch: 25, Loss: 0.023009393364191055
Epoch: 26, Loss: 0.04819929599761963
Epoch: 27, Loss: 0.002812727587297559
Epoch: 28

In [60]:
# Model evaluation using test_loader

model.eval()       # set the model to evaluation mode


accuracy_list = []

with torch.no_grad():
    for batch_features, batch_labels in testing_loader:


        y_pred = model(batch_features)   # forward pass
        y_pred = (y_pred > 0.8).float()  # converting probabilities to binary predictions

        # calculate accuracy for the current batch
        batch_accuracy = (y_pred.view(-1) == batch_labels).float().mean().item()
        accuracy_list.append(batch_accuracy)


# calculate overall accuracy
overall_accuracy = sum(accuracy_list) / len(accuracy_list)
print(f'Accuracy : {overall_accuracy:.4f}')

Accuracy : 0.9844
