# **Datasets and DataLoaders**
In PyTorch, `Dataset` and `DataLoader` are fundamental for handling data. Here's a breakdown of how they work and how to use them:

**1. Dataset**
The Dataset class is essentially a blueprint. When you create a custom Dataset, you decide how data is loaded and returned. It defines:
- `__init__()`: which tells how data should be loaded.
- `__len__()`: which returns the total number of samples.
- `__getitem__(index)`: which returns the data (and label) at the given index.

**2. DataLoader**
The DataLoader wraps a Dataset and handles batching, shuffling, and parallel loading for you.

DataLoader Control Flow:
- At the start of each epoch, the DataLoader (if shuffle=True) shuffles indices(using a sampler).
- It divides the indices into chunks of batch_size.
- for each index in the chunk, data samples are fetched from the Dataset object
- The samples are then collected and combined into a batch (using collate_fn)
- The batch is returned to the main training loop

**Tips:**
1. **Custom Collate Function**: For datasets that have variable-length inputs, you can use a custom collate function.
2. **Lazy Loading**: If your dataset is too large, implement lazy loading in the `__getitem__` method by loading data directly from files.
3. **Data Augmentation**: Use `torchvision.transforms` for on-the-fly data augmentation.

#### **A Note about Samplers**
In PyTorch, the sampler in the DataLoader determines the strategy for selecting samples from the dataset during data loading. It controls how indices of the dataset are drawn for each batch.

**Types of Samplers:**
PyTorch provides several predefined samplers, and you can create custom ones:
1. `SequentialSampler`:
   - Samples elements sequentially, in the order they appear in the dataset.
   - Default when shuffle=False.
2. `RandomSampler`:
   - Samples elements randomly without

#### **A Note about `collate_fn`**
The collate_fn in PyTorch's DataLoader is a function that specifies how to combine a list of samples from a dataset into a single batch. By default, the DataLoader uses a simple batch collation mechanism, but collate_fn allows you to customize how the data should be processed and batched.

#### **Important Parameters of DataLoader**
The DataLoader class in PyTorch comes with several parameters that allow you to customize how data is loaded, batched, and preprocessed. Some of the most commonly used and important parameters include:

1. **dataset (mandatory)**:
    - The Dataset from which the DataLoader will pull data.
    - Must be a subclass of torch.utils.data.Dataset that implements `__getitem__` and `__len__`.

2. **batch_size**:
    - How many samples per batch to load.
    - Default is 1.
    - Larger batch sizes can speed up training on GPUs but require more memory.

3. **shuffle**:
    - If True, the DataLoader will shuffle the dataset indices each epoch.
    - Helpful to avoid the model becoming too dependent on the order of samples.

4. **num_workers**:
    - The number of worker processes used to load data in parallel.
    - Setting num_workers > 0 can speed up data loading by leveraging multiple CPU cores, especially if I/O or preprocessing is a bottleneck.

5. **pin_memory**:
    - If True, the DataLoader will copy tensors into pinned (page-locked) memory before returning them.
    - This can improve GPU transfer speed and thus overall training throughput, particularly on CUDA systems.

6. **drop_last**:
    - If True, the DataLoader will drop the last incomplete batch if the total number of samples is not divisible by the batch size.
    - Useful when exact batch sizes are required (for example, in some batch normalization scenarios).

7. **collate_fn**:
    - A callable that processes a list of samples into a batch (the default simply stacks tensors).
    - Custom collate_fn can handle variable-length sequences, perform custom batching logic, or handle complex data structures.

8. **sampler**:
    - sampler defines the strategy for drawing samples (e.g., for handling imbalanced classes, or custom sampling strategies).
    - batch_sampler works at the batch level, controlling how batches are formed.
    - Typically, you don’t need to specify these if you are using batch_size and shuffle. However, they provide lower-level control if you have advanced requirements.

## **Import Dependencies**

In [75]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
import torch
from torch import nn
from torch.utils.data import Dataset, DataLoader
from torchinfo import summary
from torchmetrics import Accuracy

import warnings
warnings.filterwarnings('ignore')

## **Read the Dataset**

In [4]:
# Load the breast cancer dataset using Pandas
data = pd.read_csv(r"D:\GITHUB\pytorch-for-deep-Learning-and-machine-learning\datasets\breast_cancer_data.csv")
print(data.shape)
data.head()

(569, 33)


Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


## **Data Pre-processing**

### **Data Cleaning**

In [5]:
# Drop the irrelevant columns
data.drop(columns=['id', 'Unnamed: 32'], inplace=True)
print(data.shape)
data.head()

(569, 31)


Unnamed: 0,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


### **Train-Test Split**

In [6]:
# Split the data into training and testing
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(columns=['diagnosis']),
    data['diagnosis'],
    test_size=0.3,
    random_state=42
)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((398, 30), (171, 30), (398,), (171,))

### **Feature Scaling**

In [8]:
# Print the column information
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 398 entries, 149 to 102
Data columns (total 30 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   radius_mean              398 non-null    float64
 1   texture_mean             398 non-null    float64
 2   perimeter_mean           398 non-null    float64
 3   area_mean                398 non-null    float64
 4   smoothness_mean          398 non-null    float64
 5   compactness_mean         398 non-null    float64
 6   concavity_mean           398 non-null    float64
 7   concave points_mean      398 non-null    float64
 8   symmetry_mean            398 non-null    float64
 9   fractal_dimension_mean   398 non-null    float64
 10  radius_se                398 non-null    float64
 11  texture_se               398 non-null    float64
 12  perimeter_se             398 non-null    float64
 13  area_se                  398 non-null    float64
 14  smoothness_se            398 

In [11]:
# Scale the input variables using standarad scaler
scaler = StandardScaler()
X_train_scaled =scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(X_train_scaled.shape, X_test_scaled.shape)

(398, 30) (171, 30)


### **Label Encoding**

In [14]:
# Encode the target variable using label encoder
label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train)
y_test_encoded = label_encoder.transform(y_test)

print(y_train_encoded.shape, y_test_encoded.shape)

(398,) (171,)


### **Convert NumPy Arrays to PyTorch Tensors**

In [15]:
X_train_tensor = torch.from_numpy(X_train_scaled).type(torch.float32)
X_test_tensor = torch.from_numpy(X_test_scaled).type(torch.float32)
y_train_tensor = torch.from_numpy(y_train_encoded).type(torch.float32)
y_test_tensor = torch.from_numpy(y_test_encoded).type(torch.float32)

print(X_train_tensor.shape, X_test_tensor.shape, y_train_encoded.shape, y_test_encoded.shape)

torch.Size([398, 30]) torch.Size([171, 30]) (398,) (171,)


## **Create Dataset and DataLoader**

In [17]:
# Create a custon dataset class
class CustomDataset(Dataset):

    def __init__(self, features, labels):
        self.features = features
        self.labels = labels

    def __len__(self):
        return self.features.shape[0]

    def __getitem__(self, index):
        return self.features[index], self.labels[index]

In [22]:
# Create an object of the custom dataset class
train_dataset = CustomDataset(X_train_tensor, y_train_tensor)
test_dataset = CustomDataset(X_test_tensor, y_test_tensor)

print('Length of the training dataset:', train_dataset.__len__())
# Print a row from train dataset
train_dataset.__getitem__(10)

Length of the training dataset: 398


(tensor([-0.7975, -0.3776, -0.8148, -0.7253, -0.5369, -0.9810, -0.7749, -0.7285,
         -0.7530, -0.4882, -0.7349,  0.6774, -0.7181, -0.5517, -0.5347, -0.8574,
         -0.6873, -0.9723, -0.8696, -0.8052, -0.6753,  1.7994, -0.6747, -0.6325,
          0.5883, -0.5853, -0.4461, -0.4210,  0.1446, -0.3479]),
 tensor(0.))

In [26]:
# Create dataloader object
train_dataloader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=32, shuffle=True)

# Print first two batches from the training dataaset
for idx , (batch_features, batch_labels) in enumerate(train_dataloader):
    print(batch_features)
    print(batch_labels)
    print('-'*50)

    if idx == 1:
        break

tensor([[ 1.1452e+00, -1.0910e-01,  1.1560e+00,  1.0413e+00,  1.3702e+00,
          8.8371e-01,  1.1465e+00,  1.5283e+00,  1.0766e+00,  6.0239e-02,
          1.4786e+00,  7.5068e-01,  9.4802e-01,  1.1923e+00, -1.0144e+00,
          2.3669e-01, -1.3624e-01, -4.3428e-01,  7.9533e-01,  5.0428e-01,
          8.9966e-01, -2.2888e-01,  8.3943e-01,  7.7039e-01, -1.6404e-01,
         -1.3060e-01, -3.1637e-02,  2.9247e-01,  2.2173e-01, -2.0962e-01],
        [-5.7092e-01, -2.6829e-01, -5.7572e-01, -5.7049e-01, -3.7034e-01,
         -4.8731e-01, -7.5781e-01, -8.8067e-01, -1.2635e+00,  2.4323e-02,
         -6.3908e-01, -2.6271e-02, -5.6705e-01, -4.9849e-01, -6.8902e-01,
         -4.8386e-01, -6.0930e-01, -1.0865e+00, -7.2847e-01, -7.3006e-01,
         -6.1484e-01,  3.9342e-01, -5.6092e-01, -5.8671e-01, -4.7716e-01,
         -1.4273e-01, -5.4365e-01, -8.9238e-01, -7.6639e-01, -3.5708e-01],
        [ 1.3151e+00,  6.6785e-01,  1.2962e+00,  1.2562e+00,  4.2463e-01,
          6.9601e-01,  9.1756e-01,  

## **Build a Neural Network Model**

In [39]:
class MyModel(nn.Module):

    def __init__(self, n_features):
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(n_features, 3),
            nn.ReLU(),
            nn.Linear(3, 1),
            nn.Sigmoid()
        )

    def forward(self, X: torch.Tensor):
        out = self.network(X)
        return out

### **Training Pipeline**

In [96]:
# Create an object of the model
model = MyModel(X_train_tensor.shape[1])
summary(model)

Layer (type:depth-idx)                   Param #
MyModel                                  --
├─Sequential: 1-1                        --
│    └─Linear: 2-1                       93
│    └─ReLU: 2-2                         --
│    └─Linear: 2-3                       4
│    └─Sigmoid: 2-4                      --
Total params: 97
Trainable params: 97
Non-trainable params: 0

In [97]:
# Set the learning rate and number of epochs
lr = 0.01
epochs = 50

# Define a loss function and an optimizer
loss_fn = nn.BCELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=lr)

In [98]:
# Define a loop
for epoch in range(epochs):

    avg_loss = 0

    # Iterate training batches
    for batch_X, batch_y in train_dataloader:

        # Forward pass
        y_pred = model(batch_X)

        # Loss calculation
        loss = loss_fn(y_pred.squeeze(), batch_y)
        avg_loss += loss

        # Zero gradients
        optimizer.zero_grad()

        # Backward pass
        loss.backward()

        # Parameters update
        optimizer.step()

    # Validate on testing batches
    avg_test_loss = 0

    for batch_X, batch_y in test_dataloader:
        with torch.no_grad():
            test_preds = model(batch_X)
            test_loss = loss_fn(test_preds.squeeze(), batch_y)
            avg_test_loss += test_loss
            
    # Print the epoch loss
    print(f'Epoch: {epoch}, Loss: {(avg_loss / len(train_dataloader)):.2f}, Val Loss: {(avg_test_loss / len(test_dataloader)):.2f}')

Epoch: 0, Loss: 0.68, Val Loss: 0.67
Epoch: 1, Loss: 0.66, Val Loss: 0.65
Epoch: 2, Loss: 0.65, Val Loss: 0.64
Epoch: 3, Loss: 0.64, Val Loss: 0.62
Epoch: 4, Loss: 0.63, Val Loss: 0.62
Epoch: 5, Loss: 0.61, Val Loss: 0.60
Epoch: 6, Loss: 0.60, Val Loss: 0.58
Epoch: 7, Loss: 0.59, Val Loss: 0.58
Epoch: 8, Loss: 0.58, Val Loss: 0.56
Epoch: 9, Loss: 0.56, Val Loss: 0.54
Epoch: 10, Loss: 0.55, Val Loss: 0.53
Epoch: 11, Loss: 0.54, Val Loss: 0.51
Epoch: 12, Loss: 0.53, Val Loss: 0.50
Epoch: 13, Loss: 0.51, Val Loss: 0.48
Epoch: 14, Loss: 0.50, Val Loss: 0.48
Epoch: 15, Loss: 0.49, Val Loss: 0.47
Epoch: 16, Loss: 0.48, Val Loss: 0.46
Epoch: 17, Loss: 0.48, Val Loss: 0.43
Epoch: 18, Loss: 0.46, Val Loss: 0.44
Epoch: 19, Loss: 0.45, Val Loss: 0.43
Epoch: 20, Loss: 0.44, Val Loss: 0.42
Epoch: 21, Loss: 0.44, Val Loss: 0.42
Epoch: 22, Loss: 0.43, Val Loss: 0.41
Epoch: 23, Loss: 0.42, Val Loss: 0.40
Epoch: 24, Loss: 0.41, Val Loss: 0.39
Epoch: 25, Loss: 0.41, Val Loss: 0.37
Epoch: 26, Loss: 0.40,

## **Model Evaluation**

In [99]:
# Model evaluation using test dataloader
model.eval() # Set the model to evaluation mode
accuracy_list = []
accuracy = Accuracy(task='binary')

# Make predictions on testing data
with torch.no_grad():
    for batch_X, batch_y in test_dataloader:
        y_pred = model(batch_X).squeeze()
        y_pred = (y_pred > 0.5).float()
        batch_acc = accuracy(y_pred, batch_y)

        accuracy_list.append(batch_acc)
        
# Calculate overall accuracy
overall_acc = np.mean(accuracy_list)
print(f'Accuracy: {overall_acc:.2f}')

Accuracy: 0.96
