# D2Lab-A. Dataset and Dataloader for our lab

## About this notebook

This notebook was used in the 50.039 Deep Learning course at the Singapore University of Technology and Design.

**Author:** Matthieu DE MARI (matthieu_demari@sutd.edu.sg)

**Version:** 1.0 (01/02/2025)

**Requirements:**
- Python 3
- Matplotlib
- Numpy
- Pandas
- Torch
- Torchmetrics

## 0. Imports and CUDA

In addition to the libraries mentioned above, you will need the *helper_functions.py* file, which contains a few additional functions that help make this notebook simpler for you (e.g. visualisation, test cases, etc.)

Please refrain from modifying said file, but feel free to have a look at it.

In [None]:
# Matplotlib
import matplotlib.pyplot as plt
from matplotlib.lines import Line2D
# Numpy
import numpy as np
# Pandas
import pandas as pd
# Torch
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
#from torchmetrics.classification import BinaryAccuracy
# Helper functions (additional file)
from helper_functions import *

<div class="alert alert-block alert-info">
<b>A note before we start:</b> While not necessary, you might want to run the code for this homework using GPU. It remains possible, however, to use CPU only.
</div>

In [None]:
# Use GPU if available, else use CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

## 1. Loading and visualizing the dataset

In this first section, we are going to load a dataset from the *'dataset.xlsx'* file.

Feel free to have a look at this Excel file if you need.

The cells below will define the parameters of our dataset, and load the data from the file.

In [None]:
# Dataset parameters
np.random.seed(17)
min_val = -1
max_val = 1
n_points = 1000

In [None]:
# Load dataset from file
excel_file_path = 'dataset_new.xlsx'
val1_list, val2_list, inputs, outputs = load_dataset(excel_file_path = excel_file_path)

In [None]:
# Visualize data in arrays
print(inputs.shape, outputs.shape)
print("Number of samples with class 0:", len(outputs) - sum(outputs))
print("Number of samples with class 1:", sum(outputs))

The visualization below shows the samples in the dataset, along with their ground truth class (red cross = 1, green dot = 0).

In [None]:
# Visualize the dataset
plot_dataset(min_val, max_val, val1_list, val2_list, outputs)

<div class="alert alert-block alert-info">
<b>Question 1:</b> Given the code executed above, can you describe the different elements of the Machine Learning problem that we seem to be currently facing? At the moment you should be able to describe the task (T), dataset (D), inputs and outputs (I, O). The model (M) and loss (L) will be discussed later.
</div>

<div class="alert alert-block alert-info">
<b>Question 2:</b> What geometric property of the decision boundary makes it challenging for linear models like logistic regression? what concept should we use in our neural network overcome this limitation?
</div>

## 2. Writing a PyTorch Dataset object

Right now, we would like to write a *PyTorch Dataset* object for our Machine Learning problem.

Have a look at the incomplete code below, you will recognize that there are several None variables. These variables probably need to be replaced with something else.

Once you have figured out the correct values to use in place of the None variables, you should be able to run the function *test_dataset_oject()* below. It will produce two test cases for you, and both should pass for this task to be considered resolved.

You class is expected to have the following features.
- Initialization (__init__ method): The dataset initializes by reading an Excel file (dataset.xlsx) using Pandas read_excel function and stores it in the dataframe attribute.
- Length method (__len__ method): This method should return a certain information about the dataset.
- Get item method (__getitem__ method): This method is called when you index into the dataset (e.g., dataset[idx]). It retrieves a single sample from the dataset at the given index idx. It extracts the features x1 and x2 along with the target y from the dataframe for the sample corresponding to the specified index. The features x1 and x2 should be converted into PyTorch tensors of type torch.float32. The target y should also be converted into a PyTorch tensor of type torch.float32. The features should then be stacked together into a single tensor inputs with 2 columns and rows for each sample. Finally, this method should return two values corresponding to the input features tensor inputs and the target tensor y.

In [None]:
class CustomDataset(Dataset):
    def __init__(self):
        self.dataframe = pd.read_excel('dataset.xlsx')
        
    def __len__(self):
        return None
    
    def __getitem__(self, idx):
        # Select columns corresponding to the different inputs and outputs from the dataframe we just created.
        # And convert to PyTorch tensors
        x1 = None
        x2 = None
        y = None
        x1 = torch.tensor(x1, dtype = torch.float32)
        x2 = torch.tensor(x2, dtype = torch.float32)
        y = torch.tensor(y, dtype = torch.float32)
        # Assemble all input features in a single inputs tensor with 2 columns and rows for each sample in the dataset.
        inputs = None
        return inputs, y

In [None]:
# Create our PyTorch Dataset object from the class above
pt_dataset = CustomDataset()

In [None]:
# Running test function for our dataset object
test_dataset_object(pt_dataset)

<div class="alert alert-block alert-info">
<b>Question 3:</b> Show the code for your CustomDataset object, after you have correctly figured out how to replace the different None variables.
</div>

<div class="alert alert-block alert-info">
<b>Question 4:</b> What information about the dataset is the __len__ special method supposed to return? What would happen if it returned an incorrect value (e.g., return 0)?
</div>

## 3. Writing a Dataloader object

Our next task is now to write a PyTorch dataloader object. It will serve as a conveyor belt for our PyTorch dataset object in the previous section.

Its objective will be to form mini-batches of size 128, shuffling the samples in the dataset.

In [None]:
# Define batch size
batch_size = None

<div class="alert alert-block alert-info">
<b>Question 5:</b> Can you figure out what to put in place of the None variables in the cell below? Show your code in your report.
</div>

In [None]:
# Create DataLoader object
pt_dataloader = DataLoader(None)

<div class="alert alert-block alert-info">
<b>Question 6:</b> Why is shuffling the dataset important in the DataLoader? What happens if it is turned off?
</div> 

If you have correctly figured out the code for the cell above, the test cases checked by the function *test_dataloader_object()* should all pass.

In [None]:
# Running test function for our dataloader object
test_dataloader_object(pt_dataloader)

## What is next?

Our task continues in the Notebook B.