<a href="https://colab.research.google.com/github/parthivz/Fundamentals-of-GenAI-Course-Lab/blob/main/PyTorch_3_Exercise_PyTorch_Dataset_Dataloader.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **TODO:** Download the yelp dataset

## Use PyTorch `Dataset` and `Dataloader` with a structured dataset

In [1]:
import os

import pandas as pd
import torch as pt

from torch import nn
from torch.utils.data import DataLoader
from torch.utils.data import TensorDataset

pt.set_default_dtype(pt.float64)

Read the files that match `part-*.csv` from the `data` subdirectory into a Pandas data frame named `df`.

In [2]:
import os
import urllib.request
import zipfile

# Define URL and download path
URL = "https://s3.amazonaws.com/courses.axel.net/data_working_with_pytorch.zip"
ZIP_PATH = "data_working_with_pytorch.zip"
EXTRACT_PATH = "data"

# Download the dataset if not already downloaded
if not os.path.exists(ZIP_PATH):
    print("Downloading dataset...")
    urllib.request.urlretrieve(URL, ZIP_PATH)
    print("Download complete.")

# Extract the dataset if not already extracted
if not os.path.exists(EXTRACT_PATH):
    print("Extracting dataset...")
    with zipfile.ZipFile(ZIP_PATH, 'r') as zip_ref:
        zip_ref.extractall(EXTRACT_PATH)
    print("Extraction complete.")


## Explore the `df` data frame, including the column names, the first few rows of the dataset, and the data frame's memory usage.

In [3]:
import pandas as pd
import torch as pt

from torch import nn
from torch.utils.data import DataLoader, TensorDataset

pt.set_default_dtype(pt.float64)  # Ensure consistency in precision

# Read all CSV files matching "part-*.csv" into a single DataFrame
df = pd.concat([pd.read_csv(os.path.join(EXTRACT_PATH, file))
                for file in os.listdir(EXTRACT_PATH) if file.startswith("part-")])

print(f"Dataset loaded with {len(df)} rows and {len(df.columns)} columns.")

# Display column names
print(df.columns)

# Show the first few rows
print(df.head())

# Check memory usage
print(df.info(memory_usage="deep"))


Dataset loaded with 6368133 rows and 7 columns.
Index(['fareamount', 'origindatetime_tr', 'origin_block_latitude',
       'origin_block_longitude', 'destination_block_latitude',
       'destination_block_longitude', 'id'],
      dtype='object')
   fareamount origindatetime_tr  origin_block_latitude  \
0        4.87  06/01/2017 07:00              38.898314   
1       12.70  06/01/2017 14:00              38.904683   
2        5.14  06/01/2017 12:00              38.910635   
3        5.14  06/02/2017 13:00              38.889184   
4       14.32  06/01/2017 13:00              38.901336   

   origin_block_longitude  destination_block_latitude  \
0              -77.028849                   38.902521   
1              -77.046645                   38.940181   
2              -77.042514                   38.909652   
3              -77.021907                   38.897207   
4              -77.037534                   38.942216   

   destination_block_longitude  \
0                   -77.03079

## Drop the `origindatetime_tr` column from the data frame.

For now you are going to predict the taxi fare just based on the lat/lon coordinates of the pickup and the drop off locations. Remove the `origindatetime_tr` column from the data frame in your working dataset.

In [4]:
df = df.drop(columns=["origindatetime_tr"])  # Drop timestamp column


## Sample 10% of your working dataset into a test dataset data frame

* **hint:** use the Pandas `sample` function with the dataframe. Specify a value for the `random_state` to achieve reproducibility.

In [5]:
# Sample 10% for testing
test_df = df.sample(frac=0.1, random_state=42)  # Reproducible results


## Drop the rows that exist in your test dataset from the working dataset to produce a training dataset.

* **hint** DataFrame's `drop` function can use index values from a data frame to drop specific rows.

In [6]:
# Drop test rows from training set
train_df = df.drop(test_df.index)

print(f"Train size: {len(train_df)}, Test size: {len(test_df)}")


Train size: 5177451, Test size: 636813


## Define 2 Python lists: 1st for the feature column names; 2nd for the target column name

In [8]:
# Define feature columns (latitude, longitude) and target column (fare)
feature_columns = ["pickup_lat", "pickup_lon", "dropoff_lat", "dropoff_lon"]
target_column = ["fareamount"]


## Create `X` and `y` tensors with the values of your feature and target columns in the training dataset

In [9]:
# Ensure all feature columns exist
available_features = [col for col in feature_columns if col in train_df.columns]

# Convert train data to tensors
X_train = pt.tensor(train_df[available_features].values, dtype=pt.float64)
y_train = pt.tensor(train_df[target_column].values, dtype=pt.float64)


## Create a `TensorDataset` instance with the `y` and `X` tensors (in that order)

In [10]:
train_dataset = TensorDataset(X_train, y_train)


## Create a `DataLoader` instance specifying a custom batch size

A batch size of `2 ** 18 = 262,144` should work well.

In [11]:
# Define batch size
BATCH_SIZE = 2 ** 18  # 262,144

# Create DataLoader
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)


## Create a model using `nn.Linear`

In [12]:
# Define a simple linear regression model
class TaxiFareModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(len(feature_columns), 1)  # Input features → 1 output (fare)

    def forward(self, x):
        return self.linear(x)

# Instantiate model
model = TaxiFareModel()
model = nn.Linear(X_train.shape[1], 1).double()  # Convert model to float64




## Create an instance of the `AdamW` optimizer for the model

In [13]:
optimizer = pt.optim.AdamW(model.parameters(), lr=0.01)  # AdamW for stability


## Declare your `forward`, `loss` and `metric` functions

* **hint:** if you are tried of computing MSE by hand you can use `nn.functional.mse_loss` instead.

In [14]:
import torch.nn.functional as F

# Loss function (Mean Squared Error)
def loss_fn(y_pred, y_true):
    return F.mse_loss(y_pred, y_true)

# Root Mean Squared Error (RMSE)
def rmse(y_pred, y_true):
    return pt.sqrt(loss_fn(y_pred, y_true))


## Iterate over the batches returned by your `DataLoader` instance

For every step of gradient descent, print out the MSE, RMSE, and the batch index
* **hint:** you can use Python's `enumerable` for an iterable
* **hint:** the batch returned by the `enumerable` has the same contents as your `TensorDataset` instance

In [15]:
for batch_idx, (X_batch, y_batch) in enumerate(train_loader):
    # Forward pass
    y_pred = model(X_batch)

    # Compute MSE loss
    loss = loss_fn(y_pred, y_batch.view(-1, 1))  # Ensure y_batch has correct shape

    # Compute RMSE
    rmse_value = pt.sqrt(loss)

    # Backpropagation
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    # Print results
    print(f"Batch {batch_idx}: MSE = {loss.item():.4f}, RMSE = {rmse_value.item():.4f}")


Batch 0: MSE = 115.8499, RMSE = 10.7634
Batch 1: MSE = 115.3742, RMSE = 10.7412
Batch 2: MSE = 114.7101, RMSE = 10.7103
Batch 3: MSE = 114.1856, RMSE = 10.6858
Batch 4: MSE = 114.3140, RMSE = 10.6918
Batch 5: MSE = 114.8712, RMSE = 10.7178
Batch 6: MSE = 114.0713, RMSE = 10.6804
Batch 7: MSE = 113.5932, RMSE = 10.6580
Batch 8: MSE = 113.2199, RMSE = 10.6405
Batch 9: MSE = 113.0383, RMSE = 10.6319
Batch 10: MSE = 113.5653, RMSE = 10.6567
Batch 11: MSE = 113.2965, RMSE = 10.6441
Batch 12: MSE = 113.1279, RMSE = 10.6362
Batch 13: MSE = 112.7565, RMSE = 10.6187
Batch 14: MSE = 112.4849, RMSE = 10.6059
Batch 15: MSE = 112.6724, RMSE = 10.6147
Batch 16: MSE = 112.4410, RMSE = 10.6038
Batch 17: MSE = 112.2282, RMSE = 10.5938
Batch 18: MSE = 111.9377, RMSE = 10.5801
Batch 19: MSE = 111.4893, RMSE = 10.5589


## Implement 10 epochs of gradient descent training

For every step of gradient descent, printout the MSE, RMSE, epoch index, and batch index.

* **hint:** you can call `enumerate(DataLoader)` repeatedly in a `for` loop

In [16]:
EPOCHS = 10  # Number of epochs

for epoch in range(EPOCHS):
    for batch_idx, (X_batch, y_batch) in enumerate(train_loader):
        # Forward pass
        y_pred = model(X_batch)

        # Compute MSE loss
        loss = loss_fn(y_pred, y_batch.view(-1, 1))

        # Compute RMSE
        rmse_value = pt.sqrt(loss)

        # Backpropagation
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # Print results
        print(f"Epoch {epoch + 1}, Batch {batch_idx}: MSE = {loss.item():.4f}, RMSE = {rmse_value.item():.4f}")


Epoch 1, Batch 0: MSE = 111.5720, RMSE = 10.5628
Epoch 1, Batch 1: MSE = 111.0290, RMSE = 10.5370
Epoch 1, Batch 2: MSE = 110.8370, RMSE = 10.5279
Epoch 1, Batch 3: MSE = 110.2292, RMSE = 10.4990
Epoch 1, Batch 4: MSE = 110.6765, RMSE = 10.5203
Epoch 1, Batch 5: MSE = 110.0693, RMSE = 10.4914
Epoch 1, Batch 6: MSE = 110.1186, RMSE = 10.4937
Epoch 1, Batch 7: MSE = 109.5369, RMSE = 10.4660
Epoch 1, Batch 8: MSE = 110.9675, RMSE = 10.5341
Epoch 1, Batch 9: MSE = 110.0242, RMSE = 10.4892
Epoch 1, Batch 10: MSE = 109.6585, RMSE = 10.4718
Epoch 1, Batch 11: MSE = 109.7333, RMSE = 10.4754
Epoch 1, Batch 12: MSE = 109.4427, RMSE = 10.4615
Epoch 1, Batch 13: MSE = 108.9281, RMSE = 10.4369
Epoch 1, Batch 14: MSE = 108.4539, RMSE = 10.4141
Epoch 1, Batch 15: MSE = 109.0394, RMSE = 10.4422
Epoch 1, Batch 16: MSE = 108.5257, RMSE = 10.4176
Epoch 1, Batch 17: MSE = 109.0430, RMSE = 10.4424
Epoch 1, Batch 18: MSE = 107.9345, RMSE = 10.3892
Epoch 1, Batch 19: MSE = 107.7562, RMSE = 10.3806
Epoch 2, B