<a href="https://colab.research.google.com/github/belvdere/NeuralNetworkAssignment/blob/main/Belvedere_Song_Zheng_Yi_Part_B_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Question B2 (10 marks)
In Question B1, we used the Category Embedding model. This creates a feedforward neural network in which the categorical features get learnable embeddings. In this question, we will make use of a library called Pytorch-WideDeep. This library makes it easy to work with multimodal deep-learning problems combining images, text, and tables. We will just be utilizing the deeptabular component of this library through the TabMlp network:

In [1]:
!pip install pytorch-widedeep

Collecting pytorch-widedeep
  Downloading pytorch_widedeep-1.6.3-py3-none-any.whl.metadata (10 kB)
Collecting scipy<=1.12.0,>=1.7.3 (from pytorch-widedeep)
  Downloading scipy-1.12.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.4/60.4 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
Collecting torchmetrics (from pytorch-widedeep)
  Downloading torchmetrics-1.4.2-py3-none-any.whl.metadata (19 kB)
Collecting fastparquet>=0.8.1 (from pytorch-widedeep)
  Downloading fastparquet-2024.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.1 kB)
Collecting sentence-transformers (from pytorch-widedeep)
  Downloading sentence_transformers-3.1.1-py3-none-any.whl.metadata (10 kB)
Collecting cramjam>=2.3 (from fastparquet>=0.8.1->pytorch-widedeep)
  Downloading cramjam-2.8.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.6 kB)
Collecting lightning-utilities>=0.

In [2]:
SEED = 42

import os

import random
random.seed(SEED)

import numpy as np
np.random.seed(SEED)

import pandas as pd

from pytorch_widedeep.preprocessing import TabPreprocessor
from pytorch_widedeep.models import TabMlp, WideDeep
from pytorch_widedeep import Trainer
from pytorch_widedeep.metrics import R2Score



1.Divide the dataset (‘hdb_price_prediction.csv’) into train and test sets by using entries from the year 2020 and before as training data, and entries from 2021 and after as the test data.

In [4]:
df = pd.read_csv('hdb_price_prediction.csv')

df = df.drop(['full_address', 'nearest_stn'], axis=1)

train_df = df[df.year <= 2020]
test_df = df[df.year >= 2021]

2.Refer to the documentation of Pytorch-WideDeep and perform the following tasks:
https://pytorch-widedeep.readthedocs.io/en/latest/index.html
* Use [**TabPreprocessor**](https://pytorch-widedeep.readthedocs.io/en/latest/examples/01_preprocessors_and_utils.html#2-tabpreprocessor) to create the deeptabular component using the continuous
features and the categorical features. Use this component to transform the training dataset.
* Create the [**TabMlp**](https://pytorch-widedeep.readthedocs.io/en/latest/pytorch-widedeep/model_components.html#pytorch_widedeep.models.tabular.mlp.tab_mlp.TabMlp) model with 2 linear layers in the MLP, with 200 and 100 neurons respectively.
* Create a [**Trainer**](https://pytorch-widedeep.readthedocs.io/en/latest/pytorch-widedeep/trainer.html#pytorch_widedeep.training.Trainer) for the training of the created TabMlp model with the root mean squared error (RMSE) cost function. Train the model for 100 epochs using this trainer, keeping a batch size of 64. (Note: set the *num_workers* parameter to 0.)

In [5]:
continuous_var = ["dist_to_nearest_stn", "dist_to_dhoby", "degree_centrality", "eigenvector_centrality", "remaining_lease_years", "floor_area_sqm"]
categorical_var = ["month", "town", "flat_model_type", "storey_range"]

tab_preprocessor = TabPreprocessor(
    cat_embed_cols=categorical_var,
    continuous_cols=continuous_var
)

X_tab = tab_preprocessor.fit_transform(train_df)

tab_mlp = TabMlp(
    mlp_hidden_dims=[200, 100],
    column_idx=tab_preprocessor.column_idx,
    cat_embed_input=tab_preprocessor.cat_embed_input,
    continuous_cols=continuous_var
)



In [6]:
model = WideDeep(deeptabular=tab_mlp)

trainer = Trainer(model, objective="rmse", num_workers=0)

trainer.fit(
    X_tab=X_tab,
    target=train_df['resale_price'].values,
    n_epochs=100,
    batch_size=64
)

epoch 1: 100%|██████████| 1366/1366 [00:29<00:00, 47.06it/s, loss=1.8e+5]
epoch 2: 100%|██████████| 1366/1366 [00:18<00:00, 73.91it/s, loss=1.01e+5]
epoch 3: 100%|██████████| 1366/1366 [00:18<00:00, 74.98it/s, loss=7.94e+4]
epoch 4: 100%|██████████| 1366/1366 [00:18<00:00, 72.31it/s, loss=6.51e+4]
epoch 5: 100%|██████████| 1366/1366 [00:20<00:00, 68.25it/s, loss=5.95e+4]
epoch 6: 100%|██████████| 1366/1366 [00:20<00:00, 68.05it/s, loss=5.74e+4]
epoch 7: 100%|██████████| 1366/1366 [00:18<00:00, 73.38it/s, loss=5.63e+4]
epoch 8: 100%|██████████| 1366/1366 [00:18<00:00, 74.25it/s, loss=5.52e+4]
epoch 9: 100%|██████████| 1366/1366 [00:19<00:00, 71.57it/s, loss=5.43e+4]
epoch 10: 100%|██████████| 1366/1366 [00:19<00:00, 71.86it/s, loss=5.34e+4]
epoch 11: 100%|██████████| 1366/1366 [00:18<00:00, 71.97it/s, loss=5.27e+4]
epoch 12: 100%|██████████| 1366/1366 [00:18<00:00, 73.25it/s, loss=5.2e+4]
epoch 13: 100%|██████████| 1366/1366 [00:20<00:00, 67.97it/s, loss=5.14e+4]
epoch 14: 100%|████████

3.Report the test RMSE and the test R2 value that you obtained.

In [7]:
from sklearn.metrics import mean_squared_error, r2_score

# Inference
X_tab_test = tab_preprocessor.transform(test_df)
predictions = trainer.predict(X_tab=X_tab_test, batch_size=64)

y_labels = test_df['resale_price'].values.tolist()
mse = np.sqrt(mean_squared_error(y_labels, predictions))
r_squared = r2_score(y_labels, predictions)

print("RMSE", mse)
print("R Squared", r_squared)

predict: 100%|██████████| 1128/1128 [00:10<00:00, 104.23it/s]


RMSE 100923.36314600529
R Squared 0.6441376209259033
