# Question B2 (10 marks)
In Question B1, we used the Category Embedding model. This creates a feedforward neural network in which the categorical features get learnable embeddings. In this question, we will make use of a library called Pytorch-WideDeep. This library makes it easy to work with multimodal deep-learning problems combining images, text, and tables. We will just be utilizing the deeptabular component of this library through the TabMlp network:

In [1]:
!pip install pytorch-widedeep --user




[notice] A new release of pip is available: 24.0 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
SEED = 42

import os

import random
random.seed(SEED)

import numpy as np
np.random.seed(SEED)

import pandas as pd

from pytorch_widedeep.preprocessing import TabPreprocessor
from pytorch_widedeep.models import TabMlp, WideDeep
from pytorch_widedeep import Trainer
from pytorch_widedeep.metrics import R2Score

1.Divide the dataset (‘hdb_price_prediction.csv’) into train and test sets by using entries from the year 2020 and before as training data, and entries from 2021 and after as the test data.

In [3]:
df = pd.read_csv('hdb_price_prediction.csv')

# YOUR CODE HERE

df_train = df[df["year"] <= 2020]
df_test = df[df["year"] >= 2021]

print("Training data : ",df_train["year"].unique())
print("Testing data : ",df_test["year"].unique())

Training data :  [2017 2018 2019 2020]
Testing data :  [2021 2022 2023]


2.Refer to the documentation of Pytorch-WideDeep and perform the following tasks:
https://pytorch-widedeep.readthedocs.io/en/latest/index.html
* Use [**TabPreprocessor**](https://pytorch-widedeep.readthedocs.io/en/latest/examples/01_preprocessors_and_utils.html#2-tabpreprocessor) to create the deeptabular component using the continuous
features and the categorical features. Use this component to transform the training dataset.
* Create the [**TabMlp**](https://pytorch-widedeep.readthedocs.io/en/latest/pytorch-widedeep/model_components.html#pytorch_widedeep.models.tabular.mlp.tab_mlp.TabMlp) model with 2 linear layers in the MLP, with 200 and 100 neurons respectively.
* Create a [**Trainer**](https://pytorch-widedeep.readthedocs.io/en/latest/pytorch-widedeep/trainer.html#pytorch_widedeep.training.Trainer) for the training of the created TabMlp model with the root mean squared error (RMSE) cost function. Train the model for 100 epochs using this trainer, keeping a batch size of 64. (Note: set the *num_workers* parameter to 0.)

In [4]:
# YOUR CODE & RESULT HERE

cat_embed_cols=["month", "town", "flat_model_type", "storey_range"]
continuous_cols=["dist_to_nearest_stn", "dist_to_dhoby", "degree_centrality", "eigenvector_centrality", "remaining_lease_years", "floor_area_sqm"]


#Use TabPreprocessor to create the deeptabular component using the continuous features and the categorical features. Use this component to transform the training dataset.
tab_preprocessor = TabPreprocessor(
    cat_embed_cols=cat_embed_cols,
    continuous_cols=continuous_cols
    #cols_to_scale=['resale_price'],  # or scale=True or cols_to_scale=continuous_cols
)

#For Training
X_tab = tab_preprocessor.fit_transform(df_train) # x_train
target = df_train['resale_price'].values # y_train

# Create the TabMlp model with 2 linear layers in the MLP, with 200 and 100 neurons respectively.
model = TabMlp(
    column_idx=tab_preprocessor.column_idx,  
    cat_embed_input=tab_preprocessor.cat_embed_input, 
    continuous_cols=continuous_cols, 
    #mlp_activation='relu', 
    #mlp_dropout=0.1,
    mlp_hidden_dims=[200, 100] 
     
)

# Create a Trainer for the training of the created TabMlp model with the root mean squared error (RMSE) cost function. 
wide_deep_model = WideDeep(deeptabular=model) # Combine the TabMlp model with any other models you want to use
Trainer_ = Trainer(
    wide_deep_model, 
    objective="root_mean_squared_error", 
    metrics=[R2Score], 
    num_workers=0
)

# Train the model for 100 epochs using this trainer, keeping a batch size of 64. (Note: set the num_workers parameter to 0.)
Trainer_.fit(X_tab=X_tab, target=target, n_epochs=100, batch_size=64)


epoch 1: 100%|██████████| 1366/1366 [00:09<00:00, 150.83it/s, loss=1.86e+5, metrics={'r2': -1.2865}]
epoch 2: 100%|██████████| 1366/1366 [00:09<00:00, 151.54it/s, loss=9.98e+4, metrics={'r2': 0.4848}]
epoch 3: 100%|██████████| 1366/1366 [00:09<00:00, 144.39it/s, loss=7.93e+4, metrics={'r2': 0.6829}]
epoch 4: 100%|██████████| 1366/1366 [00:09<00:00, 138.86it/s, loss=6.69e+4, metrics={'r2': 0.7916}]
epoch 5: 100%|██████████| 1366/1366 [00:09<00:00, 144.35it/s, loss=6.18e+4, metrics={'r2': 0.8273}]
epoch 6: 100%|██████████| 1366/1366 [00:09<00:00, 138.98it/s, loss=5.97e+4, metrics={'r2': 0.8401}]
epoch 7: 100%|██████████| 1366/1366 [00:10<00:00, 126.46it/s, loss=5.87e+4, metrics={'r2': 0.8455}]
epoch 8: 100%|██████████| 1366/1366 [00:11<00:00, 115.34it/s, loss=5.78e+4, metrics={'r2': 0.8499}]
epoch 9: 100%|██████████| 1366/1366 [00:10<00:00, 130.90it/s, loss=5.68e+4, metrics={'r2': 0.8555}]
epoch 10: 100%|██████████| 1366/1366 [00:10<00:00, 127.33it/s, loss=5.62e+4, metrics={'r2': 0.8584}

3.Report the test RMSE and the test R2 value that you obtained.

In [5]:
# YOUR CODE & RESULT HERE
import math
from sklearn.metrics import r2_score, mean_squared_error

x_test = tab_preprocessor.transform(df_test)
y_test = df_test['resale_price'].values

predictions = Trainer_.predict(X_tab=x_test,batch_size=64)

print("RMSE : ", math.sqrt(mean_squared_error(df_test['resale_price'], predictions)))
print("R2 : ", r2_score(df_test['resale_price'], predictions))

predict: 100%|██████████| 1128/1128 [00:02<00:00, 379.87it/s]

RMSE :  100623.95106220947
R2 :  0.6462460788447137



