# CS4001/4042 Assignment 1, Part B, Q2
In Question B1, we used the Category Embedding model. This creates a feedforward neural network in which the categorical features get learnable embeddings. In this question, we will make use of a library called Pytorch-WideDeep. This library makes it easy to work with multimodal deep-learning problems combining images, text, and tables. We will just be utilizing the deeptabular component of this library through the TabMlp network:

In [1]:
!pip install pytorch-widedeep

Collecting pytorch-widedeep
  Downloading pytorch_widedeep-1.3.2-py3-none-any.whl (21.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.8/21.8 MB[0m [31m72.4 MB/s[0m eta [36m0:00:00[0m
Collecting einops (from pytorch-widedeep)
  Downloading einops-0.7.0-py3-none-any.whl (44 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.6/44.6 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
Collecting torchmetrics (from pytorch-widedeep)
  Downloading torchmetrics-1.2.0-py3-none-any.whl (805 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m805.2/805.2 kB[0m [31m58.0 MB/s[0m eta [36m0:00:00[0m
Collecting fastparquet>=0.8.1 (from pytorch-widedeep)
  Downloading fastparquet-2023.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m81.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting cramjam>=2.3 (from fastparquet>=0.8.1->pytorch-wided

In [2]:
SEED = 42

import os

import random
random.seed(SEED)

import numpy as np
np.random.seed(SEED)

import pandas as pd

from pytorch_widedeep.preprocessing import TabPreprocessor
from pytorch_widedeep.models import TabMlp, WideDeep
from pytorch_widedeep import Trainer
from pytorch_widedeep.metrics import R2Score



>Divide the dataset (‘hdb_price_prediction.csv’) into train and test sets by using entries from the year 2020 and before as training data, and entries from 2021 and after as the test data.

In [3]:
df = pd.read_csv('hdb_price_prediction.csv')

# TODO: Enter your code here
# year 2020 and before as training data
df_train = df[df['year'] <= 2020]
# entries from 2021 and after as the test data
df_test = df[df['year'] >= 2021]

  and should_run_async(code)


>Refer to the documentation of Pytorch-WideDeep and perform the following tasks:
https://pytorch-widedeep.readthedocs.io/en/latest/index.html
* Use [**TabPreprocessor**](https://pytorch-widedeep.readthedocs.io/en/latest/examples/01_preprocessors_and_utils.html#2-tabpreprocessor) to create the deeptabular component using the continuous
features and the categorical features. Use this component to transform the training dataset.
* Create the [**TabMlp**](https://pytorch-widedeep.readthedocs.io/en/latest/pytorch-widedeep/model_components.html#pytorch_widedeep.models.tabular.mlp.tab_mlp.TabMlp) model with 2 linear layers in the MLP, with 200 and 100 neurons respectively.
* Create a [**Trainer**](https://pytorch-widedeep.readthedocs.io/en/latest/pytorch-widedeep/trainer.html#pytorch_widedeep.training.Trainer) for the training of the created TabMlp model with the root mean squared error (RMSE) cost function. Train the model for 100 epochs using this trainer, keeping a batch size of 64. (Note: set the *num_workers* parameter to 0.)

In [4]:
# TODO: Enter your code here

# Reference: https://towardsdatascience.com/pytorch-widedeep-deep-learning-for-tabular-data-9cd1c48eb40d
# Target from the train dataset
target = df_train['resale_price'].values

# taken from my B1
# continuous_columns: List[str]: Column names of the numeric fields. Defaults to []
# Numeric / Continuous features given: dist_to_nearest_stn, dist_to_dhoby, degree_centrality, eigenvector_centrality, remaining_lease_years, floor_area_sqm
continuous_columns = ['dist_to_nearest_stn', 'dist_to_dhoby', 'degree_centrality', 'eigenvector_centrality', 'remaining_lease_years', 'floor_area_sqm']

# categorical_columns: List[str]: Column names of the categorical fields to treat differently
# Categorical features given: month, town, flat_model_type, storey_range
categorical_columns = ['month', 'town', 'flat_model_type', 'storey_range']

# create the deeptabular component using the continuous features and the categorical features
tab_preprocessor = TabPreprocessor(
    cat_embed_cols=categorical_columns, continuous_cols=continuous_columns
)

# transform the training dataset
X_tab = tab_preprocessor.fit_transform(df_train)

# Create model with 2 linear layers in the MLP, with 200 and 100 neurons respectively.
tab_mlp = TabMlp(
    mlp_hidden_dims=[200, 100],
    column_idx=tab_preprocessor.column_idx,
    cat_embed_input=tab_preprocessor.cat_embed_input,
    continuous_cols=continuous_columns
)

# Create a Trainer for the training of the created TabMlp model with
# cost function = root mean squared error (RMSE)
# choose R2Score as the metric for next part
# num_workers = 0
model = WideDeep(deeptabular=tab_mlp)
trainer = Trainer(model, cost_function="rmse", metrics=[R2Score], num_workers=0)
# Train model for 100 epochs, batch size of 64
trainer.fit(
    X_tab=X_tab, # Input for the deeptabular model component
    target=target,
    n_epochs=100, # number of epochs
    batch_size=64, # batch size
)

  and should_run_async(code)
epoch 1: 100%|██████████| 1366/1366 [00:22<00:00, 60.98it/s, loss=2.31e+5, metrics={'r2': -2.2724}]
epoch 2: 100%|██████████| 1366/1366 [00:18<00:00, 73.07it/s, loss=9.59e+4, metrics={'r2': 0.5559}] 
epoch 3: 100%|██████████| 1366/1366 [00:11<00:00, 116.28it/s, loss=8.25e+4, metrics={'r2': 0.6865}]
epoch 4: 100%|██████████| 1366/1366 [00:12<00:00, 113.81it/s, loss=7.71e+4, metrics={'r2': 0.7314}]
epoch 5: 100%|██████████| 1366/1366 [00:11<00:00, 115.10it/s, loss=7.41e+4, metrics={'r2': 0.7538}]
epoch 6: 100%|██████████| 1366/1366 [00:11<00:00, 118.01it/s, loss=7.22e+4, metrics={'r2': 0.7674}]
epoch 7: 100%|██████████| 1366/1366 [00:11<00:00, 122.09it/s, loss=7.09e+4, metrics={'r2': 0.7759}]
epoch 8: 100%|██████████| 1366/1366 [00:12<00:00, 109.88it/s, loss=6.94e+4, metrics={'r2': 0.786}]
epoch 9: 100%|██████████| 1366/1366 [00:12<00:00, 112.29it/s, loss=6.87e+4, metrics={'r2': 0.7905}]
epoch 10: 100%|██████████| 1366/1366 [00:11<00:00, 114.50it/s, loss=6.79

>Report the test RMSE and the test R2 value that you obtained.

In [5]:
# TODO: Enter your code here
import math
import sklearn.metrics

# Transform the test data
X_test_tab = tab_preprocessor.transform(df_test)

# Predict the resale prices for the test data
y_pred = trainer.predict(X_tab=X_test_tab)

print(f"Test RMSE: {math.sqrt(sklearn.metrics.mean_squared_error(df_test['resale_price'], y_pred))}")
print(f"Test R2: {sklearn.metrics.r2_score(df_test['resale_price'], y_pred)}")

  and should_run_async(code)
predict: 100%|██████████| 1128/1128 [00:04<00:00, 263.65it/s]


Test RMSE: 97868.21495280517
Test R2: 0.6653569079110121
