# Question B2 (10 marks)
In Question B1, we used the Category Embedding model. This creates a feedforward neural network in which the categorical features get learnable embeddings. In this question, we will make use of a library called Pytorch-WideDeep. This library makes it easy to work with multimodal deep-learning problems combining images, text, and tables. We will just be utilizing the deeptabular component of this library through the TabMlp network:

In [1]:
!pip install pytorch-widedeep

Collecting pytorch-widedeep
  Using cached pytorch_widedeep-1.6.3-py3-none-any.whl (21.9 MB)
Collecting fastparquet>=0.8.1
  Using cached fastparquet-2024.5.0-cp39-cp39-win_amd64.whl (672 kB)
Collecting sentencepiece
  Using cached sentencepiece-0.2.0-cp39-cp39-win_amd64.whl (991 kB)
Collecting sentence-transformers
  Using cached sentence_transformers-3.1.1-py3-none-any.whl (245 kB)
Collecting opencv-contrib-python
  Using cached opencv_contrib_python-4.10.0.84-cp37-abi3-win_amd64.whl (45.5 MB)
Collecting pyarrow
  Using cached pyarrow-17.0.0-cp39-cp39-win_amd64.whl (25.1 MB)
Collecting spacy
  Using cached spacy-3.7.6-cp39-cp39-win_amd64.whl (12.2 MB)
Collecting gensim
  Using cached gensim-4.3.3-cp39-cp39-win_amd64.whl (24.0 MB)
Collecting cramjam>=2.3
  Using cached cramjam-2.8.4-cp39-none-win_amd64.whl (2.1 MB)
Collecting spacy-legacy<3.1.0,>=3.0.11
  Using cached spacy_legacy-3.0.12-py2.py3-none-any.whl (29 kB)
Collecting pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4
  Using cached pydant

You should consider upgrading via the 'c:\users\joann\coding projects\ipynb dump\ipynb-dump\scripts\python.exe -m pip install --upgrade pip' command.


In [2]:
SEED = 42

import os

import random
random.seed(SEED)

import numpy as np
np.random.seed(SEED)

import pandas as pd

from pytorch_widedeep.preprocessing import TabPreprocessor
from pytorch_widedeep.models import TabMlp, WideDeep
from pytorch_widedeep import Trainer
from pytorch_widedeep.metrics import R2Score

1.Divide the dataset (‘hdb_price_prediction.csv’) into train and test sets by using entries from the year 2020 and before as training data, and entries from 2021 and after as the test data.

In [4]:
df = pd.read_csv('hdb_price_prediction.csv')

train_df = df[df['year'] <= 2020]  
test_df = df[df['year'] >= 2021]  

2.Refer to the documentation of Pytorch-WideDeep and perform the following tasks:
https://pytorch-widedeep.readthedocs.io/en/latest/index.html
* Use [**TabPreprocessor**](https://pytorch-widedeep.readthedocs.io/en/latest/examples/01_preprocessors_and_utils.html#2-tabpreprocessor) to create the deeptabular component using the continuous
features and the categorical features. Use this component to transform the training dataset.
* Create the [**TabMlp**](https://pytorch-widedeep.readthedocs.io/en/latest/pytorch-widedeep/model_components.html#pytorch_widedeep.models.tabular.mlp.tab_mlp.TabMlp) model with 2 linear layers in the MLP, with 200 and 100 neurons respectively.
* Create a [**Trainer**](https://pytorch-widedeep.readthedocs.io/en/latest/pytorch-widedeep/trainer.html#pytorch_widedeep.training.Trainer) for the training of the created TabMlp model with the root mean squared error (RMSE) cost function. Train the model for 100 epochs using this trainer, keeping a batch size of 64. (Note: set the *num_workers* parameter to 0.)

In [5]:
categorical_cols = ['month', 'town', 'flat_model_type', 'storey_range']  
continuous_cols = ['dist_to_nearest_stn', 'dist_to_dhoby', 'degree_centrality', 'eigenvector_centrality', 'remaining_lease_years', 'floor_area_sqm'] 


# Create the TabPreprocessor
tab_preprocessor = TabPreprocessor(
    cat_embed_cols=categorical_cols,
    continuous_cols= continuous_cols
)
X_tab = tab_preprocessor.fit_transform(train_df)

# Build the TabMlp model
tabmlp = TabMlp(
    mlp_hidden_dims=[200, 100],  
    column_idx=tab_preprocessor.column_idx, 
    cat_embed_input=tab_preprocessor.cat_embed_input,  
    continuous_cols=continuous_cols 
)

# Create the WideDeep model
model = WideDeep(deeptabular=tabmlp)

# Create the Trainer
trainer = Trainer(
    model=model,  
    cost_function="rmse",  
    num_workers=0  
)

no_epochs = 100
batch_size = 64

trainer.fit(
    X_tab=X_tab,  
    target=train_df['resale_price'].values,  
    n_epochs=no_epochs,  
    batch_size=batch_size  
)   

epoch 1: 100%|██████████| 1366/1366 [00:10<00:00, 135.44it/s, loss=1.81e+5]
epoch 2: 100%|██████████| 1366/1366 [00:22<00:00, 61.11it/s, loss=9.88e+4]
epoch 3: 100%|██████████| 1366/1366 [00:11<00:00, 116.24it/s, loss=7.69e+4]
epoch 4: 100%|██████████| 1366/1366 [00:15<00:00, 88.84it/s, loss=6.45e+4] 
epoch 5: 100%|██████████| 1366/1366 [00:09<00:00, 151.75it/s, loss=5.98e+4]
epoch 6: 100%|██████████| 1366/1366 [00:08<00:00, 166.37it/s, loss=5.76e+4]
epoch 7: 100%|██████████| 1366/1366 [00:07<00:00, 171.75it/s, loss=5.59e+4]
epoch 8: 100%|██████████| 1366/1366 [00:08<00:00, 156.65it/s, loss=5.47e+4]
epoch 9: 100%|██████████| 1366/1366 [00:08<00:00, 167.31it/s, loss=5.34e+4]
epoch 10: 100%|██████████| 1366/1366 [00:08<00:00, 169.34it/s, loss=5.22e+4]
epoch 11: 100%|██████████| 1366/1366 [00:08<00:00, 159.15it/s, loss=5.12e+4]
epoch 12: 100%|██████████| 1366/1366 [00:08<00:00, 158.53it/s, loss=5.02e+4]
epoch 13: 100%|██████████| 1366/1366 [00:09<00:00, 142.01it/s, loss=4.92e+4]
epoch 14:

3.Report the test RMSE and the test R2 value that you obtained.

In [6]:
#TODO: Check res orng" gmna

from sklearn.metrics import mean_squared_error, r2_score

X_tab_test = tab_preprocessor.transform(test_df)

y_pred = trainer.predict(X_tab=X_tab_test)
y_true = test_df['resale_price'] 

print('RMSE & R2')

rmse = mean_squared_error(y_true, y_pred, squared=False)  # Set squared=False to get the RMSE
print(f'Test RMSE: {rmse}')

r2 = r2_score(y_true, y_pred) 
print(f'Test R2: {r2}')

predict: 100%|██████████| 1128/1128 [00:02<00:00, 396.10it/s]

RMSE & R2
Test RMSE: 100470.89891734322
Test R2: 0.6473214017550503



