# Question B1 (15 marks)

Real world datasets often have a mix of numeric and categorical features – this dataset is one example. To build models on such data, categorical features have to be encoded or embedded.

PyTorch Tabular is a library that makes it very convenient to build neural networks for tabular data. It is built on top of PyTorch Lightning, which abstracts away boilerplate model training code and makes it easy to integrate other tools, e.g. TensorBoard for experiment tracking.

For questions B1 and B2, the following features should be used:   
- **Numeric / Continuous** features: dist_to_nearest_stn, dist_to_dhoby, degree_centrality, eigenvector_centrality, remaining_lease_years, floor_area_sqm
- **Categorical** features: month, town, flat_model_type, storey_range



---



In [24]:
pip install pytorch-tabular

You should consider upgrading via the '/usr/local/bin/python3 -m pip install --upgrade pip' command.[0m[33m
[0mNote: you may need to restart the kernel to use updated packages.


In [25]:
SEED = 42

import os

import random
random.seed(SEED)

import numpy as np
np.random.seed(SEED)

import pandas as pd

from pytorch_tabular import TabularModel
from pytorch_tabular.models import CategoryEmbeddingModelConfig
from pytorch_tabular.config import (
    DataConfig,
    OptimizerConfig,
    TrainerConfig,
)

1.Divide the dataset (‘hdb_price_prediction.csv’) into train, validation and test sets by using entries from year 2019 and before as training data, year 2020 as validation data and year 2021 as test data.
**Do not** use data from year 2022 and year 2023.



In [26]:
df = pd.read_csv('hdb_price_prediction.csv')

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159553 entries, 0 to 159552
Data columns (total 14 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   month                   159553 non-null  int64  
 1   year                    159553 non-null  int64  
 2   town                    159553 non-null  object 
 3   full_address            159553 non-null  object 
 4   nearest_stn             159553 non-null  object 
 5   dist_to_nearest_stn     159553 non-null  float64
 6   dist_to_dhoby           159553 non-null  float64
 7   degree_centrality       159553 non-null  float64
 8   eigenvector_centrality  159553 non-null  float64
 9   flat_model_type         159553 non-null  object 
 10  remaining_lease_years   159553 non-null  float64
 11  floor_area_sqm          159553 non-null  float64
 12  storey_range            159553 non-null  object 
 13  resale_price            159553 non-null  float64
dtypes: float64(7), int64

In [27]:
df.head()

Unnamed: 0,month,year,town,full_address,nearest_stn,dist_to_nearest_stn,dist_to_dhoby,degree_centrality,eigenvector_centrality,flat_model_type,remaining_lease_years,floor_area_sqm,storey_range,resale_price
0,1,2017,ANG MO KIO,406 ANG MO KIO AVENUE 10,Ang Mo Kio,1.007264,7.006044,0.016807,0.006243,"2 ROOM, Improved",61.333333,44.0,10 TO 12,232000.0
1,1,2017,ANG MO KIO,108 ANG MO KIO AVENUE 4,Ang Mo Kio,1.271389,7.983837,0.016807,0.006243,"3 ROOM, New Generation",60.583333,67.0,01 TO 03,250000.0
2,1,2017,ANG MO KIO,602 ANG MO KIO AVENUE 5,Yio Chu Kang,1.069743,9.0907,0.016807,0.002459,"3 ROOM, New Generation",62.416667,67.0,01 TO 03,262000.0
3,1,2017,ANG MO KIO,465 ANG MO KIO AVENUE 10,Ang Mo Kio,0.94689,7.519889,0.016807,0.006243,"3 ROOM, New Generation",62.083333,68.0,04 TO 06,265000.0
4,1,2017,ANG MO KIO,601 ANG MO KIO AVENUE 5,Yio Chu Kang,1.092551,9.130489,0.016807,0.002459,"3 ROOM, New Generation",62.416667,67.0,01 TO 03,265000.0


In [28]:
# YOUR CODE HERE
df = pd.read_csv('hdb_price_prediction.csv')

df_train = df[df['year'] <= 2019].copy()
# Validation Data Set: Year 2020
df_validation = df[df['year'] == 2020].copy()
# Testing Data Set: Year 2021
df_test = df[df['year'] == 2021].copy()


2.Refer to the documentation of **PyTorch Tabular** and perform the following tasks: https://pytorch-tabular.readthedocs.io/en/latest/#usage
- Use **[DataConfig](https://pytorch-tabular.readthedocs.io/en/latest/data/)** to define the target variable, as well as the names of the continuous and categorical variables.
- Use **[TrainerConfig](https://pytorch-tabular.readthedocs.io/en/latest/training/)** to automatically tune the learning rate. Set batch_size to be 1024 and set max_epoch as 50.
- Use **[CategoryEmbeddingModelConfig](https://pytorch-tabular.readthedocs.io/en/latest/models/#category-embedding-model)** to create a feedforward neural network with 1 hidden layer containing 50 neurons.
- Use **[OptimizerConfig](https://pytorch-tabular.readthedocs.io/en/latest/optimizer/)** to choose Adam optimiser. There is no need to set the learning rate (since it will be tuned automatically) nor scheduler.
- Use **[TabularModel](https://pytorch-tabular.readthedocs.io/en/latest/tabular_model/)** to initialise the model and put all the configs together.

In [29]:
# YOUR CODE HERE
all_columns = df.columns.tolist()
categorical_columns = [ 'town', 'full_address', 'nearest_stn', 'flat_model_type', 'storey_range']
num_columns = ['dist_to_nearest_stn', 'dist_to_dhoby', 'degree_centrality', 'eigenvector_centrality', 'remaining_lease_years', 'floor_area_sqm']


print(categorical_columns)
print(num_columns)

# DataConfig
data_config = DataConfig(
    target=["resale_price"],  
    continuous_cols=num_columns,
    categorical_cols=categorical_columns
)


#TrainerConfig
trainer_config = TrainerConfig(
    auto_lr_find=True,  
    batch_size=1024,
    max_epochs=50,
)

# CategoryEmbeddingModelConfig
model_config = CategoryEmbeddingModelConfig(
    task="regression",
    layers="50"
)


# OptimiserConfig
optimizer_config = OptimizerConfig()

#model
tabular_model = TabularModel(
    data_config=data_config,
    model_config=model_config,
    optimizer_config=optimizer_config,
    trainer_config=trainer_config,
)

['town', 'full_address', 'nearest_stn', 'flat_model_type', 'storey_range']
['dist_to_nearest_stn', 'dist_to_dhoby', 'degree_centrality', 'eigenvector_centrality', 'remaining_lease_years', 'floor_area_sqm']


In [30]:
tabular_model.fit(train=df_train, validation=df_validation)
result = tabular_model.evaluate(df_test)
prediction_df = tabular_model.predict(df_test)

Seed set to 42


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  X_encoded[col].fillna(self._imputed, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  X_encoded[col].fillna(self._imputed, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are settin

GPU available: True (mps), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs


/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py:639: Checkpoint directory /Users/mihirbhupathiraju/Desktop/sc4001/saved_models exists and is not empty.
/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:441: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.
/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:441: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.


Finding best initial lr:   0%|          | 0/100 [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_steps=100` reached.
Learning rate set to 0.47863009232263803
Restoring states from the checkpoint path at /Users/mihirbhupathiraju/Desktop/sc4001/.lr_find_5b1cb070-2983-4ce9-bdc7-1e112385797a.ckpt
/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/lightning_fabric/utilities/cloud_io.py:56: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We

Output()

Output()

  return torch.load(f, map_location=map_location)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  X_encoded[col].fillna(self._imputed, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  X_encoded[col].fillna(self._imputed, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work beca

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  X_encoded[col].fillna(self._imputed, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  X_encoded[col].fillna(self._imputed, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are settin

3.Report the test RMSE error and the test R2 value that you obtained.



In [31]:
result
prediction_df

Unnamed: 0,resale_price_prediction
87370,206476.437500
87371,236978.015625
87372,296334.812500
87373,273965.218750
87374,240292.156250
...,...
116422,575501.562500
116423,573838.937500
116424,640002.437500
116425,740735.312500


In [32]:
# YOUR CODE & RESULT HERE
import numpy as np
from sklearn.metrics import mean_squared_error, r2_score

# Extract the predicted values
predicted_values = prediction_df['resale_price_prediction'].values  
actual_values = df_test['resale_price'].values  

# Calculate RMSE
rmse = np.sqrt(mean_squared_error(actual_values, predicted_values))
# Calculate R^2
r2 = r2_score(actual_values, predicted_values)
print(f"Test RMSE: {rmse:.4f}")
print(f"Test R²: {r2:.4f}")


Test RMSE: 66821.2455
Test R²: 0.8312


4.Print out the corresponding rows in the dataframe for the top 25 test samples with the largest errors. Identify a trend in these poor predictions and suggest a way to reduce these errors.



In [33]:
# YOUR CODE & RESULT HERE
error_df = pd.DataFrame({'actual': actual_values,'predicted': predicted_values})

# Calculate absolute errors
error_df['error'] = abs(error_df['actual'] - error_df['predicted'])

# Get the top 25 largest errors
top_errors = error_df.nlargest(25, 'error')

# Print the corresponding rows in the original test DataFrame
top_error_indices = top_errors.index  
top_error_samples = df_test.iloc[top_error_indices]  

print("Top 25 test samples with largest errors:")
top_error_samples

Top 25 test samples with largest errors:


Unnamed: 0,month,year,town,full_address,nearest_stn,dist_to_nearest_stn,dist_to_dhoby,degree_centrality,eigenvector_centrality,flat_model_type,remaining_lease_years,floor_area_sqm,storey_range,resale_price
92405,11,2021,BUKIT MERAH,46 SENG POH ROAD,Tiong Bahru,0.581977,2.309477,0.016807,0.047782,"3 ROOM, Standard",50.166667,88.0,01 TO 03,780000.0
92226,9,2021,BUKIT MERAH,96A HENDERSON ROAD,Tiong Bahru,0.586629,2.932814,0.016807,0.047782,"5 ROOM, Improved",96.75,113.0,28 TO 30,1220000.0
92442,11,2021,BUKIT MERAH,127D KIM TIAN ROAD,Tiong Bahru,0.686789,2.664024,0.016807,0.047782,"5 ROOM, Improved",90.333333,113.0,16 TO 18,1165000.0
96910,12,2021,GEYLANG,332 UBI AVENUE 1,Ubi,0.53638,7.02884,0.016807,0.004409,"EXECUTIVE, Maisonette",62.833333,152.0,01 TO 03,950000.0
106132,11,2021,QUEENSTOWN,50 COMMONWEALTH DRIVE,Commonwealth,0.197249,5.421535,0.016807,0.00535,"5 ROOM, Improved",92.333333,117.0,34 TO 36,1230000.0
92443,11,2021,BUKIT MERAH,96A HENDERSON ROAD,Tiong Bahru,0.586629,2.932814,0.016807,0.047782,"5 ROOM, Improved",96.583333,113.0,40 TO 42,1256000.0
105702,6,2021,QUEENSTOWN,150 MEI LING STREET,Queenstown,0.245207,4.709043,0.016807,0.008342,"EXECUTIVE, Apartment",73.416667,148.0,10 TO 12,1235000.0
90608,12,2021,BISHAN,273B BISHAN STREET 24,Bishan,0.776182,6.297489,0.033613,0.015854,"5 ROOM, DBSS",88.833333,120.0,37 TO 39,1360000.0
109220,11,2021,SENGKANG,215B COMPASSVALE DRIVE,Sengkang,0.291216,11.358756,0.016807,0.000233,"5 ROOM, Premium Apartment",94.583333,112.0,13 TO 15,820000.0
95622,2,2021,CLEMENTI,440C CLEMENTI AVENUE 3,Clementi,0.245502,9.31326,0.016807,0.001179,"5 ROOM, Improved",96.583333,112.0,34 TO 36,1095000.0


In [34]:
top_error_samples['town'].value_counts()

town
SENGKANG           6
BUKIT MERAH        5
QUEENSTOWN         5
CLEMENTI           3
BISHAN             2
GEYLANG            1
HOUGANG            1
KALLANG/WHAMPOA    1
WOODLANDS          1
Name: count, dtype: int64

In [35]:
print("\nTrends in Poor Predictions:")
top_error_samples.describe(include='all') 


Trends in Poor Predictions:


Unnamed: 0,month,year,town,full_address,nearest_stn,dist_to_nearest_stn,dist_to_dhoby,degree_centrality,eigenvector_centrality,flat_model_type,remaining_lease_years,floor_area_sqm,storey_range,resale_price
count,25.0,25.0,25,25,25,25.0,25.0,25.0,25.0,25,25.0,25.0,25,25.0
unique,,,9,21,10,,,,,8,,,13,
top,,,SENGKANG,216C COMPASSVALE DRIVE,Sengkang,,,,,"5 ROOM, Improved",,,01 TO 03,
freq,,,6,3,6,,,,,8,,,3,
mean,9.48,2021.0,,,,0.450793,7.176338,0.018151,0.014929,,87.903333,117.36,,1010240.0
std,3.043025,0.0,,,,0.235894,4.026541,0.004654,0.019884,,12.808342,20.979911,,189672.8
min,1.0,2021.0,,,,0.171256,2.128424,0.016807,2.4e-05,,50.166667,88.0,,777000.0
25%,8.0,2021.0,,,,0.245207,3.720593,0.016807,0.000233,,88.916667,112.0,,820000.0
50%,11.0,2021.0,,,,0.451387,6.370404,0.016807,0.00535,,93.916667,112.0,,958000.0
75%,12.0,2021.0,,,,0.586629,11.358756,0.016807,0.015854,,94.833333,117.0,,1220000.0


The location of the flat is a important feature in training the model. We can see that Senkang has 6 out of top 25 which is north and Bukit merah has 5 out of top 25 which is considered a very good place, which are of 3,5 bedroom flats. we also know that for places like queenstown the degree and eignvector values are low which may suggest that the properties are not central.

The model's performance can be enhanced by enlarging the dataset and implementing feature engineering techniques. This includes creating interaction or polynomial features from existing numerical data and improving the encoding of categorical variables. For example, utilizing target encoding for towns or model types can more effectively capture their impact on pricing.