<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 10px; height: 55px">


# Capstone Project: HDB Resale Price Predictions


## Notebook 4/4: Modelling
---

## Getting Started

In [1]:
# Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from pycaret.regression import *
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler

In [2]:
# Importing relevant csv files
final = pd.read_csv('../data/final.csv')

In [3]:
# Shape and head of final dataset
print(final.shape)
final.head()

(581229, 56)


Unnamed: 0,date,year,month,flat_type,block,street_name,address,floor_area_sqm,flat_model,lease_commence_date,remaining_lease,resale_price,town_bedok,town_bishan,town_bukit batok,town_bukit merah,town_bukit panjang,town_bukit timah,town_central area,town_choa chu kang,town_clementi,town_geylang,town_hougang,town_jurong east,town_jurong west,town_kallang/whampoa,town_marine parade,town_pasir ris,town_punggol,town_queenstown,town_sembawang,town_sengkang,town_serangoon,town_tampines,town_toa payoh,town_woodlands,town_yishun,storey_cat,mrt,mrt_dist,num_mrt_1km,mall,mall_dist,num_mall_1km,supermarket,supermarket_dist,num_supermarket_1km,hawker_dist,num_hawker_1km,park_dist,num_park_1km,school,school_dist,num_school_1km,number_school_btw_1km_2km,cityhall_dist
0,2000-01-01,2000,1,1,170,ANG MO KIO AVE 4,170 ANG MO KIO AVE 4,69.0,1,1986,14.0,147000.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4,MAYFLOWER MRT STATION,0.151573,1,Broadway Plaza,1.083526,0,560161,0.347426,5,0.31391,5,0.306142,11,569948,0.247264,3,4,9.107076
1,2000-01-01,2000,1,1,174,ANG MO KIO AVE 4,174 ANG MO KIO AVE 4,61.0,1,1986,14.0,144000.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,MAYFLOWER MRT STATION,0.237776,1,Broadway Plaza,0.990745,1,560161,0.238291,5,0.189372,4,0.340958,11,569948,0.232789,3,4,9.202219
2,2000-01-01,2000,1,1,216,ANG MO KIO AVE 1,216 ANG MO KIO AVE 1,73.0,1,1976,24.0,159000.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4,MAYFLOWER MRT STATION,0.881327,3,Broadway Plaza,0.804709,3,560215,0.05038021,9,0.207364,5,0.752762,4,569920,0.341979,1,8,8.161709
3,2000-01-01,2000,1,1,215,ANG MO KIO AVE 1,215 ANG MO KIO AVE 1,73.0,1,1976,24.0,167000.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4,MAYFLOWER MRT STATION,0.875881,2,Broadway Plaza,0.755579,3,560215,8.185305e-07,9,0.227171,5,0.734493,3,569920,0.342322,1,8,8.188099
4,2000-01-01,2000,1,1,218,ANG MO KIO AVE 1,218 ANG MO KIO AVE 1,67.0,1,1976,24.0,163000.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4,BRIGHT HILL MRT STATION,0.957857,2,AMK Hub,0.885895,2,560215,0.151839,9,0.307044,4,0.739071,4,569920,0.456619,1,7,8.039989


In [4]:
# Dropping columns in preparation for modelling
final = final.drop(['date', 'block', 'street_name', 'address', 'mrt', 'mall', 'supermarket','school'], axis = 1)

In [5]:
# Last check on dataset
final.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 581229 entries, 0 to 581228
Data columns (total 48 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   year                       581229 non-null  int64  
 1   month                      581229 non-null  int64  
 2   flat_type                  581229 non-null  int64  
 3   floor_area_sqm             581229 non-null  float64
 4   flat_model                 581229 non-null  int64  
 5   lease_commence_date        581229 non-null  int64  
 6   remaining_lease            581229 non-null  float64
 7   resale_price               581229 non-null  float64
 8   town_bedok                 581229 non-null  int64  
 9   town_bishan                581229 non-null  int64  
 10  town_bukit batok           581229 non-null  int64  
 11  town_bukit merah           581229 non-null  int64  
 12  town_bukit panjang         581229 non-null  int64  
 13  town_bukit timah           58

---
## Baseline Model

We will start by creating a basic baseline model to compare our results against. One of the simpler models we can use for this would be Linear Regression.

### Train-Test Split

In [6]:
# Creating list of all feature columns in dataset
features = [col for col in final.columns if col != 'resale_price']
len(features)

47

In [7]:
# Assigning X (feature matrix) and Y (response vector)
X = final[features]
y = final['resale_price']

In [8]:
# Split our dataset into train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

In [9]:
# Looking at the shapes of the training and test data
print(X.shape)
print(X_train.shape)
print(X_test.shape)
print('')
print(y.shape)
print(y_train.shape)
print(y_test.shape)

(581229, 47)
(523106, 47)
(58123, 47)

(581229,)
(523106,)
(58123,)


### Scaling and Instantiating Model

In [10]:
# Scaling the data
ss = StandardScaler()
X_train_sc = ss.fit_transform(X_train)
X_test_sc = ss.transform(X_test)

In [11]:
# Instantiate Linear Regression
lr = LinearRegression()

### Linear Regression Model

In [16]:
# Cross Validation
print("R2 of LR:", cross_val_score(lr, X_train, y_train, cv=5).mean())
print("RSME of LR:", -cross_val_score(lr, X_train, y_train,cv=5, scoring='neg_root_mean_squared_error').mean())

R2 of LR: 0.8329561229750919
RSME of LR: 62749.5834813807


In [13]:
# Fitting LR instantiated model
lr.fit(X_train, y_train)

LinearRegression()

In [14]:
# Scoring fit model on training and testing set
print('Training R2 of LR: ', lr.score(X_train, y_train))
print('Testing R2 of LR: ', lr.score(X_test, y_test))

Training R2 of LR:  0.8330005966593647
Testing R2 of LR:  0.8337901548296898


In [15]:
# RSME of model for training and testing set
print("Training RMSE of LR:", np.sqrt(mean_squared_error(y_train, lr.predict(X_train))))
print("Testing RMSE of LR:", np.sqrt(mean_squared_error(y_test, lr.predict(X_test))))

Training RMSE of LR: 62742.25930638633
Testing RMSE of LR: 62480.07511531172


For a baseline model, the R2 scores are consistent at around 0.83 and are already quite high. RSME is around the 62k mark, we should aim to reduce this in our modelling below. The consistency of both metrics also show that our model was not overfitted.

---
## Modelling with PyCaret

[PyCaret](https://pycaret.gitbook.io/docs/) is an open-source, low-code machine learning library in Python that automates machine learning workflows. We will be using its features to streamline and enhance our modelling process.

Notes on Setup:
- Dataset is normalized with z-score (Same as StandardScaler)
- Target variable is also transformed using box-cox so that it closely resembles a normal distribution
- Polynomial features will be created
- The threshold for automated feature selection will be lowered as advised from documentation

In [18]:
# Setting up Environment in PyCaret
s = setup(data = final, 
          target = 'resale_price',
          train_size=0.9,
          normalize=True,
          transformation = True, 
          transform_target = True,
          remove_multicollinearity = True, 
          polynomial_features=True,
          ignore_low_variance=True,
          feature_selection=True,
          feature_selection_threshold=0.5,
          use_gpu=True,
          n_jobs=-1,
          silent=True,
          session_id = 123
          ) 

Unnamed: 0,Description,Value
0,session_id,123
1,Target,resale_price
2,Original Data,"(581229, 48)"
3,Missing Values,False
4,Numeric Features,12
5,Categorical Features,35
6,Ordinal Features,False
7,High Cardinality Features,False
8,High Cardinality Method,
9,Transformed Train Set,"(523106, 111)"


In [None]:
# Comparing all models
best_model = compare_models(sort = 'RMSE')

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
et,Extra Trees Regressor,14868.5478,453299878.7729,21290.4761,0.9808,0.0624,0.045,512.192
rf,Random Forest Regressor,15084.4264,473103294.7255,21750.5305,0.9799,0.0627,0.0453,463.409
lightgbm,Light Gradient Boosting Machine,20331.7769,849471692.6987,29145.4967,0.964,0.0783,0.0593,4.395
dt,Decision Tree Regressor,20683.8239,915404968.0445,30255.2884,0.9612,0.0859,0.0619,16.444
gbr,Gradient Boosting Regressor,28216.2586,1635530045.0228,40440.8772,0.9306,0.1047,0.081,222.851
knn,K Neighbors Regressor,38510.4713,2924758317.7764,54080.7033,0.876,0.1527,0.114,173.485
huber,Huber Regressor,44906.9806,3461933530.7955,58837.8301,0.8532,0.167,0.1344,87.525
lr,Linear Regression,45140.5812,3484705203.2,59031.0465,0.8522,0.1666,0.1352,2.337
ridge,Ridge Regression,45140.6504,3484713676.8,59031.1184,0.8522,0.1666,0.1352,0.856
br,Bayesian Ridge,45141.2784,3484816770.6552,59031.9915,0.8522,0.1666,0.1352,6.37


The models' scores are sorted by their Root Mean Squared Errors (RMSE). For the RSME, the errors are squared before they are averaged, giving a relatively high weight to large errors. This results in the RMSE being useful when large errors are particularly undesirable. It is also measured in the same units as our target variable.

Comparing the current Linear Regression model's scores with our baseline model, our R2 score has increased by 0.02 and RSME has decreased by 3k. It shows that our PyCaret environment has been useful in improving the accuracy of our predictions.

The RSMEs for Extra Trees and Random Forest Regressors are the best at around 21k. Same for the R2 scores, which are around 0.98. These scores are more than ideal, but take up the bulk of the processing time for the `compare_models` function. (Took around >100 times the amount of time compared to our 3rd highest scoring model, `lightgbm`)

---

## Predictions on Unseen Data

Next, let us test out a few selected models on our holdout set. 

### Extra Trees Regressor (Best Model)

In [24]:
# Creating model
model = create_model('et')

Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,14847.619,450895729.3885,21234.3055,0.9809,0.0627,0.0451
1,14830.535,451857337.0323,21256.9362,0.9806,0.0622,0.0448
2,14882.5888,451777591.0092,21255.0604,0.9809,0.0623,0.0448
3,14991.8396,461027005.2917,21471.5394,0.9806,0.0626,0.0452
4,14848.1772,446224880.4656,21124.0356,0.9814,0.0618,0.0449
5,14877.1423,457347793.0998,21385.6913,0.9804,0.0628,0.0452
6,14801.2007,447301377.628,21149.5006,0.9811,0.062,0.0447
7,14863.1776,460450818.9046,21458.1178,0.9803,0.0631,0.0452
8,14842.9507,453356240.3739,21292.1638,0.9808,0.0623,0.0451
9,14908.7573,457202958.1394,21382.3048,0.9806,0.0626,0.045


In [25]:
# Scores of predictions
predict_model(model)

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,Extra Trees Regressor,14709.6058,446556716.3027,21131.8886,0.9809,0.0621,0.0447


Unnamed: 0,num_mall_1km_2,num_school_1km_6,num_mrt_1km_2,num_mrt_1km_9,num_mall_1km_6,number_school_btw_1km_2km_8,num_school_1km_5,month_2,num_mrt_1km_5,month_10,...,num_school_1km_3,num_mrt_1km_7,flat_type_3,num_park_1km_2,mrt_dist,month_1,num_mrt_1km_6,hawker_dist,resale_price,Label
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.118456,0.0,0.0,-1.094747,219500.0,214171.012169
1,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,-0.357491,0.0,0.0,-1.189315,340000.0,362862.482190
2,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.825242,0.0,0.0,-1.319504,150000.0,158444.340646
3,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,1.0,0.528365,0.0,0.0,-0.362267,500000.0,509734.724461
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.917783,1.0,0.0,-1.013772,228000.0,209681.115173
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
58118,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.233519,0.0,0.0,0.304994,446000.0,487661.404425
58119,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,-0.739797,0.0,0.0,-1.610684,200000.0,201336.887234
58120,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,-0.299616,0.0,0.0,-1.054239,500000.0,663957.689214
58121,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,-0.525884,1.0,0.0,-0.458900,265000.0,249302.470571


### Huber Regressor (Robust to Outliers)

In [26]:
# Creating model
model_1 = create_model('huber')

Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,44760.194,3430333300.4977,58569.0473,0.8549,0.1667,0.1343
1,45053.2141,3481472096.3772,59004.0007,0.8509,0.1673,0.1347
2,44961.7458,3461242237.5526,58832.3231,0.8533,0.167,0.1339
3,45302.0245,3516844394.6292,59302.9881,0.8522,0.1679,0.1351
4,44788.7155,3467114414.005,58882.208,0.8553,0.1663,0.1338
5,45057.822,3477238490.7784,58968.1142,0.8507,0.1677,0.1352
6,44847.7342,3444038068.284,58685.9273,0.8547,0.1662,0.1339
7,44886.3627,3460417816.7697,58825.3161,0.8517,0.1672,0.1346
8,44703.5595,3445903716.0808,58701.8204,0.8541,0.1668,0.1344
9,44710.2365,3434475119.8098,58604.3951,0.8541,0.1665,0.1338


In [27]:
predict_model(model_1)

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,Huber Regressor,45202.5956,3519792235.3019,59327.8369,0.8495,0.1678,0.135


Unnamed: 0,num_mall_1km_2,num_school_1km_6,num_mrt_1km_2,num_mrt_1km_9,num_mall_1km_6,number_school_btw_1km_2km_8,num_school_1km_5,month_2,num_mrt_1km_5,month_10,...,num_school_1km_3,num_mrt_1km_7,flat_type_3,num_park_1km_2,mrt_dist,month_1,num_mrt_1km_6,hawker_dist,resale_price,Label
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.118456,0.0,0.0,-1.094747,219500.0,261619.942470
1,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,-0.357491,0.0,0.0,-1.189315,340000.0,314688.554296
2,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.825242,0.0,0.0,-1.319504,150000.0,164035.764686
3,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,1.0,0.528365,0.0,0.0,-0.362267,500000.0,491762.932786
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.917783,1.0,0.0,-1.013772,228000.0,155775.486678
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
58118,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.233519,0.0,0.0,0.304994,446000.0,494513.747361
58119,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,-0.739797,0.0,0.0,-1.610684,200000.0,235569.721710
58120,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,-0.299616,0.0,0.0,-1.054239,500000.0,596666.507155
58121,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,-0.525884,1.0,0.0,-0.458900,265000.0,227703.207058


### Gradient Boosting Regressor (To Compare with LightGBM)

In [28]:
# Creating model
model_2 = create_model('gbr')

Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,28170.5217,1623743861.8306,40295.7053,0.9313,0.1043,0.0809
1,28382.7884,1666123132.2181,40818.1716,0.9286,0.1055,0.0814
2,28271.451,1628208886.7208,40351.0705,0.931,0.1043,0.0806
3,28358.4559,1666239195.2281,40819.5933,0.93,0.1049,0.0811
4,28213.73,1629900214.0589,40372.0227,0.932,0.1042,0.0809
5,27943.8625,1607538073.4302,40094.1152,0.931,0.1039,0.0805
6,28253.2761,1646300904.819,40574.6338,0.9305,0.1044,0.0807
7,27971.2049,1601677732.3812,40020.9662,0.9313,0.1045,0.0808
8,28228.481,1638932578.4485,40483.7323,0.9306,0.1054,0.0816
9,28368.814,1646635871.0923,40578.7613,0.93,0.1053,0.0815


In [29]:
predict_model(model_2)

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,Gradient Boosting Regressor,28116.3488,1618052357.4955,40225.0215,0.9308,0.1043,0.0808


Unnamed: 0,num_mall_1km_2,num_school_1km_6,num_mrt_1km_2,num_mrt_1km_9,num_mall_1km_6,number_school_btw_1km_2km_8,num_school_1km_5,month_2,num_mrt_1km_5,month_10,...,num_school_1km_3,num_mrt_1km_7,flat_type_3,num_park_1km_2,mrt_dist,month_1,num_mrt_1km_6,hawker_dist,resale_price,Label
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.118456,0.0,0.0,-1.094747,219500.0,228729.452516
1,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,-0.357491,0.0,0.0,-1.189315,340000.0,381975.296592
2,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.825242,0.0,0.0,-1.319504,150000.0,164722.741852
3,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,1.0,0.528365,0.0,0.0,-0.362267,500000.0,479456.608698
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.917783,1.0,0.0,-1.013772,228000.0,195389.391415
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
58118,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.233519,0.0,0.0,0.304994,446000.0,486032.436141
58119,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,-0.739797,0.0,0.0,-1.610684,200000.0,222667.633843
58120,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,-0.299616,0.0,0.0,-1.054239,500000.0,548068.365966
58121,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,-0.525884,1.0,0.0,-0.458900,265000.0,243513.665873


### LightGBM (Low Processing Power Required + Accurate)

In [25]:
# Creating model
model_3 = create_model('lightgbm')

Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,20325.5013,846079078.234,29087.4385,0.9642,0.078,0.0592
1,20315.3471,857864986.4076,29289.3323,0.9633,0.0783,0.0591
2,20228.0722,830348256.7447,28815.764,0.9648,0.0779,0.0588
3,20400.7043,853636351.0654,29217.0558,0.9641,0.0782,0.0592
4,20200.179,833964490.557,28878.4434,0.9652,0.0774,0.0588
5,20321.4447,855393362.0412,29247.1086,0.9633,0.0785,0.0594
6,20292.8929,848880470.7968,29135.5534,0.9642,0.0781,0.059
7,20338.3964,855181194.9594,29243.4812,0.9633,0.0787,0.0594
8,20221.255,851191270.0069,29175.1824,0.964,0.0781,0.0592
9,20417.4925,847632009.6767,29114.1205,0.964,0.0787,0.0596


In [26]:
predict_model(model_3)

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,Light Gradient Boosting Machine,20301.5745,848988191.777,29137.4019,0.9637,0.0781,0.0592


Unnamed: 0,num_supermarket_1km_8,num_mall_1km_0,num_hawker_1km_6,town_woodlands,num_school_1km_8,num_hawker_1km_4,num_park_1km_5,num_park_1km_0,num_park_1km_10,num_school_1km_4,...,park_dist,number_school_btw_1km_2km_14,num_school_1km_6,month_3,num_mrt_1km_1,num_supermarket_1km_9,num_hawker_1km_9,remaining_lease,resale_price,Label
0,1.0,1.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,...,-1.553264,0.0,0.0,1.0,1.0,0.0,0.0,-0.041455,219500.0,226074.955796
1,0.0,0.0,1.0,0,0.0,0.0,0.0,0.0,0.0,1.0,...,-0.190660,0.0,0.0,0.0,0.0,0.0,0.0,0.532198,340000.0,387939.113684
2,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.099870,0.0,0.0,0.0,0.0,1.0,0.0,-0.096066,150000.0,168953.255114
3,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,...,-0.863650,0.0,0.0,0.0,1.0,0.0,0.0,1.350660,500000.0,506563.499863
4,0.0,1.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,...,-1.297112,0.0,0.0,0.0,1.0,0.0,0.0,-0.152193,228000.0,202670.105422
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
58118,1.0,0.0,1.0,0,0.0,0.0,0.0,0.0,0.0,0.0,...,-1.861987,0.0,0.0,0.0,0.0,0.0,0.0,-1.020338,446000.0,511093.575873
58119,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,...,-0.607025,0.0,1.0,0.0,1.0,0.0,0.0,-1.508133,200000.0,218112.719304
58120,0.0,0.0,1.0,0,0.0,0.0,0.0,0.0,0.0,0.0,...,-0.509191,0.0,0.0,0.0,0.0,1.0,0.0,1.408893,500000.0,597261.383923
58121,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,...,-0.773044,0.0,0.0,0.0,0.0,0.0,0.0,-0.269002,265000.0,247630.745630


Overall scores across all our models are consistent, showing that our models were not underfitted/overfitted. Extra Trees Regressor is still the most accurate, but is not appropriate for future deployment because of its long processing time. Both GBR and LightGBM are gradient boosted regressors, but LightGBM is better in almost every way. Significantly shorter processing time, R2 score higher by 0.03, and a ~25% lower RSME. LightGBM is also the most efficient model (Best score to time ratio)

Looking at all the above considerations, LightGBM was chosen as our final model.

---
## Tuning Chosen Model

We will now use the `tune_model` function to automatically tune the hyperparameters of our model using pre-defined grids. We will also using Optuna, an automatic hyperparameter optimization software framework, to significantly reduce the size of the hyperparameter space to be explored.

In [27]:
# Model tuning
tuned_lightgbm = tune_model(model_3, optimize='RMSE',
                   tuner_verbose=False, search_library='optuna')

Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,15081.2319,450315266.5335,21220.633,0.981,0.0612,0.0453
1,15148.4986,461003348.143,21470.9885,0.9803,0.0611,0.0451
2,15128.2186,454942678.1625,21329.3853,0.9807,0.061,0.045
3,15202.5464,463638959.5493,21532.2772,0.9805,0.061,0.0452
4,15079.4048,455179501.1742,21334.9362,0.981,0.0606,0.045
5,15157.33,463650283.3495,21532.5401,0.9801,0.0616,0.0455
6,15182.8628,461169208.4013,21474.8506,0.9805,0.0614,0.0453
7,15085.1678,460691746.175,21463.7309,0.9803,0.0613,0.0452
8,15125.0562,459132118.1083,21427.3684,0.9806,0.061,0.0453
9,15132.6712,458255857.062,21406.9114,0.9805,0.0617,0.0454


After tuning, our RSME has dropped by another ~8k, and R2 score has increased by 1.6%. These scores are similar to our initial Extra Trees Regressor model.

In [28]:
# Saving tuned model
save_model(tuned_lightgbm,'Tuned LightGBM')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=None,
          steps=[('dtypes',
                  DataTypes_Auto_infer(categorical_features=[],
                                       display_types=False, features_todrop=[],
                                       id_columns=[], ml_usecase='regression',
                                       numerical_features=[],
                                       target='resale_price',
                                       time_features=[])),
                 ('imputer',
                  Simple_Imputer(categorical_strategy='not_available',
                                 fill_value_categorical=None,
                                 fill_value_numerical=None,
                                 numeric_st...
                                                                          min_child_samples=29,
                                                                          min_child_weight=0.001,
                                                                          min_sp

### Blending Models

Lets test if combining LightGBM with another model will help improve accuracy. Decision Trees was chosen as it is also a high scoring model with low processing time.

In [29]:
# Creating models
dt = create_model('dt', verbose = False)
lightgbm = create_model('lightgbm', verbose = False)

In [31]:
# Blending models
blend_model = blend_models(estimator_list=[dt, lightgbm],
                           choose_better=True,
                           optimize='RMSE')

Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,17417.2461,618288080.9604,24865.3993,0.9739,0.0703,0.0518
1,17441.2576,630645798.712,25112.6621,0.973,0.07,0.0516
2,17494.3308,624570446.3958,24991.4075,0.9735,0.0702,0.0518
3,17602.0025,635578293.7043,25210.6782,0.9733,0.0705,0.0521
4,17400.9373,620346318.9613,24906.7525,0.9741,0.0698,0.0517
5,17492.6103,631778409.0157,25135.2026,0.9729,0.0705,0.052
6,17381.2064,621504525.7138,24929.9925,0.9738,0.07,0.0516
7,17500.4036,634401404.1011,25187.3263,0.9728,0.071,0.0522
8,17421.5963,630594183.3621,25111.6344,0.9733,0.0701,0.0518
9,17518.4351,631010533.2862,25119.923,0.9732,0.0705,0.052


In [32]:
# Tuning blended model
blend_model = tune_model(blend_model, optimize='RMSE', tuner_verbose=False)

Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,17469.958,620295946.8211,24905.7412,0.9738,0.0698,0.0518
1,17496.3756,632555409.9025,25150.6543,0.9729,0.0696,0.0517
2,17520.9488,623333793.243,24966.6536,0.9736,0.0698,0.0517
3,17643.5903,636131808.7244,25221.6536,0.9733,0.0701,0.0521
4,17445.0237,620901506.5439,24917.8953,0.9741,0.0694,0.0516
5,17546.764,633039587.2896,25160.278,0.9728,0.0701,0.0521
6,17427.9969,623123055.8557,24962.4329,0.9737,0.0696,0.0515
7,17548.8802,635201970.0036,25203.2135,0.9728,0.0706,0.0522
8,17463.9127,631405405.9435,25127.7816,0.9733,0.0697,0.0518
9,17573.2752,631769176.6882,25135.0189,0.9732,0.0702,0.052


Scores for the blended model are lower than our Tuned LightGBM model. We will therefore stick with the Tuned LightGBM model as our final model.

---
## Model Evaluation

In [34]:
evaluate_model(tuned_lightgbm)

interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Hyperparameters', 'param…

In [41]:
interpret_model(tuned_lightgbm, save=True)

findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans.
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial, Liberation Sans, Bitstream Vera Sans, sans-serif


![shap](../plots/SHAP%20summary.png)

In [37]:
plot_model(tuned_lightgbm, plot = 'feature', save=True)

'Feature Importance.png'

![some](../plots/Feature%20Importance.png)

In [38]:
plot_model(tuned_lightgbm, plot = 'feature_all', save=True)

'Feature Importance (All).png'

![all](../plots/Feature%20Importance%20(All).png)

### Finalising Model

In [42]:
# Creating final model
final_lightgbm = finalize_model(tuned_lightgbm)



In [43]:
# Scoring final model
predict_model(final_lightgbm)

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,Light Gradient Boosting Machine,14696.2967,429615159.4138,20727.1599,0.9816,0.0593,0.044


Unnamed: 0,num_supermarket_1km_8,num_mall_1km_0,num_hawker_1km_6,town_woodlands,num_school_1km_8,num_hawker_1km_4,num_park_1km_5,num_park_1km_0,num_park_1km_10,num_school_1km_4,...,park_dist,number_school_btw_1km_2km_14,num_school_1km_6,month_3,num_mrt_1km_1,num_supermarket_1km_9,num_hawker_1km_9,remaining_lease,resale_price,Label
0,1.0,1.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,...,-1.553264,0.0,0.0,1.0,1.0,0.0,0.0,-0.041455,219500.0,221890.540403
1,0.0,0.0,1.0,0,0.0,0.0,0.0,0.0,0.0,1.0,...,-0.190660,0.0,0.0,0.0,0.0,0.0,0.0,0.532198,340000.0,370725.684880
2,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.099870,0.0,0.0,0.0,0.0,1.0,0.0,-0.096066,150000.0,161601.235643
3,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,...,-0.863650,0.0,0.0,0.0,1.0,0.0,0.0,1.350660,500000.0,529018.428661
4,0.0,1.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,...,-1.297112,0.0,0.0,0.0,1.0,0.0,0.0,-0.152193,228000.0,215498.535784
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
58118,1.0,0.0,1.0,0,0.0,0.0,0.0,0.0,0.0,0.0,...,-1.861987,0.0,0.0,0.0,0.0,0.0,0.0,-1.020338,446000.0,446035.820303
58119,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,...,-0.607025,0.0,1.0,0.0,1.0,0.0,0.0,-1.508133,200000.0,204288.544013
58120,0.0,0.0,1.0,0,0.0,0.0,0.0,0.0,0.0,0.0,...,-0.509191,0.0,0.0,0.0,0.0,1.0,0.0,1.408893,500000.0,623360.661868
58121,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,...,-0.773044,0.0,0.0,0.0,0.0,0.0,0.0,-0.269002,265000.0,249616.238744


In [44]:
# Saving final model
save_model(final_lightgbm,'Final LightGBM Model')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=None,
          steps=[('dtypes',
                  DataTypes_Auto_infer(categorical_features=[],
                                       display_types=False, features_todrop=[],
                                       id_columns=[], ml_usecase='regression',
                                       numerical_features=[],
                                       target='resale_price',
                                       time_features=[])),
                 ('imputer',
                  Simple_Imputer(categorical_strategy='not_available',
                                 fill_value_categorical=None,
                                 fill_value_numerical=None,
                                 numeric_st...
                                                                          min_child_samples=29,
                                                                          min_child_weight=0.001,
                                                                          min_sp

---
## Conclusion

We managed to create a model that has a significantly lower RSME when compared to our baseline. The final RSME is 20,727, which is less than 4% of the current mean resale price of ~$550k. We can also conclude that the features we engineered have a statistically significant relationship with resale flat prices, and were useful in improving the accuracy of our models. Our model should be able to detect undervalued and overvalued flats, and should be able to give a good estimate of COVs.

Moving forward, I will aim to deploy the trained model in a web-based application like Streamlit. I also plan to test this model further on future data. HDB updates the dataset very frequently, making this plan very feasible.

## Limitations

There are other factors that will influence resale flat prices and COVs like the condition of flats and the directions they are facing. Flats with extensive renovations and furnishings or flats which are well maintained tend to fetch a higher price. Flats that are North/South facing have generally higher demand compared to those facing East/West because of heat and glare of the Sun. Our inability to take these factors into account might ultimately affect our model's accuracy.

A key consideration in this project was whether I should select the most accurate or efficient model. Processing power was a big concern; I only managed to train all the models with the help of AWS Sagemaker Studio. I only ended up choosing the most efficient model, LightGBM, because of its high potential in deployability. This decision might have led to a notable tradeoff in prediction accuracy.

Euclidean distance was also used instead of travel time. This may not be the most accurate unit because euclidean distance from a location is not perfectly correlated with the time taken to get there. This could have been improved by using a paid API (e.g. Google Maps API) to further understand the relationship between distance and travel times between two locations.

---