### Predicting HDB Prices with and without Macroeconomic Data
This notebook examines whether the inclusion of macroeconomic factors in a dataset that initially contains only house-specific features enhances the performance of house price prediction models. Prior research has indicated the effectiveness of tree-based models, particularly random forest and XGBoost algorithms, in forecasting house prices. These models are typically trained using solely house-specific features. However, evidence suggests that macroeconomic variables like interest rates, inflation, and GDP have an impact on house prices. As such, this notebook explores the potential improvement in model performance by integrating these macroeconomic variables, comparing them with traditional models that rely exclusively on house-specific features.


#### Load Libraries

In [1]:
# Data Manipulation
import pandas as pd
import numpy as np

# Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# Train Test Split
from sklearn.model_selection import train_test_split

# Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer

# Modelling
from xgboost import XGBRegressor

# Model Evaluation
from sklearn.metrics import mean_absolute_error, r2_score, mean_absolute_percentage_error
import scipy.stats as stats

#### Load data into DataFrame and Remove Unwanted Columns

In [2]:
# Make file path variable so that all we need is to change this if we move notebook location
file_path = '../data/processed/final_HDB_for_model.parquet.gzip'

# Read data into csv
df = pd.read_parquet(file_path)

# Check to see if it loaded correctly
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 890376 entries, 0 to 890375
Data columns (total 27 columns):
 #   Column                                                                         Non-Null Count   Dtype         
---  ------                                                                         --------------   -----         
 0   town                                                                           890376 non-null  object        
 1   flat_type                                                                      890376 non-null  object        
 2   block                                                                          890376 non-null  object        
 3   street_name                                                                    890376 non-null  object        
 4   storey_range                                                                   890376 non-null  object        
 5   floor_area_sqm                                                          

In [3]:
# Put all columns to be deleted into a list
drop_cols = ['block', 'street_name','address','sold_year_month']

# Drop columns
df = df.drop(columns=drop_cols)

In [4]:
df.columns

Index(['town', 'flat_type', 'storey_range', 'floor_area_sqm', 'flat_model',
       'lease_commence_date', 'resale_price', 'sold_year',
       'sold_remaining_lease', 'max_floor_lvl', '5 year bond yields',
       'GDPm (Current Prices)', 'GDP per capita', 'Personal Income m',
       'Unemployment Rate', 'Core inflation', 'Median Household Inc',
       'Lime, Cement, & Fabricated Construction Materials Excl Glass & Clay Materials',
       'Clay Construction Materials & Refractory Construction Materials',
       'most_closest_mrt', 'walking_time_mrt', 'ResidentPopulation',
       'ResidentPopulation_Growth_Rate'],
      dtype='object')

In [5]:
df_only_hdb = df[['town', 'flat_type', 'storey_range', 'floor_area_sqm', 'flat_model',
       'lease_commence_date', 'resale_price', 'sold_year',
       'sold_remaining_lease', 'max_floor_lvl']].copy()

#### Creating a Pipeline 
Pipelines in machine learning are useful because they streamline the entire process of preparing data and building models. Most importantly, they make things repeatable without human error! 

In [6]:
# Create lists of the categorical and numerical columns allowing them to be treated differently
cat_cols = df.select_dtypes(include=['object']).columns
cat_cols_only_hdb = df_only_hdb.select_dtypes(include=['object']).columns


In [7]:
# Create instances of OneHotEncoder
cat_transformer = OneHotEncoder(drop='first', handle_unknown='ignore', sparse_output=True)

# Create a pipeline to apply transformation
prepoc = make_column_transformer(
    (cat_transformer, cat_cols),
    remainder='passthrough'
)

# View Pipeline
prepoc

In [8]:
# Create instances of OneHotEncoder
cat_transformer_only_hdb = OneHotEncoder(drop='first', handle_unknown='ignore', sparse_output=True)

# Create a pipeline to apply transformation
prepoc_only_hdb = make_column_transformer(
    (cat_transformer_only_hdb, cat_cols_only_hdb),
    remainder='passthrough'
)

# View Pipeline
prepoc_only_hdb

#### Train Test Split the Data 

In [9]:
# Select target column
target_col = 'resale_price'

# Ready X and y
X = df.loc[:, ~df.columns.isin([target_col])]
y = df[target_col]
X_only_hdb = df_only_hdb.loc[:, ~df_only_hdb.columns.isin([target_col])]
y_only_hdb = df_only_hdb[target_col]

In [10]:
# Split the data, 80-20 split with a random state included for reproducibility
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2, random_state = 54)
X_train_only_hdb, X_test_only_hdb, y_train_only_hdb, y_test_only_hdb = train_test_split(X_only_hdb,y_only_hdb, test_size = 0.2, random_state = 54)

#### Preprocess the Data

In [11]:
# Process X & y with preprocessing pipeline
X_train_processed = prepoc.fit_transform(X_train)
X_test_processed = prepoc.transform(X_test)
X_train_only_hdb_processed = prepoc_only_hdb.fit_transform(X_train_only_hdb)
X_test_only_hdb_processed = prepoc_only_hdb.transform(X_test_only_hdb)

#### Train the Models

In [12]:
# Instantiate the model
xgb_reg = XGBRegressor()
xgb_reg_only_hdb = XGBRegressor()

In [13]:
# Fit the model to the training data
xgb_reg.fit(X_train_processed, y_train)

In [14]:
xgb_reg_only_hdb.fit(X_train_only_hdb_processed, y_train_only_hdb)

#### Testing Models

In [19]:
# Predict y with fitted model
y_pred_only_hdb = xgb_reg_only_hdb.predict(X_test_only_hdb_processed)

# results
test_base_r2_mean_only_hdb = round(r2_score(y_test_only_hdb, y_pred_only_hdb),2)
test_base_mae_mean_only_hdb = round(mean_absolute_error(y_test_only_hdb, y_pred_only_hdb),2)
test_base_mape_mean_only_hdb = round(mean_absolute_percentage_error(y_test_only_hdb, y_pred_only_hdb),2)

print("Only HDB data - testing r2 score =", test_base_r2_mean_only_hdb)
print("Only HDB data - testing Mean Absolute Error =", test_base_mae_mean_only_hdb)
print("Only HDB data - testing Mean Absolute Percentage Error =", test_base_mape_mean_only_hdb)

Only HDB data - testing r2 score = 0.97
Only HDB data - testing Mean Absolute Error = 20716.63
Only HDB data - testing Mean Absolute Percentage Error = 0.07


In [20]:
# Predict y with fitted model
y_pred = xgb_reg.predict(X_test_processed)

# results
test_base_r2_mean = round(r2_score(y_test, y_pred),2)
test_base_mae_mean = round(mean_absolute_error(y_test, y_pred),2)
test_base_mape_mean = round(mean_absolute_percentage_error(y_test, y_pred),2)

print("Macroecon + HDB data - test set r2 score =", test_base_r2_mean)
print("Macroecon + HDB data - test set Mean Absolute Error =", test_base_mae_mean)
print("Macroecon + HDB data - test set Mean Absolute Percentage Error =", test_base_mape_mean)

Macroecon + HDB data - test set r2 score = 0.98
Macroecon + HDB data - test set Mean Absolute Error = 17535.12
Macroecon + HDB data - test set Mean Absolute Percentage Error = 0.06


#### Hypothesis Test
- Null Hypothesis (H0): There is no difference in the performance of the models (the means of the errors from both models are equal).
- Alternative Hypothesis (H1): There is a difference in the performance of the models (the means of the errors from both models are not equal).
- Significance level of 0.05

To test this hypothesis:
1) Collect the Residuals: For each model, calculate the difference between the predicted and actual values for each data point in your test set.
2) Perform a Paired T-test: The test compares the mean of the differences to zero.
3) The t-statistic tells you how much the means of the two sets of residuals differ in units of standard error. The p-value tells you the probability of observing a result as extreme as, or more extreme than, the results obtained if the null hypothesis (that there is no difference between the two sets of residuals) were true.
3) Based on the p-value from the t-test, if the p-value is less than 0.05, you would reject the null hypothesis, suggesting a significant difference in model performance.


In [29]:
# Calculate residuals
residuals_only_hdb = y_test - y_pred_only_hdb
residuals_hdb_macro = y_test - y_pred

In [30]:
# Perform the paired t-test
t_statistic, p_value = stats.ttest_rel(residuals_only_hdb, residuals_hdb_macro)

print("T-statistic:", t_statistic)
print("P-value:", p_value)

T-statistic: -1.547679604336388
P-value: 0.12170123203723181


**Test Results**
- The T-statistic is a negative value, indicating that the mean of the residuals for the first set (HDB data only) is higher than the mean of the residuals for the second set (HDB + Macroeconomic data). However, the magnitude of this value is small.
- The p-value (0.12) is greater than the chosen significance level (0.05). 
- The evidence is not strong enough to reject the null hypothesis, suggesting no significant difference in performance.

### Conclusion
The model using only HDB data achieved a testing R² score of 0.97, a Mean Absolute Error (MAE) of 20,176.63, and a Mean Absolute Percentage Error (MAPE) of 0.07. In contrast, the model that combined HDB data with macroeconomic variables like interest rates, inflation, and GDP recorded a marginally higher R² score of 0.98, indicating a better fit. It also achieved a lower MAE of 17,535.12 and a MAPE of 0.06, signifying more slightly more accurate predictions.

Although these results align with the initial hypothesis that including macroeconomic factors will have an impact on the prediction of house prices, a paired t-test conducted, at significance level of 0.05, indicated that there is no significant difference in the performance of the models. 

Statistical significance, however, does not necessarily equate to practical significance. Even if a difference is not statistically significant, it might still be of practical importance. As such for our use case of projecting into the future, we will continue with the use of the macroeconomic data. 