# HDB Resale Price Regression Models

**Models implemented:**
1. Linear Regression (OLS)
2. Non-linear Regression (polynomial terms for selected variables)
3. IV Regression (2SLS, exploratory)

**Common settings:**
- 80/20 train-test split
- 5-fold Cross-Validation (CV) for hypertuning
- RMSE used as main metric

In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score


## Load and Prepare Data

In [2]:
DATA_PATH = "../data/HDB_data_2021_sample.xlsx"

df = pd.read_excel(DATA_PATH)

# drop rows with missing resale_price
df = df.dropna(subset=["resale_price"])

# define target: use log(price) for nicer regression properties
df["log_resale_price"] = np.log(df["resale_price"])

In [3]:
# we'll use all columns except the target and year (year is constant 2021)
drop_cols_full = ["resale_price", "log_resale_price", "year"]
X_full = df.drop(columns=drop_cols_full)
y = df["log_resale_price"].values

## Train-Test Split (80/20)

In [4]:
X_train, X_test, y_train, y_test = train_test_split(
    X_full, y, test_size=0.2, random_state=42
)

## Utility: Compute RMSE

In [5]:
def rmse(y_true, y_pred):
    return np.sqrt(mean_squared_error(y_true, y_pred))

## Linear Regression (OLS)

In [6]:
# baseline: simple OLS on full feature set
ols_baseline = LinearRegression()
ols_baseline.fit(X_train, y_train)

# predictions
y_train_pred_ols = ols_baseline.predict(X_train)
y_test_pred_ols = ols_baseline.predict(X_test)

In [7]:
# evaluate
ols_train_rmse = rmse(y_train, y_train_pred_ols)
ols_test_rmse = rmse(y_test, y_test_pred_ols)

ols_train_mae = mean_absolute_error(y_train, y_train_pred_ols)
ols_test_mae = mean_absolute_error(y_test, y_test_pred_ols)

ols_train_r2 = r2_score(y_train, y_train_pred_ols)
ols_test_r2 = r2_score(y_test, y_test_pred_ols)

print("\n========================")
print("Baseline OLS (no polynomial terms)")
print("========================")
print(f"Train RMSE: {ols_train_rmse:.4f}")
print(f"Test  RMSE: {ols_test_rmse:.4f}")
print(f"Train MAE:  {ols_train_mae:.4f}")
print(f"Test  MAE:  {ols_test_mae:.4f}")
print(f"Train R^2:  {ols_train_r2:.4f}")
print(f"Test  R^2:  {ols_test_r2:.4f}")


Baseline OLS (no polynomial terms)
Train RMSE: 0.0732
Test  RMSE: 0.0772
Train MAE:  0.0561
Test  MAE:  0.0594
Train R^2:  0.9470
Test  R^2:  0.9429


## OLS with Polynomial Terms

In [9]:
degrees = [1, 2]
cv_results = []

for deg in degrees:
    model = Pipeline([
        ("poly", PolynomialFeatures(degree=deg, include_bias=False)),
        ("ols", LinearRegression())
    ])

    kf = KFold(n_splits=5, shuffle=True, random_state=42)

    neg_mse = cross_val_score(
        model, X_train, y_train,
        scoring="neg_mean_squared_error",
        cv=kf, n_jobs=-1
    )

    rmse_scores = np.sqrt(-neg_mse)

    cv_results.append({
        "degree": deg,
        "cv_rmse_mean": rmse_scores.mean(),
        "cv_rmse_std": rmse_scores.std()
    })

print("\n==== Polynomial OLS — CV RMSE ====")
for res in cv_results:
    print(f"Degree {res['degree']}: RMSE = {res['cv_rmse_mean']:.4f} (std {res['cv_rmse_std']:.4f})")

# select best degree
best_degree = min(cv_results, key=lambda x: x["cv_rmse_mean"])["degree"]
print(f"\nBest polynomial degree: {best_degree}")


==== Polynomial OLS — CV RMSE ====
Degree 1: RMSE = 0.0770 (std 0.0011)
Degree 2: RMSE = 11.1991 (std 7.3566)

Best polynomial degree: 1
