# Penalised, PCR, and Stepwise Models for HDB Resale Prices

This notebook implements the following modelling families:

Penalised Regression Models
- **LASSO**
- **Ridge**
- **Elastic Net**

Variable Selection
- **Forward Stepwise OLS using AIC**
- **Forward Stepwise OLS using BIC**

Common Settings
- Target variable: `log_resale_price = log(resale_price)`
- Train–Test Split: 80/20
- 5-Fold Cross-Validation for hyperparameter search
- Performance Metrics: **RMSE and MAE**
- Penalised models + PCR use **standardised predictors**
- Stepwise uses a **small interpretable feature set**

In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split, GridSearchCV, KFold
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import Lasso, Ridge, ElasticNet, LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.pipeline import Pipeline

## Load and Prepare Data

In [2]:
DATA_PATH = "../data/HDB_data_2021_sample.xlsx"

df = pd.read_excel(DATA_PATH)

# ensure resale_price exists
df = df.dropna(subset=["resale_price"])

# target variable: log(price)
df["log_resale_price"] = np.log(df["resale_price"])

# predictors for penalised models & PCR
drop_cols_full = ["resale_price", "log_resale_price", "year"]
drop_cols_full = [c for c in drop_cols_full if c in df.columns]

X_full = df.drop(columns=drop_cols_full)
y = df["log_resale_price"].values

feature_names_full = X_full.columns.tolist()

print("Number of observations:", X_full.shape[0])
print("Number of predictors (full):", X_full.shape[1])

Number of observations: 6000
Number of predictors (full): 228


## Train–Test Split & Scaling

In [3]:
# 80/20 split
X_full_train, X_full_test, y_train, y_test = train_test_split(
    X_full, y, test_size=0.2, random_state=42
)

# standardisation (important for penalised models and PCA)
scaler = StandardScaler()
X_full_train_scaled = scaler.fit_transform(X_full_train)
X_full_test_scaled = scaler.transform(X_full_test)

# performance metrics
def rmse(y_true, y_pred):
    return np.sqrt(mean_squared_error(y_true, y_pred))

def mae(y_true, y_pred):
    return mean_absolute_error(y_true, y_pred)

## Helper: Penalised Model Runner

In [4]:
def run_penalised_model(model_name, base_model, param_grid,
                        X_train, y_train, X_test, y_test, feature_names):

    print(f"\n=== {model_name} ===")

    # baseline: default hyperparameters
    base_model.fit(X_train, y_train)
    y_pred_test_base = base_model.predict(X_test)
    baseline_rmse = rmse(y_test, y_pred_test_base)
    baseline_mae  = mae(y_test, y_pred_test_base)

    print(f"{model_name} Baseline - Test RMSE: {baseline_rmse:.4f}, MAE: {baseline_mae:.4f}")

    # hyperparameter tuning via CV
    grid = GridSearchCV(
        estimator=base_model,
        param_grid=param_grid,
        scoring="neg_mean_squared_error",
        cv=5,
        n_jobs=-1
    )
    grid.fit(X_train, y_train)

    best_model = grid.best_estimator_
    y_pred_test_best = best_model.predict(X_test)
    best_rmse = rmse(y_test, y_pred_test_best)
    best_mae  = mae(y_test, y_pred_test_best)

    print(f"{model_name} Tuned - Best Params: {grid.best_params_}")
    print(f"{model_name} Tuned - Test RMSE: {best_rmse:.4f}, MAE: {best_mae:.4f}")

    # feature importance (penalised coefficients)
    coefs = pd.Series(best_model.coef_, index=feature_names)
    print(f"\nTop 10 {model_name} coefficients by absolute magnitude:")
    print(coefs.abs().sort_values(ascending=False).head(10))

    return best_model, best_rmse, best_mae, grid.best_params_

## Penalised Models: LASSO, Ridge, Elastic Net

In [5]:
results = []  # store final results

# LASSO
lasso_base = Lasso(max_iter=5000, random_state=42)
lasso_grid = {"alpha": [0.0005, 0.001, 0.01, 0.1, 1.0]}

best_lasso, lasso_rmse, lasso_mae, lasso_params = run_penalised_model(
    "LASSO", lasso_base, lasso_grid,
    X_full_train_scaled, y_train,
    X_full_test_scaled, y_test,
    feature_names_full
)
results.append({"Model": "LASSO", "Test_RMSE": lasso_rmse, "Test_MAE": lasso_mae,
                "Best_Params": lasso_params})

# Ridge
ridge_base = Ridge()
ridge_grid = {"alpha": [0.1, 1.0, 10.0, 100.0]}

best_ridge, ridge_rmse, ridge_mae, ridge_params = run_penalised_model(
    "Ridge", ridge_base, ridge_grid,
    X_full_train_scaled, y_train,
    X_full_test_scaled, y_test,
    feature_names_full
)
results.append({"Model": "Ridge", "Test_RMSE": ridge_rmse, "Test_MAE": ridge_mae,
                "Best_Params": ridge_params})

# Elastic Net
enet_base = ElasticNet(max_iter=5000, random_state=42)
enet_grid = {"alpha": [0.0005, 0.001, 0.01, 0.1, 1.0],
             "l1_ratio": [0.2, 0.5, 0.8]}

best_enet, enet_rmse, enet_mae, enet_params = run_penalised_model(
    "ElasticNet", enet_base, enet_grid,
    X_full_train_scaled, y_train,
    X_full_test_scaled, y_test,
    feature_names_full
)
results.append({"Model": "Elastic Net", "Test_RMSE": enet_rmse, "Test_MAE": enet_mae,
                "Best_Params": enet_params})


=== LASSO ===
LASSO Baseline - Test RMSE: 0.3236, MAE: 0.2571
LASSO Tuned - Best Params: {'alpha': 0.0005}
LASSO Tuned - Test RMSE: 0.0774, MAE: 0.0599

Top 10 LASSO coefficients by absolute magnitude:
floor_area_sqm           0.178885
Remaining_lease          0.154745
Dist_CBD                 0.062814
mature                   0.031116
Dist_nearest_GHawker     0.030710
storey_range_01.TO.03    0.027989
flat_type_3.ROOM         0.027887
max_floor_lvl            0.026193
Dist_nearest_station     0.025844
postal_2digits_44        0.024547
dtype: float64

=== Ridge ===
Ridge Baseline - Test RMSE: 0.0773, MAE: 0.0595
Ridge Tuned - Best Params: {'alpha': 1.0}
Ridge Tuned - Test RMSE: 0.0773, MAE: 0.0595

Top 10 Ridge coefficients by absolute magnitude:
floor_area_sqm             0.162985
Remaining_lease            0.153364
Dist_CBD                   0.097727
Dist_nearest_GAI_jc        0.073005
Dist_nearest_university    0.071881
Dist_nearest_jc            0.042982
flat_type_3.ROOM          

## Stepwise Feature Subset (Interpretable Variables)

In [8]:
# manually selected interpretable features
nonlinear_features = [
    "floor_area_sqm",
    "Remaining_lease",
    "max_floor_lvl",
    "mature",
    "Dist_CBD",
    "Dist_nearest_station",
    "Dist_nearest_hospital",
]

nonlinear_features = [f for f in nonlinear_features if f in df.columns]

step_df = df[nonlinear_features + ["log_resale_price"]].dropna()

X_small = step_df[nonlinear_features]
y_small = step_df["log_resale_price"]

X_small_train, X_small_test, y_small_train, y_small_test = train_test_split(
    X_small, y_small, test_size=0.2, random_state=42
)

print("Stepwise subset observations:", X_small.shape[0])
print("Candidate features:", nonlinear_features)

Stepwise subset observations: 6000
Candidate features: ['floor_area_sqm', 'Remaining_lease', 'max_floor_lvl', 'mature', 'Dist_CBD', 'Dist_nearest_station', 'Dist_nearest_hospital']


## Compute AIC/BIC

In [10]:
def compute_ic(y_true, y_pred, k, criterion="AIC"):
    y_true = np.asarray(y_true)
    y_pred = np.asarray(y_pred)
    n = y_true.shape[0]

    rss = np.sum((y_true - y_pred)**2)
    sigma2 = rss / n

    loglik = -0.5 * n * (np.log(2 * np.pi * sigma2) + 1)

    if criterion.upper() == "AIC":
        return 2 * k - 2 * loglik
    else:
        return k * np.log(n) - 2 * loglik

## Forward Stepwise (AIC/BIC)

In [11]:
def forward_stepwise_ic(X, y, candidate_features, criterion="AIC", tol=1e-4, verbose=True):

    selected = []
    remaining = list(candidate_features)
    best_ic = np.inf

    while remaining:
        ic_with_candidates = []

        for feat in remaining:
            trial_features = selected + [feat]

            # Fit OLS with sklearn
            X_trial = X[trial_features].values
            model = LinearRegression()
            model.fit(X_trial, y)

            y_pred = model.predict(X_trial)
            k = len(trial_features) + 1  # coeffs + intercept

            ic_val = compute_ic(y, y_pred, k, criterion)
            ic_with_candidates.append((ic_val, feat))

            if verbose:
                print(f"{criterion}: try adding {feat:>25s} -> {ic_val:.4f}")

        ic_with_candidates.sort(key=lambda x: x[0])
        best_new_ic, best_new_feat = ic_with_candidates[0]

        if best_new_ic + tol < best_ic:
            selected.append(best_new_feat)
            remaining.remove(best_new_feat)
            best_ic = best_new_ic

            if verbose:
                print(f"--> Added {best_new_feat}, new best {criterion} = {best_ic:.4f}\n")
        else:
            print(f"No improvement in {criterion}. Stopping.")
            break

    return selected, best_ic

## Run Stepwise AIC & BIC

In [12]:
candidate_features = list(X_small_train.columns)

### --- AIC ---
print("\n=== Forward Stepwise (AIC) ===")
selected_aic, best_aic = forward_stepwise_ic(
    X_small_train, y_small_train, candidate_features, "AIC"
)

aic_model = LinearRegression()
aic_model.fit(X_small_train[selected_aic], y_small_train)

y_aic_test_pred = aic_model.predict(X_small_test[selected_aic])
aic_rmse = rmse(y_small_test, y_aic_test_pred)
aic_mae = mae(y_small_test, y_aic_test_pred)

results.append({"Model": "Stepwise (AIC)", "Test_RMSE": aic_rmse,
                "Test_MAE": aic_mae, "Best_Params": {"selected_features": selected_aic}})


### --- BIC ---
print("\n=== Forward Stepwise (BIC) ===")
selected_bic, best_bic = forward_stepwise_ic(
    X_small_train, y_small_train, candidate_features, "BIC"
)

bic_model = LinearRegression()
bic_model.fit(X_small_train[selected_bic], y_small_train)

y_bic_test_pred = bic_model.predict(X_small_test[selected_bic])
bic_rmse = rmse(y_small_test, y_bic_test_pred)
bic_mae = mae(y_small_test, y_bic_test_pred)

results.append({"Model": "Stepwise (BIC)", "Test_RMSE": bic_rmse,
                "Test_MAE": bic_mae, "Best_Params": {"selected_features": selected_bic}})


=== Forward Stepwise (AIC) ===
AIC: try adding            floor_area_sqm -> -379.9940
AIC: try adding           Remaining_lease -> 1782.5026
AIC: try adding             max_floor_lvl -> 1387.1563
AIC: try adding                    mature -> 2500.0828
AIC: try adding                  Dist_CBD -> 2426.2184
AIC: try adding      Dist_nearest_station -> 2563.1592
AIC: try adding     Dist_nearest_hospital -> 2573.6651
--> Added floor_area_sqm, new best AIC = -379.9940

AIC: try adding           Remaining_lease -> -1642.7671
AIC: try adding             max_floor_lvl -> -3148.6260
AIC: try adding                    mature -> -918.7692
AIC: try adding                  Dist_CBD -> -1660.2105
AIC: try adding      Dist_nearest_station -> -568.9949
AIC: try adding     Dist_nearest_hospital -> -752.5156
--> Added max_floor_lvl, new best AIC = -3148.6260

AIC: try adding           Remaining_lease -> -3519.7010
AIC: try adding                    mature -> -3795.7308
AIC: try adding                  D

## Summary Table

In [13]:
results_df = pd.DataFrame(results)
results_df.sort_values("Test_RMSE")

Unnamed: 0,Model,Test_RMSE,Test_MAE,Best_Params
2,Elastic Net,0.077115,0.05936,"{'alpha': 0.0005, 'l1_ratio': 0.2}"
1,Ridge,0.077322,0.05951,{'alpha': 1.0}
0,LASSO,0.077444,0.059862,{'alpha': 0.0005}
4,Stepwise (AIC),0.119253,0.094862,"{'selected_features': ['floor_area_sqm', 'max_..."
5,Stepwise (BIC),0.119253,0.094862,"{'selected_features': ['floor_area_sqm', 'max_..."
3,PCR,0.163911,0.125199,{'pca__n_components': 30}
