<h1> Stacking model</h1>
This notebook contains code for stacking model, which combines 3 models with best performance on 30% test set: Random Forest Regressor, Gradient Boosting Regressor, and Multilayer Perceptron. We get best models & parameters by pretraining them separately. Their RMSE on price/sqft(per_price) will be converted to their respective weights for final stacking.

In [34]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import tree
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error
import random

In [107]:
# Libraries for MLP
from mlp import *
import tqdm
from torch.utils.data import DataLoader, Dataset
from tqdm import tqdm
from sklearn.preprocessing import StandardScaler

In [262]:
# Reproducibility
def set_seeds(seed):
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    np.random.seed(seed)
    random.seed(seed)
#8
set_seeds(8)

In [263]:
def format_predicted(predicted):
  df_predicted = pd.DataFrame(predicted)
  df_predicted.columns = ['Predicted']
  df_predicted['Id'] = df_predicted.index
  df_predicted = df_predicted.reindex(columns=['Id', 'Predicted'])
  return df_predicted

In [264]:
# Load data
df_train = pd.read_csv('train_final_complete.csv')
df_test = pd.read_csv('test_final_complete_cleaned.csv')

In [265]:
feature_list = ['built_year', 'num_beds', 'num_baths', 'lat', 'lng', 'size_sqft',
                    'tenure_group', 'subzone_per_price_encoded',
                    'property_type_ordinal',
                    #mrt
                    'dist_to_nearest_important_mrt_rounded',
                    #schools
                    'number_of_nearby_primary_schools', 
                    'number_of_nearby_secondary_schools', 
                    #shopping mall
                    'number_of_nearby_shopping_malls',
                    #CR
                    'name_of_nearest_BN_ordinal',
                    'name_of_nearest_CR_ordinal']
X_train = df_train[feature_list]
y_train = df_train['per_price']
size_sqft_train = df_train['size_sqft']

X_test = df_test[feature_list]
X_train = X_train.reindex(columns=list(X_test.columns)) # unify column order

<h2> Random Forest </h2>

In [266]:
#Fit Model
random_forest_model = RandomForestRegressor(max_depth=50,max_features=8,n_estimators=100)
random_forest_model.fit(X_train, y_train)

RandomForestRegressor(max_depth=50, max_features=8)

In [267]:
#Get RMSE prediction on train set
y_pred = random_forest_model.predict(X_train)
random_forest_score = mean_squared_error(y_train,y_pred)
print('RandomForestRegressor RMSE:', random_forest_score)

RandomForestRegressor RMSE: 55343.215727600356


<h2> Gradient Boosting </h2>

In [268]:
#Fit model
gradient_boosting_model = GradientBoostingRegressor(learning_rate=0.5,max_depth=4,n_estimators=400)
gradient_boosting_model.fit(X_train, y_train)

GradientBoostingRegressor(learning_rate=0.5, max_depth=4, n_estimators=400)

In [269]:
#Get RMSE prediction on train set
y_pred = gradient_boosting_model.predict(X_train)
gradient_boosting_score = mean_squared_error(y_train,y_pred)
print('GradientBoostingRegressor RMSE:', gradient_boosting_score)

GradientBoostingRegressor RMSE: 50975.49296366274


<h2> Multilayer Perceptron</h2>

In [270]:
def mlp_predict(model,data):
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model.load_state_dict(torch.load('./model.pth',map_location=device))
    model.to(device)
    model.eval()
    prices = []
    test_dataset = houseTestDataset(data)
    with torch.no_grad():
        for step, data in enumerate(tqdm(test_dataset)):
            input_tensor = data.to(device)
            pred = model(input_tensor).detach().cpu().item()
            prices.append(pred)
    res = np.asarray(prices)
    return res

In [271]:
# Normalize feature data
scaler = StandardScaler()
scaler.fit(X_train)
X_train_normalized = scaler.transform(X_train)

In [278]:
model = BaseNN(len(feature_list))
y_pred = mlp_predict(model,X_train_normalized)
#mlp_score = mean_squared_error(y_train*size_sqft_train,y_pred*size_sqft_train)
mlp_score = mean_squared_error(y_train,y_pred)
print('MLP RMSE:', mlp_score)

100%|██████████| 20003/20003 [00:01<00:00, 13543.35it/s]

MLP RMSE: 99874.31874259491





<h2> Stack Results </h2>

<h3>Random forest + gradient boosting + mlp </h3>

In [273]:
# Calculate weight for each model
sum_scores = (1/random_forest_score)+(1/gradient_boosting_score)+(1/mlp_score)
random_forest_weight = (1/random_forest_score)/sum_scores 
gradient_boosting_weight = (1/gradient_boosting_score)/sum_scores
mlp_weight = 1 - random_forest_weight - gradient_boosting_weight
print('Random Forest weight',random_forest_weight,'GradientBoosting weight',gradient_boosting_weight,'MLP weight:', mlp_weight)

Random Forest weight 0.3788149426695488 GradientBoosting weight 0.41127286611909625 MLP weight: 0.2099121912113549


In [274]:
random_forest_prediction = random_forest_model.predict(X_test)
gradient_boosting_prediction = gradient_boosting_model.predict(X_test)

X_test_normalized = scaler.transform(X_test)
mlp_prediction = mlp_predict(model,X_test_normalized)

100%|██████████| 6966/6966 [00:00<00:00, 14271.31it/s]


In [275]:
final_prediction = (random_forest_weight*random_forest_prediction + gradient_boosting_weight * gradient_boosting_prediction + mlp_weight * mlp_prediction)*df_test.size_sqft
df_pred = format_predicted(final_prediction)
df_pred.to_csv('./stacking_pred.csv',index = False)

<h3> Random forest + gradient boosting </h3>

In [276]:
sum_scores = (1/random_forest_score)+(1/gradient_boosting_score)
random_forest_weight = (1/random_forest_score)/sum_scores 
gradient_boosting_weight = 1-random_forest_weight
print('Random Forest weight',random_forest_weight,'GradientBoosting weight',gradient_boosting_weight)

Random Forest weight 0.47945929358199335 GradientBoosting weight 0.5205407064180066


In [277]:
final_prediction = (random_forest_weight*random_forest_prediction + gradient_boosting_weight * gradient_boosting_prediction)*df_test.size_sqft
df_pred = format_predicted(final_prediction)
df_pred.to_csv('./stacking_pred_nomlp.csv',index = False)