In this notebook I will use the HEBO package to perform bayesian optimisation in order to tune the parameters of a random forest regressor  to predict house sale prices.


Step 1. import packages


In [15]:
import sys
sys.path.append('../')
import pandas as pd
import numpy  as np
import torch
from hebo.design_space.design_space import DesignSpace
from hebo.optimizers.hebo import HEBO
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_squared_error
from hebo.sklearn_tuner import sklearn_tuner
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings("ignore")

Step 2. load dataset

In [44]:
df = pd.read_csv('kc_house_data.csv')
df.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900.0,3,1.0,1180,5650,1.0,0,0,...,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,0,0,...,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,5631500400,20150225T000000,180000.0,2,1.0,770,10000,1.0,0,0,...,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,20141209T000000,604000.0,4,3.0,1960,5000,1.0,0,0,...,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,20150218T000000,510000.0,3,2.0,1680,8080,1.0,0,0,...,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


Step 3. process dataset before data analysis


In [45]:
#convert dataframe to numeric values

df['date'] = pd.to_datetime(df['date'], format='%Y%m%dT%H%M%S')
df['days_since_Jan1st2000'] = (df['date'] - pd.Timestamp('2000-01-01')).dt.days
df = df.drop('date', axis=1)
for col in df.columns:
    df[col] = pd.to_numeric(df[col], errors='ignore')
#place price outcome as last column
columns = df.columns.to_list()
columns.remove('price')
df = df[columns + ['price']]
# create training and test sets with 99% in training, 1% in test
Xy=np.array(df)
X=Xy[:,:-1]
y=Xy[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.01, random_state=20)


Step 4. create the search space for the hyperparameters of the random forest regressor 

In [46]:
space_rf = [
    {'name' : 'max_depth', 'type' : 'int', 'lb' : 1, 'ub' : 20},
    {'name' : 'min_samples_leaf', 'type' : 'num', 'lb' : 1e-4, 'ub' : 0.5},
    {'name' : 'max_features', 'type' : 'cat', 'categories' : [ 'sqrt', 'log2']},
    {'name' : 'bootstrap', 'type' : 'bool'},
    {'name' : 'min_impurity_decrease', 'type' : 'pow', 'lb' : 1e-4, 'ub' : 1.0},
]

Step 5. Tune the random forest regressor model with the HEBO tuner

In [48]:
result = sklearn_tuner(RandomForestRegressor, space_rf, X_train, y_train, metric=r2_score, max_iter=12)

Iter 0, best metric: 0.304875
Iter 1, best metric: 0.530789
Iter 2, best metric: 0.530789
Iter 3, best metric: 0.530789
Iter 4, best metric: 0.530789
Iter 5, best metric: 0.530789
Iter 6, best metric: 0.530789
Iter 7, best metric: 0.530789
Iter 8, best metric: 0.622292
Iter 9, best metric: 0.753243
Iter 10, best metric: 0.769291
Iter 11, best metric: 0.855799


Step 6. train a random forest regressor model with the tuned hyper parameters on the training set , then test the final tuned model on the test set


In [52]:
#create model with tuned hyper parameters
model = RandomForestRegressor(max_depth = result['max_depth'],min_samples_leaf = result['min_samples_leaf'],bootstrap = result['bootstrap'],min_impurity_decrease = result['min_impurity_decrease'],max_features = result['max_features'])
model.fit(X_train, y_train)

# create tuned, trained model predictions for test set
y_pred = model.predict(X_test)

# test the  performance of the tuned , trained model on the test set
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("R^2 coefficient of tuned, trained regression model is", r2)


R^2 coefficient of tuned, trained regression model is 0.8532793000391117
