This notebook will assess a variety of ML algorithms in their performance in predicting house prices. We'll be considering lasso and ridge regression, random forests, xgboost and maybe later light gbm. All models will be tuned via Bayesian Optimisation in order to minimise the average 5-fold cross validation score. Note that here specifically we are predicting log house prices and using the RMSE metric so our model evaluation will need to be tailored to that.

First we'll import the data and clean it so that it can be used in a ML friendly format. We'll use a combination of random shuffling and forward filling to get rid of the NAs and convert the strings into dummy variables. We will not be exploring feature creation as there are many features in this data set already. Since exploratory data analysis has already been covered in the other notebook, we won't bother repeating it here.

In [70]:
import pandas as pd
import numpy as np
import time

train = pd.read_csv('Data/train.csv')
test = pd.read_csv('Data/test.csv')

In [71]:
#Setting up forward filling function

def fill_nas(df):
    
    #Keeping tack of time
    t0 = time.time()
    
    #Counting NaNs
    na_count = df.isna().sum().sum()
    
    while na_count>0:
        df = df.sample(frac=1)
        df = df.fillna(method='ffill',limit=11)
        na_count = df.isna().sum().sum()

    filled_df = df.sort_index()
    
    #Calculating time taken
    t1 = time.time()
    print(t1-t0)
    
    #Return filled df
    return(filled_df)

In [72]:
train = fill_nas(train)
test = fill_nas(test)

0.13582897186279297
0.10307097434997559


In [73]:
#Stripping SalePrice from the training data and combining with the test data
SalePrice = np.log(train['SalePrice'])
train = train.drop('SalePrice',axis=1)
X = pd.concat([train,test])


In [74]:
#Getting strings and numerics
#MSSubClass is a categorical variable, so we convert it to a string 
X['MSSubClass'] = X['MSSubClass'].apply(str) 
numerics = X.select_dtypes(exclude='object')
strings = X.select_dtypes(include='object')

#Converting strings to dummies and joining with numerics
dummies = pd.get_dummies(strings)
X = pd.concat([numerics,dummies],axis=1)

#Splitting into train and test data
X_train = X.iloc[:train.shape[0],]
X_test = X.iloc[train.shape[0]:,]
train = pd.concat([SalePrice,X_train],axis=1)
print(X_train.shape, X_test.shape)
print(train.shape, test.shape)

(1460, 304) (1459, 304)
(1460, 305) (1459, 80)


So now that the initial data preprocessing has been done, we are now going to set up cross-validation and Bayesian optimisation procedures for each model.

In [59]:
#Setting up optimisation for Ridge regression model
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import Ridge

def ridge_score(alpha):
    train_data = train.sample(frac=1)
    X=train_data.iloc[:,1:]
    y=train_data['SalePrice']

    scores = cross_val_score(estimator=Ridge(alpha=alpha), X=X, y=y, scoring='neg_mean_squared_error', cv=5)
    return(-np.average(np.sqrt(-scores)))

bounds_ridge = {'alpha': (0,1000)}

In [60]:
#Setting up optimisation for Lasso model
from sklearn.linear_model import Lasso

def lasso_score(alpha):
    train_data = train.sample(frac=1)
    X=train_data.iloc[:,1:]
    y=train_data['SalePrice']

    scores = cross_val_score(estimator=Lasso(alpha=alpha), X=X, y=y, scoring='neg_mean_squared_error', cv=5)
    return(-np.average(np.sqrt(-scores)))

bounds_lasso = {'alpha': (0.0000000000000000001,0.1)}

In [61]:
#Creating optimizer for a Random Forests regressor
from sklearn.ensemble import RandomForestRegressor as RF

def RF_score(n_estimators,max_depth,min_samples_split,min_samples_leaf,max_features):
    
    #Contraining hyperparameters to be converted to integers (e.g. number of decision trees can't be continuous!)
    n_estimators = int(n_estimators)
    max_depth = int(max_depth)
    min_samples_split = int(min_samples_split)
    min_samples_leaf = int(min_samples_leaf)
    max_features = int(max_features)
    
    assert type(n_estimators) == int
    assert type(max_depth) == int
    assert type(min_samples_split) == int
    assert type(min_samples_leaf) == int
    assert type(max_features) == int
    
    train_data = train.sample(frac=1)
    X=train_data.iloc[:,1:]
    y=train_data['SalePrice']

    scores = cross_val_score(
        estimator=RF(
                    n_estimators=n_estimators, 
                    max_depth=max_depth, 
                    min_samples_split=min_samples_split,
                    min_samples_leaf = min_samples_leaf,
                    max_features = max_features),
    X=X, y=y, scoring='neg_mean_squared_error', cv=5)
    return(-np.average(np.sqrt(-scores)))

bounds_RF = {
    'n_estimators': (1,3000),
    'max_depth': (1,100),
    'min_samples_split': (2,200),
    'min_samples_leaf': (1,200),
    'max_features': (1,290)
}

In [62]:
#Creating optimiser for xgboost, note that doing it this way isn't entirely necessary
#xgboost already contains the inbuilt functionality to do so, but I want to be consistent in my approach
import os
os.environ['KMP_DUPLICATE_LIB_OK']='True'
import xgboost as xgb

def xgb_score(eta, max_depth, gamma):
    
    #Contraining hyperparameters to be converted to integers (e.g. number of decision trees can't be continuous!)
    #n_estimators = int(n_estimators)
    max_depth = int(max_depth)
    #min_samples_split = int(min_samples_split)
    #min_samples_leaf = int(min_samples_leaf)
    #max_features = int(max_features)
    
    #assert type(n_estimators) == int
    assert type(max_depth) == int
    #assert type(min_samples_split) == int
    #assert type(min_samples_leaf) == int
    #assert type(max_features) == int
    
    train_data = train.sample(frac=1)
    X=train_data.iloc[:,1:]
    #X = xgb.DMatrix(X)
    y=np.array(train_data['SalePrice'])
    

    xgb_model = xgb.XGBRegressor(learning_rate=0.1, 
                                 max_depth=max_depth, 
                                 min_split_loss=gamma, 
                                 objective="reg:squarederror")
    
    scores = cross_val_score(estimator=xgb_model, X=X, y=y, scoring='neg_mean_squared_error', cv=5)
    return(-np.average(np.sqrt(-scores)))

bounds_xgb = {'eta': (0.08,0.12),'max_depth':(5,15),'gamma':(0,0.1)}

In [56]:
#Importing models
from bayes_opt import BayesianOptimization

BO = BayesianOptimization(xgb_score, bounds_xgb)

In [57]:
BO.maximize(n_iter=50,alpha=0.0001)

|   iter    |  target   |   gamma   | max_depth |
-------------------------------------------------
| [0m 1       [0m | [0m-0.1335  [0m | [0m 0.01501 [0m | [0m 9.988   [0m |
| [0m 2       [0m | [0m-0.3997  [0m | [0m 0.0909  [0m | [0m 0.9037  [0m |
| [0m 3       [0m | [0m-0.1383  [0m | [0m 0.06796 [0m | [0m 13.18   [0m |
| [0m 4       [0m | [0m-0.1387  [0m | [0m 0.09692 [0m | [0m 2.963   [0m |
| [0m 5       [0m | [0m-0.1371  [0m | [0m 0.01459 [0m | [0m 11.42   [0m |
| [0m 6       [0m | [0m-0.1401  [0m | [0m 0.0     [0m | [0m 5.981   [0m |
| [95m 7       [0m | [95m-0.131   [0m | [95m 0.0     [0m | [95m 15.0    [0m |
| [0m 8       [0m | [0m-0.1396  [0m | [0m 0.1     [0m | [0m 4.395   [0m |
| [0m 9       [0m | [0m-0.1367  [0m | [0m 0.1     [0m | [0m 8.018   [0m |
| [0m 10      [0m | [0m-0.1319  [0m | [0m 0.0     [0m | [0m 3.753   [0m |
| [0m 11      [0m | [0m-0.1327  [0m | [0m 0.1     [0m | [0m 15.0    

So now that we've tuned the hyperparameters using Bayes opt and gotten a good feel of how the models perform using different parameter sets, it's pretty easy to see that xgboost performs the best. As a final evaluation, I'll put in some paramter values to evaluate our models concurrently and then pick from there.

In [75]:
print(
    lasso_score(0.00001),
    ridge_score(0.01),
    RF_score(n_estimators=1000, 
                    max_depth=100, 
                    min_samples_split=20,
                    min_samples_leaf = 5,
                    max_features = 250),
    xgb_score(eta=0.1,max_depth=10,gamma=0))



-0.14694637973357275 -0.15107785073146096 -0.14957741736316 -0.1317039883423656


As noted, xgboost clearly outperforms all other models. So we'll use that configuration to make our test predictions and submit on Kaggle.

In [76]:
model_xgb = xgb.XGBRegressor(learning_rate=0.1, max_depth=10, min_split_loss=0, objective="reg:squarederror")
model_xgb.fit(X_train,SalePrice)
xgb_preds = np.expm1(model_xgb.predict(X_test))

#Submission csv
submission = pd.read_csv('Data/sample_submission.csv')
submission['SalePrice'] = xgb_preds
submission.to_csv('Data/submission.csv',index=False)

  if getattr(data, 'base', None) is not None and \


Well! Turns out we did rather poorly despite all the work we put in, we got a public score of 0.14504 placing us in the bottom 50% :(.