This notebook will assess a variety of ML algorithms in their performance in predicting house prices. We'll be considering lasso and ridge regression, random forests, xgboost and light gbm. All models will be tuned via Bayesian Optimisation in order to minimise the average 5-fold cross validation score. Note that here specifically we are predicting log house prices and using the RMSE metric so our model evaluation will need to be tailored to that.

First we'll import the data and clean it so that it can be used in a ML friendly format. We'll use a combination of random shuffling and forward filling to get rid of the NAs and convert the strings into dummy variables. We will not be exploring feature creation as there are many features in this data set already. Since exploratory data analysis has already been covered in the other notebook, we won't bother repeating it here.

In [18]:
import pandas as pd
import numpy as np
import time

train = pd.read_csv('Data/train.csv')
test = pd.read_csv('Data/test.csv')

In [19]:
#Setting up forward filling function

def fill_nas(df):
    
    #Keeping tack of time
    t0 = time.time()
    
    #Counting NaNs
    na_count = df.isna().sum().sum()
    
    while na_count>0:
        df = df.sample(frac=1)
        df = df.fillna(method='ffill',limit=11)
        na_count = df.isna().sum().sum()

    filled_df = df.sort_index()
    
    #Calculating time taken
    t1 = time.time()
    print(t1-t0)
    
    #Return filled df
    return(filled_df)

In [20]:
train = fill_nas(train)
test = fill_nas(test)

0.131577730178833
0.08455395698547363


In [21]:
#Stripping SalePrice from the training data and combining with the test data
SalePrice = np.log(train['SalePrice'])
train = train.drop('SalePrice',axis=1)
X = pd.concat([train,test])


In [22]:
#Getting strings and numerics
numerics = X.select_dtypes(exclude='object')
strings = X.select_dtypes(include='object')

#Converting strings to dummies and joining with numerics
dummies = pd.get_dummies(strings)
X = pd.concat([numerics,dummies],axis=1)

#Splitting into train and test data
X_train = X.iloc[:train.shape[0],]
X_test = X.iloc[train.shape[0]:,]
train = pd.concat([SalePrice,X_train],axis=1)
print(X_train.shape, X_test.shape)
print(train.shape, test.shape)

(1460, 289) (1459, 289)
(1460, 290) (1459, 80)


So now that the initial data preprocessing has been done, we are now going to set up cross-validation and Bayesian optimisation procedures for each model.

In [23]:
#Setting up optimisation for Ridge regression model
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import Ridge

def ridge_score(alpha):
    train_data = train.sample(frac=1)
    X=train_data.iloc[:,1:]
    y=train_data['SalePrice']

    scores = cross_val_score(estimator=Ridge(alpha=alpha), X=X, y=y, scoring='neg_mean_squared_error', cv=5)
    return(-np.average(np.sqrt(-scores)))

bounds_ridge = {'alpha': (0,1000)}

In [190]:
#Setting up optimisation for Lasso model
from sklearn.linear_model import Lasso

def lasso_score(alpha):
    train_data = train.sample(frac=1)
    X=train_data.iloc[:,1:]
    y=train_data['SalePrice']

    scores = cross_val_score(estimator=Lasso(alpha=alpha), X=X, y=y, scoring='neg_mean_squared_error', cv=5)
    return(-np.average(np.sqrt(-scores)))

bounds_lasso = {'alpha': (0.0000000000000000001,0.1)}

In [185]:
#Creating optimizer for a Random Forests regressor
from sklearn.ensemble import RandomForestRegressor as RF

def RF_score(n_estimators,max_depth,min_samples_split,min_samples_leaf,max_features):
    
    #Contraining hyperparameters to be converted to integers (e.g. number of decision trees can't be continuous!)
    n_estimators = int(n_estimators)
    max_depth = int(max_depth)
    min_samples_split = int(min_samples_split)
    min_samples_leaf = int(min_samples_leaf)
    max_features = int(max_features)
    
    assert type(n_estimators) == int
    assert type(max_depth) == int
    assert type(min_samples_split) == int
    assert type(min_samples_leaf) == int
    assert type(max_features) == int
    
    train_data = train.sample(frac=1)
    X=train_data.iloc[:,1:]
    y=train_data['SalePrice']

    scores = cross_val_score(
        estimator=RF(
                    n_estimators=n_estimators, 
                    max_depth=max_depth, 
                    min_samples_split=min_samples_split,
                    min_samples_leaf = min_samples_leaf,
                    max_features = max_features),
    X=X, y=y, scoring='neg_mean_squared_error', cv=5)
    return(-np.average(np.sqrt(-scores)))

bounds_RF = {
    'n_estimators': (1,3000),
    'max_depth': (1,100),
    'min_samples_split': (2,200),
    'min_samples_leaf': (1,200),
    'max_features': (1,290)
}

In [24]:
import xgboost as xgb

def xgb_score(eta):
    train_data = train.sample(frac=1)
    X=train_data.iloc[:,1:]
    #X = xgb.DMatrix(X)
    y=train_data['SalePrice']
    

    xgb_model = xgb.XGBRegressor(learning_rate=eta)
    scores = cross_val_score(estimator=xgb_model, X=X, y=y, scoring='neg_mean_squared_error', cv=5)
    return(-np.average(np.sqrt(-scores)))

bounds_xgb = {'eta': (0,1)}

In [25]:
#Importing models
from bayes_opt import BayesianOptimization

BO = BayesianOptimization(xgb_score, bounds_xgb)

In [None]:
BO.maximize(n_iter=50,alpha=0.0001)

In [191]:
print(
    lasso_score(0.00001),
    ridge_score(0.01),
    RF_score(n_estimators=1000, 
                    max_depth=100, 
                    min_samples_split=20,
                    min_samples_leaf = 5,
                    max_features = 250))



-0.15092148943147535 -0.15552330349325888 -0.1500128815231247
