This notebook will assess a variety of ML algorithms in their performance in predicting house prices. We'll be considering lasso and ridge regression, random forests, xgboost and maybe later light gbm. All models will be tuned via Bayesian Optimisation in order to minimise the average 5-fold cross validation score. Note that here specifically we are predicting log house prices and using the RMSE metric so our model evaluation will need to be tailored to that.

First we'll import the data and clean it so that it can be used in a ML friendly format. We'll use a combination of random shuffling and forward filling to get rid of the NAs and convert the strings into dummy variables. We will not be exploring feature creation as there are many features in this data set already. Since exploratory data analysis has already been covered in the other notebook, we won't bother repeating it here.

In [1]:
import pandas as pd
import numpy as np
import time
from scipy.stats import skew
from bayes_opt import BayesianOptimization

In [2]:
#Setting up forward filling function

def fill_nas(df):
    
    #Keeping tack of time
    t0 = time.time()
    
    #Counting NaNs
    na_count = df.isna().sum().sum()
    
    while na_count>0:
        df = df.sample(frac=1)
        df = df.fillna(method='ffill',limit=1)
        na_count = df.isna().sum().sum()

    filled_df = df.sort_index()
    
    #Calculating time taken
    t1 = time.time()
    print(t1-t0)
    
    #Return filled df
    return(filled_df)

In [3]:
train = pd.read_csv('Data/train.csv')
test = pd.read_csv('Data/test.csv')

#Stripping SalePrice from the training data and combining with the test data
SalePrice = np.log(train['SalePrice'])
train = train.drop('SalePrice',axis=1)
X = pd.concat([train,test], ignore_index=True)

Doing some data pre-processing

In [4]:
#MSSubClass is a categorical variable, so we convert it to a string 
X['MSSubClass'] = X['MSSubClass'].apply(str)

In [5]:
##Dealing with numerics
numerics = X.select_dtypes(exclude='object')

#Filling nas
numerics = fill_nas(numerics)

#log transformation of skewed variables
skews = numerics.apply(lambda x:skew(x.dropna()))
skewed = skews>0.75
skewed_data = numerics[skewed.index[skewed]]
skewed_feats = skewed_data.columns
numerics = numerics.drop(skewed_feats, axis=1)
log_transformed = np.log1p(skewed_data)
numerics = pd.concat([numerics,log_transformed],axis=1)


0.01822805404663086


In [6]:
#Dealing with strings
strings = X.select_dtypes(include='object')

#Converting strings to dummies and joining with numerics
dummies = pd.get_dummies(strings, dummy_na=True)
X = pd.concat([numerics,dummies],axis=1)

#Splitting into train and test data
X_train = X.iloc[:train.shape[0],]
X_test = X.iloc[train.shape[0]:,]
train = pd.concat([SalePrice,X_train],axis=1)
print(X_train.shape, X_test.shape)
print(train.shape, test.shape)

(1460, 348) (1459, 348)
(1460, 349) (1459, 80)


So now that the initial data preprocessing has been done, we are now going to set up cross-validation and Bayesian optimisation procedures for each model. The reason why we use Bayesian Optimisation for hyperparameter tuning is because of the stochastic nature of our objective function. We are randomly sorting the data to better replicate the variability in the dgp when measuring the validation score. While a deterministic optimiser may work, it may not be the best solution if it does not account for the noise.

In [7]:
#Setting up optimisation for Ridge regression model
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import Ridge

def ridge_score(alpha):
    train_data = train.sample(frac=1)
    X=train_data.iloc[:,1:]
    y=train_data['SalePrice']
    
    scores = cross_val_score(estimator=Ridge(alpha=alpha), X=X, y=y, scoring='neg_mean_squared_error', cv=5)
    return(-np.average(np.sqrt(-scores)))

bounds_ridge = {'alpha': (0,20)}

For the parameter bounds, I have chosen alpha to be between 0 and 20. Using a 0 value will be equivalent to estimation by OLS, any positive values will shrink the parameter estimates towards zero. I do not want to using an upper bound that is too high as it will waste computation power as in practice, optimal values for alpha are relatively small.

In [35]:
#Running Bayes Opt to tune ridge regression

BO = BayesianOptimization(ridge_score, bounds_ridge)
BO.maximize(n_iter=50,alpha=0.0001)

|   iter    |  target   |   alpha   |
-------------------------------------
| [0m 1       [0m | [0m-0.1287  [0m | [0m 9.72    [0m |
| [95m 2       [0m | [95m-0.1265  [0m | [95m 10.92   [0m |
| [0m 3       [0m | [0m-0.1317  [0m | [0m 14.69   [0m |
| [0m 4       [0m | [0m-0.1266  [0m | [0m 7.17    [0m |
| [0m 5       [0m | [0m-0.132   [0m | [0m 1.477   [0m |
| [0m 6       [0m | [0m-0.1297  [0m | [0m 20.0    [0m |
| [0m 7       [0m | [0m-0.1516  [0m | [0m 0.001091[0m |
| [0m 8       [0m | [0m-0.1265  [0m | [0m 20.0    [0m |
| [0m 9       [0m | [0m-0.1265  [0m | [0m 20.0    [0m |
| [0m 10      [0m | [0m-0.1288  [0m | [0m 20.0    [0m |
| [0m 11      [0m | [0m-0.1283  [0m | [0m 20.0    [0m |
| [0m 12      [0m | [0m-0.1331  [0m | [0m 20.0    [0m |
| [0m 13      [0m | [0m-0.1272  [0m | [0m 20.0    [0m |
| [0m 14      [0m | [0m-0.1315  [0m | [0m 20.0    [0m |
| [0m 15      [0m | [0m-0.1315  [0m | [0m 20.0  

The optimal choice of alpha seems to be around 10.

In [17]:
#Setting up optimisation for Lasso model
from sklearn.linear_model import Lasso

def lasso_score(alpha):
    train_data = train.sample(frac=1)
    X=train_data.iloc[:,1:]
    y=train_data['SalePrice']

    scores = cross_val_score(estimator=Lasso(alpha=alpha, max_iter=10000), X=X, y=y, scoring='neg_mean_squared_error', cv=5)
    return(-np.average(np.sqrt(-scores)))

bounds_lasso = {'alpha': (0.00001,20)}

I'm adjusting the bounds for the lasso as 0 values for regularisation cause convergence problems.

In [41]:
#Running Bayes Opt to tune Lasso regression

BO = BayesianOptimization(lasso_score, bounds_lasso)
BO.maximize(n_iter=50,alpha=0.0001)

|   iter    |  target   |   alpha   |
-------------------------------------
| [0m 1       [0m | [0m-0.305   [0m | [0m 7.299   [0m |
| [0m 2       [0m | [0m-0.3066  [0m | [0m 9.549   [0m |
| [0m 3       [0m | [0m-0.3148  [0m | [0m 17.51   [0m |
| [95m 4       [0m | [95m-0.3044  [0m | [95m 5.857   [0m |
| [0m 5       [0m | [0m-0.316   [0m | [0m 19.1    [0m |
| [95m 6       [0m | [95m-0.1356  [0m | [95m 1.006e-0[0m |
| [0m 7       [0m | [0m-0.1363  [0m | [0m 1e-05   [0m |
| [95m 8       [0m | [95m-0.132   [0m | [95m 1e-05   [0m |
| [0m 9       [0m | [0m-0.132   [0m | [0m 1e-05   [0m |
| [0m 10      [0m | [0m-0.132   [0m | [0m 1e-05   [0m |
| [0m 11      [0m | [0m-0.132   [0m | [0m 1e-05   [0m |
| [0m 12      [0m | [0m-0.1363  [0m | [0m 1e-05   [0m |
| [0m 13      [0m | [0m-0.1418  [0m | [0m 1e-05   [0m |
| [0m 14      [0m | [0m-0.1418  [0m | [0m 1e-05   [0m |
| [0m 15      [0m | [0m-0.1418  [0m | [0m 

I'm quite suspicious about the behaviour of the Bayes optimiser in this case, it cannot seem to explore other values than the lower bound. While I could change the lower bound, it will cause convergence issues and will require an increased number of iterations. I can handpick a value for alpha that does noticeably better.

In [44]:
lasso_score(0.0001)

-0.12857380432989013

At this point I'm not going to try delve into the workings of optimiser as I don't think it will be very fruitful. Instead I'm going to try using a different optimiser to see if we get better convergence.

In [48]:
from scipy.optimize import minimize

minimize(lasso_score, x0=1, bounds = ((0.00001,20),), tol=0.00001)

      fun: -0.2677542767110079
 hess_inv: <1x1 LbfgsInvHessProduct with dtype=float64>
      jac: array([6881.69102627])
  message: b'CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH'
     nfev: 50
      nit: 2
   status: 0
  success: True
        x: array([0.99999875])

This also performs quite poorly. I think it gets thrown off by the randomness of the objective function. It does find a good optimum, but only if I put in an optimal starting point!

In [49]:
minimize(lasso_score, x0=0.001, bounds = ((0.00001,20),), tol=0.00001)

      fun: -0.12544040196865947
 hess_inv: <1x1 LbfgsInvHessProduct with dtype=float64>
      jac: array([-196060.25283075])
  message: b'CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH'
     nfev: 32
      nit: 2
   status: 0
  success: True
        x: array([0.00097682])

In [None]:
With a good choice of alpha, the lasso marginally outpeforms the ridge regression.

In [18]:
#Creating optimizer for a Random Forests regressor
from sklearn.ensemble import RandomForestRegressor as RF

def RF_score(n_estimators,max_depth,min_samples_split,min_samples_leaf,max_features):
    
    #Contraining hyperparameters to be converted to integers (e.g. number of decision trees can't be continuous!)
    n_estimators = int(n_estimators)
    max_depth = int(max_depth)
    min_samples_split = int(min_samples_split)
    min_samples_leaf = int(min_samples_leaf)
    max_features = int(max_features)
    
    assert type(n_estimators) == int
    assert type(max_depth) == int
    assert type(min_samples_split) == int
    assert type(min_samples_leaf) == int
    assert type(max_features) == int
    
    train_data = train.sample(frac=1)
    X=train_data.iloc[:,1:]
    y=train_data['SalePrice']

    scores = cross_val_score(
        estimator=RF(
                    n_estimators=n_estimators, 
                    max_depth=max_depth, 
                    min_samples_split=min_samples_split,
                    min_samples_leaf = min_samples_leaf,
                    max_features = max_features),
    X=X, y=y, scoring='neg_mean_squared_error', cv=5)
    return(-np.average(np.sqrt(-scores)))

bounds_RF = {
    'n_estimators': (1,3000),
    'max_depth': (1,100),
    'min_samples_split': (2,200),
    'min_samples_leaf': (1,200),
    'max_features': (1,290)
}

In [12]:
#Running Bayes Opt to tune Rando Forests regression

BO = BayesianOptimization(RF_score, bounds_RF)
BO.maximize(n_iter=50,alpha=0.0001)

|   iter    |  target   | max_depth | max_fe... | min_sa... | min_sa... | n_esti... |
-------------------------------------------------------------------------------------
| [0m 1       [0m | [0m-0.2525  [0m | [0m 83.77   [0m | [0m 47.77   [0m | [0m 165.1   [0m | [0m 117.2   [0m | [0m 44.25   [0m |
| [0m 2       [0m | [0m-0.2654  [0m | [0m 63.19   [0m | [0m 22.49   [0m | [0m 145.3   [0m | [0m 133.5   [0m | [0m 2.303e+0[0m |
| [95m 3       [0m | [95m-0.2111  [0m | [95m 65.32   [0m | [95m 236.2   [0m | [95m 107.6   [0m | [95m 143.9   [0m | [95m 946.7   [0m |
| [0m 4       [0m | [0m-0.2505  [0m | [0m 6.298   [0m | [0m 263.0   [0m | [0m 158.0   [0m | [0m 167.8   [0m | [0m 1.276e+0[0m |
| [0m 5       [0m | [0m-0.2573  [0m | [0m 6.909   [0m | [0m 273.4   [0m | [0m 197.2   [0m | [0m 141.0   [0m | [0m 1.167e+0[0m |
| [0m 6       [0m | [0m-0.2822  [0m | [0m 1.0     [0m | [0m 290.0   [0m | [0m 1.0     [0m | [0m 41.3

In [15]:
#Creating optimiser for xgboost, note that doing it this way isn't entirely necessary
#xgboost already contains the inbuilt functionality to do so, but I want to be consistent in my approach
import os
os.environ['KMP_DUPLICATE_LIB_OK']='True'
import xgboost as xgb

def xgb_score(eta, max_depth, gamma, colsample, subsample, early_stop):
    
    #Contraining hyperparameters to be converted to integers (e.g. number of decision trees can't be continuous!)
    max_depth = int(max_depth)
    early_stop = int(early_stop)
    
    #assert type(n_estimators) == int
    assert type(max_depth) == int
    assert type(early_stop) == int
    
    train_data = train.sample(frac=1)
    X=train_data.iloc[:,1:]
    y=np.array(train_data['SalePrice'])
    

    xgb_model = xgb.XGBRegressor(learning_rate=0.1, 
                                 max_depth=max_depth, 
                                 min_split_loss=gamma, 
                                 colsample_bytree = colsample,
                                 subsample = subsample,
                                 early_stopping_rounds = early_stop,
                                 objective="reg:squarederror")
    
    scores = cross_val_score(estimator=xgb_model, X=X, y=y, scoring='neg_mean_squared_error', cv=5)
    return(-np.average(np.sqrt(-scores)))

bounds_xgb = {'eta': (0.01,0.5),
              'max_depth':(3,40),
              'gamma':(0,0.4), 
              'colsample':(0.3,1), 
              'subsample':(0.3,1),
              'early_stop':(1,15)}

In [16]:
#Running Bayes Opt to tune Random Forests regression

BO = BayesianOptimization(xgb_score, bounds_xgb)
BO.maximize(n_iter=50,alpha=0.0001)

|   iter    |  target   | colsample | early_... |    eta    |   gamma   | max_depth | subsample |
-------------------------------------------------------------------------------------------------
| [0m 1       [0m | [0m-0.1421  [0m | [0m 0.9163  [0m | [0m 8.453   [0m | [0m 0.07675 [0m | [0m 0.3335  [0m | [0m 19.24   [0m | [0m 0.9383  [0m |
| [95m 2       [0m | [95m-0.1381  [0m | [95m 0.4751  [0m | [95m 11.29   [0m | [95m 0.118   [0m | [95m 0.2687  [0m | [95m 11.04   [0m | [95m 0.7378  [0m |
| [95m 3       [0m | [95m-0.1326  [0m | [95m 0.7861  [0m | [95m 13.81   [0m | [95m 0.08371 [0m | [95m 0.07473 [0m | [95m 21.65   [0m | [95m 0.8461  [0m |
| [95m 4       [0m | [95m-0.1305  [0m | [95m 0.7299  [0m | [95m 9.31    [0m | [95m 0.4099  [0m | [95m 0.1261  [0m | [95m 27.06   [0m | [95m 0.3844  [0m |
| [0m 5       [0m | [0m-0.1388  [0m | [0m 0.3472  [0m | [0m 1.883   [0m | [0m 0.03803 [0m | [0m 0.2748  [0m | [0m 15.2 

| [0m 51      [0m | [0m-0.1326  [0m | [0m 0.5703  [0m | [0m 1.037   [0m | [0m 0.4504  [0m | [0m 0.1016  [0m | [0m 3.023   [0m | [0m 0.9569  [0m |
| [0m 52      [0m | [0m-0.1291  [0m | [0m 0.504   [0m | [0m 14.95   [0m | [0m 0.3867  [0m | [0m 0.0374  [0m | [0m 3.018   [0m | [0m 0.7305  [0m |
| [0m 53      [0m | [0m-0.1404  [0m | [0m 0.8258  [0m | [0m 14.99   [0m | [0m 0.3848  [0m | [0m 0.3104  [0m | [0m 3.009   [0m | [0m 0.4784  [0m |
| [0m 54      [0m | [0m-0.1383  [0m | [0m 0.5643  [0m | [0m 1.075   [0m | [0m 0.2045  [0m | [0m 0.1931  [0m | [0m 39.98   [0m | [0m 0.7325  [0m |
| [0m 55      [0m | [0m-0.1428  [0m | [0m 0.5509  [0m | [0m 15.0    [0m | [0m 0.176   [0m | [0m 0.3965  [0m | [0m 3.013   [0m | [0m 0.7466  [0m |


So now that we've tuned the hyperparameters using Bayes opt and gotten a good feel of how the models perform using different parameter sets, it's pretty easy to see that xgboost performs the best. As a final evaluation, I'll put in some paramter values to evaluate our models concurrently and then pick from there.

In [19]:
print(
    lasso_score(0.001),
    ridge_score(10),
    RF_score(n_estimators=20, 
                    max_depth=100, 
                    min_samples_split=2,
                    min_samples_leaf = 1,
                    max_features = 130),
    xgb_score(eta=0.1,
              max_depth=40,
              gamma=0.05,
              colsample=0.9,
              subsample=0.7,
              early_stop=10)
    )

-0.1269655787503537 -0.12987026651616576 -0.14191146053416806 -0.1309665650518888


As noted, xgboost clearly outperforms all other models. So we'll use that configuration to make our test predictions and submit on Kaggle.

In [156]:
model_xgb = xgb.XGBRegressor(learning_rate=0.1, max_depth=10, min_split_loss=0, objective="reg:squarederror")
model_xgb.fit(X_train,SalePrice)
xgb_preds = np.expm1(model_xgb.predict(X_test))


#Submission csv
submission = pd.read_csv('Data/sample_submission.csv')
submission['SalePrice'] = xgb_preds
submission.to_csv('Data/submission.csv',index=False)

  if getattr(data, 'base', None) is not None and \


Well! Turns out we did rather poorly despite all the work we put in, we got a public score of 0.14504 placing us in the bottom 50% :(.

In [52]:
model_lasso = Lasso(alpha=0.001)
model_lasso.fit(X_train,SalePrice)
lasso_preds = np.expm1(model_lasso.predict(X_test))

#Submission csv
submission = pd.read_csv('Data/sample_submission.csv')
submission['SalePrice'] = lasso_preds
submission.to_csv('Data/submission.csv',index=False)