### Model Implementation and Machine Learning
Now our features are set, we can move on to train the machine for predictions. We'll choose the best model and evaluate them by R2, Root Mean Squared Error, Mean Squared and Absolute Errors.



According to [this medium article](https://towardsdatascience.com/what-are-the-best-metrics-to-evaluate-your-regression-model-418ca481755b):

' R Square/Adjusted R Square is better used to explain the model to other people because you can explain the number as a percentage of the output variability. MSE, RMSE, or MAE are better be used to compare performance between different regression models'

Since we want to predict accurate house prices, I will use all four error values. 





In [3]:
# read dummy csv's from the previous notebook
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
test_dummy = pd.read_csv('test_dummy.csv') # I forgot to ignore index while saving as csv
train_dummy = pd.read_csv('train_dummy.csv')

In [4]:
test_dummy

Unnamed: 0,LotFrontage,LotArea,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,1stFlrSF,2ndFlrSF,LowQualFinSF,...,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
0,7.041123,27.511249,0.000000,12.108210,8.528312,10.330766,14.416106,14.477288,0.000000,0.0,...,0,0,0,1,0,0,0,0,1,0
1,7.070709,28.872139,7.777777,14.593156,0.000000,11.629871,16.073678,16.073678,0.000000,0.0,...,0,0,0,1,0,0,0,0,1,0
2,6.857203,28.662062,0.000000,13.998313,0.000000,8.395096,14.614315,14.614315,13.545452,0.0,...,0,0,0,1,0,0,0,0,1,0
3,6.981065,26.534606,4.192081,12.990095,0.000000,10.898153,14.605862,14.605862,13.422305,0.0,...,0,0,0,1,0,0,0,0,1,0
4,5.657628,22.470602,0.000000,10.250736,0.000000,14.976507,15.916058,15.916058,0.000000,0.0,...,0,0,0,1,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1454,4.278004,17.719351,0.000000,0.000000,0.000000,12.642798,12.642798,12.642798,12.642798,0.0,...,0,0,0,1,0,0,0,0,1,0
1455,4.278004,17.619961,0.000000,10.121473,0.000000,10.593170,12.642798,12.642798,12.642798,0.0,...,0,0,0,1,1,0,0,0,0,0
1456,8.814500,31.239346,0.000000,15.729901,0.000000,0.000000,15.729901,15.729901,0.000000,0.0,...,0,0,0,1,1,0,0,0,0,0
1457,6.450860,26.821947,0.000000,11.023351,0.000000,12.826025,14.546282,14.788544,0.000000,0.0,...,0,0,0,1,0,0,0,0,1,0


In [None]:
train_dummy

Unnamed: 0,LotFrontage,LotArea,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,1stFlrSF,2ndFlrSF,LowQualFinSF,...,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
0,6.557896,25.503637,9.383456,13.571795,0.000000,8.638462,14.300394,14.300394,14.291377,0.0,...,0,0,0,1,0,0,0,0,1,0
1,7.041123,26.291998,0.000000,14.821045,0.000000,10.485990,15.856944,15.856944,0.000000,0.0,...,0,0,0,1,0,0,0,0,1,0
2,6.661108,27.300424,8.848653,12.237560,0.000000,11.852637,14.580417,14.580417,14.345227,0.0,...,0,0,0,1,0,0,0,0,1,0
3,6.377215,26.259338,0.000000,9.664321,0.000000,12.603923,13.827349,14.751724,13.827349,0.0,...,0,0,0,1,1,0,0,0,0,0
4,7.157766,28.868815,11.144754,13.295773,0.000000,12.265784,15.455351,15.455351,15.115838,0.0,...,0,0,0,1,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1451,6.450860,25.108776,0.000000,0.000000,0.000000,14.718764,14.718764,14.718764,13.508319,0.0,...,0,0,0,1,0,0,0,0,1,0
1452,7.186239,28.337016,8.025855,13.993513,8.865603,12.911848,16.709169,18.032005,0.000000,0.0,...,0,0,0,1,0,0,0,0,1,0
1453,6.592710,25.919503,0.000000,10.386924,0.000000,14.394068,15.480279,15.606602,15.480279,0.0,...,0,0,0,1,0,0,0,0,1,0
1454,6.661108,26.367896,0.000000,5.933621,15.023383,0.000000,15.210371,15.210371,0.000000,0.0,...,0,0,0,1,0,0,0,0,1,0


In [5]:
# Ok . The cvs's are ready. 
predictors = train_dummy.drop('SalePrice',axis=1)
target = train_dummy.SalePrice

In [6]:
# Import required packages
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, RandomizedSearchCV
from sklearn.model_selection import RepeatedKFold
#ML algoritms
from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, AdaBoostRegressor 
from sklearn.tree import DecisionTreeRegressor
from xgboost import XGBRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from catboost import CatBoostRegressor

#Performance metrics
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [7]:
# Split the data
x_train, x_test, y_train, y_test = train_test_split(predictors,target, train_size=0.8,
                                                    random_state= 42)
# Check if the data split correctly
print(x_train.shape, x_test.shape)
y_train.shape, y_test.shape

(1164, 667) (292, 667)


((1164,), (292,))

In [8]:
model_accuracy = pd.DataFrame()
models = {'LinReg': LinearRegression(),
          'KNNReg': KNeighborsRegressor(),
          'RFR': RandomForestRegressor(),
          'XGB' : XGBRegressor(),
          'ADA': AdaBoostRegressor(),
          'GBR' : GradientBoostingRegressor(),
          'Elastic_net' : ElasticNet(),
          'Ridge' : Ridge(),
          'Lasso' : Lasso(),
          'svr': SVR(), 
          'cat': CatBoostRegressor(silent=True),
         }

          

for test, clf in models.items():
    clf.fit(x_train, y_train)
    y_pred = clf.predict(x_test)
    
    rmse = mean_squared_error(y_test, y_pred, squared=False)
    mae = mean_absolute_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    mse = mean_squared_error(y_test,y_pred)
    print(test + ' RMSE score : ', "{:.2f}".format(rmse))
    print(test + ' MAE score :' ,  "{:.2f}".format(mae))
    print(test + ' MSE score : ', "{:.2f}".format(mse))
    print('r2 scores',  "{:.2f}".format(r2_score(y_test,y_pred)))
    print('*' * 40)
    model_accuracy = model_accuracy.append({'Model': test, 'RMSE': rmse, 'MAE': mae, 'MSE': mse, 'R2' : r2 }, ignore_index=True)
    

LinReg RMSE score :  0.17
LinReg MAE score : 0.12
LinReg MSE score :  0.03
r2 scores 0.82
****************************************
KNNReg RMSE score :  0.26
KNNReg MAE score : 0.18
KNNReg MSE score :  0.07
r2 scores 0.61
****************************************
RFR RMSE score :  0.14
RFR MAE score : 0.09
RFR MSE score :  0.02
r2 scores 0.89
****************************************
XGB RMSE score :  0.14
XGB MAE score : 0.10
XGB MSE score :  0.02
r2 scores 0.88
****************************************
ADA RMSE score :  0.18
ADA MAE score : 0.14
ADA MSE score :  0.03
r2 scores 0.82
****************************************
GBR RMSE score :  0.13
GBR MAE score : 0.09
GBR MSE score :  0.02
r2 scores 0.90
****************************************
Elastic_net RMSE score :  0.31
Elastic_net MAE score : 0.22
Elastic_net MSE score :  0.09
r2 scores 0.45
****************************************
Ridge RMSE score :  0.13
Ridge MAE score : 0.09
Ridge MSE score :  0.02
r2 scores 0.90
*****************

In [9]:
model_accuracy.sort_values('MSE',ascending=True)

Unnamed: 0,MAE,MSE,Model,R2,RMSE
10,0.083135,0.014533,cat,0.914806,0.120551
5,0.090557,0.016414,GBR,0.903774,0.128118
7,0.093027,0.017909,Ridge,0.89501,0.133826
2,0.094212,0.018307,RFR,0.892677,0.135304
3,0.100604,0.021004,XGB,0.876867,0.144928
0,0.11732,0.029944,LinReg,0.824461,0.173042
4,0.139227,0.031504,ADA,0.815314,0.177493
9,0.170191,0.060055,svr,0.647938,0.245061
1,0.179262,0.06701,KNNReg,0.607166,0.258863
6,0.218185,0.09385,Elastic_net,0.44982,0.30635


CatBoostRegressor, GradientBoostRegressor, Ridge and XGBoost resulted good R2 scores. CatBoost is the best of them (lowest MAE, MSE and RMSE scores)

I will try stacking models resulted higher than 0.85.

In [10]:
from sklearn.ensemble import StackingRegressor
estimators = [
          ('RFR', RandomForestRegressor()),
          ('XGB' , XGBRegressor()),
          ('GBR' , GradientBoostingRegressor()),
          ('Ridge' , Ridge()),
          ('Cat', CatBoostRegressor(silent=True))]
sr = StackingRegressor(estimators= estimators, final_estimator = LinearRegression())
sr.fit(x_train, y_train)
y_pred = sr.predict(x_test)
print('mean squared error : ', mean_squared_error(y_test,y_pred))
print('root mse : ', mean_squared_error(y_test,y_pred,squared=False))
print('r2 scores : ', r2_score(y_test,y_pred))


model_accuracy = model_accuracy.append({'Model': 'Stacked', 'RMSE':mean_squared_error(y_test,y_pred,squared=False),
                      'MAE' : mean_absolute_error(y_test,y_pred), 'MSE': mean_squared_error(y_test,y_pred),
                      'R2': r2_score(y_test,y_pred)}, ignore_index=True)

mean squared error :  0.014345087907367058
root mse :  0.11977098107374365
r2 scores :  0.9159043567629865


In [11]:
model_accuracy.sort_values('MSE',ascending=True)

Unnamed: 0,MAE,MSE,Model,R2,RMSE
11,0.080203,0.014345,Stacked,0.915904,0.119771
10,0.083135,0.014533,cat,0.914806,0.120551
5,0.090557,0.016414,GBR,0.903774,0.128118
7,0.093027,0.017909,Ridge,0.89501,0.133826
2,0.094212,0.018307,RFR,0.892677,0.135304
3,0.100604,0.021004,XGB,0.876867,0.144928
0,0.11732,0.029944,LinReg,0.824461,0.173042
4,0.139227,0.031504,ADA,0.815314,0.177493
9,0.170191,0.060055,svr,0.647938,0.245061
1,0.179262,0.06701,KNNReg,0.607166,0.258863


Best model : Stacking regressor. Now I will use Hyperopt Sklearn to tune the parameters of models. 

However, Catboost is not defined in Hyperopt, therefore I only use Extra Gradient Boost, random forest, gradient boost.

Later I will tune catboost parameters with RandomizedSearchCV and GridSearchCV.

In [32]:
from hpsklearn import HyperoptEstimator
from hpsklearn import xgboost_regression 
from hpsklearn import random_forest_regression 
from hpsklearn import gradient_boosting_regression
from hyperopt import tpe

In [33]:
model = HyperoptEstimator(regressor=(xgboost_regression('xgboost_regression')),
                            algo=tpe.suggest, trial_timeout=300)

model.fit(x_train, y_train)

print( model.score( x_test, y_test ) )
# <<show score here>>
print( model.best_model() )

100%|██████████| 1/1 [00:14<00:00, 14.09s/trial, best loss: 0.2255404789484481]
100%|██████████| 2/2 [01:13<00:00, 73.36s/trial, best loss: 0.12188189345469458]
100%|██████████| 3/3 [03:15<00:00, 195.79s/trial, best loss: 0.10546851550469027]
100%|██████████| 4/4 [02:11<00:00, 131.22s/trial, best loss: 0.0822680857184166]
100%|██████████| 5/5 [00:12<00:00, 12.61s/trial, best loss: 0.0822680857184166]
100%|██████████| 6/6 [00:14<00:00, 14.19s/trial, best loss: 0.0822680857184166]
100%|██████████| 7/7 [02:25<00:00, 145.40s/trial, best loss: 0.0822680857184166]
100%|██████████| 8/8 [01:05<00:00, 65.76s/trial, best loss: 0.0822680857184166]
100%|██████████| 9/9 [02:26<00:00, 146.99s/trial, best loss: 0.0822680857184166]
100%|██████████| 10/10 [00:12<00:00, 12.22s/trial, best loss: 0.0822680857184166]
0.9097723131430315
{'learner': XGBRegressor(base_score=0.5, booster='gbtree',
             colsample_bylevel=0.5920480741161004, colsample_bynode=1,
             colsample_bytree=0.53421103311

In [34]:
model = HyperoptEstimator(regressor=(random_forest_regression('random_forest')),
                            algo=tpe.suggest, trial_timeout=300)

model.fit(x_train, y_train)

print( model.score( x_test, y_test ) )
# <<show score here>>
print( model.best_model() )

100%|██████████| 1/1 [00:18<00:00, 18.19s/trial, best loss: 0.18683289546923665]
100%|██████████| 2/2 [00:14<00:00, 14.23s/trial, best loss: 0.16849822417406646]
100%|██████████| 3/3 [00:15<00:00, 15.07s/trial, best loss: 0.16849822417406646]
100%|██████████| 4/4 [00:18<00:00, 18.11s/trial, best loss: 0.12228393290332329]
100%|██████████| 5/5 [00:25<00:00, 25.99s/trial, best loss: 0.12228393290332329]
100%|██████████| 6/6 [00:10<00:00, 10.11s/trial, best loss: 0.12228393290332329]
100%|██████████| 7/7 [00:10<00:00, 10.54s/trial, best loss: 0.12228393290332329]
100%|██████████| 8/8 [00:12<00:00, 12.37s/trial, best loss: 0.12228393290332329]
100%|██████████| 9/9 [00:09<00:00,  9.62s/trial, best loss: 0.12228393290332329]
100%|██████████| 10/10 [00:12<00:00, 12.28s/trial, best loss: 0.12228393290332329]
0.8996232391287924
{'learner': RandomForestRegressor(bootstrap=False, max_features=0.3987863192293277,
                      n_estimators=247, n_jobs=1, random_state=1,
                   

In [36]:
model = HyperoptEstimator(regressor=(gradient_boosting_regression('gradient_boost')),
                            algo=tpe.suggest, trial_timeout=300)

model.fit(x_train, y_train)

print( model.score( x_test, y_test ) )
# <<show score here>>
print( model.best_model() )

100%|██████████| 1/1 [04:20<00:00, 260.10s/trial, best loss: 0.12933690993181968]
100%|██████████| 2/2 [00:14<00:00, 14.02s/trial, best loss: 0.12933690993181968]
100%|██████████| 3/3 [00:04<00:00,  4.39s/trial, best loss: 0.12933690993181968]
100%|██████████| 4/4 [00:05<00:00,  5.51s/trial, best loss: 0.12933690993181968]
100%|██████████| 5/5 [00:10<00:00, 10.30s/trial, best loss: 0.12933690993181968]
100%|██████████| 6/6 [00:03<00:00,  3.43s/trial, best loss: 0.12933690993181968]
100%|██████████| 7/7 [00:09<00:00,  9.94s/trial, best loss: 0.12933690993181968]
100%|██████████| 8/8 [00:03<00:00,  3.94s/trial, best loss: 0.12933690993181968]
100%|██████████| 9/9 [00:03<00:00,  3.43s/trial, best loss: 0.12933690993181968]
100%|██████████| 10/10 [00:04<00:00,  4.96s/trial, best loss: 0.12933690993181968]
0.8832103351844581
{'learner': GradientBoostingRegressor(alpha=0.8570036099535357,
                          learning_rate=0.026472079517656004, loss='huber',
                          ma

In [39]:
xgb = XGBRegressor(base_score=0.5, booster='gbtree',
             colsample_bylevel=0.5920480741161004, colsample_bynode=1,
             colsample_bytree=0.5342110331102824, gamma=0.022747667282595144,
             gpu_id=-1, importance_type='gain', interaction_constraints='',
             learning_rate=0.10079137168083327, max_delta_step=0, max_depth=9,
             min_child_weight=2,
             n_estimators=5200, n_jobs=-1, num_parallel_tree=1,
             objective='reg:linear', random_state=2,
             reg_alpha=0.0070599715322066025, reg_lambda=2.6415023331554033,
             scale_pos_weight=1, seed=2, subsample=0.6080316946851188,
             tree_method='exact', validate_parameters=1, verbosity=None)


rfr = RandomForestRegressor(bootstrap=False, max_features=0.3987863192293277,
                      n_estimators=247, n_jobs=-1, random_state=1,
                      verbose=False)

gbr = GradientBoostingRegressor(alpha=0.8570036099535357,
                          learning_rate=0.026472079517656004, loss='huber',
                          max_depth=None, max_features='sqrt', n_estimators=713,
                          random_state=0, subsample=0.9565703063356579)

In [40]:
# Catboost Parameters
cat_params = {'depth':[3,1,2,6,4,5,7,8,9,10],
          'learning_rate':[0.03,0.001,0.01,0.1,0.2,0.3], 
          'l2_leaf_reg':[3,1,5,10,100],
          'border_count':[32,5,10,20,50,100,200],
             }

In [41]:
def eval_best_params(model, parameters, X_train, y_train):
    """This function evaluates models best parameters,
    model is base model such as RandomForestRegressor(),"""
    rf = RandomizedSearchCV(estimator= model, param_distributions= parameters, cv=5, verbose=2, n_jobs=-1, )
    rf.fit(X_train,y_train)
    print(rf.best_params_)
 

In [42]:
eval_best_params(CatBoostRegressor(silent=True), cat_params, x_train,y_train)

Fitting 5 folds for each of 10 candidates, totalling 50 fits
{'learning_rate': 0.01, 'l2_leaf_reg': 1, 'depth': 6, 'border_count': 100}


In [44]:
# GridSearchCV for n_estimators; first change parameters of the models
cat = CatBoostRegressor(learning_rate=0.01, l2_leaf_reg = 1, depth = 6,border_count = 100, silent= True)

# Cat, see CatBoost document, it uses iterations for n_estimators, num_trees etc. 
grid_cv = GridSearchCV(estimator = cat,
                       param_grid= {'iterations': [1000,2000,5000,10000,20000,50000]} , cv=5, n_jobs=-1)

In [45]:
grid_cv.fit(x_train, y_train)
print('n_estimators for CatBoost is : ', grid_cv.best_params_)

n_estimators for CatBoost is :  {'iterations': 5000}


In [48]:
cat = CatBoostRegressor(iterations= 5000,
                        learning_rate=0.01, 
                        l2_leaf_reg = 1,
                         depth = 6,
                         border_count = 100, 
                         silent= True)

In [54]:
# Define a dictionary with the models 
estimators = [
          ('RFR', rfr),
          ('XGB' , xgb),
          ('GBR' , gbr),
          ('Ridge' , Ridge(alpha=10)),
          ('Cat', cat)]
sr = StackingRegressor(estimators= estimators, final_estimator = LinearRegression())
sr.fit(x_train, y_train)
y_pred = sr.predict(x_test)
print('mean squared error : ', mean_squared_error(y_test,y_pred))
print('root mse : ', mean_squared_error(y_test,y_pred,squared=False))
print('r2 scores : ', r2_score(y_test,y_pred))


model_accuracy = model_accuracy.append({'Model': 'hyper_stacked', 'RMSE':mean_squared_error(y_test,y_pred,squared=False),
                      'MAE' : mean_absolute_error(y_test,y_pred), 'MSE': mean_squared_error(y_test,y_pred),
                      'R2': r2_score(y_test,y_pred)}, ignore_index=True)

# predict houseprices
house_price = sr.predict(test_dummy)

mean squared error :  0.013775009636378269
root mse :  0.11736698699539948
r2 scores :  0.9192463438741025


In [55]:
model_accuracy.sort_values('MSE',ascending=True)

Unnamed: 0,MAE,MSE,Model,R2,RMSE
12,0.076752,0.013775,hyper_stacked,0.919246,0.117367
13,0.076752,0.013775,hyper_stacked,0.919246,0.117367
11,0.080203,0.014345,Stacked,0.915904,0.119771
10,0.083135,0.014533,cat,0.914806,0.120551
5,0.090557,0.016414,GBR,0.903774,0.128118
7,0.093027,0.017909,Ridge,0.89501,0.133826
2,0.094212,0.018307,RFR,0.892677,0.135304
3,0.100604,0.021004,XGB,0.876867,0.144928
0,0.11732,0.029944,LinReg,0.824461,0.173042
4,0.139227,0.031504,ADA,0.815314,0.177493


Hyperparamater tuning of the Stacked models are now complete. As we can see above, we get the highest scores after tuning each model in the StackingRegression. 

Time to predict House Prices on the test set

In [56]:
house_price = np.expm1(house_price)

In [57]:
house_price

array([131331.96715068, 159510.41169311, 202715.50446174, ...,
       161076.42077712, 122322.68534736, 216190.4312172 ])

In [58]:
houses = pd.read_csv('sample_submission.csv')
houses['SalePrice'] = house_price

In [59]:
houses

Unnamed: 0,Id,SalePrice
0,1461,131331.967151
1,1462,159510.411693
2,1463,202715.504462
3,1464,205736.260546
4,1465,185620.830751
...,...,...
1454,2915,89253.912441
1455,2916,84008.915365
1456,2917,161076.420777
1457,2918,122322.685347


In [60]:
houses.to_csv('submission.csv', index=False)

Scored 0.13138


1245nd on the ladder.

This is my very first Regression analysis, and I learnt a lot from this dataset. 

Thank you for your time to read the code. If you liked it, I would be appreciated for an upvote.

Stay safe.