# Random Forest Regressor Model

## ** Notebook Content **

1. Introduction


2. Data Preprocessing
    - 2.1 Importing the required packages
    - 2.2 Loading the dataset
    - 2.3 Preparing the data
        - 2.3.1 Remove variables with low correlation
        - 2.3.2 Getting the Dependent and Independent variables
        - 2.3.3 Creating new dataframes based on the data type


3. Building a "baseline" model applying logistics regression
    - 3.1 Build a prediction model with numerical variables
    - 3.2 Build a prediction model with numerical and categorical variables 
        - 3.2.1 Apllying dummy variables
        - 3.2.2 Convert the remaining categorical variables into numbers
        - 3.2.3 Implement the same changes in the test set
    - 3.3 Saving the changes
         
         
4. Building a Random Forest Regressor
    - 4.1 Defining the Random Forest Regressor baseline
        - 4.1.1 Fitting the Random Forest Regressor
        - 4.1.2 Predicting the results
    - 4.2 Applying K-Fold Cross-Validation technique
    - 4.3 Applying Grid-Search technique
        - 4.3.1 Random Hyper-parameters Grid technique
            - a) Creating a parameter grid
            - b) Random Search Training
            - c) Evaluate Random Search
        - 4.3.2 Grid-Search technique
            - a) Initiate the grid search model
            - b) Fit the grid search to the data
            - c) Fitting the final Random Forest Regressor Model

## 1. Introduction

The goal of this notebook is to create a predictive regression model applying logististcs regression and random forest regressor techniques.

## 2 Data Preprocessing

### 2.1 Importing the requiered packages 

In [11]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

### 2.2 Loading the dataset

In [12]:
#loading the training set
df_train_clean = pd.read_csv('df_train_clean.csv')
df_train_clean.head()

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,LotShape,LandContour,Utilities,LotConfig,LandSlope,...,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,60,RL,65.0,8450,Pave,Reg,Lvl,AllPub,Inside,Gtl,...,0,0,0,0,0,2,2008,WD,Normal,208500.0
1,20,RL,80.0,9600,Pave,Reg,Lvl,AllPub,FR2,Gtl,...,0,0,0,0,0,5,2007,WD,Normal,181500.0
2,60,RL,68.0,11250,Pave,IR1,Lvl,AllPub,Inside,Gtl,...,0,0,0,0,0,9,2008,WD,Normal,223500.0
3,70,RL,60.0,9550,Pave,IR1,Lvl,AllPub,Corner,Gtl,...,272,0,0,0,0,2,2006,WD,Abnorml,140000.0
4,60,RL,84.0,14260,Pave,IR1,Lvl,AllPub,FR2,Gtl,...,0,0,0,0,0,12,2008,WD,Normal,250000.0


In [13]:
df_train_clean.shape

(1324, 75)

### 2.3 Preparing the data

#### 2.3.1 Remove variables with low correlation

First, we have to remove those variables that have a low correlation compared to the dependent variable (SalePrice).

In [14]:
#remove variables with low correlation
df_train_clean.drop(['MoSold', 'ScreenPorch', '3SsnPorch', 'PoolArea', 'MiscVal', 'YrSold', 'LowQualFinSF', 'MSSubClass',
               'BsmtFinSF2', 'BsmtHalfBath'], axis = 1, inplace = True)

In [15]:
#check the shape of the dataframe after removing the variables with low correlation
df_train_clean.shape

(1324, 65)

#### 2.3.2 Getting the Dependent and Independent variables

In [16]:
#Getting the Dependent and Independent variables
X_train = df_train_clean.iloc[:, :-1] #all lines, all columns except the last one
y_train = df_train_clean.iloc[:, 64] #all lines, only the last column

In [17]:
#check the shape of X_train and y_train
X_train.shape, y_train.shape

((1324, 64), (1324,))

#### 2.3.3 Creating new dataframes based on the data type 

Let's start the creation of our prediction model builing some dataframes related to the datatype of the variables that are part of X_train. These dataframes will help us to build a pilot model composed of only numerical variables. 

Then, we are going to add categorical variables to our model to improve the score and the power prediction of the model.

In [8]:
##Create dtype dataframes
#create a dataframe with only categorical variables
df_object = X_train.select_dtypes(include=[object])
#create a dataframe with only numerical variables
df_number = X_train.select_dtypes(include=[np.number])

In [9]:
df_object.head()

Unnamed: 0,MSZoning,Street,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,...,Electrical,KitchenQual,Functional,GarageType,GarageFinish,GarageQual,GarageCond,PavedDrive,SaleType,SaleCondition
0,RL,Pave,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,...,SBrkr,Gd,Typ,Attchd,RFn,TA,TA,Y,WD,Normal
1,RL,Pave,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,...,SBrkr,TA,Typ,Attchd,RFn,TA,TA,Y,WD,Normal
2,RL,Pave,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,...,SBrkr,Gd,Typ,Attchd,RFn,TA,TA,Y,WD,Normal
3,RL,Pave,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,...,SBrkr,Gd,Typ,Detchd,Unf,TA,TA,Y,WD,Abnorml
4,RL,Pave,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,...,SBrkr,Gd,Typ,Attchd,RFn,TA,TA,Y,WD,Normal


In [19]:
df_object.columns

Index(['MSZoning', 'Street', 'LotShape', 'LandContour', 'Utilities',
       'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2',
       'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st',
       'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation',
       'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2',
       'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual',
       'Functional', 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond',
       'PavedDrive', 'SaleType', 'SaleCondition'],
      dtype='object')

Comments =>> We realized that we have a total of 38 numerical variables. For our predictive model, we have to convert them to numerics in order to be able to apply algorithms such as logistics regression and random forest. However, we can not transform them all at once, as in some cases it is convenient to turn some variables into dummy to achieve a positive impact on the total set of the model.

In the following section, you will find more details about the categorical varaibles treatment. 

In [10]:
df_number.head()

Unnamed: 0,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtUnfSF,TotalBsmtSF,...,BedroomAbvGr,KitchenAbvGr,TotRmsAbvGrd,Fireplaces,GarageYrBlt,GarageCars,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch
0,65.0,8450,7,5,2003,2003,196.0,706.0,150.0,856.0,...,3,1,8,0,2003.0,2.0,548.0,0,61,0
1,80.0,9600,6,8,1976,1976,0.0,978.0,284.0,1262.0,...,3,1,6,1,1976.0,2.0,460.0,298,0,0
2,68.0,11250,7,5,2001,2002,162.0,486.0,434.0,920.0,...,3,1,6,1,2001.0,2.0,608.0,0,42,0
3,60.0,9550,7,5,1915,1970,0.0,216.0,540.0,756.0,...,3,1,7,1,1998.0,3.0,642.0,0,35,272
4,84.0,14260,8,5,2000,2000,350.0,655.0,490.0,1145.0,...,4,1,9,1,2000.0,3.0,836.0,192,84,0


In [18]:
df_number.columns

Index(['LotFrontage', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt',
       'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtUnfSF', 'TotalBsmtSF',
       '1stFlrSF', '2ndFlrSF', 'GrLivArea', 'BsmtFullBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd',
       'Fireplaces', 'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF',
       'OpenPorchSF', 'EnclosedPorch'],
      dtype='object')

Comments => We realized that we have a total of 26 numerical variables. For our predictive model, we can we consider all the numerical variables, since both Logistics Regression and Random Forest operate with numerical variables and the volume is not very high. 

## 3. Building a "baseline" model applying logistics regression

### 3.1 Build a prediction model with numerical variables

#### 3.1.1 Pilot Model 1 (numerical variables with correlation > [+0.5 & -0.5] )

Let's start creating a pilot model only with those variables that have a higher correlation with respect to the dependent variable (more than 0.50 of correlation).

In [20]:
pilot_model_1 = df_number[['OverallQual', 'GrLivArea', '1stFlrSF', 'FullBath', 'YearBuilt', 'YearRemodAdd',
                           'TotRmsAbvGrd', 'GarageCars', 'GarageArea', 'TotalBsmtSF', 'GarageYrBlt']]

In [21]:
pilot_model_1.shape

(1324, 11)

In [22]:
#Fitting logistic Regression into the Training set
from sklearn.linear_model import LogisticRegression
log_regressor_1 = LogisticRegression (random_state = 0)
log_regressor_1.fit(pilot_model_1, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=0, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

__Important Note:__ The classifier learns the correlation between the df_number and the x_train. 

Now, let's start calculating the $R^2$ (coefficient of determination) regression score function, which determines the quality of the model to replicate the results and the proportion of variation of the results that can be explained by the model.

**Best possible score is 1.0 and it can be negative** (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a $R^2$ score of 0.

In [23]:
#Compute Score (ùëÖ2) for the pilot_model_1 and y_training
print('Training Score: {}'.format(log_regressor_1.score(pilot_model_1, y_train)))
#Compute MSE (Mean Squared Error) for the pilot_model_1 and y_training
print('Training MSE: {}'.format(np.mean((log_regressor_1.predict(pilot_model_1) - y_train)**2)))

Training Score: 0.2190332326283988
Training MSE: 1242486857.018127


#### 3.1.2 Pilot Model 2 (all numerical variables)

Now, we are going to include all the numerical varaibles into the pilot model in order to check its performance and define a preliminary baseline.

In [24]:
pilot_model_2 = df_number[['OverallQual', 'GrLivArea', '1stFlrSF', 'FullBath', 'YearBuilt', 'YearRemodAdd',
                           'TotRmsAbvGrd', 'GarageCars', 'GarageArea', 'TotalBsmtSF', 'GarageYrBlt', 'MasVnrArea' ,
                           'BsmtFinSF1', 'LotFrontage', 'BsmtFullBath', 'Fireplaces', 'OpenPorchSF', 'WoodDeckSF', 
                           '2ndFlrSF', 'HalfBath','LotArea', 'BedroomAbvGr', 'OverallCond', 'KitchenAbvGr',
                           'EnclosedPorch', 'BsmtUnfSF']]

In [25]:
pilot_model_2.shape

(1324, 26)

In [26]:
#Fitting logistic Regression into the Training set
from sklearn.linear_model import LogisticRegression
log_regressor_2 = LogisticRegression (random_state = 0)
log_regressor_2.fit(pilot_model_2, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=0, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [38]:
#Compute Score (ùëÖ2) for the pilot_model_2 and y_training
print('Training Score: {}'.format(log_regressor_2.score(pilot_model_2, y_train)))
#Compute MSE (Mean Squared Error) for the pilot_model_2 and y_training
print('Training MSE: {}'.format(np.mean((log_regressor_2.predict(pilot_model_2) - y_train)**2)))

Training Score: 0.6102719033232629
Training MSE: 515970129.33912385


__Comments:__ We realized that the Score and the MSE have improved after includding all numerical variables, which present a significant level of correlation with respect to the dependent variable.

__Note [2]:__ We tried to improve the performance of the model checking the distribution of all the numerical variables that are part of the model to see what kind of distribution they showed and adjusting the distribution of those variables that followed a distribution close to the Gaussian by applying logarithms. 

However, we realized that the score of the Logistics Regression worsened after applying logaritm functions, so we decided  not to use this method to improve model performance, as the results are not as expected.

You can check the details of our analysis clicking in the following link:

[Notebook - Testing Variables Distribution Applying 'Logarithms'](https://github.com/lmendezotero/Postgraduate-Project/blob/master/House%20Prices%20Prediction/Testing%20Variables%20Distribution%20Applying%20'Logarithms'.ipynb)

### 3.2 Build a prediction model with numerical and categorical variables

#### 3.2.1  Convert some categorical variables into dummy variable

The first step to build our predictive model keeping in mind both numerical and categorical variables is to convert into dummy those categorical variables that have a positive impact on the model and that have few options/classes in order to avoid a significantly increase the number of features in the dataset. 

We do not want to make this notebook too extensive and include the necessary code without falling into redundancies. So, we have already done this analysis in other notebooks, which are linked to this current file.

In the following Jupyter Notebook you can see the analysis performed in which we converted into dummy variables all the variables that seemed to us subject to being converted and the impact of each variable on the performance of the model: 

[Categorical Data - Dummy Variables Testing](https://github.com/lmendezotero/Postgraduate-Project/blob/master/House%20Prices%20Prediction/Categorical%20Data%20-%20Dummy%20Variables%20Testing.ipynb)

However, we found that certain variables ('ExterCond', 'Utilities' and 'Street') that were transformed to dummy did not provide a positive impact on the model. So, we have to exclude these variables from the dummy analysis and we created a final version of the Notebook, in which we have tested the performance of the model with all the choosen dummy variables and the remaining categorical variables converted into numbers.

You can check the details of our analysis in the Jupyter Notebook:

[Categorical Variables - Analysis & Testing.ipynb](https://github.com/lmendezotero/Postgraduate-Project/blob/master/House%20Prices%20Prediction/Categorical%20Data%20-%20Analysis%20%26%20Testing.ipynb)


So, let's go!

We are going to start building our predictive model converting the choosen categorical variables into dummy.

In [27]:
#convert the choosen categorical variables into dummy variables
X_train = pd.get_dummies (X_train, columns = ['LotShape', 'LandContour', 'LandSlope', 'BldgType', 'MasVnrType', 'ExterQual', 
                                              'BsmtQual', 'BsmtCond', 'BsmtExposure', 'CentralAir', 'KitchenQual', 
                                              'GarageFinish', 'PavedDrive'])
                                              
#check the shape of df_object after converting the variables into dummy
X_train.shape

(1324, 99)

In [28]:
#check the name of the columns after converting the variables into dummy
X_train.columns

Index(['MSZoning', 'LotFrontage', 'LotArea', 'Street', 'Utilities',
       'LotConfig', 'Neighborhood', 'Condition1', 'Condition2', 'HouseStyle',
       'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'RoofStyle',
       'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrArea', 'ExterCond',
       'Foundation', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtFinType2', 'BsmtUnfSF',
       'TotalBsmtSF', 'Heating', 'HeatingQC', 'Electrical', '1stFlrSF',
       '2ndFlrSF', 'GrLivArea', 'BsmtFullBath', 'FullBath', 'HalfBath',
       'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', 'Functional',
       'Fireplaces', 'GarageType', 'GarageYrBlt', 'GarageCars', 'GarageArea',
       'GarageQual', 'GarageCond', 'WoodDeckSF', 'OpenPorchSF',
       'EnclosedPorch', 'SaleType', 'SaleCondition', 'LotShape_IR1',
       'LotShape_IR2', 'LotShape_IR3', 'LotShape_Reg', 'LandContour_Bnk',
       'LandContour_HLS', 'LandContour_Low', 'LandContour_Lvl',
       'LandSlope_Gtl', 'LandSlope_Mod', 'LandSlope_Sev',

Now, we proceed to merge all the dummy variables in the same pilot_model.

In [29]:
#numerical model + dummy variables
pilot_model_3 = X_train[['OverallQual', 'GrLivArea', '1stFlrSF', 'FullBath', 'YearBuilt', 'YearRemodAdd',
                           'TotRmsAbvGrd', 'GarageCars', 'GarageArea', 'TotalBsmtSF', 'GarageYrBlt', 'MasVnrArea' ,
                           'BsmtFinSF1', 'LotFrontage', 'BsmtFullBath', 'Fireplaces', 'OpenPorchSF', 'WoodDeckSF', 
                           '2ndFlrSF', 'HalfBath','LotArea', 'BedroomAbvGr', 'OverallCond', 'KitchenAbvGr',
                           'EnclosedPorch', 'BsmtUnfSF', 'LandSlope_Gtl','LandSlope_Mod', 'LandSlope_Sev', 'LotShape_IR1', 
                           'LotShape_IR2', 'LotShape_IR3', 'LotShape_Reg', 'LandContour_Bnk', 'LandContour_HLS',
                           'LandContour_Low', 'LandContour_Lvl', 'MasVnrType_BrkCmn', 'MasVnrType_BrkFace', 'MasVnrType_None', 
                           'MasVnrType_Stone', 'ExterQual_Ex', 'ExterQual_Fa', 'ExterQual_Gd', 'ExterQual_TA', 'BldgType_1Fam', 
                           'BldgType_2fmCon', 'BldgType_Duplex', 'BldgType_Twnhs', 'BldgType_TwnhsE', 'BsmtQual_Ex', 
                           'BsmtQual_Fa', 'BsmtQual_Gd', 'BsmtQual_TA', 'BsmtCond_Fa', 'BsmtCond_Gd', 'BsmtCond_Po', 
                           'BsmtCond_TA', 'BsmtExposure_Av', 'BsmtExposure_Gd', 'BsmtExposure_Mn', 'BsmtExposure_No', 
                           'CentralAir_N', 'CentralAir_Y', 'KitchenQual_Ex', 'KitchenQual_Fa', 'KitchenQual_Gd', 
                           'KitchenQual_TA', 'GarageFinish_Fin', 'GarageFinish_RFn', 'GarageFinish_Unf','PavedDrive_N', 
                           'PavedDrive_P', 'PavedDrive_Y']]

pilot_model_3.shape

(1324, 74)

Let's fitting the X_training applying logistics regression to check the performance of the model after includding all the dummy variables.

In [30]:
#Fitting logistic Regression into the Training set
from sklearn.linear_model import LogisticRegression
log_regressor_3 = LogisticRegression (random_state = 0)
log_regressor_3.fit(pilot_model_3, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=0, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [32]:
#Compute Score (ùëÖ2) for the pilot_model_3 and y_training
print('Training Score: {}'.format(log_regressor_3.score(pilot_model_3, y_train)))
#Compute MSE (Mean Squared Error) for the pilot_model_3 and y_training
print('Training MSE: {}'.format(np.mean((log_regressor_3.predict(pilot_model_3) - y_train)**2)))

Training Score: 0.7741691842900302
Training MSE: 183925345.60725075


Comments =>> It looks like the Score and MSE of our model improved improved after apllying all the dummy variables compared to the pilot_model_2 ("0.7741 VS 0.6102" and "183925345 VS 515970129" respectively).

So, the *combination of those dummy variables and the numerical variables have raised a positive impact on model performance*.

#### 3.2.2 Convert the remaining categorical variables into numbers

Now, we are going to convert the remaining categorical variables into numbers and check the performance of the model.

In [33]:
#convert the rest of the categorical variables into numbers
from sklearn.preprocessing import LabelEncoder
lencoders = {}

for col in X_train.select_dtypes(include=['object']).columns:
    lencoders[col] = LabelEncoder()
    X_train[col] = lencoders[col].fit_transform(X_train[col])

In [34]:
#check the datatype of X_train to review that all the variables are numbers
X_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1324 entries, 0 to 1323
Data columns (total 99 columns):
MSZoning              1324 non-null int32
LotFrontage           1324 non-null float64
LotArea               1324 non-null int64
Street                1324 non-null int32
Utilities             1324 non-null int32
LotConfig             1324 non-null int32
Neighborhood          1324 non-null int32
Condition1            1324 non-null int32
Condition2            1324 non-null int32
HouseStyle            1324 non-null int32
OverallQual           1324 non-null int64
OverallCond           1324 non-null int64
YearBuilt             1324 non-null int64
YearRemodAdd          1324 non-null int64
RoofStyle             1324 non-null int32
RoofMatl              1324 non-null int32
Exterior1st           1324 non-null int32
Exterior2nd           1324 non-null int32
MasVnrArea            1324 non-null float64
ExterCond             1324 non-null int32
Foundation            1324 non-null int32
BsmtFin

Let's merge all the variables in the same pilot_model.

In [35]:
#numerical model + all dummy variables + remaining numerical variables
pilot_model_4 = X_train

pilot_model_4.shape

(1324, 99)

In [36]:
#Fitting logistic Regression into the Training set
from sklearn.linear_model import LogisticRegression
log_regressor_4 = LogisticRegression (random_state = 0)
log_regressor_4.fit(pilot_model_4, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=0, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [37]:
#Compute Score (ùëÖ2) for the pilot_model_4 and y_training
print('Training Score: {}'.format(log_regressor_4.score(pilot_model_4, y_train)))
#Compute MSE (Mean Squared Error) for the pilot_model_4 and y_training
print('Training MSE: {}'.format(np.mean((log_regressor_4.predict(pilot_model_4) - y_train)**2)))

Training Score: 0.8919939577039275
Training MSE: 57845974.32024169


Once we apply logistics regression with all the transformed variables, we observe that we achieve a Score of 0.89199 and a Mean Squared Error of result of 57.845.974, respectively.

Therefore, consider the pilot_model_4 as the reference or baseline. So, the goal is to improve the Score and the MSE or our model applying one of the most famous ensemble algorithm that is called the Random Forest Regressor.

#### 3.2.3 Implementing the changes in the test test

In order to build a good model and predict the results correctly, we have to implement in the test set the same changes we made earlier in the training set to have the inforamtion in the same format and get realistic results.

##### a) Loading the dataset

Let's start loading the test set.

In [93]:
#loading the X_test dataset
X_test = pd.read_csv('df_test_clean.csv')
X_test.head()

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,LotShape,LandContour,Utilities,LotConfig,LandSlope,...,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,20,RH,80.0,11622,Pave,Reg,Lvl,AllPub,Inside,Gtl,...,0,0,0,120,0,0,6,2010,WD,Normal
1,20,RL,81.0,14267,Pave,IR1,Lvl,AllPub,Corner,Gtl,...,36,0,0,0,0,12500,6,2010,WD,Normal
2,60,RL,74.0,13830,Pave,IR1,Lvl,AllPub,Inside,Gtl,...,34,0,0,0,0,0,3,2010,WD,Normal
3,60,RL,78.0,9978,Pave,IR1,Lvl,AllPub,Inside,Gtl,...,36,0,0,0,0,0,6,2010,WD,Normal
4,120,RL,43.0,5005,Pave,IR1,HLS,AllPub,Inside,Gtl,...,82,0,0,144,0,0,1,2010,WD,Normal


In [94]:
#check the shape of the test set
X_test.shape

(1319, 74)

##### b) Delete variables with low correlation

We are going to remove those variables that show a low correlation with respect to the SalePrice variable.

In [95]:
#remove variables with low correlation
X_test.drop(['MoSold', 'ScreenPorch', '3SsnPorch', 'PoolArea', 'MiscVal', 'YrSold', 'LowQualFinSF', 'MSSubClass',
               'BsmtFinSF2', 'BsmtHalfBath'], axis = 1, inplace = True)

In [96]:
#check the shape of the test set after removing variables with low correlation
X_test.shape

(1319, 64)

##### c) Convert some categorical variables into dummy


We are going to convert into dummy variables some categorical features that have an impact on the SalePrice.

In [97]:
#convert some categorical variables into dummy variables
X_test = pd.get_dummies (X_test, columns = ['LandSlope', 'LotShape', 'LandContour', 'MasVnrType', 'ExterQual', 'BldgType', 
                                             'BsmtQual', 'BsmtCond', 'BsmtExposure', 'CentralAir', 'KitchenQual', 'GarageFinish',
                                             'PavedDrive'])

#check the shape of test set after converting the variables into dummy
X_test.shape

(1319, 99)

##### d) Convert the remaining categorical variables into numbers

Regarding the remaining categorical variables, we have to convert them into numbers in order to buil our regression model.

In [98]:
#convert the rest of the categorical variables into numbers
from sklearn.preprocessing import LabelEncoder
lencoders = {}

for col in X_test.select_dtypes(include=['object']).columns:
    lencoders[col] = LabelEncoder()
    X_test[col] = lencoders[col].fit_transform(X_test[col])

In [99]:
#check the datatype of X_train to review that all the variables are numbers
X_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1319 entries, 0 to 1318
Data columns (total 99 columns):
MSZoning              1319 non-null int32
LotFrontage           1319 non-null float64
LotArea               1319 non-null int64
Street                1319 non-null int32
Utilities             1319 non-null int32
LotConfig             1319 non-null int32
Neighborhood          1319 non-null int32
Condition1            1319 non-null int32
Condition2            1319 non-null int32
HouseStyle            1319 non-null int32
OverallQual           1319 non-null int64
OverallCond           1319 non-null int64
YearBuilt             1319 non-null int64
YearRemodAdd          1319 non-null int64
RoofStyle             1319 non-null int32
RoofMatl              1319 non-null int32
Exterior1st           1319 non-null int32
Exterior2nd           1319 non-null int32
MasVnrArea            1319 non-null float64
ExterCond             1319 non-null int32
Foundation            1319 non-null int32
BsmtFin

In [100]:
#Review the final data
X_test.head()

Unnamed: 0,MSZoning,LotFrontage,LotArea,Street,Utilities,LotConfig,Neighborhood,Condition1,Condition2,HouseStyle,...,KitchenQual_Ex,KitchenQual_Fa,KitchenQual_Gd,KitchenQual_TA,GarageFinish_Fin,GarageFinish_RFn,GarageFinish_Unf,PavedDrive_N,PavedDrive_P,PavedDrive_Y
0,2,80.0,11622,1,0,4,12,1,2,2,...,0,0,0,1,0,0,1,0,0,1
1,3,81.0,14267,1,0,0,12,2,2,2,...,0,0,1,0,0,0,1,0,0,1
2,3,74.0,13830,1,0,4,8,2,2,4,...,0,0,0,1,1,0,0,0,0,1
3,3,78.0,9978,1,0,4,8,2,2,4,...,0,0,1,0,1,0,0,0,0,1
4,3,43.0,5005,1,0,4,22,2,2,2,...,0,0,1,0,0,1,0,0,0,1


Now, we are ready to training the Random Forest Regressor model, as the training set and test set are aligned.

### 3.3 Saving the changes

In [101]:
#export the baseline model to csv
X_train.to_csv('Xtrain_baseline_model.csv', index=False)
y_train.to_csv('ytrain_baseline_model.csv', index = False)
X_test.to_csv('Xtest_baseline_model.csv', index = False)

  This is separate from the ipykernel package so we can avoid doing imports until


In [102]:
y_train.to_pickle("./y_training.pkl")

## 4. Building a Random Forest Regressor Model

### 4.1 Defining the Random Forest Regressor baseline

We start our analysis building a simple random forest regressor model, which is the baseline. Then, applying cross-validation techniques we will search what are the best parameters and we will apply them in order to build a solid and robust predictive model.  

So, let's start creating the baseline model.

#### 4.1.1 Fitting the Random Forest Regressor

In [103]:
# Fitting Random Forest Regression to the dataset
from sklearn.ensemble import RandomForestRegressor
rf_regressor = RandomForestRegressor(n_estimators = 10, random_state = 0)
rf_regressor.fit(X_train, y_train)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
                      max_features='auto', max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=10,
                      n_jobs=None, oob_score=False, random_state=0, verbose=0,
                      warm_start=False)

In [104]:
#Compute Score (ùëÖ2) for the X_train and y_training
print('Training Score: {}'.format(rf_regressor.score(X_train, y_train)))
#Compute MSE (Mean Squared Error) for the X_train  and y_training
print('Training MSE: {}'.format(np.mean((rf_regressor.predict(X_train) - y_train)**2)))

Training Score: 0.9739051625728526
Training MSE: 123354012.79515868


Comments =>> Comparing the results obtained in the random forest model compared to logistics regression, we verify that the random forest shows a greater potential, as we have increased the score by about 10% (from 0.895 to 0.974).

#### 4.1.2 Predicting the results

In [107]:
# Predicting the test set results
y_pred = rf_regressor.predict(X_test)
y_pred

array([126600., 154805., 178900., ...,  85250., 148860., 221970.])

Comments => As we do not have any test labels (y_test) to validate the predictions of our model, we have to look for other method to corroborate that our model is trained properly (without falling into overfitting) and predicts new results correctly.

One possible option could be applying **cross-validation techniques**, working and validating directly the data based on the training set.

### 4.2 Applying K-Fold Cross-Validation technique

One of the most common cross-validation techniques is the**K-fold, which consists on splitting the training set into K number of subsets, called folds**. Then, we iteratively fit the model K times, each time training the data on K-1 of the folds and evaluating on the Kth fold (called the validation data). So, at the very end of training, we average the performance on each of the folds to come up with final validation metrics for the model.

Let's see how the K-fold technique works with our random forest regressor model.

In [109]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(estimator = rf_regressor, X = X_train, y = y_train, cv = 10)
print("Accuracy: {:.2f} %".format(scores.mean()*100))
print("Standard Deviation: {:.2f} %".format(scores.std()*100))

Score / Accuracy: 85.48 %
Standard Deviation: 3.63 %


Comments: We got a maximum accuracy of 85.48% applying the K-fold cross-validation technique.

Let's check how to improve the result with another popular cross-validation technique, which is called *Grid-Search tecnique*.

### 4.3 Applying Grid-Search technique

We would like to optimize our random forest model tunign some hyper-parameters to get a better performance. In order to do it, we will use the **"Random Hyper-parameters Grid" and "Grid-Search" method to find the best parameters for our regression model.**

Before starting tunning the hyper-parameters, we need to check what are the parameters that we are using now.

In [110]:
from pprint import pprint
# Look at parameters used by our current forest
print('Parameters currently in use:\n')
pprint(rf_regressor_1.get_params())

Parameters currently in use:

{'bootstrap': True,
 'criterion': 'mse',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 10,
 'n_jobs': None,
 'oob_score': False,
 'random_state': 0,
 'verbose': 0,
 'warm_start': False}


We will try adjusting the following set of hyperparameters:
* __n_estimators__ = number of trees in the foreset
* __max_features__ = max number of features considered for splitting a node
* __max_depth__ = max number of levels in each decision tree
* __min_samples_split__ = min number of data points placed in a node before the node is split
* __min_samples_leaf__ = min number of data points allowed in a leaf node
* __bootstrap__ = method for sampling data points (with or without replacement)

#### 4.3.1 Random Hyper-parameters Grid technique

We are going to start our analysis applying the **Random Hyper-parameters Grid technique**, whose benefit of a random search is that we are not trying every combination, but selecting at random to sample a wide range of values.

##### a) Creating a parameter grid 

In order to apply the Random Hyper-parameters Grid technique, we have to use the **RandomizedSearchCV class**. So, we first need to create a parameter grid to sample from during fitting.

In [111]:
## Creating a parameter grid

from sklearn.model_selection import RandomizedSearchCV
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]

# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

pprint(random_grid)

{'bootstrap': [True, False],
 'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, None],
 'max_features': ['auto', 'sqrt'],
 'min_samples_leaf': [1, 2, 4],
 'min_samples_split': [2, 5, 10],
 'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]}


Comments =>> On each iteration, the algorithm will choose a difference combination of the features.

##### b) Random Search Training

We will use the random grid to search what are the most powerpul values of the hyper-parameters of the random forest regression model.

In [113]:
## Use the random grid to search for best hyperparameters

# First create the base model to tune
rf_bm = RandomForestRegressor(n_estimators = 10, random_state = 0)

# Random search of parameters, using 10 fold cross validation, 
# search across 100 different combinations, and use all available cores
rf_random = RandomizedSearchCV(estimator = rf_bm, param_distributions = random_grid, n_iter = 100, cv =10, verbose=2, 
                               random_state=0, n_jobs = -1)

# Fit the random search model
rf_random.fit(X_train, y_train)

Fitting 10 folds for each of 100 candidates, totalling 1000 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:   56.3s
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed: 10.8min
[Parallel(n_jobs=-1)]: Done 357 tasks      | elapsed: 29.5min
[Parallel(n_jobs=-1)]: Done 640 tasks      | elapsed: 47.6min
[Parallel(n_jobs=-1)]: Done 1000 out of 1000 | elapsed: 74.2min finished


RandomizedSearchCV(cv=10, error_score='raise-deprecating',
                   estimator=RandomForestRegressor(bootstrap=True,
                                                   criterion='mse',
                                                   max_depth=None,
                                                   max_features='auto',
                                                   max_leaf_nodes=None,
                                                   min_impurity_decrease=0.0,
                                                   min_impurity_split=None,
                                                   min_samples_leaf=1,
                                                   min_samples_split=2,
                                                   min_weight_fraction_leaf=0.0,
                                                   n_estimators=10, n_jobs=None,
                                                   oob_score=False,
                                                   random_state=...


Comments =>> The most important arguments in RandomizedSearchCV are n_iter, which controls the number of different combinations to try, and cv which is the number of folds to use for cross validation. In our case, we used a total of 100 iteractions and 10 folds. In addition, we realized that the run time has increased due to the number of folds chosen, but this allows us to reduce the risk of excess.


Now, let's check the best parameters from fitting the random search.

In [114]:
#check the best random-search parameters
rf_random.best_params_

{'n_estimators': 1600,
 'min_samples_split': 2,
 'min_samples_leaf': 1,
 'max_features': 'sqrt',
 'max_depth': 100,
 'bootstrap': False}

#####  c) Evaluate Random Search

To determine if random search got a better model, we compute the Score(R2) and MSE metrics of the rf_random model. Then, we compare the results of the random search model with the base model.

In [119]:
#Compute Score (ùëÖ2) for the rf_random and y_training
print('Training Score: {}'.format(rf_random.score(X_train, y_train)))
#Compute MSE (Mean Squared Error) for the rf_random  and y_training
print('Training MSE: {}'.format(np.mean((rf_random.predict(X_train) - y_train)**2)))

Training Score: 0.9999955053205318
Training MSE: 21246.989952617543


Explain and comment the results.....

We can further improve our results by using grid search to focus on the most promising hyperparameters ranges found in the random search.

#### 4.3.2 Grid-Search technique

Once we have are clear about the ranges of values that can fit our model to achieve a good score, we are ready to apply the grid-search technique to find the best parameters for our final predictive regression model.

##### a) Initiate the grid search model

In [130]:
from sklearn.model_selection import GridSearchCV

# Create the parameter grid based on the results of random search 
param_grid = [{
    'bootstrap': [False],
    'max_depth': [80, 90, 100, 110],
    'max_features': [2, 3],
    'min_samples_leaf': [1, 2, 3],
    'min_samples_split': [1.0, 2, 3],
    'n_estimators': [1000, 1200, 1400, 1600]}]

# Create a based model
rf_gs = RandomForestRegressor(n_estimators = 10, random_state = 0)

# Instantiate the grid search model
grid_search = GridSearchCV(estimator = rf_gs, param_grid = param_grid, 
                          cv = 10, n_jobs = -1)

##### b) Fit the grid search to the data

In [131]:
# Fit the grid search to the data
grid_search.fit(X_train, y_train)

#check the best grid-search accuracy
best_accuracy = grid_search.best_score_
#check the best grid-search parameters
best_parameters = grid_search.best_params_
print("Best Accuracy: {:.2f} %".format(best_accuracy*100))
print("Best Parameters:", best_parameters)

Best Accuracy: 85.67 %
Best Parameters: {'bootstrap': False, 'max_depth': 80, 'max_features': 3, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 1400}


Comments =>> Although the difference is not very large compared to the K-Fold cross-validation technique, we got a better accuracy result applying the Grid-search (85.67 instead of 85.48, respectively).



##### c) Fitting the final Random Forest Regressor Model

Finally, we proceed to build and fit the final Random Forest Regressor Model keeping in mind the best parameters values that the grid-search has provided us.

In [135]:
#Fitting the final Random Forest Regressor Model

regressor_finalmodel = RandomForestRegressor(n_estimators = 1400, random_state = 0, bootstrap = False, max_depth = 80,
                                             max_features = 3, min_samples_leaf = 1, min_samples_split = 2)

regressor_finalmodel.fit(X_train, y_train)

RandomForestRegressor(bootstrap=False, criterion='mse', max_depth=80,
                      max_features=3, max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=1400,
                      n_jobs=None, oob_score=False, random_state=0, verbose=0,
                      warm_start=False)

In [136]:
#Compute Score (ùëÖ2) for the regressor_finalmodel and y_training
print('Training Score: {}'.format(regressor_finalmodel.score(X_train, y_train)))
#Compute MSE (Mean Squared Error) for the regressor_finalmodel  and y_training
print('Training MSE: {}'.format(np.mean((regressor_finalmodel.predict(X_train) - y_train)**2)))

Training Score: 0.9999954924677704
Training MSE: 21307.74678617668


In [137]:
# Predicting the test set results
y_pred_finalmodel = regressor_finalmodel.predict(X_test)
y_pred_finalmodel

array([127374.48785714, 155743.18857143, 179954.93      , ...,
        97405.93785714, 156186.78      , 221590.27285714])

__End of analysis.__

__Thanks for reading!!__