# Categorical Variables | Analysis & Testing

1. Introduction


2. Data Preprocessing
    - 2.1 Importing the required packages
    - 2.2 Loading the dataset
    - 2.3 Preparing the data


3. Variables Testing & Adjustments
    - 3.1 Build a prediction model with numerical variables
    - 3.2 Build a prediction model with numerical and categorical variables 
        - 3.2.1 Apllying dummy variables
        - 3.2.2 Convert the remaining categorical variables into numbers

## 1. Introduction

The goal of this notebook is to try to improve the performance of the initial model created in the *House Prices Prediction* project, which is composed only of numerical variables.

In the main notebook of the *House Prices Prediction* project we analyzed the distrubution and contribution of the categorical variables individually. Now, we let's go a step further by analyzing the best way in which categorical variables can positively impact model performance.

To get it, we are going to convert into dummy those variables that have a number of options less than 5 and a high impact on the dependent variable. Then, regarding the remaining numerical variables, we will convert them into numbers and build the final model for our House Price Prediction analysis.

## 2 Data Preprocessing

### 2.1 Importing the requiered packages 

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

from scipy import stats
from scipy.stats import skew, boxcox_normmax, norm
from scipy.special import boxcox1p

import matplotlib.gridspec as gridspec
from matplotlib.ticker import MaxNLocator

import warnings
pd.options.display.max_columns = 250
pd.options.display.max_rows = 250
warnings.filterwarnings('ignore')
plt.style.use('fivethirtyeight')

### 2.2 Loading the dataset

In [2]:
#loading the training set
df_train_clean = pd.read_csv('df_train_clean.csv')
df_train_clean.head()

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,60,RL,65.0,8450,Pave,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196.0,Gd,TA,PConc,Gd,TA,No,GLQ,706.0,Unf,0.0,150.0,856.0,GasA,Ex,Y,SBrkr,856,854,0,1710,1.0,0.0,2,1,3,1,Gd,8,Typ,0,Attchd,2003.0,RFn,2.0,548.0,TA,TA,Y,0,61,0,0,0,0,0,2,2008,WD,Normal,208500.0
1,20,RL,80.0,9600,Pave,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978.0,Unf,0.0,284.0,1262.0,GasA,Ex,Y,SBrkr,1262,0,0,1262,0.0,1.0,2,0,3,1,TA,6,Typ,1,Attchd,1976.0,RFn,2.0,460.0,TA,TA,Y,298,0,0,0,0,0,0,5,2007,WD,Normal,181500.0
2,60,RL,68.0,11250,Pave,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,BrkFace,162.0,Gd,TA,PConc,Gd,TA,Mn,GLQ,486.0,Unf,0.0,434.0,920.0,GasA,Ex,Y,SBrkr,920,866,0,1786,1.0,0.0,2,1,3,1,Gd,6,Typ,1,Attchd,2001.0,RFn,2.0,608.0,TA,TA,Y,0,42,0,0,0,0,0,9,2008,WD,Normal,223500.0
3,70,RL,60.0,9550,Pave,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,,0.0,TA,TA,BrkTil,TA,Gd,No,ALQ,216.0,Unf,0.0,540.0,756.0,GasA,Gd,Y,SBrkr,961,756,0,1717,1.0,0.0,1,0,3,1,Gd,7,Typ,1,Detchd,1998.0,Unf,3.0,642.0,TA,TA,Y,0,35,272,0,0,0,0,2,2006,WD,Abnorml,140000.0
4,60,RL,84.0,14260,Pave,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,BrkFace,350.0,Gd,TA,PConc,Gd,TA,Av,GLQ,655.0,Unf,0.0,490.0,1145.0,GasA,Ex,Y,SBrkr,1145,1053,0,2198,1.0,0.0,2,1,4,1,Gd,9,Typ,1,Attchd,2000.0,RFn,3.0,836.0,TA,TA,Y,192,84,0,0,0,0,0,12,2008,WD,Normal,250000.0


In [3]:
df_train_clean.shape

(1324, 75)

### 2.3 Preparing the data

In [4]:
#remove variables with low correlation
df_train_clean.drop(['MoSold', 'ScreenPorch', '3SsnPorch', 'PoolArea', 'MiscVal', 'YrSold', 'LowQualFinSF', 'MSSubClass',
               'BsmtFinSF2', 'BsmtHalfBath'], axis = 1, inplace = True)

In [5]:
#check the shape of the dataframe after removing the variables with low correlation
df_train_clean.shape

(1324, 65)

In [6]:
#Getting the Dependent and Independent variables
X_train = df_train_clean.iloc[:, :-1] #all lines, all columns except the last one
y_train = df_train_clean.iloc[:, 64] #all lines, only the last column

In [7]:
#check the shaape of X_train and y_train
X_train.shape, y_train.shape

((1324, 64), (1324,))

## 3. Variables Testing

### 3.1 Build a prediction model with numerical variables

In [8]:
pilot_model_3 = X_train[['OverallQual', 'GrLivArea', '1stFlrSF', 'FullBath', 'YearBuilt', 'YearRemodAdd',
                           'TotRmsAbvGrd', 'GarageCars', 'GarageArea', 'TotalBsmtSF', 'GarageYrBlt', 'MasVnrArea' ,
                           'BsmtFinSF1', 'LotFrontage', 'BsmtFullBath', 'Fireplaces', 'OpenPorchSF', 'WoodDeckSF', 
                           '2ndFlrSF', 'HalfBath','LotArea', 'BedroomAbvGr', 'OverallCond', 'KitchenAbvGr',
                           'EnclosedPorch', 'BsmtUnfSF']]

In [9]:
pilot_model_3.shape

(1324, 26)

In [10]:
#Fitting logistic Regression into the Training set
from sklearn.linear_model import LogisticRegression
log_regressor_3 = LogisticRegression (random_state = 0)
log_regressor_3.fit(pilot_model_3, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=0, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [11]:
#Compute Score (𝑅2) for the pilot_model_3 and y_training
print('Training Score: {}'.format(log_regressor_3.score(pilot_model_3, y_train)))
#Compute MSE (Mean Squared Error) for the pilot_model_3 and y_training
print('Training MSE: {}'.format(np.mean((log_regressor_3.predict(pilot_model_3) - y_train)**2)))

Training Score: 0.6102719033232629
Training MSE: 515970129.33912385


### 3.2 Build a prediction model with numerical and categorical variables

Important Note =>> We are going to analyze the following categorical variables:

Street, LotShape, LandContour ,Utilities, LandSlope, BldgType, MasVnrType, ExterQual, ExterCond, LotConfig Neighborhood, Condition1, Condition2, HouseStyle, RoofStyle, RoofMatl, Exterior1st, Exterior2nd.


Regarding the dummy variables conversion, we are only going to treat the following variables keeping in mind the analysis performed in the main House Price Prediction Notebook:

*Street, LotShape, LandContour ,Utilities, LandSlope, BldgType, MasVnrType, ExterQual, ExterCond*


The remaining categoriacal variables will be converted into numbers at the end of the notebook.

#### 3.2.1 Apllying dummy variables


__Street__

The *Street* variable identifies the type of road access to property (gravael or paved).

In [12]:
#convert the Street variable into dummy variables
X_train = pd.get_dummies (X_train, columns = ['Street'])
#check the shape of df_object after converting the variables into dummy
X_train.shape

(1324, 65)

In [13]:
X_train.columns

Index(['MSZoning', 'LotFrontage', 'LotArea', 'LotShape', 'LandContour',
       'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1',
       'Condition2', 'BldgType', 'HouseStyle', 'OverallQual', 'OverallCond',
       'YearBuilt', 'YearRemodAdd', 'RoofStyle', 'RoofMatl', 'Exterior1st',
       'Exterior2nd', 'MasVnrType', 'MasVnrArea', 'ExterQual', 'ExterCond',
       'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1',
       'BsmtFinSF1', 'BsmtFinType2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'GrLivArea', 'BsmtFullBath', 'FullBath', 'HalfBath', 'BedroomAbvGr',
       'KitchenAbvGr', 'KitchenQual', 'TotRmsAbvGrd', 'Functional',
       'Fireplaces', 'GarageType', 'GarageYrBlt', 'GarageFinish', 'GarageCars',
       'GarageArea', 'GarageQual', 'GarageCond', 'PavedDrive', 'WoodDeckSF',
       'OpenPorchSF', 'EnclosedPorch', 'SaleType', 'SaleCondition',
       'Street_Grvl', 'Street

In [14]:
#numerical model + Street dummy variables
pilot_model_4 = X_train[['OverallQual', 'GrLivArea', '1stFlrSF', 'FullBath', 'YearBuilt', 'YearRemodAdd',
                           'TotRmsAbvGrd', 'GarageCars', 'GarageArea', 'TotalBsmtSF', 'GarageYrBlt', 'MasVnrArea' ,
                           'BsmtFinSF1', 'LotFrontage', 'BsmtFullBath', 'Fireplaces', 'OpenPorchSF', 'WoodDeckSF', 
                           '2ndFlrSF', 'HalfBath','LotArea', 'BedroomAbvGr', 'OverallCond', 'KitchenAbvGr',
                           'EnclosedPorch', 'BsmtUnfSF', 'Street_Grvl', 'Street_Pave']]

In [15]:
pilot_model_4.shape

(1324, 28)

In [16]:
#Fitting logistic Regression into the Training set
from sklearn.linear_model import LogisticRegression
log_regressor_4 = LogisticRegression (random_state = 0)
log_regressor_4.fit(pilot_model_4, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=0, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [17]:
#Compute Score (𝑅2) for the pilot_model_4 and y_training
print('Training Score: {}'.format(log_regressor_4.score(pilot_model_4, y_train)))
#Compute MSE (Mean Squared Error) for the pilot_model_4 and y_training
print('Training MSE: {}'.format(np.mean((log_regressor_4.predict(pilot_model_4) - y_train)**2)))

Training Score: 0.6087613293051359
Training MSE: 546088210.070997


Comments =>> It looks like the score and MSE of our model worsened after including the 'Street_Grvl' and 'Street_Pave' variables.

__Utilities__

The *Utilities* variable identifies the type of utilities available (all public utilities, electricity, gas, water, etc).

In [18]:
#convert the Utilities variable into dummy variables
X_train = pd.get_dummies (X_train, columns = ['Utilities'])
#check the shape of df_object after converting the variables into dummy
X_train.shape

(1324, 66)

In [19]:
X_train.columns

Index(['MSZoning', 'LotFrontage', 'LotArea', 'LotShape', 'LandContour',
       'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2',
       'BldgType', 'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt',
       'YearRemodAdd', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd',
       'MasVnrType', 'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation',
       'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating', 'HeatingQC',
       'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF', 'GrLivArea',
       'BsmtFullBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr',
       'KitchenQual', 'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF',
       'EnclosedPorch', 'SaleType', 'SaleCondition', 'Street_Grvl',
       'Street_Pave', 'Util

In [20]:
#numerical model + Utilities dummy variables
pilot_model_5 = X_train[['OverallQual', 'GrLivArea', '1stFlrSF', 'FullBath', 'YearBuilt', 'YearRemodAdd',
                           'TotRmsAbvGrd', 'GarageCars', 'GarageArea', 'TotalBsmtSF', 'GarageYrBlt', 'MasVnrArea' ,
                           'BsmtFinSF1', 'LotFrontage', 'BsmtFullBath', 'Fireplaces', 'OpenPorchSF', 'WoodDeckSF', 
                           '2ndFlrSF', 'HalfBath','LotArea', 'BedroomAbvGr', 'OverallCond', 'KitchenAbvGr',
                           'EnclosedPorch', 'BsmtUnfSF', 'Utilities_AllPub', 'Utilities_NoSeWa']]

In [21]:
pilot_model_5.shape

(1324, 28)

In [22]:
#Fitting logistic Regression into the Training set
from sklearn.linear_model import LogisticRegression
log_regressor_5 = LogisticRegression (random_state = 0)
log_regressor_5.fit(pilot_model_5, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=0, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [23]:
#Compute Score (𝑅2) for the pilot_model_5 and y_training
print('Training Score: {}'.format(log_regressor_5.score(pilot_model_5, y_train)))
#Compute MSE (Mean Squared Error) for the pilot_model_5 and y_training
print('Training MSE: {}'.format(np.mean((log_regressor_5.predict(pilot_model_5) - y_train)**2)))

Training Score: 0.6170694864048338
Training MSE: 519413507.13066465


Comments =>>  It looks like the score of our model improved after including the 'Utilities_AllPub' and 'Utilities_NoSeWa' variables (0.6170 vs 0.6102), but the MSE recorded a slight increase compared to the pilot_model_3 (519413507.130 vs 515970129.339)

__LandSlope__

The *LandSlope* variable identifies the slope of property (pendiente de la propiedad).

In [24]:
#convert the LandSlope variable into dummy variables
X_train = pd.get_dummies (X_train, columns = ['LandSlope'])
#check the shape of df_object after converting the variables into dummy
X_train.shape

(1324, 68)

In [25]:
X_train.columns

Index(['MSZoning', 'LotFrontage', 'LotArea', 'LotShape', 'LandContour',
       'LotConfig', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating', 'HeatingQC',
       'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF', 'GrLivArea',
       'BsmtFullBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr',
       'KitchenQual', 'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF',
       'EnclosedPorch', 'SaleType', 'SaleCondition', 'Street_Grvl',
       'Street_Pave', 'Utilities_AllPub'

In [26]:
#numerical model + LandSlope dummy variables
pilot_model_6 = X_train[['OverallQual', 'GrLivArea', '1stFlrSF', 'FullBath', 'YearBuilt', 'YearRemodAdd',
                           'TotRmsAbvGrd', 'GarageCars', 'GarageArea', 'TotalBsmtSF', 'GarageYrBlt', 'MasVnrArea' ,
                           'BsmtFinSF1', 'LotFrontage', 'BsmtFullBath', 'Fireplaces', 'OpenPorchSF', 'WoodDeckSF', 
                           '2ndFlrSF', 'HalfBath','LotArea', 'BedroomAbvGr', 'OverallCond', 'KitchenAbvGr',
                           'EnclosedPorch', 'BsmtUnfSF', 'LandSlope_Gtl','LandSlope_Mod', 'LandSlope_Sev']]

In [27]:
pilot_model_6.shape

(1324, 29)

In [28]:
#Fitting logistic Regression into the Training set
from sklearn.linear_model import LogisticRegression
log_regressor_6 = LogisticRegression (random_state = 0)
log_regressor_6.fit(pilot_model_6, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=0, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [29]:
#Compute Score (𝑅2) for the pilot_model_6 and y_training
print('Training Score: {}'.format(log_regressor_6.score(pilot_model_6, y_train)))
#Compute MSE (Mean Squared Error) for the pilot_model_5 and y_training
print('Training MSE: {}'.format(np.mean((log_regressor_6.predict(pilot_model_6) - y_train)**2)))

Training Score: 0.622356495468278
Training MSE: 476450343.1268882


Comments =>> It looks like the **Score and MSE of our model **improved remarkably** after including the 'LandSlope_Gtl','LandSlope_Mod' and 'LandSlope_Sev'variables compared to the pilot_model_3 ("0.6223 VS 0.6102" and "476450343.126 VS 515970129.339" respectively).

__LotShape__

The *LotShape* variable identifies the general shape of property

In [30]:
#convert the LotShape variable into dummy variables
X_train = pd.get_dummies (X_train, columns = ['LotShape'])
#check the shape of df_object after converting the variables into dummy
X_train.shape

(1324, 71)

In [31]:
X_train.columns

Index(['MSZoning', 'LotFrontage', 'LotArea', 'LandContour', 'LotConfig',
       'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle',
       'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'RoofStyle',
       'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'MasVnrArea',
       'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond',
       'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtFinType2',
       'BsmtUnfSF', 'TotalBsmtSF', 'Heating', 'HeatingQC', 'CentralAir',
       'Electrical', '1stFlrSF', '2ndFlrSF', 'GrLivArea', 'BsmtFullBath',
       'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'GarageType', 'GarageYrBlt',
       'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual', 'GarageCond',
       'PavedDrive', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', 'SaleType',
       'SaleCondition', 'Street_Grvl', 'Street_Pave', 'Utilities_AllPub',
       'Utilities

In [32]:
#numerical model + LotShape dummy variables
pilot_model_7 = X_train[['OverallQual', 'GrLivArea', '1stFlrSF', 'FullBath', 'YearBuilt', 'YearRemodAdd',
                           'TotRmsAbvGrd', 'GarageCars', 'GarageArea', 'TotalBsmtSF', 'GarageYrBlt', 'MasVnrArea' ,
                           'BsmtFinSF1', 'LotFrontage', 'BsmtFullBath', 'Fireplaces', 'OpenPorchSF', 'WoodDeckSF', 
                           '2ndFlrSF', 'HalfBath','LotArea', 'BedroomAbvGr', 'OverallCond', 'KitchenAbvGr',
                           'EnclosedPorch', 'BsmtUnfSF', 'LotShape_IR1', 'LotShape_IR2', 'LotShape_IR3', 'LotShape_Reg']]

In [33]:
pilot_model_7.shape

(1324, 30)

In [34]:
#Fitting logistic Regression into the Training set
from sklearn.linear_model import LogisticRegression
log_regressor_7 = LogisticRegression (random_state = 0)
log_regressor_7.fit(pilot_model_7, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=0, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [35]:
#Compute Score (𝑅2) for the pilot_model_6 and y_training
print('Training Score: {}'.format(log_regressor_7.score(pilot_model_7, y_train)))
#Compute MSE (Mean Squared Error) for the pilot_model_5 and y_training
print('Training MSE: {}'.format(np.mean((log_regressor_7.predict(pilot_model_7) - y_train)**2)))

Training Score: 0.6321752265861027
Training MSE: 464727139.60120845


Comments =>> It looks like the **Score and MSE of our model **improved remarkably** after including the 'LotShape_IR1', 'LotShape_IR2', 'LotShape_IR3', 'LotShape_Reg' variables compared to the pilot_model_3 ("0.6321 VS 0.6102" and "464727139.601 VS 515970129.339" respectively).

__LandContour__

The LandContour variable identifies the flatness of the property (planitud del inmueble).

In [36]:
#convert the LandContour  variable into dummy variables
X_train = pd.get_dummies (X_train, columns = ['LandContour'])
#check the shape of df_object after converting the variables into dummy
X_train.shape

(1324, 74)

In [37]:
X_train.columns

Index(['MSZoning', 'LotFrontage', 'LotArea', 'LotConfig', 'Neighborhood',
       'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'OverallQual',
       'OverallCond', 'YearBuilt', 'YearRemodAdd', 'RoofStyle', 'RoofMatl',
       'Exterior1st', 'Exterior2nd', 'MasVnrType', 'MasVnrArea', 'ExterQual',
       'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure',
       'BsmtFinType1', 'BsmtFinSF1', 'BsmtFinType2', 'BsmtUnfSF',
       'TotalBsmtSF', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical',
       '1stFlrSF', '2ndFlrSF', 'GrLivArea', 'BsmtFullBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'GarageType', 'GarageYrBlt',
       'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual', 'GarageCond',
       'PavedDrive', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', 'SaleType',
       'SaleCondition', 'Street_Grvl', 'Street_Pave', 'Utilities_AllPub',
       'Utilities_NoSeWa', 'Land

In [38]:
#numerical model + LandContour dummy variables
pilot_model_8 = X_train[['OverallQual', 'GrLivArea', '1stFlrSF', 'FullBath', 'YearBuilt', 'YearRemodAdd',
                           'TotRmsAbvGrd', 'GarageCars', 'GarageArea', 'TotalBsmtSF', 'GarageYrBlt', 'MasVnrArea' ,
                           'BsmtFinSF1', 'LotFrontage', 'BsmtFullBath', 'Fireplaces', 'OpenPorchSF', 'WoodDeckSF', 
                           '2ndFlrSF', 'HalfBath','LotArea', 'BedroomAbvGr', 'OverallCond', 'KitchenAbvGr',
                           'EnclosedPorch', 'BsmtUnfSF', 'LandContour_Bnk', 'LandContour_HLS', 'LandContour_Low',
                         'LandContour_Lvl']]

In [39]:
pilot_model_8.shape

(1324, 30)

In [40]:
#Fitting logistic Regression into the Training set
from sklearn.linear_model import LogisticRegression
log_regressor_8 = LogisticRegression (random_state = 0)
log_regressor_8.fit(pilot_model_8, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=0, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [41]:
#Compute Score (𝑅2) for the pilot_model_8 and y_training
print('Training Score: {}'.format(log_regressor_8.score(pilot_model_8, y_train)))
#Compute MSE (Mean Squared Error) for the pilot_model_8 and y_training
print('Training MSE: {}'.format(np.mean((log_regressor_8.predict(pilot_model_8) - y_train)**2)))

Training Score: 0.6261329305135952
Training MSE: 487159486.7356495


Comments =>> It looks like the **Score and MSE of our model **improved remarkably** after including the 'LandContour_Bnk', 'LandContour_HLS', 'LandContour_Low', 'LandContour_Lvl' variables compared to the pilot_model_3 ("0.6261 VS 0.6102" and "487159486.735 VS 515970129.339" respectively).

__MasVnrType__

The MasVnrType variable identifies the Masonry veneer type / tipo de chapa de albañilería.

In [42]:
#convert the LandContour  variable into dummy variables
X_train = pd.get_dummies (X_train, columns = ['MasVnrType'])
#check the shape of df_object after converting the variables into dummy
X_train.shape

(1324, 77)

In [43]:
X_train.columns

Index(['MSZoning', 'LotFrontage', 'LotArea', 'LotConfig', 'Neighborhood',
       'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'OverallQual',
       'OverallCond', 'YearBuilt', 'YearRemodAdd', 'RoofStyle', 'RoofMatl',
       'Exterior1st', 'Exterior2nd', 'MasVnrArea', 'ExterQual', 'ExterCond',
       'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1',
       'BsmtFinSF1', 'BsmtFinType2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'GrLivArea', 'BsmtFullBath', 'FullBath', 'HalfBath', 'BedroomAbvGr',
       'KitchenAbvGr', 'KitchenQual', 'TotRmsAbvGrd', 'Functional',
       'Fireplaces', 'GarageType', 'GarageYrBlt', 'GarageFinish', 'GarageCars',
       'GarageArea', 'GarageQual', 'GarageCond', 'PavedDrive', 'WoodDeckSF',
       'OpenPorchSF', 'EnclosedPorch', 'SaleType', 'SaleCondition',
       'Street_Grvl', 'Street_Pave', 'Utilities_AllPub', 'Utilities_NoSeWa',
       'LandSlope_Gtl', 'L

In [44]:
#numerical model + LandContour dummy variables
pilot_model_9 = X_train[['OverallQual', 'GrLivArea', '1stFlrSF', 'FullBath', 'YearBuilt', 'YearRemodAdd',
                           'TotRmsAbvGrd', 'GarageCars', 'GarageArea', 'TotalBsmtSF', 'GarageYrBlt', 'MasVnrArea' ,
                           'BsmtFinSF1', 'LotFrontage', 'BsmtFullBath', 'Fireplaces', 'OpenPorchSF', 'WoodDeckSF', 
                           '2ndFlrSF', 'HalfBath','LotArea', 'BedroomAbvGr', 'OverallCond', 'KitchenAbvGr',
                           'EnclosedPorch', 'BsmtUnfSF', 'MasVnrType_BrkCmn', 'MasVnrType_BrkFace', 'MasVnrType_None',
                           'MasVnrType_Stone']]

In [45]:
pilot_model_9.shape

(1324, 30)

In [46]:
#Fitting logistic Regression into the Training set
from sklearn.linear_model import LogisticRegression
log_regressor_9 = LogisticRegression (random_state = 0)
log_regressor_9.fit(pilot_model_9, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=0, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [47]:
#Compute Score (𝑅2) for the pilot_model_9 and y_training
print('Training Score: {}'.format(log_regressor_9.score(pilot_model_9, y_train)))
#Compute MSE (Mean Squared Error) for the pilot_model_9 and y_training
print('Training MSE: {}'.format(np.mean((log_regressor_9.predict(pilot_model_9) - y_train)**2)))

Training Score: 0.6306646525679759
Training MSE: 469581231.55740184


Comments =>> It looks like the **Score and MSE of our model **improved remarkably** after including the 'LandContour_Bnk', 'LandContour_HLS', 'LandContour_Low', 'LandContour_Lvl' variables compared to the pilot_model_3 ("0.6306 VS 0.6102" and "469581231.557 VS 515970129.339" respectively).

__ExterQual__

The ExterQual variable evaluates the quality of the material on the exterior.

In [48]:
#convert the ExtrQual  variable into dummy variables
X_train = pd.get_dummies (X_train, columns = ['ExterQual'])
#check the shape of df_object after converting the variables into dummy
X_train.shape

(1324, 80)

In [49]:
X_train.columns

Index(['MSZoning', 'LotFrontage', 'LotArea', 'LotConfig', 'Neighborhood',
       'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'OverallQual',
       'OverallCond', 'YearBuilt', 'YearRemodAdd', 'RoofStyle', 'RoofMatl',
       'Exterior1st', 'Exterior2nd', 'MasVnrArea', 'ExterCond', 'Foundation',
       'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating', 'HeatingQC',
       'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF', 'GrLivArea',
       'BsmtFullBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr',
       'KitchenQual', 'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF',
       'EnclosedPorch', 'SaleType', 'SaleCondition', 'Street_Grvl',
       'Street_Pave', 'Utilities_AllPub', 'Utilities_NoSeWa', 'LandSlope_Gtl',
       'LandSlope_Mod'

In [50]:
#numerical model + ExterQual dummy variables
pilot_model_10 = X_train[['OverallQual', 'GrLivArea', '1stFlrSF', 'FullBath', 'YearBuilt', 'YearRemodAdd',
                           'TotRmsAbvGrd', 'GarageCars', 'GarageArea', 'TotalBsmtSF', 'GarageYrBlt', 'MasVnrArea' ,
                           'BsmtFinSF1', 'LotFrontage', 'BsmtFullBath', 'Fireplaces', 'OpenPorchSF', 'WoodDeckSF', 
                           '2ndFlrSF', 'HalfBath','LotArea', 'BedroomAbvGr', 'OverallCond', 'KitchenAbvGr',
                           'EnclosedPorch', 'BsmtUnfSF', 'ExterQual_Ex', 'ExterQual_Fa', 'ExterQual_Gd', 'ExterQual_TA']]

In [51]:
pilot_model_10.shape

(1324, 30)

In [52]:
#Fitting logistic Regression into the Training set
from sklearn.linear_model import LogisticRegression
log_regressor_10 = LogisticRegression (random_state = 0)
log_regressor_10.fit(pilot_model_10, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=0, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [53]:
#Compute Score (𝑅2) for the pilot_model_10 and y_training
print('Training Score: {}'.format(log_regressor_10.score(pilot_model_10, y_train)))
#Compute MSE (Mean Squared Error) for the pilot_model_10 and y_training
print('Training MSE: {}'.format(np.mean((log_regressor_10.predict(pilot_model_10) - y_train)**2)))

Training Score: 0.629154078549849
Training MSE: 439340030.62009066


Comments =>> It looks like the **Score and MSE of our model **improved remarkably** after including the 'ExterQual_Ex', 'ExterQual_Fa', 'ExterQual_Gd', 'ExterQual_TA' variables compared to the pilot_model_3 ("0.6291 VS 0.6102" and "439340030.620 VS 515970129.339" respectively).

__ExterCond__

The ExterCond variable identifies the exterior covering on house (if more than one material).

In [54]:
#convert the LandContour  variable into dummy variables
X_train = pd.get_dummies (X_train, columns = ['ExterCond'])
#check the shape of df_object after converting the variables into dummy
X_train.shape

(1324, 83)

In [55]:
X_train.columns

Index(['MSZoning', 'LotFrontage', 'LotArea', 'LotConfig', 'Neighborhood',
       'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'OverallQual',
       'OverallCond', 'YearBuilt', 'YearRemodAdd', 'RoofStyle', 'RoofMatl',
       'Exterior1st', 'Exterior2nd', 'MasVnrArea', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating', 'HeatingQC',
       'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF', 'GrLivArea',
       'BsmtFullBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr',
       'KitchenQual', 'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF',
       'EnclosedPorch', 'SaleType', 'SaleCondition', 'Street_Grvl',
       'Street_Pave', 'Utilities_AllPub', 'Utilities_NoSeWa', 'LandSlope_Gtl',
       'LandSlope_Mod', 'LandSlope_

In [56]:
#numerical model + ExterCond dummy variables
pilot_model_11 = X_train[['OverallQual', 'GrLivArea', '1stFlrSF', 'FullBath', 'YearBuilt', 'YearRemodAdd',
                           'TotRmsAbvGrd', 'GarageCars', 'GarageArea', 'TotalBsmtSF', 'GarageYrBlt', 'MasVnrArea' ,
                           'BsmtFinSF1', 'LotFrontage', 'BsmtFullBath', 'Fireplaces', 'OpenPorchSF', 'WoodDeckSF', 
                           '2ndFlrSF', 'HalfBath','LotArea', 'BedroomAbvGr', 'OverallCond', 'KitchenAbvGr',
                           'EnclosedPorch', 'BsmtUnfSF', 'ExterCond_Ex', 'ExterCond_Fa', 'ExterCond_Gd', 'ExterCond_TA']]

In [57]:
pilot_model_11.shape

(1324, 30)

In [58]:
#Fitting logistic Regression into the Training set
from sklearn.linear_model import LogisticRegression
log_regressor_11 = LogisticRegression (random_state = 0)
log_regressor_11.fit(pilot_model_11, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=0, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [59]:
#Compute Score (𝑅2) for the pilot_model_11 and y_training
print('Training Score: {}'.format(log_regressor_11.score(pilot_model_11, y_train)))
#Compute MSE (Mean Squared Error) for the pilot_model_11 and y_training
print('Training MSE: {}'.format(np.mean((log_regressor_11.predict(pilot_model_11) - y_train)**2)))

Training Score: 0.6193353474320241
Training MSE: 505790662.2877644


Comments =>> It looks like the Score and MSE of our model worsened after including the 'ExterCond_Ex', 'ExterCond_Fa', 'ExterCond_Gd', 'ExterCond_TA' variables compared to the pilot_model_3 ("0.6193 VS 0.6102"), while the MSe improved ("505790662.287 VS 515970129.339" respectively).

In [61]:
#numerical model + dummy variables
pilot_model_12 = X_train[['OverallQual', 'GrLivArea', '1stFlrSF', 'FullBath', 'YearBuilt', 'YearRemodAdd',
                           'TotRmsAbvGrd', 'GarageCars', 'GarageArea', 'TotalBsmtSF', 'GarageYrBlt', 'MasVnrArea' ,
                           'BsmtFinSF1', 'LotFrontage', 'BsmtFullBath', 'Fireplaces', 'OpenPorchSF', 'WoodDeckSF', 
                           '2ndFlrSF', 'HalfBath','LotArea', 'BedroomAbvGr', 'OverallCond', 'KitchenAbvGr',
                           'EnclosedPorch', 'BsmtUnfSF', 'LandSlope_Gtl',
                           'LandSlope_Mod', 'LandSlope_Sev', 'LotShape_IR1', 'LotShape_IR2',
                           'LotShape_IR3', 'LotShape_Reg', 'LandContour_Bnk', 'LandContour_HLS',
                           'LandContour_Low', 'LandContour_Lvl', 'MasVnrType_BrkCmn',
                           'MasVnrType_BrkFace', 'MasVnrType_None', 'MasVnrType_Stone', 'ExterQual_Ex', 'ExterQual_Fa', 
                           'ExterQual_Gd', 'ExterQual_TA']]

In [62]:
pilot_model_12.shape

(1324, 45)

In [63]:
#Fitting logistic Regression into the Training set
from sklearn.linear_model import LogisticRegression
log_regressor_12 = LogisticRegression (random_state = 0)
log_regressor_12.fit(pilot_model_12, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=0, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [64]:
#Compute Score (𝑅2) for the pilot_model_12 and y_training
print('Training Score: {}'.format(log_regressor_12.score(pilot_model_12, y_train)))
#Compute MSE (Mean Squared Error) for the pilot_model_12 and y_training
print('Training MSE: {}'.format(np.mean((log_regressor_12.predict(pilot_model_12) - y_train)**2)))

Training Score: 0.676737160120846
Training MSE: 327632101.0287009


Resultado bueno!

__BldgType__

The BldgType variable identifies the type of dwelling.

In [65]:
#convert the LandContour  variable into dummy variables
X_train = pd.get_dummies (X_train, columns = ['BldgType'])
#check the shape of df_object after converting the variables into dummy
X_train.shape

(1324, 87)

In [66]:
X_train.columns

Index(['MSZoning', 'LotFrontage', 'LotArea', 'LotConfig', 'Neighborhood',
       'Condition1', 'Condition2', 'HouseStyle', 'OverallQual', 'OverallCond',
       'YearBuilt', 'YearRemodAdd', 'RoofStyle', 'RoofMatl', 'Exterior1st',
       'Exterior2nd', 'MasVnrArea', 'Foundation', 'BsmtQual', 'BsmtCond',
       'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtFinType2',
       'BsmtUnfSF', 'TotalBsmtSF', 'Heating', 'HeatingQC', 'CentralAir',
       'Electrical', '1stFlrSF', '2ndFlrSF', 'GrLivArea', 'BsmtFullBath',
       'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'GarageType', 'GarageYrBlt',
       'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual', 'GarageCond',
       'PavedDrive', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', 'SaleType',
       'SaleCondition', 'Street_Grvl', 'Street_Pave', 'Utilities_AllPub',
       'Utilities_NoSeWa', 'LandSlope_Gtl', 'LandSlope_Mod', 'LandSlope_Sev',
       'LotSh

In [67]:
#numerical model + ExterCond dummy variables
pilot_model_13 = X_train[['OverallQual', 'GrLivArea', '1stFlrSF', 'FullBath', 'YearBuilt', 'YearRemodAdd',
                           'TotRmsAbvGrd', 'GarageCars', 'GarageArea', 'TotalBsmtSF', 'GarageYrBlt', 'MasVnrArea' ,
                           'BsmtFinSF1', 'LotFrontage', 'BsmtFullBath', 'Fireplaces', 'OpenPorchSF', 'WoodDeckSF', 
                           '2ndFlrSF', 'HalfBath','LotArea', 'BedroomAbvGr', 'OverallCond', 'KitchenAbvGr',
                           'EnclosedPorch', 'BsmtUnfSF', 'BldgType_1Fam', 'BldgType_2fmCon',
                           'BldgType_Duplex', 'BldgType_Twnhs', 'BldgType_TwnhsE']]

In [68]:
pilot_model_13.shape

(1324, 31)

In [69]:
#Fitting logistic Regression into the Training set
from sklearn.linear_model import LogisticRegression
log_regressor_13 = LogisticRegression (random_state = 0)
log_regressor_13.fit(pilot_model_13, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=0, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [70]:
#Compute Score (𝑅2) for the pilot_model_13 and y_training
print('Training Score: {}'.format(log_regressor_13.score(pilot_model_13, y_train)))
#Compute MSE (Mean Squared Error) for the pilot_model_13 and y_training
print('Training MSE: {}'.format(np.mean((log_regressor_13.predict(pilot_model_13) - y_train)**2)))

Training Score: 0.6178247734138973
Training MSE: 481267784.5558912


Comments =>> It looks like the **Score and MSE of our model improved** after including the 'BldgType_1Fam', 'BldgType_2fmCon',
'BldgType_Duplex', 'BldgType_Twnhs' and 'BldgType_TwnhsE' variables compared to the pilot_model_3 ("0.6178 VS 0.6102" and "481267784.555 VS 515970129.339" respectively).

Now, we are going to build a model applying the numercial variables and the dummy variables whcih improved the score of the pilot_model_3 (the initial reference model) to test if the performance of the model improved or not. 

In [71]:
#numerical model + dummy variables which improved the score of the pilot_model_3
pilot_model_14 = X_train[['OverallQual', 'GrLivArea', '1stFlrSF', 'FullBath', 'YearBuilt', 'YearRemodAdd',
                           'TotRmsAbvGrd', 'GarageCars', 'GarageArea', 'TotalBsmtSF', 'GarageYrBlt', 'MasVnrArea' ,
                           'BsmtFinSF1', 'LotFrontage', 'BsmtFullBath', 'Fireplaces', 'OpenPorchSF', 'WoodDeckSF', 
                           '2ndFlrSF', 'HalfBath','LotArea', 'BedroomAbvGr', 'OverallCond', 'KitchenAbvGr',
                           'EnclosedPorch', 'BsmtUnfSF', 'LandSlope_Gtl',
                           'LandSlope_Mod', 'LandSlope_Sev', 'LotShape_IR1', 'LotShape_IR2',
                           'LotShape_IR3', 'LotShape_Reg', 'LandContour_Bnk', 'LandContour_HLS',
                           'LandContour_Low', 'LandContour_Lvl', 'MasVnrType_BrkCmn',
                           'MasVnrType_BrkFace', 'MasVnrType_None', 'MasVnrType_Stone', 'ExterQual_Ex', 'ExterQual_Fa', 
                           'ExterQual_Gd', 'ExterQual_TA', 'BldgType_1Fam', 'BldgType_2fmCon',
                           'BldgType_Duplex', 'BldgType_Twnhs', 'BldgType_TwnhsE']]

In [72]:
pilot_model_14.shape

(1324, 50)

In [73]:
#Fitting logistic Regression into the Training set
from sklearn.linear_model import LogisticRegression
log_regressor_14 = LogisticRegression (random_state = 0)
log_regressor_14.fit(pilot_model_14, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=0, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [74]:
#Compute Score (𝑅2) for the pilot_model_13 and y_training
print('Training Score: {}'.format(log_regressor_14.score(pilot_model_14, y_train)))
#Compute MSE (Mean Squared Error) for the pilot_model_14 and y_training
print('Training MSE: {}'.format(np.mean((log_regressor_14.predict(pilot_model_14) - y_train)**2)))

Training Score: 0.6865558912386707
Training MSE: 319883868.4003021


Great result!!

__What happens if we include those dummy variables with which we had not achieved a good result?__

In [75]:
#numerical model + dummy variables which improved the score of the pilot_model_3 + Street dummy variables
pilot_model_15 = X_train[['OverallQual', 'GrLivArea', '1stFlrSF', 'FullBath', 'YearBuilt', 'YearRemodAdd',
                           'TotRmsAbvGrd', 'GarageCars', 'GarageArea', 'TotalBsmtSF', 'GarageYrBlt', 'MasVnrArea' ,
                           'BsmtFinSF1', 'LotFrontage', 'BsmtFullBath', 'Fireplaces', 'OpenPorchSF', 'WoodDeckSF', 
                           '2ndFlrSF', 'HalfBath','LotArea', 'BedroomAbvGr', 'OverallCond', 'KitchenAbvGr',
                           'EnclosedPorch', 'BsmtUnfSF', 'LandSlope_Gtl',
                           'LandSlope_Mod', 'LandSlope_Sev', 'LotShape_IR1', 'LotShape_IR2',
                           'LotShape_IR3', 'LotShape_Reg', 'LandContour_Bnk', 'LandContour_HLS',
                           'LandContour_Low', 'LandContour_Lvl', 'MasVnrType_BrkCmn',
                           'MasVnrType_BrkFace', 'MasVnrType_None', 'MasVnrType_Stone', 'ExterQual_Ex', 'ExterQual_Fa', 
                           'ExterQual_Gd', 'ExterQual_TA', 'BldgType_1Fam', 'BldgType_2fmCon',
                           'BldgType_Duplex', 'BldgType_Twnhs', 'BldgType_TwnhsE', 'Street_Grvl', 'Street_Pave']]

In [76]:
pilot_model_15.shape

(1324, 52)

In [77]:
#Fitting logistic Regression into the Training set
from sklearn.linear_model import LogisticRegression
log_regressor_15 = LogisticRegression (random_state = 0)
log_regressor_15.fit(pilot_model_15, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=0, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [78]:
#Compute Score (𝑅2) for the pilot_model_15 and y_training
print('Training Score: {}'.format(log_regressor_15.score(pilot_model_15, y_train)))
#Compute MSE (Mean Squared Error) for the pilot_model_15 and y_training
print('Training MSE: {}'.format(np.mean((log_regressor_15.predict(pilot_model_15) - y_train)**2)))

Training Score: 0.6933534743202417
Training MSE: 341749669.73036253


The Score improved, but the MSE worsened!!

In [79]:
#numerical model + dummy variables which improved the score of the pilot_model_3 + Utilities dummy variables
pilot_model_16 = X_train[['OverallQual', 'GrLivArea', '1stFlrSF', 'FullBath', 'YearBuilt', 'YearRemodAdd',
                           'TotRmsAbvGrd', 'GarageCars', 'GarageArea', 'TotalBsmtSF', 'GarageYrBlt', 'MasVnrArea' ,
                           'BsmtFinSF1', 'LotFrontage', 'BsmtFullBath', 'Fireplaces', 'OpenPorchSF', 'WoodDeckSF', 
                           '2ndFlrSF', 'HalfBath','LotArea', 'BedroomAbvGr', 'OverallCond', 'KitchenAbvGr',
                           'EnclosedPorch', 'BsmtUnfSF', 'LandSlope_Gtl',
                           'LandSlope_Mod', 'LandSlope_Sev', 'LotShape_IR1', 'LotShape_IR2',
                           'LotShape_IR3', 'LotShape_Reg', 'LandContour_Bnk', 'LandContour_HLS',
                           'LandContour_Low', 'LandContour_Lvl', 'MasVnrType_BrkCmn',
                           'MasVnrType_BrkFace', 'MasVnrType_None', 'MasVnrType_Stone', 'ExterQual_Ex', 'ExterQual_Fa', 
                           'ExterQual_Gd', 'ExterQual_TA', 'BldgType_1Fam', 'BldgType_2fmCon',
                           'BldgType_Duplex', 'BldgType_Twnhs', 'BldgType_TwnhsE', 'Utilities_AllPub',
                           'Utilities_NoSeWa']]

In [80]:
pilot_model_16.shape

(1324, 52)

In [81]:
#Fitting logistic Regression into the Training set
from sklearn.linear_model import LogisticRegression
log_regressor_16 = LogisticRegression (random_state = 0)
log_regressor_16.fit(pilot_model_16, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=0, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [82]:
#Compute Score (𝑅2) for the pilot_model_16 and y_training
print('Training Score: {}'.format(log_regressor_16.score(pilot_model_16, y_train)))
#Compute MSE (Mean Squared Error) for the pilot_model_16 and y_training
print('Training MSE: {}'.format(np.mean((log_regressor_16.predict(pilot_model_16) - y_train)**2)))

Training Score: 0.6910876132930514
Training MSE: 308161450.32250756


Comments =>> **Both the Score and MSE improved remarkly after includding the dummy variables related to Utilities with respect the pilot_model_14.**

In [83]:
#numerical model + dummy variables which improved the score of the pilot_model_3 + ExterCond dummy variables
pilot_model_17 = X_train[['OverallQual', 'GrLivArea', '1stFlrSF', 'FullBath', 'YearBuilt', 'YearRemodAdd',
                           'TotRmsAbvGrd', 'GarageCars', 'GarageArea', 'TotalBsmtSF', 'GarageYrBlt', 'MasVnrArea' ,
                           'BsmtFinSF1', 'LotFrontage', 'BsmtFullBath', 'Fireplaces', 'OpenPorchSF', 'WoodDeckSF', 
                           '2ndFlrSF', 'HalfBath','LotArea', 'BedroomAbvGr', 'OverallCond', 'KitchenAbvGr',
                           'EnclosedPorch', 'BsmtUnfSF', 'LandSlope_Gtl',
                           'LandSlope_Mod', 'LandSlope_Sev', 'LotShape_IR1', 'LotShape_IR2',
                           'LotShape_IR3', 'LotShape_Reg', 'LandContour_Bnk', 'LandContour_HLS',
                           'LandContour_Low', 'LandContour_Lvl', 'MasVnrType_BrkCmn',
                           'MasVnrType_BrkFace', 'MasVnrType_None', 'MasVnrType_Stone', 'ExterQual_Ex', 'ExterQual_Fa', 
                           'ExterQual_Gd', 'ExterQual_TA', 'BldgType_1Fam', 'BldgType_2fmCon',
                           'BldgType_Duplex', 'BldgType_Twnhs', 'BldgType_TwnhsE', 'ExterCond_Ex', 'ExterCond_Fa',
                           'ExterCond_Gd', 'ExterCond_TA']]

In [84]:
pilot_model_17.shape

(1324, 54)

In [85]:
#Fitting logistic Regression into the Training set
from sklearn.linear_model import LogisticRegression
log_regressor_17 = LogisticRegression (random_state = 0)
log_regressor_17.fit(pilot_model_17, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=0, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [86]:
#Compute Score (𝑅2) for the pilot_model_17 and y_training
print('Training Score: {}'.format(log_regressor_17.score(pilot_model_17, y_train)))
#Compute MSE (Mean Squared Error) for the pilot_model_17 and y_training
print('Training MSE: {}'.format(np.mean((log_regressor_17.predict(pilot_model_17) - y_train)**2)))

Training Score: 0.6993957703927492
Training MSE: 337150346.93429005


The Score improved, but the MSE worsened!!

In [87]:
#numerical model + all dummy variables 
pilot_model_18 = X_train[['OverallQual', 'GrLivArea', '1stFlrSF', 'FullBath', 'YearBuilt', 'YearRemodAdd',
                           'TotRmsAbvGrd', 'GarageCars', 'GarageArea', 'TotalBsmtSF', 'GarageYrBlt', 'MasVnrArea' ,
                           'BsmtFinSF1', 'LotFrontage', 'BsmtFullBath', 'Fireplaces', 'OpenPorchSF', 'WoodDeckSF', 
                           '2ndFlrSF', 'HalfBath','LotArea', 'BedroomAbvGr', 'OverallCond', 'KitchenAbvGr',
                           'EnclosedPorch', 'BsmtUnfSF', 'LandSlope_Gtl',
                           'LandSlope_Mod', 'LandSlope_Sev', 'LotShape_IR1', 'LotShape_IR2',
                           'LotShape_IR3', 'LotShape_Reg', 'LandContour_Bnk', 'LandContour_HLS',
                           'LandContour_Low', 'LandContour_Lvl', 'MasVnrType_BrkCmn',
                           'MasVnrType_BrkFace', 'MasVnrType_None', 'MasVnrType_Stone', 'ExterQual_Ex', 'ExterQual_Fa', 
                           'ExterQual_Gd', 'ExterQual_TA', 'BldgType_1Fam', 'BldgType_2fmCon',
                           'BldgType_Duplex', 'BldgType_Twnhs', 'BldgType_TwnhsE', 'ExterCond_Ex', 'ExterCond_Fa',
                           'ExterCond_Gd', 'ExterCond_TA', 'Street_Grvl', 'Street_Pave', 'Utilities_AllPub',
                           'Utilities_NoSeWa']]

In [88]:
pilot_model_18.shape

(1324, 58)

In [89]:
#Fitting logistic Regression into the Training set
from sklearn.linear_model import LogisticRegression
log_regressor_18 = LogisticRegression (random_state = 0)
log_regressor_18.fit(pilot_model_18, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=0, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [90]:
#Compute Score (𝑅2) for the pilot_model_18 and y_training
print('Training Score: {}'.format(log_regressor_18.score(pilot_model_18, y_train)))
#Compute MSE (Mean Squared Error) for the pilot_model_18 and y_training
print('Training MSE: {}'.format(np.mean((log_regressor_18.predict(pilot_model_18) - y_train)**2)))

Training Score: 0.697129909365559
Training MSE: 321083335.3413897


Comments =>> We realized that the score of the model improved achieving the highest value close to 0.7, but the MSE 
got worse compared to the pilot_model_16 (321083335.34 vs 308161450.32).

#### 3.2.2 Convert the remaining categorical variables into numbers

Now, we are going to convert the remaining categorical variables into numbers and check the performance of the model with all the variables of the X_training.

In [98]:
#variables converted into dummy: Street, LotShape, LandContour ,Utilities, LandSlope, BldgType, MasVnrType, ExterQual, ExterCond

#remaining variables to convert into number: LotConfig Neighborhood, Condition1, Condition2, HouseStyle, 
                                            #RoofStyle, RoofMatl, Exterior1st, Exterior2nd

#convert the rest of the categorical variables into numbers
from sklearn.preprocessing import LabelEncoder
lencoders = {}

for col in X_train.select_dtypes(include=['object']).columns:
    lencoders[col] = LabelEncoder()
    X_train[col] = lencoders[col].fit_transform(X_train[col])

In [99]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1324 entries, 0 to 1323
Data columns (total 87 columns):
MSZoning              1324 non-null int32
LotFrontage           1324 non-null float64
LotArea               1324 non-null int64
LotConfig             1324 non-null int32
Neighborhood          1324 non-null int32
Condition1            1324 non-null int32
Condition2            1324 non-null int32
HouseStyle            1324 non-null int32
OverallQual           1324 non-null int64
OverallCond           1324 non-null int64
YearBuilt             1324 non-null int64
YearRemodAdd          1324 non-null int64
RoofStyle             1324 non-null int32
RoofMatl              1324 non-null int32
Exterior1st           1324 non-null int32
Exterior2nd           1324 non-null int32
MasVnrArea            1324 non-null float64
Foundation            1324 non-null int32
BsmtQual              1324 non-null int32
BsmtCond              1324 non-null int32
BsmtExposure          1324 non-null int32
BsmtFin

In [101]:
X_train.head()

Unnamed: 0,MSZoning,LotFrontage,LotArea,LotConfig,Neighborhood,Condition1,Condition2,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrArea,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,GrLivArea,BsmtFullBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,SaleType,SaleCondition,Street_Grvl,Street_Pave,Utilities_AllPub,Utilities_NoSeWa,LandSlope_Gtl,LandSlope_Mod,LandSlope_Sev,LotShape_IR1,LotShape_IR2,LotShape_IR3,LotShape_Reg,LandContour_Bnk,LandContour_HLS,LandContour_Low,LandContour_Lvl,MasVnrType_BrkCmn,MasVnrType_BrkFace,MasVnrType_None,MasVnrType_Stone,ExterQual_Ex,ExterQual_Fa,ExterQual_Gd,ExterQual_TA,ExterCond_Ex,ExterCond_Fa,ExterCond_Gd,ExterCond_TA,BldgType_1Fam,BldgType_2fmCon,BldgType_Duplex,BldgType_Twnhs,BldgType_TwnhsE
0,3,65.0,8450,4,5,2,2,5,7,5,2003,2003,1,1,11,13,196.0,2,2,3,3,2,706.0,5,150.0,856.0,0,0,1,4,856,854,1710,1.0,2,1,3,1,2,8,6,0,1,2003.0,1,2.0,548.0,4,4,2,0,61,0,8,4,0,1,1,0,1,0,0,0,0,0,1,0,0,0,1,0,1,0,0,0,0,1,0,0,0,0,1,1,0,0,0,0
1,3,80.0,9600,2,24,1,2,2,6,8,1976,1976,1,1,7,8,0.0,1,2,3,1,0,978.0,5,284.0,1262.0,0,0,1,4,1262,0,1262,0.0,2,0,3,1,3,6,6,1,1,1976.0,1,2.0,460.0,4,4,2,298,0,0,8,4,0,1,1,0,1,0,0,0,0,0,1,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,1,1,0,0,0,0
2,3,68.0,11250,4,5,2,2,5,7,5,2001,2002,1,1,11,13,162.0,2,2,3,2,2,486.0,5,434.0,920.0,0,0,1,4,920,866,1786,1.0,2,1,3,1,2,6,6,1,1,2001.0,1,2.0,608.0,4,4,2,0,42,0,8,4,0,1,1,0,1,0,0,1,0,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,0,1,1,0,0,0,0
3,3,60.0,9550,0,6,2,2,5,7,5,1915,1970,1,1,12,15,0.0,0,3,1,3,0,216.0,5,540.0,756.0,0,2,1,4,961,756,1717,1.0,1,0,3,1,2,7,6,1,5,1998.0,2,3.0,642.0,4,4,2,0,35,272,8,0,0,1,1,0,1,0,0,1,0,0,0,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,1,1,0,0,0,0
4,3,84.0,14260,2,15,2,2,5,8,5,2000,2000,1,1,11,13,350.0,2,2,3,0,2,655.0,5,490.0,1145.0,0,0,1,4,1145,1053,2198,1.0,2,1,4,1,2,9,6,1,1,2000.0,1,3.0,836.0,4,4,2,192,84,0,8,4,0,1,1,0,1,0,0,1,0,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,0,1,1,0,0,0,0


In [102]:
#numerical model + all dummy variables + remaining numerical variables ()
pilot_model_19 = X_train[['OverallQual', 'GrLivArea', '1stFlrSF', 'FullBath', 'YearBuilt', 'YearRemodAdd',
                           'TotRmsAbvGrd', 'GarageCars', 'GarageArea', 'TotalBsmtSF', 'GarageYrBlt', 'MasVnrArea' ,
                           'BsmtFinSF1', 'LotFrontage', 'BsmtFullBath', 'Fireplaces', 'OpenPorchSF', 'WoodDeckSF', 
                           '2ndFlrSF', 'HalfBath','LotArea', 'BedroomAbvGr', 'OverallCond', 'KitchenAbvGr',
                           'EnclosedPorch', 'BsmtUnfSF', 'LandSlope_Gtl',
                           'LandSlope_Mod', 'LandSlope_Sev', 'LotShape_IR1', 'LotShape_IR2',
                           'LotShape_IR3', 'LotShape_Reg', 'LandContour_Bnk', 'LandContour_HLS',
                           'LandContour_Low', 'LandContour_Lvl', 'MasVnrType_BrkCmn',
                           'MasVnrType_BrkFace', 'MasVnrType_None', 'MasVnrType_Stone', 'ExterQual_Ex', 'ExterQual_Fa', 
                           'ExterQual_Gd', 'ExterQual_TA', 'BldgType_1Fam', 'BldgType_2fmCon',
                           'BldgType_Duplex', 'BldgType_Twnhs', 'BldgType_TwnhsE', 'ExterCond_Ex', 'ExterCond_Fa',
                           'ExterCond_Gd', 'ExterCond_TA', 'Street_Grvl', 'Street_Pave', 'Utilities_AllPub',
                           'Utilities_NoSeWa',  'LotConfig', 'Neighborhood', 'Condition1', 'Condition2', 'HouseStyle', 
                            'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd']]

pilot_model_19.shape

(1324, 67)

In [103]:
#Fitting logistic Regression into the Training set
from sklearn.linear_model import LogisticRegression
log_regressor_19 = LogisticRegression (random_state = 0)
log_regressor_19.fit(pilot_model_19, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=0, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [104]:
#Compute Score (𝑅2) for the pilot_model_19 and y_training
print('Training Score: {}'.format(log_regressor_19.score(pilot_model_19, y_train)))
#Compute MSE (Mean Squared Error) for the pilot_model_19 and y_training
print('Training MSE: {}'.format(np.mean((log_regressor_19.predict(pilot_model_19) - y_train)**2)))

Training Score: 0.790785498489426
Training MSE: 271007633.95694864


Comments =>> **We have achieved the highest result by applying the combination of dummy variables and the rest of the categorical variables converted into numbers.**