# Categorical Data | Analysis & Dummy Variables Testing

1. Introduction


2. Data Preprocessing
    - 2.1 Importing the required packages
    - 2.2 Loading the dataset
    - 2.3 Preparing the data


3. Variables Testing & Adjustments
    - 3.1 Build a prediction model with numerical variables
    - 3.2 Build a prediction model with numerical and categorical variables 
        - 3.2.1 Apllying dummy variables
            - Group A 
            - Group B
            - Applying all dummy variables
        - 3.2.2 Convert the remaining categorical variables into numbers
         
         
4. Back-up

## 1. Introduction

The goal of this notebook is to try to improve the performance of the initial model created in the *House Prices Prediction* project, which is composed only of numerical variables.

In the main notebook of the *House Prices Prediction* project we analyzed the distrubution and contribution of the categorical variables individually. Now, we let's go a step further by analyzing the best way in which categorical variables can positively impact model performance.

To get it, we are going to convert into dummy those variables that have a number of options less than 5 and a high impact on the dependent variable. Then, regarding the remaining numerical variables, we will convert them into numbers and build the final model for our House Price Prediction analysis.

## 2 Data Preprocessing

### 2.1 Importing the requiered packages 

In [2]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

from scipy import stats
from scipy.stats import skew, boxcox_normmax, norm
from scipy.special import boxcox1p

import matplotlib.gridspec as gridspec
from matplotlib.ticker import MaxNLocator

import warnings
pd.options.display.max_columns = 250
pd.options.display.max_rows = 250
warnings.filterwarnings('ignore')
plt.style.use('fivethirtyeight')

### 2.2 Loading the dataset

In [3]:
#loading the training set
df_train_clean = pd.read_csv('df_train_clean.csv')
df_train_clean.head()

FileNotFoundError: [Errno 2] File b'df_train_clean.csv' does not exist: b'df_train_clean.csv'

In [None]:
df_train_clean.shape

### 2.3 Preparing the data

In [None]:
#remove variables with low correlation
df_train_clean.drop(['MoSold', 'ScreenPorch', '3SsnPorch', 'PoolArea', 'MiscVal', 'YrSold', 'LowQualFinSF', 'MSSubClass',
               'BsmtFinSF2', 'BsmtHalfBath'], axis = 1, inplace = True)

In [None]:
#check the shape of the dataframe after removing the variables with low correlation
df_train_clean.shape

In [None]:
#Getting the Dependent and Independent variables
X_train = df_train_clean.iloc[:, :-1] #all lines, all columns except the last one
y_train = df_train_clean.iloc[:, 64] #all lines, only the last column

In [None]:
#check the shaape of X_train and y_train
X_train.shape, y_train.shape

## 3. Variables Testing

### 3.1 Build a prediction model with numerical variables

In [None]:
pilot_model_3 = X_train[['OverallQual', 'GrLivArea', '1stFlrSF', 'FullBath', 'YearBuilt', 'YearRemodAdd',
                           'TotRmsAbvGrd', 'GarageCars', 'GarageArea', 'TotalBsmtSF', 'GarageYrBlt', 'MasVnrArea' ,
                           'BsmtFinSF1', 'LotFrontage', 'BsmtFullBath', 'Fireplaces', 'OpenPorchSF', 'WoodDeckSF', 
                           '2ndFlrSF', 'HalfBath','LotArea', 'BedroomAbvGr', 'OverallCond', 'KitchenAbvGr',
                           'EnclosedPorch', 'BsmtUnfSF']]

In [None]:
pilot_model_3.shape

In [None]:
#Fitting logistic Regression into the Training set
from sklearn.linear_model import LogisticRegression
log_regressor_3 = LogisticRegression (random_state = 0)
log_regressor_3.fit(pilot_model_3, y_train)

In [None]:
#Compute Score (𝑅2) for the pilot_model_3 and y_training
print('Training Score: {}'.format(log_regressor_3.score(pilot_model_3, y_train)))
#Compute MSE (Mean Squared Error) for the pilot_model_3 and y_training
print('Training MSE: {}'.format(np.mean((log_regressor_3.predict(pilot_model_3) - y_train)**2)))

### 3.2 Build a prediction model with numerical and categorical variables

Important Note =>> We are going to analyze the following categorical variables:

Street, LotShape, LandContour ,Utilities, LandSlope, BldgType, MasVnrType, ExterQual, ExterCond, LotConfig Neighborhood, Condition1, Condition2, HouseStyle, RoofStyle, RoofMatl, Exterior1st, Exterior2nd.


Regarding the dummy variables conversion, we are only going to treat the following variables keeping in mind the analysis performed in the main House Price Prediction Notebook:

*Street, LotShape, LandContour ,Utilities, LandSlope, BldgType, MasVnrType, ExterQual, ExterCond*


The remaining categoriacal variables will be converted into numbers at the end of the notebook.

#### 3.2.1 Apllying dummy variables


#### Group A

In this section we are going to treat the following variables keeping in mind the analysis performed in the main House Price Prediction Notebook:

*LotShape, LandContour, LandSlope, BldgType, MasVnrType, ExterQual*.

__1 - LandSlope__

The *LandSlope* variable identifies the slope of property (pendiente de la propiedad).

In [None]:
#convert the LandSlope variable into dummy variables
X_train = pd.get_dummies (X_train, columns = ['LandSlope'])
#check the shape of df_object after converting the variables into dummy
X_train.shape

In [None]:
X_train.columns

In [None]:
#numerical model + LandSlope dummy variables
pilot_model_4 = X_train[['OverallQual', 'GrLivArea', '1stFlrSF', 'FullBath', 'YearBuilt', 'YearRemodAdd',
                           'TotRmsAbvGrd', 'GarageCars', 'GarageArea', 'TotalBsmtSF', 'GarageYrBlt', 'MasVnrArea' ,
                           'BsmtFinSF1', 'LotFrontage', 'BsmtFullBath', 'Fireplaces', 'OpenPorchSF', 'WoodDeckSF', 
                           '2ndFlrSF', 'HalfBath','LotArea', 'BedroomAbvGr', 'OverallCond', 'KitchenAbvGr',
                           'EnclosedPorch', 'BsmtUnfSF', 'LandSlope_Gtl','LandSlope_Mod', 'LandSlope_Sev']]

pilot_model_4.shape

In [None]:
pilot_model_4.shape

In [None]:
#Fitting logistic Regression into the Training set
from sklearn.linear_model import LogisticRegression
log_regressor_4 = LogisticRegression (random_state = 0)
log_regressor_4.fit(pilot_model_4, y_train)

In [None]:
#Compute Score (𝑅2) for the pilot_model_4 and y_training
print('Training Score: {}'.format(log_regressor_4.score(pilot_model_4, y_train)))
#Compute MSE (Mean Squared Error) for the pilot_model_4 and y_training
print('Training MSE: {}'.format(np.mean((log_regressor_4.predict(pilot_model_4) - y_train)**2)))

Comments =>> It looks like the **Score and MSE of our model **improved remarkably** after including the 'LandSlope_Gtl','LandSlope_Mod' and 'LandSlope_Sev'variables compared to the pilot_model_3 ("0.6223 VS 0.6102" and "476450343.126 VS 515970129.339" respectively).

__2 - LotShape__

The *LotShape* variable identifies the general shape of property

In [None]:
#convert the LotShape variable into dummy variables
X_train = pd.get_dummies (X_train, columns = ['LotShape'])
#check the shape of df_object after converting the variables into dummy
X_train.shape

In [None]:
X_train.columns

In [None]:
#numerical model + LotShape dummy variables
pilot_model_5 = X_train[['OverallQual', 'GrLivArea', '1stFlrSF', 'FullBath', 'YearBuilt', 'YearRemodAdd',
                           'TotRmsAbvGrd', 'GarageCars', 'GarageArea', 'TotalBsmtSF', 'GarageYrBlt', 'MasVnrArea' ,
                           'BsmtFinSF1', 'LotFrontage', 'BsmtFullBath', 'Fireplaces', 'OpenPorchSF', 'WoodDeckSF', 
                           '2ndFlrSF', 'HalfBath','LotArea', 'BedroomAbvGr', 'OverallCond', 'KitchenAbvGr',
                           'EnclosedPorch', 'BsmtUnfSF', 'LotShape_IR1', 'LotShape_IR2', 'LotShape_IR3', 'LotShape_Reg']]

pilot_model_5.shape

In [None]:
#Fitting logistic Regression into the Training set
from sklearn.linear_model import LogisticRegression
log_regressor_5 = LogisticRegression (random_state = 0)
log_regressor_5.fit(pilot_model_5, y_train)

In [None]:
#Compute Score (𝑅2) for the pilot_model_5 and y_training
print('Training Score: {}'.format(log_regressor_5.score(pilot_model_5, y_train)))
#Compute MSE (Mean Squared Error) for the pilot_model_5 and y_training
print('Training MSE: {}'.format(np.mean((log_regressor_5.predict(pilot_model_5) - y_train)**2)))

Comments =>> It looks like the **Score and MSE of our model **improved remarkably** after including the 'LotShape_IR1', 'LotShape_IR2', 'LotShape_IR3', 'LotShape_Reg' variables compared to the pilot_model_3 ("0.6321 VS 0.6102" and "464727139.601 VS 515970129.339" respectively).

__3 - LandContour__

The LandContour variable identifies the flatness of the property (planitud del inmueble).

In [None]:
#convert the LandContour  variable into dummy variables
X_train = pd.get_dummies (X_train, columns = ['LandContour'])
#check the shape of df_object after converting the variables into dummy
X_train.shape

In [None]:
X_train.columns

In [None]:
#numerical model + LandContour dummy variables
pilot_model_6 = X_train[['OverallQual', 'GrLivArea', '1stFlrSF', 'FullBath', 'YearBuilt', 'YearRemodAdd',
                           'TotRmsAbvGrd', 'GarageCars', 'GarageArea', 'TotalBsmtSF', 'GarageYrBlt', 'MasVnrArea' ,
                           'BsmtFinSF1', 'LotFrontage', 'BsmtFullBath', 'Fireplaces', 'OpenPorchSF', 'WoodDeckSF', 
                           '2ndFlrSF', 'HalfBath','LotArea', 'BedroomAbvGr', 'OverallCond', 'KitchenAbvGr',
                           'EnclosedPorch', 'BsmtUnfSF', 'LandContour_Bnk', 'LandContour_HLS', 'LandContour_Low',
                         'LandContour_Lvl']]

pilot_model_6.shape

In [None]:
#Fitting logistic Regression into the Training set
from sklearn.linear_model import LogisticRegression
log_regressor_6 = LogisticRegression (random_state = 0)
log_regressor_6.fit(pilot_model_6, y_train)

In [None]:
#Compute Score (𝑅2) for the pilot_model_6 and y_training
print('Training Score: {}'.format(log_regressor_6.score(pilot_model_6, y_train)))
#Compute MSE (Mean Squared Error) for the pilot_model_6 and y_training
print('Training MSE: {}'.format(np.mean((log_regressor_6.predict(pilot_model_6) - y_train)**2)))

Comments =>> It looks like the **Score and MSE of our model **improved remarkably** after including the 'LandContour_Bnk', 'LandContour_HLS', 'LandContour_Low', 'LandContour_Lvl' variables compared to the pilot_model_3 ("0.6261 VS 0.6102" and "487159486.735 VS 515970129.339" respectively).

__4 - MasVnrType__

The MasVnrType variable identifies the Masonry veneer type / tipo de chapa de albañilería.

In [None]:
#convert the LandContour  variable into dummy variables
X_train = pd.get_dummies (X_train, columns = ['MasVnrType'])
#check the shape of df_object after converting the variables into dummy
X_train.shape

In [None]:
X_train.columns

In [None]:
#numerical model + LandContour dummy variables
pilot_model_7 = X_train[['OverallQual', 'GrLivArea', '1stFlrSF', 'FullBath', 'YearBuilt', 'YearRemodAdd',
                           'TotRmsAbvGrd', 'GarageCars', 'GarageArea', 'TotalBsmtSF', 'GarageYrBlt', 'MasVnrArea' ,
                           'BsmtFinSF1', 'LotFrontage', 'BsmtFullBath', 'Fireplaces', 'OpenPorchSF', 'WoodDeckSF', 
                           '2ndFlrSF', 'HalfBath','LotArea', 'BedroomAbvGr', 'OverallCond', 'KitchenAbvGr',
                           'EnclosedPorch', 'BsmtUnfSF', 'MasVnrType_BrkCmn', 'MasVnrType_BrkFace', 'MasVnrType_None',
                           'MasVnrType_Stone']]

pilot_model_7.shape 

In [None]:
#Fitting logistic Regression into the Training set
from sklearn.linear_model import LogisticRegression
log_regressor_7 = LogisticRegression (random_state = 0)
log_regressor_7.fit(pilot_model_7, y_train)

In [None]:
#Compute Score (𝑅2) for the pilot_model_7 and y_training
print('Training Score: {}'.format(log_regressor_7.score(pilot_model_7, y_train)))
#Compute MSE (Mean Squared Error) for the pilot_model_7 and y_training
print('Training MSE: {}'.format(np.mean((log_regressor_7.predict(pilot_model_7) - y_train)**2)))

Comments =>> It looks like the **Score and MSE of our model **improved remarkably** after including the 'LandContour_Bnk', 'LandContour_HLS', 'LandContour_Low', 'LandContour_Lvl' variables compared to the pilot_model_3 ("0.6306 VS 0.6102" and "469581231.557 VS 515970129.339" respectively).

__5 - ExterQual__

The ExterQual variable evaluates the quality of the material on the exterior.

In [None]:
#convert the ExtrQual  variable into dummy variables
X_train = pd.get_dummies (X_train, columns = ['ExterQual'])
#check the shape of df_object after converting the variables into dummy
X_train.shape

In [None]:
X_train.columns

In [None]:
#numerical model + ExterQual dummy variables
pilot_model_8 = X_train[['OverallQual', 'GrLivArea', '1stFlrSF', 'FullBath', 'YearBuilt', 'YearRemodAdd',
                           'TotRmsAbvGrd', 'GarageCars', 'GarageArea', 'TotalBsmtSF', 'GarageYrBlt', 'MasVnrArea' ,
                           'BsmtFinSF1', 'LotFrontage', 'BsmtFullBath', 'Fireplaces', 'OpenPorchSF', 'WoodDeckSF', 
                           '2ndFlrSF', 'HalfBath','LotArea', 'BedroomAbvGr', 'OverallCond', 'KitchenAbvGr',
                           'EnclosedPorch', 'BsmtUnfSF', 'ExterQual_Ex', 'ExterQual_Fa', 'ExterQual_Gd', 'ExterQual_TA']]

pilot_model_8.shape

In [None]:
#Fitting logistic Regression into the Training set
from sklearn.linear_model import LogisticRegression
log_regressor_8 = LogisticRegression (random_state = 0)
log_regressor_8.fit(pilot_model_8, y_train)

In [None]:
#Compute Score (𝑅2) for the pilot_model_8 and y_training
print('Training Score: {}'.format(log_regressor_8.score(pilot_model_8, y_train)))
#Compute MSE (Mean Squared Error) for the pilot_model_8 and y_training
print('Training MSE: {}'.format(np.mean((log_regressor_8.predict(pilot_model_8) - y_train)**2)))

Comments =>> It looks like the **Score and MSE of our model **improved remarkably** after including the 'ExterQual_Ex', 'ExterQual_Fa', 'ExterQual_Gd', 'ExterQual_TA' variables compared to the pilot_model_3 ("0.6291 VS 0.6102" and "439340030.620 VS 515970129.339" respectively).

__6 - BldgType__

The BldgType variable identifies the type of dwelling.

In [None]:
#convert the LandContour  variable into dummy variables
X_train = pd.get_dummies (X_train, columns = ['BldgType'])
#check the shape of df_object after converting the variables into dummy
X_train.shape

In [None]:
X_train.columns

In [None]:
#numerical model + ExterCond dummy variables
pilot_model_9 = X_train[['OverallQual', 'GrLivArea', '1stFlrSF', 'FullBath', 'YearBuilt', 'YearRemodAdd',
                           'TotRmsAbvGrd', 'GarageCars', 'GarageArea', 'TotalBsmtSF', 'GarageYrBlt', 'MasVnrArea' ,
                           'BsmtFinSF1', 'LotFrontage', 'BsmtFullBath', 'Fireplaces', 'OpenPorchSF', 'WoodDeckSF', 
                           '2ndFlrSF', 'HalfBath','LotArea', 'BedroomAbvGr', 'OverallCond', 'KitchenAbvGr',
                           'EnclosedPorch', 'BsmtUnfSF', 'BldgType_1Fam', 'BldgType_2fmCon',
                           'BldgType_Duplex', 'BldgType_Twnhs', 'BldgType_TwnhsE']]

pilot_model_9.shape

In [None]:
#Fitting logistic Regression into the Training set
from sklearn.linear_model import LogisticRegression
log_regressor_9 = LogisticRegression (random_state = 0)
log_regressor_9.fit(pilot_model_9, y_train)

In [None]:
#Compute Score (𝑅2) for the pilot_model_9 and y_training
print('Training Score: {}'.format(log_regressor_9.score(pilot_model_9, y_train)))
#Compute MSE (Mean Squared Error) for the pilot_model_13 and y_training
print('Training MSE: {}'.format(np.mean((log_regressor_9.predict(pilot_model_9) - y_train)**2)))

Comments =>> It looks like the **Score and MSE of our model improved** after including the 'BldgType_1Fam', 'BldgType_2fmCon',
'BldgType_Duplex', 'BldgType_Twnhs' and 'BldgType_TwnhsE' variables compared to the pilot_model_3 ("0.6178 VS 0.6102" and "481267784.555 VS 515970129.339" respectively).

Now, we are going to build a model applying the numercial variables and the dummy variables (Group A) which improved the score of the pilot_model_3 (the initial reference model) to test if the performance of the model improved or not. 

In [None]:
#numerical model + dummy variables (Group A), which improved the score of the pilot_model_3
pilot_model_10 = X_train[['OverallQual', 'GrLivArea', '1stFlrSF', 'FullBath', 'YearBuilt', 'YearRemodAdd',
                           'TotRmsAbvGrd', 'GarageCars', 'GarageArea', 'TotalBsmtSF', 'GarageYrBlt', 'MasVnrArea' ,
                           'BsmtFinSF1', 'LotFrontage', 'BsmtFullBath', 'Fireplaces', 'OpenPorchSF', 'WoodDeckSF', 
                           '2ndFlrSF', 'HalfBath','LotArea', 'BedroomAbvGr', 'OverallCond', 'KitchenAbvGr',
                           'EnclosedPorch', 'BsmtUnfSF', 'LandSlope_Gtl',
                           'LandSlope_Mod', 'LandSlope_Sev', 'LotShape_IR1', 'LotShape_IR2',
                           'LotShape_IR3', 'LotShape_Reg', 'LandContour_Bnk', 'LandContour_HLS',
                           'LandContour_Low', 'LandContour_Lvl', 'MasVnrType_BrkCmn',
                           'MasVnrType_BrkFace', 'MasVnrType_None', 'MasVnrType_Stone', 'ExterQual_Ex', 'ExterQual_Fa', 
                           'ExterQual_Gd', 'ExterQual_TA', 'BldgType_1Fam', 'BldgType_2fmCon',
                           'BldgType_Duplex', 'BldgType_Twnhs', 'BldgType_TwnhsE']]

pilot_model_10.shape

In [None]:
#Fitting logistic Regression into the Training set
from sklearn.linear_model import LogisticRegression
log_regressor_10 = LogisticRegression (random_state = 0)
log_regressor_10.fit(pilot_model_10, y_train)

In [None]:
#Compute Score (𝑅2) for the pilot_model_10 and y_training
print('Training Score: {}'.format(log_regressor_10.score(pilot_model_10, y_train)))
#Compute MSE (Mean Squared Error) for the pilot_model_10 and y_training
print('Training MSE: {}'.format(np.mean((log_regressor_10.predict(pilot_model_10) - y_train)**2)))

Great result!!

#### Group B

In this section we are going to treat the following variables keeping in mind the analysis performed in the main House Price Prediction Notebook:

*BsmtQual, BsmtCond, BsmtExposure, CentralAir, KitchenQual, GarageFinish, and PavedDrive*


__1 - BsmtQual__

The BsmtQual variable evaluates the height of the basement (altura del sotano).

In [None]:
#convert the BsmtQual  variable into dummy variables
X_train = pd.get_dummies (X_train, columns = ['BsmtQual'])
#check the shape of df_object after converting the variables into dummy
X_train.shape

X_train.columns

In [None]:
#numerical model + BsmtQual dummy variables
pilot_model_11 = X_train[['OverallQual', 'GrLivArea', '1stFlrSF', 'FullBath', 'YearBuilt', 'YearRemodAdd',
                           'TotRmsAbvGrd', 'GarageCars', 'GarageArea', 'TotalBsmtSF', 'GarageYrBlt', 'MasVnrArea' ,
                           'BsmtFinSF1', 'LotFrontage', 'BsmtFullBath', 'Fireplaces', 'OpenPorchSF', 'WoodDeckSF', 
                           '2ndFlrSF', 'HalfBath','LotArea', 'BedroomAbvGr', 'OverallCond', 'KitchenAbvGr',
                           'EnclosedPorch', 'BsmtUnfSF', 'BsmtQual_Ex',
                           'BsmtQual_Fa', 'BsmtQual_Gd', 'BsmtQual_TA']]
pilot_model_11.shape

In [None]:
#Fitting logistic Regression into the Training set
from sklearn.linear_model import LogisticRegression
log_regressor_11 = LogisticRegression (random_state = 0)
log_regressor_11.fit(pilot_model_11, y_train)

In [None]:
#Compute Score (𝑅2) for the pilot_model_11 and y_training
print('Training Score: {}'.format(log_regressor_11.score(pilot_model_11, y_train)))
#Compute MSE (Mean Squared Error) for the pilot_model_11 and y_training
print('Training MSE: {}'.format(np.mean((log_regressor_11.predict(pilot_model_11) - y_train)**2)))

Comments =>> It looks like the **Score and MSE of our model improved** after including the 'BsmtQual_Ex','BsmtQual_Fa', 'BsmtQual_Gd' and 'BsmtQual_TA' variables compared to the pilot_model_3 ("0.6276 VS 0.6102" and "438421112.044 VS 515970129.339" respectively).

__2 - BsmtCond__

The BsmtCond variable evaluates the general condition of the basement.

In [None]:
#convert the BsmtCond  variable into dummy variables
X_train = pd.get_dummies (X_train, columns = ['BsmtCond'])
#check the shape of df_object after converting the variables into dummy
X_train.shape

X_train.columns

In [None]:
#numerical model + BsmtCond dummy variables
pilot_model_12 = X_train[['OverallQual', 'GrLivArea', '1stFlrSF', 'FullBath', 'YearBuilt', 'YearRemodAdd',
                           'TotRmsAbvGrd', 'GarageCars', 'GarageArea', 'TotalBsmtSF', 'GarageYrBlt', 'MasVnrArea' ,
                           'BsmtFinSF1', 'LotFrontage', 'BsmtFullBath', 'Fireplaces', 'OpenPorchSF', 'WoodDeckSF', 
                           '2ndFlrSF', 'HalfBath','LotArea', 'BedroomAbvGr', 'OverallCond', 'KitchenAbvGr',
                           'EnclosedPorch', 'BsmtUnfSF', 'BsmtCond_Fa', 'BsmtCond_Gd', 'BsmtCond_Po', 'BsmtCond_TA']]

pilot_model_12.shape

In [None]:
#Fitting logistic Regression into the Training set
from sklearn.linear_model import LogisticRegression
log_regressor_12 = LogisticRegression (random_state = 0)
log_regressor_12.fit(pilot_model_12, y_train)

In [None]:
#Compute Score (𝑅2) for the pilot_model_12 and y_training
print('Training Score: {}'.format(log_regressor_12.score(pilot_model_12, y_train)))
#Compute MSE (Mean Squared Error) for the pilot_model_12 and y_training
print('Training MSE: {}'.format(np.mean((log_regressor_12.predict(pilot_model_12) - y_train)**2)))

Comments =>> It looks like the **Score and MSE of our model improved** after including the 'BsmtCond_Fa', 'BsmtCond_Gd', 'BsmtCond_Po' and 'BsmtCond_TA' variables compared to the pilot_model_3 ("0.6283 VS 0.6102" and "472572443.84 VS 515970129.339" respectively).

__3 - BsmtExposure__

The BsmtExposure variable refers to walkout or garden level walls.

In [None]:
#convert the BsmtExposure variable into dummy variables
X_train = pd.get_dummies (X_train, columns = ['BsmtExposure'])
#check the shape of df_object after converting the variables into dummy
X_train.shape

X_train.columns

In [None]:
#numerical model + BsmtCond dummy variables
pilot_model_13 = X_train[['OverallQual', 'GrLivArea', '1stFlrSF', 'FullBath', 'YearBuilt', 'YearRemodAdd',
                           'TotRmsAbvGrd', 'GarageCars', 'GarageArea', 'TotalBsmtSF', 'GarageYrBlt', 'MasVnrArea' ,
                           'BsmtFinSF1', 'LotFrontage', 'BsmtFullBath', 'Fireplaces', 'OpenPorchSF', 'WoodDeckSF', 
                           '2ndFlrSF', 'HalfBath','LotArea', 'BedroomAbvGr', 'OverallCond', 'KitchenAbvGr',
                           'EnclosedPorch', 'BsmtUnfSF', 'BsmtExposure_Av', 'BsmtExposure_Gd', 'BsmtExposure_Mn',
                           'BsmtExposure_No']]

pilot_model_13.shape

In [None]:
#Fitting logistic Regression into the Training set
from sklearn.linear_model import LogisticRegression
log_regressor_13 = LogisticRegression (random_state = 0)
log_regressor_13.fit(pilot_model_13, y_train)

In [None]:
#Compute Score (𝑅2) for the pilot_model_13 and y_training
print('Training Score: {}'.format(log_regressor_13.score(pilot_model_13, y_train)))
#Compute MSE (Mean Squared Error) for the pilot_model_13 and y_training
print('Training MSE: {}'.format(np.mean((log_regressor_13.predict(pilot_model_13) - y_train)**2)))

Comments =>> It looks like the **Score and MSE of our model improved** after including the 'BsmtExposure_Av', 'BsmtExposure_Gd', 'BsmtExposure_Mn' and 'BsmtExposure_No' variables compared to the pilot_model_3 ("0.6389 VS 0.6102" and "467935964.311 VS 515970129.339" respectively).

__4 - CentralAir__

The CentralAir variable refers to the central air conditioning.

In [None]:
#convert the BsmtExposure variable into dummy variables
X_train = pd.get_dummies (X_train, columns = ['CentralAir'])
#check the shape of df_object after converting the variables into dummy
X_train.shape


In [None]:
#check the name of the columns after converting the variables into dummy
X_train.columns

In [None]:
#numerical model + CentralAir dummy variables
pilot_model_14 = X_train[['OverallQual', 'GrLivArea', '1stFlrSF', 'FullBath', 'YearBuilt', 'YearRemodAdd',
                           'TotRmsAbvGrd', 'GarageCars', 'GarageArea', 'TotalBsmtSF', 'GarageYrBlt', 'MasVnrArea' ,
                           'BsmtFinSF1', 'LotFrontage', 'BsmtFullBath', 'Fireplaces', 'OpenPorchSF', 'WoodDeckSF', 
                           '2ndFlrSF', 'HalfBath','LotArea', 'BedroomAbvGr', 'OverallCond', 'KitchenAbvGr',
                           'EnclosedPorch', 'BsmtUnfSF', 'CentralAir_N','CentralAir_Y']]

pilot_model_14.shape

In [None]:
#Fitting logistic Regression into the Training set
from sklearn.linear_model import LogisticRegression
log_regressor_14 = LogisticRegression (random_state = 0)
log_regressor_14.fit(pilot_model_14, y_train)

In [None]:
#Compute Score (𝑅2) for the pilot_model_14 and y_training
print('Training Score: {}'.format(log_regressor_14.score(pilot_model_14, y_train)))
#Compute MSE (Mean Squared Error) for the pilot_model_14 and y_training
print('Training MSE: {}'.format(np.mean((log_regressor_14.predict(pilot_model_14) - y_train)**2)))

Comments =>> It looks like the **Score and MSE of our model improved** of our model improved after apllying the 'CentralAir_N' and'CentralAir_Y' variables compared to the pilot_model_3 ("0.6163 VS 0.6102" and "493214700.237 VS 515970129.339" respectively).

__5 - KitchenQual__

The KitchenQual variable evaluates the Kitchen quality.

In [None]:
#convert the KitchenQual variable into dummy variables
X_train = pd.get_dummies (X_train, columns = ['KitchenQual'])
#check the shape of df_object after converting the variables into dummy
X_train.shape

In [None]:
#check the name of the columns after converting the variables into dummy
X_train.columns

In [None]:
#numerical model + KitchenQual dummy variables
pilot_model_15 = X_train[['OverallQual', 'GrLivArea', '1stFlrSF', 'FullBath', 'YearBuilt', 'YearRemodAdd',
                           'TotRmsAbvGrd', 'GarageCars', 'GarageArea', 'TotalBsmtSF', 'GarageYrBlt', 'MasVnrArea' ,
                           'BsmtFinSF1', 'LotFrontage', 'BsmtFullBath', 'Fireplaces', 'OpenPorchSF', 'WoodDeckSF', 
                           '2ndFlrSF', 'HalfBath','LotArea', 'BedroomAbvGr', 'OverallCond', 'KitchenAbvGr',
                           'EnclosedPorch', 'BsmtUnfSF', 'KitchenQual_Ex', 'KitchenQual_Fa', 'KitchenQual_Gd',
                           'KitchenQual_TA']]

pilot_model_15.shape

In [None]:
#Fitting logistic Regression into the Training set
from sklearn.linear_model import LogisticRegression
log_regressor_15 = LogisticRegression (random_state = 0)
log_regressor_15.fit(pilot_model_15, y_train)

In [None]:
#Compute Score (𝑅2) for the pilot_model_15 and y_training
print('Training Score: {}'.format(log_regressor_15.score(pilot_model_15, y_train)))
#Compute MSE (Mean Squared Error) for the pilot_model_15 and y_training
print('Training MSE: {}'.format(np.mean((log_regressor_15.predict(pilot_model_15) - y_train)**2)))

Comments =>> It looks like the **Score and MSE of our model improved** of our model improved after apllying the 'KitchenQual_Ex', 'KitchenQual_Fa', 'KitchenQual_Gd' and 'KitchenQual_TA' variables compared to the pilot_model_3 ("0.6351 VS 0.6102" and "403595969.785 VS 515970129.339" respectively).

__6 - GarageFinish__

The GarageFinish variable refers to the interior finish of the garage (remate final del garaje).

In [None]:
#convert the GarageFinish variable into dummy variables
X_train = pd.get_dummies (X_train, columns = ['GarageFinish'])
#check the shape of df_object after converting the variables into dummy
X_train.shape

In [None]:
#check the name of the columns after converting the variables into dummy
X_train.columns

In [None]:
#numerical model + GarageFinish dummy variables
pilot_model_16 = X_train[['OverallQual', 'GrLivArea', '1stFlrSF', 'FullBath', 'YearBuilt', 'YearRemodAdd',
                           'TotRmsAbvGrd', 'GarageCars', 'GarageArea', 'TotalBsmtSF', 'GarageYrBlt', 'MasVnrArea' ,
                           'BsmtFinSF1', 'LotFrontage', 'BsmtFullBath', 'Fireplaces', 'OpenPorchSF', 'WoodDeckSF', 
                           '2ndFlrSF', 'HalfBath','LotArea', 'BedroomAbvGr', 'OverallCond', 'KitchenAbvGr',
                           'EnclosedPorch', 'BsmtUnfSF', 'GarageFinish_Fin', 'GarageFinish_RFn', 'GarageFinish_Unf']]

pilot_model_16.shape

In [None]:
#Fitting logistic Regression into the Training set
from sklearn.linear_model import LogisticRegression
log_regressor_16 = LogisticRegression (random_state = 0)
log_regressor_16.fit(pilot_model_16, y_train)

In [None]:
#Compute Score (𝑅2) for the pilot_model_16 and y_training
print('Training Score: {}'.format(log_regressor_16.score(pilot_model_16, y_train)))
#Compute MSE (Mean Squared Error) for the pilot_model_16 and y_training
print('Training MSE: {}'.format(np.mean((log_regressor_16.predict(pilot_model_16) - y_train)**2)))

Comments =>> It looks like the **Score and MSE of our model improved** of our model improved after apllying the 'GarageFinish_Fin', 'GarageFinish_RFn' and 'GarageFinish_Unf' variables compared to the pilot_model_3 ("0.6397 VS 0.6102" and "498530501.830 VS 515970129.339" respectively).

__7 - PavedDrive__

The PavedDrive variable refers to the paved driveway (Calzada pavimentada).

In [None]:
#convert the PavedDrive variable into dummy variables
X_train = pd.get_dummies (X_train, columns = ['PavedDrive'])
#check the shape of df_object after converting the variables into dummy
X_train.shape

In [None]:
#check the name of the columns after converting the variables into dummy
X_train.columns

In [None]:
#numerical model + PavedDrive dummy variables
pilot_model_17 = X_train[['OverallQual', 'GrLivArea', '1stFlrSF', 'FullBath', 'YearBuilt', 'YearRemodAdd',
                           'TotRmsAbvGrd', 'GarageCars', 'GarageArea', 'TotalBsmtSF', 'GarageYrBlt', 'MasVnrArea' ,
                           'BsmtFinSF1', 'LotFrontage', 'BsmtFullBath', 'Fireplaces', 'OpenPorchSF', 'WoodDeckSF', 
                           '2ndFlrSF', 'HalfBath','LotArea', 'BedroomAbvGr', 'OverallCond', 'KitchenAbvGr',
                           'EnclosedPorch', 'BsmtUnfSF', 'PavedDrive_N', 'PavedDrive_P', 'PavedDrive_Y']]

pilot_model_17.shape

In [None]:
#Fitting logistic Regression into the Training set
from sklearn.linear_model import LogisticRegression
log_regressor_17 = LogisticRegression (random_state = 0)
log_regressor_17.fit(pilot_model_17, y_train)

In [None]:
#Compute Score (𝑅2) for the pilot_model_17 and y_training
print('Training Score: {}'.format(log_regressor_17.score(pilot_model_17, y_train)))
#Compute MSE (Mean Squared Error) for the pilot_model_17 and y_training
print('Training MSE: {}'.format(np.mean((log_regressor_17.predict(pilot_model_17) - y_train)**2)))

Comments =>> It looks like the Score and MSE of our model improved of our model improved after apllying the 'PavedDrive_N', 'PavedDrive_P' and 'PavedDrive_Y' variables compared to the pilot_model_3 ("0.6216 VS 0.6102" and "479070896.809 VS 515970129.339" respectively).

Now, we are going to build a model applying the numercial variables and the dummy variables (Group B) which improved the score of the pilot_model_3 (the initial reference model) to test if the performance of the model improved or not.

In [None]:
#numerical model + dummy variables (Group B), which improved the score of the pilot_model_3
pilot_model_18 = X_train[['OverallQual', 'GrLivArea', '1stFlrSF', 'FullBath', 'YearBuilt', 'YearRemodAdd',
                           'TotRmsAbvGrd', 'GarageCars', 'GarageArea', 'TotalBsmtSF', 'GarageYrBlt', 'MasVnrArea' ,
                           'BsmtFinSF1', 'LotFrontage', 'BsmtFullBath', 'Fireplaces', 'OpenPorchSF', 'WoodDeckSF', 
                           '2ndFlrSF', 'HalfBath','LotArea', 'BedroomAbvGr', 'OverallCond', 'KitchenAbvGr',
                           'EnclosedPorch', 'BsmtUnfSF', 'BsmtQual_Ex', 'BsmtQual_Fa', 'BsmtQual_Gd', 'BsmtQual_TA', 
                          'BsmtCond_Fa', 'BsmtCond_Gd', 'BsmtCond_Po', 'BsmtCond_TA', 'BsmtExposure_Av', 'BsmtExposure_Gd', 
                          'BsmtExposure_Mn', 'BsmtExposure_No', 'CentralAir_N', 'CentralAir_Y', 'KitchenQual_Ex', 
                          'KitchenQual_Fa', 'KitchenQual_Gd', 'KitchenQual_TA', 'GarageFinish_Fin', 'GarageFinish_RFn', 
                          'GarageFinish_Unf','PavedDrive_N', 'PavedDrive_P', 'PavedDrive_Y']]

pilot_model_18.shape

In [None]:
#Fitting logistic Regression into the Training set
from sklearn.linear_model import LogisticRegression
log_regressor_18 = LogisticRegression (random_state = 0)
log_regressor_18.fit(pilot_model_18, y_train)

In [None]:
#Compute Score (𝑅2) for the pilot_model_18 and y_training
print('Training Score: {}'.format(log_regressor_18.score(pilot_model_18, y_train)))
#Compute MSE (Mean Squared Error) for the pilot_model_18 and y_training
print('Training MSE: {}'.format(np.mean((log_regressor_18.predict(pilot_model_18) - y_train)**2)))

Great results!!

#### Applying all dummy variables

Explanation.....

In [None]:
#numerical model + dummy variables (Group A & Group B) which improved the score of the pilot_model_3
pilot_model_19 = X_train[['OverallQual', 'GrLivArea', '1stFlrSF', 'FullBath', 'YearBuilt', 'YearRemodAdd',
                           'TotRmsAbvGrd', 'GarageCars', 'GarageArea', 'TotalBsmtSF', 'GarageYrBlt', 'MasVnrArea' ,
                           'BsmtFinSF1', 'LotFrontage', 'BsmtFullBath', 'Fireplaces', 'OpenPorchSF', 'WoodDeckSF', 
                           '2ndFlrSF', 'HalfBath','LotArea', 'BedroomAbvGr', 'OverallCond', 'KitchenAbvGr',
                           'EnclosedPorch', 'BsmtUnfSF', 'LandSlope_Gtl','LandSlope_Mod', 'LandSlope_Sev', 'LotShape_IR1', 
                           'LotShape_IR2', 'LotShape_IR3', 'LotShape_Reg', 'LandContour_Bnk', 'LandContour_HLS',
                           'LandContour_Low', 'LandContour_Lvl', 'MasVnrType_BrkCmn', 'MasVnrType_BrkFace', 'MasVnrType_None', 
                           'MasVnrType_Stone', 'ExterQual_Ex', 'ExterQual_Fa', 'ExterQual_Gd', 'ExterQual_TA', 'BldgType_1Fam', 
                           'BldgType_2fmCon', 'BldgType_Duplex', 'BldgType_Twnhs', 'BldgType_TwnhsE', 'BsmtQual_Ex', 
                           'BsmtQual_Fa', 'BsmtQual_Gd', 'BsmtQual_TA', 'BsmtCond_Fa', 'BsmtCond_Gd', 'BsmtCond_Po', 
                           'BsmtCond_TA', 'BsmtExposure_Av', 'BsmtExposure_Gd', 'BsmtExposure_Mn', 'BsmtExposure_No', 
                           'CentralAir_N', 'CentralAir_Y', 'KitchenQual_Ex', 'KitchenQual_Fa', 'KitchenQual_Gd', 
                           'KitchenQual_TA', 'GarageFinish_Fin', 'GarageFinish_RFn', 'GarageFinish_Unf','PavedDrive_N', 
                           'PavedDrive_P', 'PavedDrive_Y']]

pilot_model_19.shape

In [None]:
#Fitting logistic Regression into the Training set
from sklearn.linear_model import LogisticRegression
log_regressor_19 = LogisticRegression (random_state = 0)
log_regressor_19.fit(pilot_model_19, y_train)

In [None]:
#Compute Score (𝑅2) for the pilot_model_20 and y_training
print('Training Score: {}'.format(log_regressor_19.score(pilot_model_19, y_train)))
#Compute MSE (Mean Squared Error) for the pilot_model_20 and y_training
print('Training MSE: {}'.format(np.mean((log_regressor_19.predict(pilot_model_19) - y_train)**2)))

Comments =>> **We have achieved the highest result by applying the combination of those dummy variables that have a positive impact on model performance.**

#### 3.2.2 Convert the remaining categorical variables into numbers

Now, we are going to convert the remaining categorical variables into numbers and check the performance of the model.

In [None]:
#convert the rest of the categorical variables into numbers
from sklearn.preprocessing import LabelEncoder
lencoders = {}

for col in X_train.select_dtypes(include=['object']).columns:
    lencoders[col] = LabelEncoder()
    X_train[col] = lencoders[col].fit_transform(X_train[col])

In [None]:
#check the datatype of X_train to review that all the variables are numbers
X_train.info()

In [None]:
#Review the final data
X_train.head()

In [None]:
#numerical model + all dummy variables + remaining numerical variables ()
pilot_model_20 = X_train

pilot_model_20.shape

In [None]:
#Fitting logistic Regression into the Training set
from sklearn.linear_model import LogisticRegression
log_regressor_20 = LogisticRegression (random_state = 0)
log_regressor_20.fit(pilot_model_20, y_train)

In [None]:
#Compute Score (𝑅2) for the pilot_model_20 and y_training
print('Training Score: {}'.format(log_regressor_20.score(pilot_model_20, y_train)))
#Compute MSE (Mean Squared Error) for the pilot_model_20 and y_training
print('Training MSE: {}'.format(np.mean((log_regressor_20.predict(pilot_model_20) - y_train)**2)))

**EL MSE HA EMPEORADO**

**¿CUALES HAN SIDO LAS VARIABLES QUE LO HAN PROVOCADO?**

In [None]:
X_train.to_csv('X_train_modificado.csv', index=False)

## 4 Back - up