# House Prices Prediction

## (Version for the kaggle competition) 

## ** Notebook Content **

1. Introduction


2. Previous Data Analysis
    - 2.1 Importing the required packages
    - 2.2 Loading the dataset
    - 2.3 'SalePrice' distribution analysis


3. Missing Data Analysis
    - 3.1 Cleaning data for columns
    - 3.2 Cleaning data for rows
    - 3.3 Last Adjustments & Saving the changes
    
    
4. Correlation Analysis 
     - 4.1 Introduction
     - 4.2 First Results & Data Visualization
     - 4.3 Top 10 variables with the highest correlation analysis
     - 4.4 Delete variables with low correlation


5. Categorical Data Analysis 
    - 5.1 Overview and first visualizations
    - 5.2 Detailed analysis and adjustments


6. Building a "baseline" model applying logistics regression
   - 6.1 Preparing the data
       - 6.1.1 Getting the Dependent and Independent variables
       - 6.1.2 Creating new dataframes based on the data type

   - 6.2 Build a model with only numerical variables
        - 6.2.1 Pilot Model 1 (numerical variables with correlation > [+0.5 & -0.5] )
        - 6.2.2 Pilot Model 2 (all numerical variables)

   - 6.3. Build a model with numerical and categorical variables
        - 6.3.1 Convert some categorical variables into dummy variables
        - 6.3.2 Convert the remaining categorical variables into numbers
        - 6.3.3 Saving the changes


7. Building a predictive model applying Random Forest Regressor
    - 7.1 Defining the Random Forest Regressor baseline
        * 7.1.1 Fitting the Random Forest Regressor
        * 7.1.2 Predicting the results
    - 7.2 Applying K-Fold Cross-Validation technique
    - 7.3 Applying Grid-Search technique
        * 7.3.1 Creating a parameter grid
        * 7.3.2 Random Search Training
        * 7.3.3 Evaluate the Random Search
        * 7.3.4 Initiate the grid search model
        * 7.3.5 Fitting the grid search to the data
        * 7.3.6 Fitting the final Random Forest Regressor Model
        * 7.3.7 Predicting the results
        * 7.3.8 Computing metrics for the random forest model
        * 7.3.9 Create a submission in the Kaggle competition

## 1 Introduction 

The goal of this project is to **find the best fitting model for predicting the house prices of the city of Ames**, using advanced regressions techniques, such as random forest or gradient boosting.

To do this, we will use a data set composed of 2930 observations and 80 variables (23 nominal, 23 ordinal, 14 discrete and 20 continous), which describes the sale of individual residential property in Ames from 2006 to 2010. The data has been provided by the Ames City Assesor's Office.

## 2 Previous Analysis 

### 2.1 Importing the requiered packages 

In [2]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

from scipy import stats
from scipy.stats import skew, boxcox_normmax, norm
from scipy.special import boxcox1p

import matplotlib.gridspec as gridspec
from matplotlib.ticker import MaxNLocator

import warnings
pd.options.display.max_columns = 250
pd.options.display.max_rows = 250
warnings.filterwarnings('ignore')
plt.style.use('fivethirtyeight')

from IPython.core.display import Image

### 2.2 Loading the dataset 

#### 2.2.1 Checking the training set 

In [3]:
#loading the training set
df_train_prelim = pd.read_csv('train.csv')
df_train_prelim.head(15)

FileNotFoundError: [Errno 2] File b'train.csv' does not exist: b'train.csv'

In [None]:
#check the shape of the data
df_train_prelim.shape

In [None]:
#check the type of data
df_train_prelim.info()

__Important!__
After checking the information of the dataset, we realised that the variable *SalePrice* is the dependent variable (the value that we want to predict with our model).

#### 2.2.2 Checking the test set 

In [None]:
#loading the test set
df_test_prelim = pd.read_csv('test.csv')
df_test_prelim.head(15)

In [None]:
#check the shape of the data
df_test_prelim.shape

In [None]:
#check the type of data
df_test_prelim.info()

### 2.3 'SalePrice' distribution analysis

Firstly, we need to check the SalePrice column to clearly understand the distibution of prices.

We are going to analyze if the dependent variable (SalePrice) follows a normal distribution.

In [None]:
#check the  main statistics of the dependent variable 
df_train_prelim['SalePrice'].describe() 

Keeping in mind the statistics, the mean price is around 180k USD. The most expensive house is for 775k USD and the cheapest is only for 34,9k USD. In addition, the 50% quantile lies at 163k USD.

Let's build a histogram to review the distribution of the house prices.

In [None]:
#SalePrice Histogram 1
#without applying kernel density function
sns.set_style('darkgrid')
fig,ax = plt.subplots(1,1,figsize=(6,6))
sns.distplot(df_train_prelim['SalePrice'], ax=ax, kde=False)
ax.set_xlabel('House price, USD')
plt.suptitle('SalePrice Histogram', size=15)
plt.show()

In [None]:
#SalePrice Histogram 2
#applying kernel density function
sns.distplot(df_train_prelim['SalePrice'], fit=norm)
fig = plt.figure()
res = stats.probplot(df_train_prelim['SalePrice'], plot=plt)

We will check also the Skewness and Kurtosis of the SalePrice variable in order to verify if it follows a normal distribution or not.

In [None]:
Image("Skewness_Kurtosis.PNG")

In [None]:
#SalePrice Skewness & Kurtosis
print("Skewness: %f" % df_train_prelim['SalePrice'].skew())
print("Kurtosis: %f" % df_train_prelim['SalePrice'].kurt())

__Comments:__ After analyzing the metrics, we realized that the SalePrice does not follow a normal distribution (Gaussian distribution).

Let's check how many houses have a price higher than 500000 USD.

In [None]:
df_train_prelim.query('SalePrice > 442567.0100000005')

In [None]:
len(df_train_prelim.query('SalePrice > 442567.0100000005'))

We can see that **only 15 houses have a price more than 440.000 UDS**. It seems like we can drop them as outliers in the future.

In [None]:
#Checking the indexes related to the outliers
df_train_prelim.query('SalePrice > 442567.0100000005')
id_outliers = list(df_train_prelim.query('SalePrice > 442567.0100000005')['Id'])
print(id_outliers)

In [None]:
#delete the outliers values
df_train_prelim.drop([178, 185, 440, 527, 591, 691, 769, 798, 803, 898, 1046, 1169, 1182, 1243, 1373], inplace=True)

#check the shape of the dataframe after deleting the outliers
df_train_prelim.shape

Let's check the main statistics and distribution of the SalePrice variable after removing the outliers values.

In [None]:
#check the main statistics of the dependent variable (SalePrice) 
df_train_prelim['SalePrice'].describe() 

In [None]:
#Histogram 3
sns.distplot(df_train_prelim['SalePrice'], kde=False)
ax.set_xlabel('House price, USD')
plt.suptitle('SalePrice Histogram (without ouliers)', size=15)
plt.show()

In [None]:
#SalePrice Histogram 4
#applying kernel density function
sns.distplot(df_train_prelim['SalePrice'], fit=norm)
fig = plt.figure()
res = stats.probplot(df_train_prelim['SalePrice'], plot=plt)

In [None]:
#SalePrice Skewness & Kurtosis
print("Skewness: %f" % df_train_prelim['SalePrice'].skew())
print("Kurtosis: %f" % df_train_prelim['SalePrice'].kurt())

__Comments:__
After dropping the outliears from the df_training set, we realized that the SalePrice variables shows a distribution closer to a *normal distribution*. 



__Note [1]:__
We have decided to drop the outliers from the df_training directly in order to avoid deleting records with duplicate records in the Id column, since when combining the training set and test set in a single dataframe the variable Id shows duplicate values.

## 3 Data Cleaning

### 3.1 Merging the training and test dataframes

Let's combine two datasets (df_train and df_test) and work with the data faster.

In [None]:
#generate the 'origin' column.
df_train_prelim['origin']= 0
df_test_prelim['origin']= 100

In [None]:
#create the dataframe df_total, which is a dataframe union of df_test and df_train
df_total = pd.concat([df_train_prelim,df_test_prelim], sort = False)
df_total.head(10)

In [None]:
#check the shape of the data
print('\033[1m'+ 'The Total Set is having {0} observations for {1} characteristics.'.format(*df_total.shape)+ '\033[0m')
print()

#and the type of data
print('\033[1m' +'The characteristics are of following data types:' + '\033[0m')
df_total.info()

### 3.2 Missing Data Analysis

We are going to review how many null values we have in the dataset in order to clean the data for further analysis.

In [None]:
#check how many null values there are in the dataset
df_total.isnull().sum().sum()

Really, __there are a total of 13905 records with nan values__. The remaining 1459 values are those values of the Y variable that do not appear in the test set (15364 - 13095 = 1459)

We are going to analyze the dimension of the dataset by columns and rows to decide what columns and rows should be rejected from the dataset.

First, let's check the percentage of null values per column, filtering only for those columns that have NaN values.

In [None]:
# Replace 'None' with NaN Values
for i in df_total.columns:
    df_total[i].replace('None', np.nan, inplace=True)

In [None]:
## Save SalePrice info in other DF

df_total.insert(0, 'New_Id', range(1, 1 + len(df_total)))
df_salesprice = df_total[['New_Id','SalePrice']]
df_total.drop('SalePrice', axis=1, inplace= True)

In [None]:
df_salesprice.head(5)

In [None]:
df_total.head()

In [None]:
#Creating table with "% missing"

column_list = []
for i in df_total.columns:
    my_list =[i, str(df_total[i].dtype), int(df_total[i].isnull().sum()), round(df_total[i].isnull().sum()/df_total[i].isnull().count()*100,2)]
    column_list.append(my_list) 

missing = pd.DataFrame(column_list, columns=['Column', 'Type', 'NonValues', 'Percentage'])
missing = missing[missing['NonValues']!=0]
missing.sort_values(by='NonValues', ascending=False)

In [None]:
#remove columns with more than 48% missing values
df_total.drop(list(missing[missing['Percentage']>48]['Column']), axis = 1, inplace=True)
print(df_total.shape)

Now, we are going to replace the null values.

For numerical variable we will replace Nulls with the median or mean (in case median = 0), while for categorical variables we will replace mising info with the most ocurring variable info.

In [None]:
# Replace Non Values

Dict = {}

columns = list(missing[missing['Percentage']<48]['Column'])
for i in columns:
    if (missing[missing['Column'] == i]['Type'] == 'object').any():
        df_total[i] = df_total[i].fillna(df_total[i].value_counts().index[0])                         
    elif df_total[i].median() == 0:
        df_total[i] = df_total[i].replace(np.NaN, df_total[i].mean(skipna=True))
    else: df_total[i] = df_total[i].replace(np.NaN, df_total[i].median(skipna=True))

In [None]:
#check how many null values there are in the dataset (after cleaning the numerical variables)
df_total.isnull().sum().sum()

After finishing the adjustments in the training set, we realized that __all the null values have been removed and the 'data for training and testing is cleaned completely'.__

In [None]:
# Reassign SalesPrice and drop help index

df_total = df_total.merge(df_salesprice, on='New_Id', how='inner')
df_total.drop('New_Id', axis=1, inplace=True)

In [None]:
df_total.head()

### 3.3 Splitting the data into training and test dataframes

Before spliting the X_total into training and set, we will check how looks the 'origin' column after the normalization

In [None]:
# The train data has origin 0, and the test data 100
df_total['origin']

In [None]:
# Split the dataset into training set and test set
df_train_clean = df_total[df_total['origin'] == 0].drop('origin', axis=1)
df_test_clean = df_total[df_total['origin'] == 100].drop('origin', axis=1)
df_train_clean.shape, df_test_clean.shape, df_total.shape

In [None]:
#delete the SalePrice variable from the test dataframe
df_test_clean.drop(['SalePrice'], axis = 1, inplace = True)

In [None]:
#drop the 'Id' column from the training and test dataframes(we do not need it to build the prediction model)
df_train_clean = df_train_clean.drop('Id', axis=1)
df_test_clean = df_test_clean.drop('Id', axis=1)
df_train_clean.shape, df_test_clean.shape

In [None]:
df_train_clean.head(15)

In [None]:
df_test_clean.head(15)

### 3.4 Saving the changes 

####  Exporting the data to csv  

In [None]:
#export the df_train and df_test data to csv to check the final data
df_train_clean.to_csv('df_train_clean.csv', index=False)
df_test_clean.to_csv('df_test_clean.csv', index=False)
#export the X_train with metrics to csv to analyze the data
df_train.describe().transpose().to_csv('df_train_clean_metrics.csv', index= True)

In [None]:
#export the df_train and df_test data to pkl to check the final data
df_train_clean.to_pickle("./df_train_clean.pkl")
df_test_clean.to_pickle("./df_test_clean.pkl")
#export the X_train with metrics to pkl to analyze the data
df_train_clean.describe().transpose().to_pickle("./df_train_clean_metrics.pkl")

## 4 Correlation Analysis  

### 4.1 Introduction

Before starting the process of selecting the variables that will be part of our predictive model, we will analyze the correlation of numerical variables with respect to the dependent variable to review what the data shows us and make more accurate decisions.

We think that the *correlation is one of the most reliable methods when ruling out have any relation to the dependent variable of the regression prediction models*. Therefore, once the correlation is calculated, we will **delete those variables present a correlation close to zero** (between the range of +0.1 and -0.1), since they do not interfere in the variation of the price either in a negative or positive way.

In [None]:
Image("correlation_interpretation.PNG")

### 4.2 First Results & Data Visualization 

We have to check the correlation between the numerical independent variables (both float and integer) and the dependent variable (SalePrice).

So, let's start our correlation analysis calculating the overall result between SalePrice and numerical variables. Then, we can split the list between float values and integer values to identify more insights in the data.

In [None]:
#load the training data to start the analysis 
df_train = pd.read_csv('df_train_clean.csv')

#Create dtype lists
cat_cols = list(df_train_clean.select_dtypes('object').columns)
num_cols = list(df_train_clean.select_dtypes(include=[np.number]).columns)

In [None]:
#check correlation between SalePrice and numerical variables
df_train.corr()['SalePrice'][num_cols].sort_values(ascending = False)

__Comments and Observations:__
We realized that the 'GarageCars' (size of garage in car capacity) and 'GarageArea' (size of garage in square feet) variables are the ones that shows the higher possitive correlation with the SalePrice variable, within the variables of the 'float variables' group.

Regarding the *integer variables*, We realized that the 'OverallQual' (rates the overall material and finish of the house) and '1stFlrSF' (first Floor square feet) variables are the ones that shows the higher possitive correlation with the SalePrice variable, within the variables of the 'integer variables' group.

Now, we are going to build a plotting headmap to visualize the summary of the variables correlation.

In [None]:
## Plotting heatmap. 

plt.subplots(figsize = (30,20))

# Generate a mask for the upper triangle (taken from seaborn example gallery)
mask = np.zeros_like(df_train.corr(), dtype=np.bool)
mask[np.triu_indices_from(mask)] = True


sns.heatmap(df_train.corr(), 
            cmap=sns.diverging_palette(20, 220, n=200), 
            mask = mask, 
            annot=True, 
            center = 0);
## Give title and shape. 
plt.title("Heatmap of all the Train_Clean Features", fontsize = 30);

### 4.3 Top 10 variables with the highest correlation analysis

After visualizating the correlation of all te numerical variables of the training dataset, we are going to check the performance of the top 10 variables with the highest correlation with respect to the SalePrice variable because they are the ones that have a high influence on the price of housing. 

In order to do it, we will create some heat maps and histogram graphs to check their relevance and understand the distribution of each of the variables.

In [None]:
#Heat Map with the top 10 variables with the highest correlation to the SalePrice feature
k = 11 #number of variables for heatmap
corrmat = df_train.corr()
cols = corrmat.nlargest(k, 'SalePrice')['SalePrice'].index
cm = np.corrcoef(df_train[cols].values.T)
plt.figure(figsize=(8, 8))
sns.set(font_scale=1)
hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)
plt.show()

In [None]:
#Summarized information of distribution of the top 10 variables with the SalePrice feature
sns.set()
cols = ['SalePrice', 'OverallQual', 'GrLivArea', 'GarageCars','GarageArea','1stFlrSF', 'TotalBsmtSF', 'FullBath', 'YearBuilt', 'YearRemodAdd', 'TotRmsAbvGrd'] 
sns.pairplot(df_train[cols], height = 2.5)
plt.show();

Now, we are going to start the distribution analysis of the top 10 variables with the most high correlation building histrograms for each of the variables.

__1 - GarageYrBlt__

In [None]:
# GarageYrBlt Histogram 
sns.set_style('darkgrid')

fig,ax = plt.subplots(1,1,figsize=(8,6))
sns.distplot(df_train['GarageYrBlt'], ax=ax, kde=False)

ax.set_xlabel('')
plt.suptitle('Garage Year Built Distribution', size=15)
ax.set_xlabel('Years')
plt.show()

__2 - GarageCars__

In [None]:
# GarageCars Histogram
df_train['GarageCars'].hist(density=0, histtype='stepfilled', bins=30)
plt.ylabel('frequency')
plt.xlabel('values')
plt.title('Garage Cars Distribution', fontsize = 15)

In [None]:
# GarageCars relateive frequency table
100 * df_train['GarageCars'].value_counts() / len(df_train['GarageCars'])  

__3 - GarageArea__

In [None]:
# GarageArea Histogram
df_train['GarageArea'].hist(density=0, histtype='stepfilled', bins=30)
plt.ylabel('frequency')
plt.xlabel('values')
plt.title('Garage Area Distribution', fontsize = 15)

__4 - TotalBsmtSF__

In [None]:
# TotalBsmtSF Histogram
df_train['TotalBsmtSF'].hist(density=1, histtype='stepfilled', bins=30)
plt.ylabel('frequency')
plt.xlabel('values')
plt.title('Total square feet of basement area - Distribution', fontsize = 15)

__Comments and Observations:__

* Except "Yearbuilt", the rest of the variables show an abnormal distribution with a strong influence of the values positioned on the left side of the graph.
* GarageCars analysis =>> About 60% of the sample analyzed has a space available for 2 cars, while the remaining 40% is divided into 1 and 3 slots for cars (27% for 1 car slot and 13% for 3 car slots, respectively).
* Regarding the Garage Area distribution, we observe that the garages with a capacity between 400-600 square feet are those that present a higher frequency in the analyzed sample, while the total square feet of basement with a higher frequency is between the 500 and 1500 square of feet.
* Finally, we observe that the most part of the garages built between the range of the year 1960 until the first decade of the 2000s, highlighting the high peak recorded in the first decade of the 21st century.

Let's continue with the analysis of the distribution of the numerical variables with highly correlation.

__5 - OverallQual__

In [None]:
# OverallQual Histogram 
df_train['OverallQual'].hist(density=0, histtype='stepfilled', bins=30)
plt.ylabel('frequency')
plt.xlabel('values')
plt.title('Rates of material quality - Distribution', fontsize = 15)

In [None]:
# OverallQual relateive frequency table
100 * df_train['OverallQual'].value_counts() / len(df_train['OverallQual'])  

__5 - GrLivArea__

In [None]:
#GrLivArea Histogram 
sns.set_style('darkgrid')

fig,ax = plt.subplots(1,1,figsize=(8,6))
sns.distplot(df_train['GrLivArea'], ax=ax, kde=False)

ax.set_xlabel('Values')
plt.suptitle('Above grade/ground living area square feet - Distribution', size=15)
plt.show()

__6 - 1stFlrSF__

In [None]:
#1stFlrSF Histogram 
sns.set_style('darkgrid')

fig,ax = plt.subplots(1,1,figsize=(8,6))
sns.distplot(df_train['1stFlrSF'], ax=ax, kde=False)

ax.set_xlabel('Values')
plt.suptitle('First Floor square feet - Histogram', size=15)
plt.show()

__7 - FullBath__

In [None]:
#FullBath Histogram 
sns.set_style('darkgrid')

fig,ax = plt.subplots(1,1,figsize=(8,6))
sns.distplot(df_train['FullBath'], ax=ax, kde=False)

ax.set_xlabel('Full bathrooms above grade - values')
plt.suptitle('Full bathrooms above grade - Histogram 1', size=15)
plt.show()

In [None]:
# FullBath relateive frequency table
100 * df_train['FullBath'].value_counts() / len(df_train['FullBath'])  

__8 - YearBuilt__

In [None]:
#YearBuilt Histogram
sns.set_style('darkgrid')

fig,ax = plt.subplots(1,1,figsize=(8,6))
sns.distplot(df_train['YearBuilt'], ax=ax, kde=False)

ax.set_xlabel('Year Built - Values')
plt.suptitle('Year Built - Histogram', size=15)
plt.show()

__9 - YearRemodAdd__

In [None]:
#YearRemodAdd Histogram
sns.set_style('darkgrid')

fig,ax = plt.subplots(1,1,figsize=(8,6))
sns.distplot(df_train['YearRemodAdd'], ax=ax, kde=False)

ax.set_xlabel('Years')
plt.suptitle('Remodel date - Histogram', size=15)
plt.show()

__10 - TotRmsAbvGrd__

In [None]:
#TotRmsAbvGrd Histogram 
sns.set_style('darkgrid')

fig,ax = plt.subplots(1,1,figsize=(8,6))
sns.distplot(df_train['TotRmsAbvGrd'], ax=ax, kde=False)

ax.set_xlabel('Total rooms above grade - values')
plt.suptitle('Total rooms above grade - Histogram', size=15)
plt.show()

In [None]:
# TotRmsAbvGrd relateive frequency table
100 * df_train['TotRmsAbvGrd'].value_counts() / len(df_train['TotRmsAbvGrd'])  

__Comments and Observations:__

* "Rates of material quality" analysis =>> The 77% of the analyzed sample is located in ratios of 5 (Average), 6 (Above average) and 7 (Good). Barely the 12% has registered ratios of 8 (very good) and 9 (Excellent).
* Regarding the capacity, checking the TotRmsAbvGrd we realized that the 52% of the sample is located along records of 6 and 7 rooms (28.62% and 23.39% respectively), while the 54% of the records of the variable FullBath is located in two-bathroom houses and the 44% in one-bathroom houses.
* Keeping in mind the time, the construction of most of the houses was done between the 1960s and 2010, highlighting a strong peak of construction in the first decade of the 2000s. These resulst are similar to the Remodel data, so the most of the houses have been remodeled between the 1950s and 2010, highlighting a peak in the 2010s.
* Finally, we realized that the most of the recordas of the "First Floor square feet" variable is between the 500 and 1500 quare feet, while the "Above grade/ground living area square feet" show a high record between the 1000 and 2000 square feet.

### 4.4 Deleting variables with low correlation

To finalize the correlation analysis, we are going to remove those numerical variables that have a correlation equal or close to 0 (between the range +0.10 and -0.10), becuase they do not show a relation with respect to the SalePrice varialbe.

The numerical variables susceptible to removal are as follows:

* ScreenPorch (0.084846)
* MoSold (0.073477)
* 3SsnPorch (0.057596)
* PoolArea (0.034475)
* BsmtFinSF2 (-0.009578)
* MiscVal (-0.019445)
* BsmtHalfBath (-0.030081)
* YrSold (-0.033782)
* LowQualFinSF (-0.060264)
* OverallCond (-0.076713)
* MSSubClass (-0.085869)

In [None]:
#remove variables with low correlation
df_train.drop(['MoSold', 'ScreenPorch', '3SsnPorch', 'PoolArea', 'MiscVal', 'YrSold', 'LowQualFinSF', 'MSSubClass',
               'BsmtFinSF2', 'BsmtHalfBath', 'OverallCond'], axis = 1, inplace = True)

In [None]:
#check the shape of the dataframe after removing the variables with low correlation
df_train.shape

In [None]:
#check the columns after removing the variables with low correlation
df_train.columns

In addition, we have to implement in the test set the same changes we made earlier in the training set in order to have the inforamtion in the same format and get realistic results.

Therefore, we proceed to remove those variables that show a low correlation with respect to the SalePrice variable.

In [None]:
#load the test data to implement the changes
df_test = pd.read_csv('df_test_clean.csv') 

#remove variables with low correlation
df_test.drop(['MoSold', 'ScreenPorch', '3SsnPorch', 'PoolArea', 'MiscVal', 'YrSold', 'LowQualFinSF', 'MSSubClass',
               'BsmtFinSF2', 'BsmtHalfBath', 'OverallCond'], axis = 1, inplace = True)

In [None]:
#check the shape of the test set after removing variables with low correlation
df_test.shape

In [None]:
#save the data to csv
df_train.to_csv('df_train_corr.csv', index=False)
df_test.to_csv('df_test_corr.csv', index=False)

## 5 Categorical Data Analysis

In order to obtain clean data for building the prediction models, in the previous sections, we have analyzed the data sets considering only the numerical features. In this section, we will analyze the behavior of the categorical values and their impact on predicted value. To this end, we will start plotting the categorical values and then we will analyze their distribution in order to assess if they can potentially be discarded, or if need to be analyzed deeply. If the analyzed categorical feature presents a uniform distribution we have established their possible discarding (i.e. all categories with the same or similar values), otherwise the feature must be analyzed individually considering even the inclusion of dummy variables if the feature has few categories (at most 4) in order to enrich the data set and the accuracy of the future prediction model.

### 5.1 Overview and first visualizations

In [None]:
#Count plots for categorical variables

fig, axes = plt.subplots(ncols=2, nrows=0, figsize =(20,240))
sns.color_palette("husl", 8)
plt.subplots_adjust(right=2)
plt.subplots_adjust(top=2)

for i, feature in enumerate(cat_cols, 1):
    plt.subplot(len(cat_cols), 3,i)
    sns.countplot(df_train[feature])
    plt.xlabel(f'{feature}', size=20)
    plt.ylabel('Count', size=15)
plt.show()

In [None]:
# Categorised Mean/Median for categorical data

fig, axes = plt.subplots(ncols=2, nrows=0, figsize =(12,120))
sns.color_palette("husl", 8)
plt.subplots_adjust(right=2)
plt.subplots_adjust(top=2)

for i, feature in enumerate(cat_cols, 1):
    plt.subplot(len(cat_cols), 3,i)
    df_train.groupby(feature)['SalePrice'].median().plot.bar()
    plt.xlabel(f'{feature}', size=15)
    plt.ylabel('Count', size=15)
plt.show()

__Comments and Observatons:__


It seems that we find a high volume of categorical variables, which have a very dispersed data distribution, so it is difficult to quantify whether they can all provide the same value when constructing the model.

Therefore, in order to analyze which are the most impactful categorical variables, we will analyze them one by one and evaluate whether they are worth converting them into dummy variables to build a robust predictive model.

### 5.2 Detailed analysis and adjustments

Next, we will analyze each of the variables to see what information they provide us and check if it can add value to the model or not.

__a) MSZoning__

The variable MSZoning identifies the general zoning classification of the sale (agriculture, commercial, residential, etc).

* A	Agriculture
* C	Commercial
* FV	Floating Village Residential
* I	Industrial
* RH	Residential High Density
* RL	Residential Low Density
* RP	Residential Low Density Park 
* RM	Residential Medium Density
	

In [None]:
# MSZoning bar chart
plot = df_train['MSZoning'].value_counts().plot(kind='bar', color=['tab:blue','tab:orange','tab:green'], title='MSZoning')

__b) Street__

The variable Street identifies the type of road access to property (gravael or paved).

In [None]:
# Street bar chart
plot = df_train['Street'].value_counts().plot(kind='bar', 
                                                      color=['tab:blue','tab:orange','tab:green'], title='Street - type of road access to the property')

Comments =>> It could be interesting to convert the Street variable into dummy.

__c) LotShape__

The variable LotShape identifies the general shape of property.

* Reg	- Regular	
* IR1	- Slightly irregular
* IR2	- Moderately Irregular
* IR3	- Irregular

In [None]:
# LotShape bar chart
plot = df_train['LotShape'].value_counts().plot(kind='bar', color=['tab:blue','tab:orange','tab:green'], 
                                                title='General shape of property')

Comments =>> It could be interesting to convert the Street variable into dummy.

__d) LandContour__

The variable LandContour identifies the flatness of the property.

* Lvl	- Near Flat/Level	
* Bnk	- Banked - Quick and significant rise from street grade to building
* HLS	- Hillside - Significant slope from side to side
* Low	- Depression

In [None]:
# LandContour bar chart
plot = df_train['LandContour'].value_counts().plot(kind='bar', color=['tab:blue','tab:orange','tab:green'],
                                                   title='Flatness of the property')

Comments =>> It could be interesting to convert the Street variable into dummy.

__e) Utilities__

The variable Utilities identifies the type of utilities available (all public utilities, electricity, gas, water, etc).

* AllPub	-- All public Utilities (E,G,W,& S)	
* NoSewr	-- Electricity, Gas, and Water (Septic Tank)
* NoSeWa	-- Electricity and Gas Only
* ELO	-- Electricity only	

In [None]:
# Utilities bar chart
plot = df_train['Utilities'].value_counts().plot(kind='bar', 
                                                      color=['tab:blue','tab:orange','tab:green'], title='Type of utilities')

Comments =>> It could be interesting to convert the Utilities variable into dummy.

__f) LotConfig__

The variable LotConfig identifies the lot configuration.

* Inside	-- Inside lot
* Corner	-- Corner lot
* CulDSac	-- Cul-de-sac
* FR2	-- Frontage on 2 sides of property
* FR3	-- Frontage on 3 sides of property

In [None]:
# Utilities bar chart
plot = df_train['LotConfig'].value_counts().plot(kind='bar', 
                                                      color=['tab:blue','tab:orange','tab:green'], title='Lot Configuration')

Configuration =>> I am not sure if it has sense to include this variable or not (I could not understand the meaning of it) 

__g) LandSlope__

The variable LandSlope identifies the slope of property.

* Gtl	-- Gentle slope
* Mod	-- Moderate Slope	
* Sev	-- Severe Slope

In [None]:
# LandSlope bar chart
plot = df_train['LandSlope'].value_counts().plot(kind='bar', 
                                                      color=['tab:blue','tab:orange','tab:green'], title='LandSlope')

Comments =>> It could be interesting to convert the LandSlope variable into dummy and work together with the variable LandContour.

__h) Neighborhood__

The variable Neighborhood identifies the physical locations within Ames city limits.

In [None]:
# LandSlope bar chart
plot = df_train['Neighborhood'].value_counts().plot(kind='bar', color=['tab:blue','tab:orange','tab:green'], 
                                                    title='Physical locations within Ames city limits')

Comments =>> I am not sure if it has sense to include this variable or not in the model.

__i) Condition1__

The variable Condition1 identifies the proximity to various conditions.

In [None]:
# Condition1 bar chart
plot = df_train['Condition1'].value_counts().plot(kind='bar', 
                                                      color=['tab:blue','tab:orange','tab:green'], title='Condition 1')

In [None]:
# Condition2 relateive frequency table
100 * df_train['Condition1'].value_counts() / len(df_train['Condition1'])

__j) Condition2__

The variable Condition2 identifies the proximity to various conditions (if more than one is present).

In [None]:
# Condition1 bar chart
plot = df_train['Condition2'].value_counts().plot(kind='bar', 
                                                      color=['tab:blue','tab:orange','tab:green'], title='Condition 2')

In [None]:
# Condition2 relateive frequency table
100 * df_train['Condition2'].value_counts() / len(df_train['Condition2'])

__k) BldgType__

The variable BldgType identifies the type of dwelling/housing.

* 1Fam	-- Single-family Detached	
* 2FmCon	-- Two-family Conversion; originally built as one-family dwelling
* Duplx	-- Duplex
* TwnhsE	-- Townhouse End Unit
* TwnhsI	-- Townhouse Inside Unit


In [None]:
# BldgType bar chart
plot = df_train['BldgType'].value_counts().plot(kind='bar', 
                                                      color=['tab:blue','tab:orange','tab:green'], title='BldgType')

Comments =>> It could be interesting to convert the BldgType variable into dummy. 

__l) HouseStyle__

The variable BldgType identifies the style of dwelling/housing.

* 1Story	One story
* 1.5Fin	One and one-half story: 2nd level finished
* 1.5Unf	One and one-half story: 2nd level unfinished
* 2Story	Two story
* 2.5Fin	Two and one-half story: 2nd level finished
* 2.5Unf	Two and one-half story: 2nd level unfinished
* SFoyer	Split Foyer
* SLvl	Split Level

In [None]:
# BldgType bar chart
plot = df_train['HouseStyle'].value_counts().plot(kind='bar', 
                                                      color=['tab:blue','tab:orange','tab:green'], title='HouseStyle')

__m) RoofStyle__

The variable RoofStyle identifies the type of roof.

* Flat	-- Flat
* Gable	-- Gable
* Gambrel	-- Gabrel (Barn)
* Hip	-- Hip
* Mansard	-- Mansard
* Shed	-- Shed

In [None]:
# RoofStyle bar chart
plot = df_train['RoofStyle'].value_counts().plot(kind='bar', 
                                                      color=['tab:blue','tab:orange','tab:green'], title='Roof Style')

In [None]:
# RoofStyle relateive frequency table
100 * df_train['RoofStyle'].value_counts() / len(df_train['RoofStyle'])

__n) RoofMatl:__

The variable RoofMatl:  identifies the roof material.

* ClyTile	Clay or Tile
* CompShg	Standard (Composite) Shingle
* Membran	Membrane
* Metal	Metal
* Roll	Roll
* Tar&Grv	Gravel & Tar
* WdShake	Wood Shakes
* WdShngl	Wood Shingles

In [None]:
# RoofStyle bar chart
plot = df_train['RoofMatl'].value_counts().plot(kind='bar', 
                                                      color=['tab:blue','tab:orange','tab:green'], title='Roof Material')

In [None]:
# RoofMatl relateive frequency table
100 * df_train['RoofMatl'].value_counts() / len(df_train['RoofMatl'])

__o) Exterior1st:__

The variable Exterior1st identifies the exterior covering on house.


In [None]:
# RoofStyle bar chart
plot = df_train['Exterior1st'].value_counts().plot(kind='bar', 
                                                      color=['tab:blue','tab:orange','tab:green'], 
                                                         title='Exterior covering on house 1')

__p) Exterior2nd:__

The variable Exterior2nd identifies the exterior covering on house (if more than one material).

In [None]:
# RoofStyle bar chart
plot = df_train['Exterior2nd'].value_counts().plot(kind='bar', 
                                                      color=['tab:blue','tab:orange','tab:green'], 
                                                         title='Exterior covering on house 2')

__q) ExterQual:__

The variable ExterQual evaluates the quality of the material on the exterior.

In [None]:
# ExterQual bar chart
plot = df_train['ExterQual'].value_counts().plot(kind='bar', 
                                                      color=['tab:blue','tab:orange','tab:green'], 
                                                    title='Quality of the material on the exterior')

Comments =>> I am not sure if it has sense to include this variable or not in the model.

__r) ExterCond:__

The variable ExterCond evaluates the present condition of the material on the exterior.

In [None]:
# ExterCond bar chart
plot = df_train['ExterCond'].value_counts().plot(kind='bar', 
                                                      color=['tab:blue','tab:orange','tab:green'], 
                                                    title='Present condition of the material on the exterior')

Comments =>> I am not sure if it has sense to include this variable or not in the model.

__Summary of variables that could be transformed into dummy due to  its characteristics:__

* Street (2 options)
* LotShape (4 options)
* LandContour (4 options)
* ExtrQual (4 options)
* ExtCond (4 options)
* Utilities (2 options)
* LandSlope (3 options)
* BldgType (5 options)

## 6 Building a baseline model applying Logistic Regression 

### 6.1 Preparing the data

#### 6.1.1 Getting the Dependent and Independent variables

In [None]:
#Getting the Dependent and Independent variables
X_train = df_train.iloc[:, :-1] #all lines, all columns except the last one
y_train = df_train.iloc[:, 62] #all lines, only the last column

In [None]:
#check the shaape of X_train and y_train
X_train.shape, y_train.shape

#### 6.1.2 Creating new dataframes based on the data type 

Let's start the creation of our prediction model builing some dataframes related to the datatype of the variables that are part of X_train. These dataframes will help us to build a pilot model composed of only numerical variables.

Then, we are going to add categorical variables to our model to improve the score and the power prediction of the model.

In [None]:
##Create dtype dataframes
#create a dataframe with only categorical variables
df_object = X_train.select_dtypes(include=[object])
#create a dataframe with only numerical variables
df_number = X_train.select_dtypes(include=[np.number])

In [None]:
df_object.head()

In [None]:
df_object.shape

In [None]:
#check the columns of the df_object dataframe
df_object.columns

Comments =>> We realized that we have a total of 38 numerical variables. For our predictive model, we have to convert them to numerics in order to be able to apply algorithms such as logistics regression and random forest. However, we can not transform them all at once, as in some cases it is convenient to turn some variables into dummy to achieve a positive impact on the total set of the model.

In the following section, you will find more details about the categorical varaibles treatment. 

In [None]:
df_number.head()

In [None]:
df_number.shape

In [None]:
#check the columns of the df_number dataframe
df_number.columns

Comments => We realized that we have a total of 26 numerical variables. For our predictive model, we can we consider all the numerical variables, since both Logistics Regression and Random Forest operate with numerical variables and the volume is not very high.

### 6.2 Build a model with only numerical variables 

#### 6.2.1 Pilot Model 1 (numerical variables with correlation > [+0.5 & -0.5] )

Let's start creating a pilot model only with those variables that have a higher correlation with respect to the dependent variable (more than 0.50 of correlation).

In [None]:
pilot_model_1 = df_number[['OverallQual','GrLivArea', 'GarageCars', 'GarageArea','TotalBsmtSF', '1stFlrSF', 'YearBuilt', 
                           'FullBath','YearRemodAdd' ]]

In [None]:
pilot_model_1.shape

In [None]:
#Fitting logistic Regression into the Training set
from sklearn.linear_model import LogisticRegression
log_regressor_1 = LogisticRegression (random_state = 0)
log_regressor_1.fit(pilot_model_1, y_train)

__Important Note:__ The classifier learns the correlation between the df_number and the x_train. 

Now, let's start calculating the $R^2$ (coefficient of determination) regression score function, which determines the quality of the model to replicate the results and the proportion of variation of the results that can be explained by the model.

**Best possible score is 1.0 and it can be negative** (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a $R^2$ score of 0.

In [None]:
#Compute Score (𝑅2) for the pilot_model_1 and y_training
print('Training Score: {}'.format(log_regressor_1.score(pilot_model_1, y_train)))
#Compute MSE (Mean Squared Error) for the pilot_model_1 and y_training
print('Training MSE: {}'.format(np.mean((log_regressor_1.predict(pilot_model_1) - y_train)**2)))

#### 6.2.2 Pilot Model 2 (all numerical variables)

Now, we are going to include all the numerical varaibles into the pilot model in order to check its performance and define a preliminary baseline.

In [None]:
pilot_model_2 = df_number[['OverallQual','GrLivArea', 'GarageCars', 'GarageArea','TotalBsmtSF', '1stFlrSF', 'YearBuilt', 
                           'FullBath','YearRemodAdd','TotRmsAbvGrd','GarageYrBlt','Fireplaces','MasVnrArea','BsmtFinSF1',
                           'WoodDeckSF', 'LotFrontage','OpenPorchSF', '2ndFlrSF','HalfBath','LotArea','BsmtFullBath',  
                            'BsmtUnfSF','BedroomAbvGr','EnclosedPorch','KitchenAbvGr']]

In [None]:
pilot_model_2.shape

In [None]:
#Fitting logistic Regression into the Training set
from sklearn.linear_model import LogisticRegression
log_regressor_2 = LogisticRegression (random_state = 0)
log_regressor_2.fit(pilot_model_2, y_train)

In [None]:
#Compute Score (𝑅2) for the pilot_model_2 and y_training
print('Training Score: {}'.format(log_regressor_2.score(pilot_model_2, y_train)))
#Compute MSE (Mean Squared Error) for the pilot_model_2 and y_training
print('Training MSE: {}'.format(np.mean((log_regressor_2.predict(pilot_model_2) - y_train)**2)))

__Comments:__ We realized that the Score and the MSE have improved after includding all numerical variables, which present a significant level of correlation with respect to the dependent variable.

__Note [2]:__ We tried to improve the performance of the model checking the distribution of all the numerical variables that are part of the model to see what kind of distribution they showed and adjusting the distribution of those variables that followed a distribution close to the Gaussian by applying logarithms. 

However, we realized that the score of the Logistics Regression worsened after applying logaritm functions, so we decided  not to use this method to improve model performance, as the results are not as expected.

You can check the details of our analysis clicking in the following link:

[Notebook - Testing Variables Distribution Applying 'Logarithms'](https://github.com/lmendezotero/Postgraduate-Project/blob/master/House%20Prices%20Prediction/Testing%20Variables%20Distribution%20Applying%20'Logarithms'.ipynb)

### 6.3. Build a model with numerical and categorical variables 

#### 6.3.1 Convert some categorical variables into dummy variables

The first step to build our predictive model keeping in mind both numerical and categorical variables is to convert into dummy those categorical variables that have a positive impact on the model and that have few options/classes in order to avoid a significantly increase the number of features in the dataset. 

We do not want to make this notebook too extensive and include the necessary code without falling into redundancies. So, we have already done this analysis in other notebooks, which are linked to this current file.

In the following Jupyter Notebook you can see the analysis performed in which we converted into dummy variables all the variables that seemed to us subject to being converted and the impact of each variable on the performance of the model: 

[Categorical Data - Dummy Variables Testing](https://github.com/lmendezotero/Postgraduate-Project/blob/master/House%20Prices%20Prediction/Categorical%20Data%20-%20Dummy%20Variables%20Testing.ipynb)

However, we found that certain variables ('ExterCond', 'Utilities' and 'Street') that were transformed to dummy did not provide a positive impact on the model. So, we have to exclude these variables from the dummy analysis and we created a final version of the Notebook, in which we have tested the performance of the model with all the choosen dummy variables and the remaining categorical variables converted into numbers.

You can check the details of our analysis in the Jupyter Notebook:

[Categorical Variables - Analysis & Testing.ipynb](https://github.com/lmendezotero/Postgraduate-Project/blob/master/House%20Prices%20Prediction/Categorical%20Data%20-%20Analysis%20%26%20Testing.ipynb)

So, let's go!

We are going to start building our predictive model converting the choosen categorical variables into dummy.

In [None]:
#convert the choosen categorical variables into dummy variables
X_train = pd.get_dummies (X_train, columns = ['LotShape', 'LandContour', 'LandSlope', 'BldgType', 'ExterQual', 
                                              'BsmtQual', 'BsmtCond', 'BsmtExposure', 'CentralAir', 'KitchenQual', 
                                              'GarageFinish', 'PavedDrive'])

#check the shape of df_object after converting the variables into dummy
X_train.shape

In [None]:
#check the name of the columns after converting the variables into dummy
X_train.columns

Now, we proceed to merge all the dummy variables in the same pilot_model.

In [None]:
#numerical model + dummy variables
pilot_model_3 = X_train[['OverallQual','GrLivArea', 'GarageCars', 'GarageArea','TotalBsmtSF', '1stFlrSF', 'YearBuilt', 
                           'FullBath','YearRemodAdd','TotRmsAbvGrd','GarageYrBlt','Fireplaces','MasVnrArea','BsmtFinSF1',
                           'WoodDeckSF', 'LotFrontage','OpenPorchSF', '2ndFlrSF','HalfBath','LotArea','BsmtFullBath',  
                            'BsmtUnfSF','BedroomAbvGr','EnclosedPorch','KitchenAbvGr','LotShape_IR1', 'LotShape_IR2',
                           'LotShape_IR3', 'LotShape_Reg', 'LandContour_Bnk','LandContour_HLS', 'LandContour_Low',
                          'LandContour_Lvl','LandSlope_Gtl', 'LandSlope_Mod', 'LandSlope_Sev', 'BldgType_1Fam',
                           'BldgType_2fmCon', 'BldgType_Duplex', 'BldgType_Twnhs','BldgType_TwnhsE', 'ExterQual_Ex', 
                           'ExterQual_Fa', 'ExterQual_Gd','ExterQual_TA', 'BsmtQual_Ex', 'BsmtQual_Fa', 'BsmtQual_Gd',
                           'BsmtQual_TA', 'BsmtCond_Fa', 'BsmtCond_Gd', 'BsmtCond_Po', 'BsmtCond_TA', 'BsmtExposure_Av',
                           'BsmtExposure_Gd', 'BsmtExposure_Mn','BsmtExposure_No', 'CentralAir_N', 'CentralAir_Y', 
                         'KitchenQual_Ex','KitchenQual_Fa', 'KitchenQual_Gd', 'KitchenQual_TA', 'GarageFinish_Fin',
                           'GarageFinish_RFn', 'GarageFinish_Unf','PavedDrive_N', 'PavedDrive_P', 'PavedDrive_Y']]

pilot_model_3.shape

Let's fitting the X_training applying logistics regression to check the performance of the model after includding all the dummy variables.

In [None]:
#Fitting logistic Regression into the Training set
from sklearn.linear_model import LogisticRegression
log_regressor_3 = LogisticRegression (random_state = 0)
log_regressor_3.fit(pilot_model_3, y_train)

In [None]:
#Compute Score (𝑅2) for the pilot_model_3 and y_training
print('Training Score: {}'.format(log_regressor_3.score(pilot_model_3, y_train)))
#Compute MSE (Mean Squared Error) for the pilot_model_3 and y_training
print('Training MSE: {}'.format(np.mean((log_regressor_3.predict(pilot_model_3) - y_train)**2)))

Comments =>> It looks like the Score and MSE of our model improved improved after apllying all the dummy variables compared to the pilot_model_2 ("0.7328 VS 0.5723" and "254.689.611 VS 598.234.875" respectively).

So, the *combination of those dummy variables and the numerical variables have raised a positive impact on model performance*.


Now, we procced to implement the same changes in the X_test dataframe to validate the data in the same uniform format.

In [None]:
#convert some categorical variables into dummy variables
df_test = pd.get_dummies (df_test, columns = ['LotShape', 'LandContour', 'LandSlope', 'BldgType', 'ExterQual', 
                                              'BsmtQual', 'BsmtCond', 'BsmtExposure', 'CentralAir', 'KitchenQual', 
                                              'GarageFinish', 'PavedDrive'])

#check the shape of test set after converting the variables into dummy
df_test.shape

#### 6.3.2 Convert the remaining categorical variables into numbers

Finally, we have to convert the remaining categorical variables into numbers and check the performance of the model.

In [None]:
#convert the rest of the categorical variables into numbers
from sklearn.preprocessing import LabelEncoder
lencoders = {}

for col in X_train.select_dtypes(include=['object']).columns:
    lencoders[col] = LabelEncoder()
    X_train[col] = lencoders[col].fit_transform(X_train[col])

In [None]:
#check the datatype of X_train to review that all the variables are numbers
X_train.info()

Let's merge all the variables in the same pilot_model.

In [None]:
#numerical model + all dummy variables + remaining numerical variables
pilot_model_4 = X_train

pilot_model_4.shape

In [None]:
#Fitting logistic Regression into the Training set
from sklearn.linear_model import LogisticRegression
log_regressor_4 = LogisticRegression (random_state = 0)
log_regressor_4.fit(pilot_model_4, y_train)

In [None]:
#Compute Score (𝑅2) for the pilot_model_4 and y_training
print('Training Score: {}'.format(log_regressor_4.score(pilot_model_4, y_train)))
#Compute MSE (Mean Squared Error) for the pilot_model_4 and y_training
print('Training MSE: {}'.format(np.mean((log_regressor_4.predict(pilot_model_4) - y_train)**2)))

Comments =>> Once we apply logistics regression with all the transformed variables, we observe that we achieve a Score of 0.8747 and a Mean Squared Error of result of 114.541.500, respectively.

Therefore, consider the pilot_model_4 as the reference or baseline. So, the goal is to improve the Score and the MSE or our model applying one of the most famous ensemble algorithm that is called the Random Forest Regressor.

Before saving the changes, we have to perform the same actions in the X_test dataframe to validate the data in the same uniform format.

In [None]:
#convert the rest of the categorical variables into numbers
from sklearn.preprocessing import LabelEncoder
lencoders = {}

for col in df_test.select_dtypes(include=['object']).columns:
    lencoders[col] = LabelEncoder()
    df_test[col] = lencoders[col].fit_transform(df_test[col])

In [None]:
#check the datatype of X_train to review that all the variables are numbers
df_test.info()

In [None]:
#Review the final data
df_test.head()

Now, we are ready to training the Random Forest Regressor model, as the training set and test set are aligned.

#### 6.3.3 Saving the changes

In [None]:
#let's check the shape of the final training dataframes
X_train.shape, y_train.shape

In [None]:
#export the baseline model to csv
X_train.to_csv('Xtrain_baseline_model.csv', index=False)
y_train.to_csv('ytrain_baseline_model.csv', index = False)
df_test.to_csv('Xtest_baseline_model.csv', index = False)

In [None]:
#export the baseline model to pickle
X_train.to_pickle("./Xtrain_baseline_model.pkl")
y_train.to_pickle("./y_training.pkl")
df_test.to_pickle("./Xtest_baseline_model.pkl")

### 6.4 Computing metrics for the logistics regression model

In [None]:
#predict the test results of the train set
y_pred_1 = log_regressor_4.predict(X_train)
np.set_printoptions(precision=2)
print(y_pred_1)

In [None]:
print(y_train)

In [None]:
import sklearn.metrics as metrics
print("R-squared =", metrics.r2_score(y_train, y_pred_1))

In [None]:
print("MSE =", metrics.mean_squared_error(y_train, y_pred_1))

In [None]:
#predict the test results of the test set
y_pred_2 = log_regressor_4.predict(df_test)
np.set_printoptions(precision=2)
print(y_pred_2)

In [None]:
y_pred_2_df = pd.DataFrame(y_pred_2)

In [None]:
print(y_pred_2_df)

In [None]:
y_pred_1_df = pd.DataFrame(y_pred_1)

In [None]:
print(y_pred_1_df)

In [None]:
#export to csv
y_pred_1_df.to_csv('y_pred_logreg_train.csv', index=False)
y_pred_2_df.to_csv('y_pred_logreg_test.csv', index=False)                   

## 7 Building a predictive model applying Random Forest Regressor

### 7.1 Defining the Random Forest Regressor baseline

We start our analysis building a simple random forest regressor model, which is the baseline. Then, applying cross-validation techniques we will search what are the best parameters and we will apply them in order to build a solid and robust predictive model.

So, let's start creating the Random Forest baseline model.

#### 7.1.1 Fitting the Random Forest Regressor

In [None]:
# Fitting Random Forest Regression to the dataset
from sklearn.ensemble import RandomForestRegressor
rf_regressor = RandomForestRegressor(n_estimators = 10, random_state = 0)
rf_regressor.fit(X_train, y_train)

In [None]:
#Compute Score (𝑅2) for the X_train and y_training
print('Training Score: {}'.format(rf_regressor.score(X_train, y_train)))
#Compute MSE (Mean Squared Error) for the X_train  and y_training
print('Training MSE: {}'.format(np.mean((rf_regressor.predict(X_train) - y_train)**2)))

Comparing the results obtained in the random forest model compared to logistics regression, we verify that the random forest shows a greater potential, as we have increased the score by about 10% (from 0.87 to 0.97).

Therefore, we define a Random Forest baseline model that has a Score of 0.974 and a MSE of 126.354.326,35. So, **our goal is to try to improve these results.** 

#### 7.1.2 Predicting the results

In [None]:
# Predicting the test set results
y_pred = rf_regressor.predict(df_test)
y_pred

Comments => As we do not have any test labels (y_test) to validate the predictions of our model, we have to look for other method to corroborate that our model is trained properly (without falling into overfitting) and predicts new results correctly.

One possible option could be applying **Cross-Validation techniques**, through which we are going to work and validate directly the data based on the training set.

### 7.2 Applying K-Fold Cross-Validation technique

One of the most common cross-validation techniques is the **K-fold, which consists on splitting the training set into K number of subsets, called folds**. Then, we iteratively fit the model K times, each time training the data on K-1 of the folds and evaluating on the Kth fold (called the validation data). So, at the very end of training, we average the performance on each of the folds to come up with final validation metrics for the model.

In the following picture, we can see visually how cross-validation works:

In [None]:
Image("FOLD_CROSS-VALIDATION.PNG")

Let's see how the K-fold technique works with our random forest regressor model.

In [None]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(estimator = rf_regressor, X = X_train, y = y_train, cv = 10)
print("Accuracy: {:.2f} %".format(scores.mean()*100))
print("Standard Deviation: {:.2f} %".format(scores.std()*100))

We got a maximum accuracy of 85.15% applying the K-fold cross-validation technique.

Let's check how to improve the result with another popular cross-validation technique, which is called *Grid-Search tecnique*.

### 7.3 Applying Grid-Search technique

We would like to optimize our random forest model tunign some hyper-parameters to get a better performance. In order to do it, we will first do a random search to review what are the hyper-parameters ranges of values that can fit our model to achieve a good score. Then, we will apply the "Grid-Search" method to find the best parameters for our regression model.

Before starting tunning the hyper-parameters, we need to check what are the parameters that we are using now.

In [None]:
from pprint import pprint
# Look at parameters used by our current forest
print('Parameters currently in use:\n')
pprint(rf_regressor.get_params())

We will try adjusting the following set of hyperparameters:
* __n_estimators__ = number of trees in the foreset
* __max_features__ = max number of features considered for splitting a node
* __max_depth__ = max number of levels in each decision tree
* __min_samples_split__ = min number of data points placed in a node before the node is split
* __min_samples_leaf__ = min number of data points allowed in a leaf node
* __bootstrap__ = method for sampling data points (with or without replacement)

#### 7.3.1 Creating a parameter grid

We are going to start our analysis applying the **Random Hyper-parameters Grid technique**, whose benefit of a random search is that we are not trying every combination, but selecting at random to sample a wide range of values.

In order to apply the Random Hyper-parameters Grid technique, we have to use the **RandomizedSearchCV class**. So, we first need to create a parameter grid to sample from during fitting.

In [None]:
## Creating a parameter grid

from sklearn.model_selection import RandomizedSearchCV
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]

# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

pprint(random_grid)

Comments =>> On each iteration, the algorithm will choose a difference combination of the features.

#### 7.3.2 Random Search Training

We will use the random grid to search what are the most powerpul values of the hyper-parameters of the random forest regression model.

In [None]:
## Use the random grid to search for best hyperparameters

# First create the base model to tune
rf_bm = RandomForestRegressor(n_estimators = 10, random_state = 0)

# Random search of parameters, using 10 fold cross validation, 
# search across 100 different combinations, and use all available cores
rf_random = RandomizedSearchCV(estimator = rf_bm, param_distributions = random_grid, n_iter = 100, cv =5, verbose=2, 
                               random_state=0, n_jobs = -1)

# Fit the random search model
rf_random.fit(X_train, y_train)

Comments =>> The most important arguments in RandomizedSearchCV are n_iter, which controls the number of different combinations to try, and cv which is the number of folds to use for cross validation. In our case, we used a total of 100 iteractions and 5 folds. In addition, we realized that the run time has increased due to the number of folds chosen, but this allows us to reduce the risk of excess.

Now, let's check the best parameters from fitting the random search.

In [None]:
#check the best random-search parameters
rf_random.best_params_

#### 7.3.3 Evaluate the Random Search

To determine if random search got a better model, we compute the Score(R2) and MSE metrics of the rf_random model. Then, we compare the results of the random search model with the base model.

In [None]:
#Compute Score (𝑅2) for the rf_random and y_training
print('Training Score: {}'.format(rf_random.score(X_train, y_train)))
#Compute MSE (Mean Squared Error) for the rf_random  and y_training
print('Training MSE: {}'.format(np.mean((rf_random.predict(X_train) - y_train)**2)))

We can further improve our results by using grid search to focus on the most promising hyperparameters ranges found in the random search.

#### 7.3.4 Initiate the grid search model

Once we have are clear about the ranges of values that can fit our model to achieve a good score, we are ready to apply the grid-search technique to find the best parameters for our final predictive regression model.

In [None]:
from sklearn.model_selection import GridSearchCV

# Create the parameter grid based on the results of random search 
param_grid = [{
    'bootstrap': [False],
    'max_depth': [80, 90, 100, 110],
    'max_features': [2, 3],
    'min_samples_leaf': [1, 2, 3],
    'min_samples_split': [1.0, 2, 3],
    'n_estimators': [1000, 1200, 1400, 1600]}]

# Create a based model
rf_gs = RandomForestRegressor(n_estimators = 10, random_state = 0)

# Instantiate the grid search model
grid_search = GridSearchCV(estimator = rf_gs, param_grid = param_grid, 
                          cv = 5, n_jobs = -1)

#### 7.3.5 Fitting the grid search to the data

In [None]:
# Fit the grid search to the data
grid_search.fit(X_train, y_train)

In [None]:
#check the best grid-search accuracy
best_accuracy = grid_search.best_score_
#check the best grid-search parameters
best_parameters = grid_search.best_params_
print("Best Accuracy: {:.2f} %".format(best_accuracy*100))
print("Best Parameters:", best_parameters)

Comments =>> The obtained results applying the Grid-search method are very similar to the obtained results applying the K-Fold cross-validation technique (86.30 instead of 85.15, respectively).

#### 7.3.6 Fitting the final Random Forest Regressor Model

Finally, we proceed to build and fit the final Random Forest Regressor Model keeping in mind the best parameters values that the grid-search has provided us.

In [None]:
#Fitting the final Random Forest Regressor Model

regressor_finalmodel = RandomForestRegressor(n_estimators = 1400, random_state = 0, bootstrap = False, max_depth = 80,
                                             max_features = 3, min_samples_leaf = 1, min_samples_split = 2)

regressor_finalmodel.fit(X_train, y_train)

In [None]:
#Compute Score (𝑅2) for the regressor_finalmodel and y_training
print('Training Score: {}'.format(regressor_finalmodel.score(X_train, y_train)))
#Compute MSE (Mean Squared Error) for the regressor_finalmodel  and y_training
print('Training MSE: {}'.format(np.mean((regressor_finalmodel.predict(X_train) - y_train)**2)))

#### 7.3.7 Predicting the results

In [202]:
#predict the test results of the train set
y_pred_3 = regressor_finalmodel.predict(X_train)
np.set_printoptions(precision=2)
print(y_pred_3)

[208500.   181500.   223500.   ... 266500.   142125.   147505.54]


In [1]:
#predict the test results of the test set
y_pred_4 = regressor_finalmodel.predict(df_test)
np.set_printoptions(precision=2)
print(y_pred_4)

NameError: name 'regressor_finalmodel' is not defined

In [203]:
print(y_train)

0       208500.0
1       181500.0
2       223500.0
3       140000.0
4       250000.0
5       143000.0
6       307000.0
7       200000.0
8       129900.0
9       118000.0
10      129500.0
11      345000.0
12      144000.0
13      279500.0
14      157000.0
15      132000.0
16      149000.0
17       90000.0
18      159000.0
19      139000.0
20      325300.0
21      139400.0
22      230000.0
23      129900.0
24      154000.0
25      256300.0
26      134800.0
27      306000.0
28      207500.0
29       68500.0
30       40000.0
31      149350.0
32      179900.0
33      165500.0
34      277500.0
35      309000.0
36      145000.0
37      153000.0
38      109000.0
39       82000.0
40      160000.0
41      170000.0
42      144000.0
43      130250.0
44      141000.0
45      319900.0
46      239686.0
47      249700.0
48      113000.0
49      127000.0
50      177000.0
51      114500.0
52      110000.0
53      385000.0
54      130000.0
55      180500.0
56      172500.0
57      196500.0
58      438780

Explain....

In [211]:
#convert the y_pred results into dataframes
y_pred_3 = pd.DataFrame(y_pred_3)
y_pred_4 = pd.DataFrame(y_pred_4)

In [213]:
#export the y_pred_3 and y_pred_4 to csv
y_pred_3.to_csv('y_pred_rf_train.csv', index=False)
y_pred_4.to_csv('y_pred_rf_test.csv', index=False)  

#### 7.3.8 Computing metrics for the random forest model

In [244]:
import sklearn.metrics as metrics
#Compute Score (𝑅2) for the y_training and y_pred_3
print("R-squared =", metrics.r2_score(y_train, y_pred_3))
#Compute MSE (Mean Squared Error) for the y_training and y_pred_3
print("MSE =", metrics.mean_squared_error(y_train, y_pred_3))

R-squared = 0.9999959716067569
MSE = 19508.861070310475


In [209]:
from sklearn.metrics import mean_squared_error
from math import sqrt

#Compute Root-Mean-Squared-Error (RMSE) for the y_training and y_pred_3
print("RMSE =", sqrt(mean_squared_error(y_train, y_pred_3)))

RMSE = 139.67412455537524


__Observations:__ Root-Mean-Squared-Error (RMSE) is always non-negative, and a value of 0 (almost never achieved in practice) would indicate a perfect fit to the data. In general, a lower RMSD is better than a higher one.

#### 7.3.9 Create a submission in the Kaggle competition

The last step in our analysis is to merge the "Id" column of the test set with the predicted results (y_pred_4) in order to make a submission in the Kaggle competition. One of the Kaggle's rules is to upload a file with 1459 prediction rows and a header row with the "Id" and "SalePrice" columns.

So, we are going to compile this information in one dataframe and create a submission in the Kaggle Leaderboard.

Let's do it!

In [241]:
#Creating the "Id_test" dataframe to merge the Id column of the test set with the y_pred_4 results later 
Id_test = df_test_prelim[['Id']]
#Merging the "Id_test" set with the y_pred_4_df results 
rf_final_results = pd.concat([Id_test, y_pred_4], axis=1,)
rf_final_results.head()

Unnamed: 0,Id,0
0,1461,176000.0
1,1462,158000.0
2,1463,190000.0
3,1464,215000.0
4,1465,201000.0


In [238]:
#check the shape of the rf_final_results dataframe
rf_final_results.shape

(1459, 2)

In [242]:
#rename the columns of the  rf_final_results dataframe
rf_final_results.columns = ['Id', 'SalePrice']
rf_final_results.columns

Index(['Id', 'SalePrice'], dtype='object')

In [243]:
#export the rf_final_results and y_pred_4 to csv
rf_final_results.to_csv('rf_final_results.csv', index=False)  

In order to finally our analysis, we have collected the summarized the outcomes results and the errors of the prediction models (logistics regression, randome forest and artificial neural network).

You can check the evidences in the following link:

[Outcomes Models Summary](https://onedrive.live.com/edit.aspx?cid=ed1967779d009305&page=view&resid=ED1967779D009305!383&parId=ED1967779D009305!331&app=PowerPoint)

__End of analysis.__

__Thanks for reading!!__