# Project 2: Ames Housing Data Analysis and Modelling

## Problem Statement

Using the information in the Ames Housing Dataset, is it possible to **predict the housing sale prices for houses in Ames, IA, USA (by employing different Machine Learning techniques)**? If yes, **how accurately can these housing sale prices be predicted?**

## Executive Summary

According to this [page](https://nycdatascience.com/blog/student-works/machine-learning/machine-learning-project-ames-housing-dataset/), the Ames Housing Dataset contains observations of housing sales in Ames, Iowa, USA between 2006 and 2010. There are 23 nominal, 23 ordinal, 14 discrete, and 20 continuous features describing each house's size, quality, area, age, and other miscellaneous attributes. In this project, **I seek to apply different machine learning techniques to predict the sale price of houses based on their features**.

For ease of organization and understanding, the project has been **divided into 3 separate Jupyter notebooks**. The first 2 notebooks each focus on getting the datasets ready for application of machine learning techniques later on. This involves data cleaning, converting ordinal variables to numerical scales & one-hot encoding the nominal variables. The third notebook then focuses on EDA, different methods of feature selection, modelling and evaluation.

In order to predict the housing sale prices, I employ different methods of feature selection and machine learning techniques. Some of the techniques/models employed in this project are **Linear Regression, Regularization (Lasso, Ridge & ElasticNet), Recursive Feature Elimination with Cross-Validation (RFECV), GridSearchCV and Sequential Feature Selector**. Each of the models built are evaluated using metrics such as **R^2 score, Mean Squared Error (MSE) and Root Mean Squared Error (RMSE)** to explain how well the models may generalize to new data.

The predictions of housing sale prices from these models may be very relevant in the real world. For instance, these predictions may be used by home owners, buyers, sellers and realtors to gauge how much a house may be worth given it's characteristics (features). The predictions may also be used by the local government for tax assessment purposes or by financial institutions to determine the loan and mortgage rates. Although the predictions modelled in this project may be localized to the city of Ames, they may be applied to other cities (with caution) to understand the differences, before refining the models for such cities.

### Contents:
- Jupyter Notebook 1 - ***1_data_cleaning_train_csv.ipynb***
    - Preparation of train.csv
        - Data Import & Cleaning
        - Data Dictionary
        - One-hot Encoding of Nominal Variables
- Jupyter Notebook 2 - ***2_data_cleaning_test_csv.ipynb***
    - [Preparation of test.csv](#Preparation-of-test.csv)
        - [Data Import & Cleaning](#Data-Import-&-Cleaning)
        - [Data Dictionary](#Data-Dictionary)
        - [One-hot Encoding of Nominal Variables](#One-hot-Encoding-of-Nominal-Variables)
        - [Matching columns between train & test datasets](#Matching-columns-between-train-&-test-datasets)
- Jupyter Notebook 3 - ***3_EDA_feature_selection_model_evaluation.ipynb***
    - Data Dictionary & Data Importing
    - Exploratory Data Analysis
        - EDA of Continuous Variables
        - EDA of Discrete Variables
        - EDA of Ordinal Variables
        - EDA of Nominal Variables
        - Correlations between Selected Variables
        - Imputation of  Missing Data
    - Feature Selection, Model Building & Evaluation
        - Model Preparation
        - RFECV with LinearRegression
        - RidgeCV followed by Ridge
        - LassoCV followed by Lasso
        - GridSearchCV with ElasticNet
        - Sequential Forward Selection with Linear Regression
    - Conclusions and Recommendations

In [1]:
# Imports

import pandas as pd
import numpy as np

## Preparation of *test.csv*

### Data Import & Cleaning

In [2]:
# Read test.csv and assign it to a dataframe 'test'.
# keep_default_na=False because from the data description, we see that a few object columns have 'NA' values which have meanings.
# We need to avoid these 'NA' values from being wrongly read as empty NaN values by default.
# Only missing values (blank in the CSV file) should be identified as NaN values.

test = pd.read_csv('../datasets/test.csv', keep_default_na=False)

# We also read test.csv without keep_default_na=False to get the list of all numeric columns from it.
# This list will be later used to convert columns to their correct data types.

temp_test = pd.read_csv('../datasets/test.csv')
numeric_cols = list(temp_test._get_numeric_data().columns)

In [3]:
numeric_cols

['Id',
 'PID',
 'MS SubClass',
 'Lot Frontage',
 'Lot Area',
 'Overall Qual',
 'Overall Cond',
 'Year Built',
 'Year Remod/Add',
 'Mas Vnr Area',
 'BsmtFin SF 1',
 'BsmtFin SF 2',
 'Bsmt Unf SF',
 'Total Bsmt SF',
 '1st Flr SF',
 '2nd Flr SF',
 'Low Qual Fin SF',
 'Gr Liv Area',
 'Bsmt Full Bath',
 'Bsmt Half Bath',
 'Full Bath',
 'Half Bath',
 'Bedroom AbvGr',
 'Kitchen AbvGr',
 'TotRms AbvGrd',
 'Fireplaces',
 'Garage Yr Blt',
 'Garage Cars',
 'Garage Area',
 'Wood Deck SF',
 'Open Porch SF',
 'Enclosed Porch',
 '3Ssn Porch',
 'Screen Porch',
 'Pool Area',
 'Misc Val',
 'Mo Sold',
 'Yr Sold']

In [4]:
test.head()

Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,3Ssn Porch,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type
0,2658,902301120,190,RM,69.0,9142,Pave,Grvl,Reg,Lvl,...,0,0,0,,,,0,4,2006,WD
1,2718,905108090,90,RL,,9662,Pave,,IR1,Lvl,...,0,0,0,,,,0,8,2006,WD
2,2414,528218130,60,RL,58.0,17104,Pave,,IR1,Lvl,...,0,0,0,,,,0,9,2006,New
3,1989,902207150,30,RM,60.0,8520,Pave,,Reg,Lvl,...,0,0,0,,,,0,7,2007,WD
4,625,535105100,20,RL,,9500,Pave,,IR1,Lvl,...,0,185,0,,,,0,7,2009,WD


In [5]:
test.tail()

Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,3Ssn Porch,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type
874,1662,527377110,60,RL,80,8000,Pave,,Reg,Lvl,...,0,0,0,,,,0,11,2007,WD
875,1234,535126140,60,RL,90,14670,Pave,,Reg,Lvl,...,0,0,0,,MnPrv,,0,8,2008,WD
876,1373,904100040,20,RL,55,8250,Pave,,Reg,Lvl,...,0,0,0,,,,0,8,2008,WD
877,1672,527425140,20,RL,60,9000,Pave,,Reg,Lvl,...,0,0,0,,GdWo,,0,5,2007,WD
878,1939,535327160,20,RL,70,8400,Pave,,Reg,Lvl,...,0,0,0,,GdWo,,0,3,2007,WD


In [6]:
test.shape

(879, 80)

In [7]:
test.info()

# Notice that a lot of columns in list 'numeric_cols' have wrong data type object below, instead of int/float.
# We also see that reading the CSV file with keep_default_na=False doesn't give us ANY NaN (null/missing) values in the entire dataframe.

# This is because it reads the missing values in the CSV as '' (empty string) instead.
# This means that some columns with int/float types get converted to object type if they have any missing values.
# So, we need to replace all '' with np.nan in all the numerical columns ('numeric_cols' defined earlier), and convert remaining values to type float.
# And we need to replace all '' with np.nan in all non-numerical columns.
# Then we will proceed with data cleaning as usual, while also checking the types and values in each column.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 879 entries, 0 to 878
Data columns (total 80 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Id               879 non-null    int64 
 1   PID              879 non-null    int64 
 2   MS SubClass      879 non-null    int64 
 3   MS Zoning        879 non-null    object
 4   Lot Frontage     879 non-null    object
 5   Lot Area         879 non-null    int64 
 6   Street           879 non-null    object
 7   Alley            879 non-null    object
 8   Lot Shape        879 non-null    object
 9   Land Contour     879 non-null    object
 10  Utilities        879 non-null    object
 11  Lot Config       879 non-null    object
 12  Land Slope       879 non-null    object
 13  Neighborhood     879 non-null    object
 14  Condition 1      879 non-null    object
 15  Condition 2      879 non-null    object
 16  Bldg Type        879 non-null    object
 17  House Style      879 non-null    ob

In [8]:
# Mapping a lambda function for all numerical columns to convert all '' (empty string) to NaN and all other values to type float.

for col in numeric_cols:
    test[col] = test[col].map(lambda x : np.nan if x=='' else float(x))

In [9]:
# Mapping a lambda function for all non-numerical columns to convert all '' (empty string) to NaN.

for col in test.columns:
    if col not in numeric_cols:
        test[col] = test[col].map(lambda x : np.nan if x=='' else x)

In [10]:
# Checking data types of columns again.
# Also notice that now some columns have a few NaN values.

test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 879 entries, 0 to 878
Data columns (total 80 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Id               879 non-null    float64
 1   PID              879 non-null    float64
 2   MS SubClass      879 non-null    float64
 3   MS Zoning        879 non-null    object 
 4   Lot Frontage     719 non-null    float64
 5   Lot Area         879 non-null    float64
 6   Street           879 non-null    object 
 7   Alley            879 non-null    object 
 8   Lot Shape        879 non-null    object 
 9   Land Contour     879 non-null    object 
 10  Utilities        879 non-null    object 
 11  Lot Config       879 non-null    object 
 12  Land Slope       879 non-null    object 
 13  Neighborhood     879 non-null    object 
 14  Condition 1      879 non-null    object 
 15  Condition 2      879 non-null    object 
 16  Bldg Type        879 non-null    object 
 17  House Style     

In [11]:
test.dtypes.unique()

array([dtype('float64'), dtype('O')], dtype=object)

In [12]:
# Checking for all np.nan values in all columns and filtering out columns which have one or more rows with np.nan values.
# This is the true number of missing values in the dataframe.

null_cols = test.isna().sum()[test.isna().sum()!=0]
null_cols

Lot Frontage     160
Mas Vnr Type       1
Mas Vnr Area       1
Electrical         1
Garage Yr Blt     45
Garage Finish      1
dtype: int64

In [13]:
# Checking all unique values of 'MS SubClass' column.

sorted(test['MS SubClass'].unique())

[20.0,
 30.0,
 40.0,
 45.0,
 50.0,
 60.0,
 70.0,
 75.0,
 80.0,
 85.0,
 90.0,
 120.0,
 160.0,
 180.0,
 190.0]

In [14]:
# Checking all unique values of 'MS Zoning' column.

sorted(test['MS Zoning'].unique())

['C (all)', 'FV', 'I (all)', 'RH', 'RL', 'RM']

In [15]:
# Filtering all rows with missing values in 'Lot Frontage' column.
# May need to impute this missing data if column is to be used in the model.
# We will deal with data imputation if this column ends up being used in our regression model.

test[test['Lot Frontage'].isna()]

Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,3Ssn Porch,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type
1,2718.0,905108090.0,90.0,RL,,9662.0,Pave,,IR1,Lvl,...,0.0,0.0,0.0,,,,0.0,8.0,2006.0,WD
4,625.0,535105100.0,20.0,RL,,9500.0,Pave,,IR1,Lvl,...,0.0,185.0,0.0,,,,0.0,7.0,2009.0,WD
7,858.0,907202130.0,20.0,RL,,9286.0,Pave,,IR1,Lvl,...,0.0,0.0,0.0,,,,0.0,10.0,2009.0,WD
13,818.0,906230030.0,90.0,RL,,7976.0,Pave,,Reg,Lvl,...,0.0,0.0,0.0,,,,0.0,10.0,2009.0,WD
20,222.0,905105070.0,20.0,RL,,8246.0,Pave,,IR1,Lvl,...,0.0,0.0,0.0,,MnPrv,,0.0,5.0,2010.0,WD
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
854,2582.0,535301010.0,90.0,RL,,7032.0,Pave,,IR1,Lvl,...,0.0,0.0,0.0,,,,0.0,12.0,2006.0,WD
860,984.0,923275140.0,20.0,RL,,8780.0,Pave,,IR1,Lvl,...,0.0,0.0,0.0,,MnPrv,,0.0,3.0,2009.0,WD
867,2271.0,916460020.0,20.0,RL,,7777.0,Pave,,IR1,Lvl,...,0.0,0.0,0.0,,,,0.0,11.0,2007.0,WD
868,1633.0,527182170.0,160.0,RL,,5062.0,Pave,,IR1,Lvl,...,0.0,0.0,0.0,,,,0.0,9.0,2007.0,WD


In [16]:
# Checking for any values in 'Lot Frontage' column which are negative.

test[test['Lot Frontage']<0]

Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,3Ssn Porch,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type


In [17]:
# Checking for any values in 'Lot Area' column which are negative.

test[test['Lot Area']<0]

Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,3Ssn Porch,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type


In [18]:
# Checking all unique values of 'Street' column.

sorted(test['Street'].unique())

['Grvl', 'Pave']

In [19]:
# Checking all unique values of 'Alley' column.

sorted(test['Alley'].unique())

['Grvl', 'NA', 'Pave']

In [20]:
# Checking all unique values of 'Lot Shape' column.

sorted(test['Lot Shape'].unique())

['IR1', 'IR2', 'IR3', 'Reg']

In [21]:
# Converting ordinal values in 'Lot Shape' column to discrete values of type float.

ordinal_lot_shape = {'Reg':4.0, 'IR1':3.0, 'IR2':2.0, 'IR3':1.0}

test['Lot Shape'] = test['Lot Shape'].map(ordinal_lot_shape)

In [22]:
# Checking type of 'Lot Shape' column again to verify successful conversion.

test['Lot Shape'].dtypes

dtype('float64')

In [23]:
# Checking all unique values of 'Lot Shape' column.

sorted(test['Lot Shape'].unique())

[1.0, 2.0, 3.0, 4.0]

In [24]:
# Checking all unique values of 'Land Contour' column.

sorted(test['Land Contour'].unique())

['Bnk', 'HLS', 'Low', 'Lvl']

In [25]:
# Checking all unique values of 'Utilities' column.

sorted(test['Utilities'].unique())

['AllPub', 'NoSewr']

In [26]:
# Converting ordinal values in 'Utilities' column to discrete values of type float.

ordinal_utilities = {'AllPub':4.0, 'NoSewr':3.0, 'NoSeWa':2.0, 'ELO':1.0}
 
test['Utilities'] = test['Utilities'].map(ordinal_utilities)

In [27]:
# Checking type of 'Utilities' column again to verify successful conversion.

test['Utilities'].dtypes

dtype('float64')

In [28]:
# Checking all unique values of 'Utilities' column.

sorted(test['Utilities'].unique())

[3.0, 4.0]

In [29]:
sorted(test['Lot Config'].unique())

['Corner', 'CulDSac', 'FR2', 'FR3', 'Inside']

In [30]:
# Checking all unique values of 'Land Slope' column.

sorted(test['Land Slope'].unique())

['Gtl', 'Mod', 'Sev']

In [31]:
# Converting ordinal values in 'Land Slope' column to discrete values of type float.

ordinal_land_slope = {'Gtl':3.0, 'Mod':2.0, 'Sev':1.0}

test['Land Slope'] = test['Land Slope'].map(ordinal_land_slope)

In [32]:
# Checking type of 'Land Slope' column again to verify successful conversion.

test['Land Slope'].dtypes

dtype('float64')

In [33]:
# Checking all unique values of 'Land Slope' column.

sorted(test['Land Slope'].unique())

[1.0, 2.0, 3.0]

In [34]:
# Checking all unique values of 'Neighborhood' column.

sorted(test['Neighborhood'].unique())

['Blmngtn',
 'Blueste',
 'BrDale',
 'BrkSide',
 'ClearCr',
 'CollgCr',
 'Crawfor',
 'Edwards',
 'Gilbert',
 'Greens',
 'IDOTRR',
 'MeadowV',
 'Mitchel',
 'NAmes',
 'NPkVill',
 'NWAmes',
 'NoRidge',
 'NridgHt',
 'OldTown',
 'SWISU',
 'Sawyer',
 'SawyerW',
 'Somerst',
 'StoneBr',
 'Timber',
 'Veenker']

In [35]:
# Checking all unique values of 'Condition 1' column.

sorted(test['Condition 1'].unique())

['Artery', 'Feedr', 'Norm', 'PosA', 'PosN', 'RRAe', 'RRAn', 'RRNe', 'RRNn']

In [36]:
# Checking all unique values of 'Condition 2' column.

sorted(test['Condition 2'].unique())

['Feedr', 'Norm', 'PosA', 'PosN']

In [37]:
# Checking all unique values of 'Bldg Type' column.

sorted(test['Bldg Type'].unique())

['1Fam', '2fmCon', 'Duplex', 'Twnhs', 'TwnhsE']

In [38]:
# Checking all unique values of 'House Style' column.

sorted(test['House Style'].unique())

['1.5Fin', '1.5Unf', '1Story', '2.5Fin', '2.5Unf', '2Story', 'SFoyer', 'SLvl']

In [39]:
# Checking all unique values of 'Overall Qual' column.

sorted(test['Overall Qual'].unique())

[2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0]

In [40]:
# Checking all unique values of 'Overall Cond' column.

sorted(test['Overall Cond'].unique())

[1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0]

In [41]:
# Checking for any values in 'Year Built' column which are negative.

test[test['Year Built']<0]

Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,3Ssn Porch,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type


In [42]:
# Checking for any values in 'Year Remod/Add' column which are negative.

test[test['Year Remod/Add']<0]

Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,3Ssn Porch,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type


In [43]:
# Checking all unique values of 'Roof Style' column.

sorted(test['Roof Style'].unique())

['Flat', 'Gable', 'Gambrel', 'Hip', 'Mansard', 'Shed']

In [44]:
# Checking all unique values of 'Roof Matl' column.

sorted(test['Roof Matl'].unique())

['CompShg', 'Metal', 'Roll', 'Tar&Grv', 'WdShake', 'WdShngl']

In [45]:
# Checking all unique values of 'Exterior 1st' column.

sorted(test['Exterior 1st'].unique())

['AsbShng',
 'AsphShn',
 'BrkComm',
 'BrkFace',
 'CemntBd',
 'HdBoard',
 'MetalSd',
 'Plywood',
 'PreCast',
 'Stucco',
 'VinylSd',
 'Wd Sdng',
 'WdShing']

In [46]:
# Checking all unique values of 'Exterior 2nd' column.

sorted(test['Exterior 2nd'].unique())

['AsbShng',
 'AsphShn',
 'Brk Cmn',
 'BrkFace',
 'CBlock',
 'CmentBd',
 'HdBoard',
 'ImStucc',
 'MetalSd',
 'Other',
 'Plywood',
 'PreCast',
 'Stucco',
 'VinylSd',
 'Wd Sdng',
 'Wd Shng']

In [47]:
# Notice that a few values are misspelled.
# 'Wd Shng' is misspelled. According to data description, it should be 'WdShing'.
# 'Brk Cmn' is misspelled. According to data description, it should be 'BrkComm'.
# 'CmentBd' is misspelled. According to data description, it should be 'CemntBd'.
# Replacing the wrong spellings with the correct ones.

test['Exterior 2nd'] = test['Exterior 2nd'].map(lambda x : 'WdShing' if x=='Wd Shng' else x)
test['Exterior 2nd'] = test['Exterior 2nd'].map(lambda x : 'BrkComm' if x=='Brk Cmn' else x)
test['Exterior 2nd'] = test['Exterior 2nd'].map(lambda x : 'CemntBd' if x=='CmentBd' else x)

In [48]:
# Checking all unique values of 'Exterior 2nd' column again.

sorted(test['Exterior 2nd'].unique())

['AsbShng',
 'AsphShn',
 'BrkComm',
 'BrkFace',
 'CBlock',
 'CemntBd',
 'HdBoard',
 'ImStucc',
 'MetalSd',
 'Other',
 'Plywood',
 'PreCast',
 'Stucco',
 'VinylSd',
 'Wd Sdng',
 'WdShing']

In [49]:
# Checking all unique values of 'Mas Vnr Type' column.

test['Mas Vnr Type'].unique()

array(['None', 'BrkFace', 'Stone', 'BrkCmn', 'CBlock', nan], dtype=object)

In [50]:
# Checking to see if any rows left in 'Mas Vnr Type' column with missing values.
# We will deal with data imputation if 'Mas Vnr Type' column ends up being used in our regression model.

test['Mas Vnr Type'].isna().sum()

1

In [51]:
# Filtering all rows with missing values in 'Mas Vnr Area' column.
# We will deal with data imputation if 'Mas Vnr Area' column ends up being used in our regression model.

test[test['Mas Vnr Area'].isna()]

Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,3Ssn Porch,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type
866,868.0,907260030.0,60.0,RL,70.0,8749.0,Pave,,4.0,Lvl,...,0.0,0.0,0.0,,,,0.0,11.0,2009.0,WD


In [52]:
# Checking for any values in 'Mas Vnr Area' column which are negative.

test[test['Mas Vnr Area']<0]

Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,3Ssn Porch,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type


In [53]:
# Checking all unique values of 'Exter Qual' column.

test['Exter Qual'].unique()

array(['TA', 'Gd', 'Fa', 'Ex'], dtype=object)

In [54]:
# Checking all unique values of 'Exter Cond' column.

test['Exter Cond'].unique()

array(['Fa', 'TA', 'Gd', 'Ex', 'Po'], dtype=object)

In [55]:
# Converting ordinal values in 'Exter Qual' & 'Exter Cond' columns to discrete values of type float.

ordinal_exter = {'Ex':5.0, 'Gd':4.0, 'TA':3.0, 'Fa':2.0, 'Po':1.0}

test['Exter Qual'] = test['Exter Qual'].map(ordinal_exter)

test['Exter Cond'] = test['Exter Cond'].map(ordinal_exter)

In [56]:
# Checking type of 'Exter Qual' column again to verify successful conversion.

test['Exter Qual'].dtypes

dtype('float64')

In [57]:
# Checking type of 'Exter Cond' column again to verify successful conversion.

test['Exter Cond'].dtypes

dtype('float64')

In [58]:
# Checking all unique values of 'Exter Qual' column.

sorted(test['Exter Qual'].unique())

[2.0, 3.0, 4.0, 5.0]

In [59]:
# Checking all unique values of 'Exter Cond' column.

sorted(test['Exter Cond'].unique())

[1.0, 2.0, 3.0, 4.0, 5.0]

In [60]:
# Checking all unique values of 'Foundation' column.

sorted(test['Foundation'].unique())

['BrkTil', 'CBlock', 'PConc', 'Slab', 'Stone', 'Wood']

In [61]:
# Checking all unique values of 'Bsmt Qual' column.

test['Bsmt Qual'].unique()

array(['Fa', 'Gd', 'TA', 'Ex', 'NA', 'Po'], dtype=object)

In [62]:
# Checking all unique values of 'Bsmt Cond' column.

test['Bsmt Cond'].unique()

array(['TA', 'Gd', 'NA', 'Fa'], dtype=object)

In [63]:
# Converting ordinal values in 'Bsmt Qual' & 'Bsmt Cond' columns to discrete values of type float.

ordinal_bsmt = {'Ex':5.0, 'Gd':4.0, 'TA':3.0, 'Fa':2.0, 'Po':1.0, 'NA':0.0}

test['Bsmt Qual'] = test['Bsmt Qual'].map(ordinal_bsmt)

test['Bsmt Cond'] = test['Bsmt Cond'].map(ordinal_bsmt)

In [64]:
# Checking type of 'Bsmt Qual' column again to verify successful conversion.

test['Bsmt Qual'].dtypes

dtype('float64')

In [65]:
# Checking type of 'Bsmt Cond' column again to verify successful conversion.

test['Bsmt Cond'].dtypes

dtype('float64')

In [66]:
# Checking all unique values of 'Bsmt Qual' column.

sorted(test['Bsmt Qual'].unique())

[0.0, 1.0, 2.0, 3.0, 4.0, 5.0]

In [67]:
# Checking all unique values of 'Bsmt Cond' column.

sorted(test['Bsmt Cond'].unique())

[0.0, 2.0, 3.0, 4.0]

In [68]:
# Checking all unique values of 'Bsmt Exposure' column.

test['Bsmt Exposure'].unique()

array(['No', 'Av', 'NA', 'Mn', 'Gd'], dtype=object)

In [69]:
# Converting ordinal values in 'Bsmt Exposure' column to discrete values of type float.

ordinal_bsmt_exposure = {'Gd':4.0, 'Av':3.0, 'Mn':2.0, 'No':1.0, 'NA':0.0}

test['Bsmt Exposure'] = test['Bsmt Exposure'].map(ordinal_bsmt_exposure)

In [70]:
# Checking type of 'Bsmt Exposure' column again to verify successful conversion.

test['Bsmt Exposure'].dtypes

dtype('float64')

In [71]:
# Checking all unique values of 'Bsmt Exposure' column.

sorted(test['Bsmt Exposure'].unique())

[0.0, 1.0, 2.0, 3.0, 4.0]

In [72]:
# Checking all unique values of 'BsmtFin Type 1' column.

test['BsmtFin Type 1'].unique()

array(['Unf', 'GLQ', 'BLQ', 'Rec', 'ALQ', 'NA', 'LwQ'], dtype=object)

In [73]:
# Checking all unique values of 'BsmtFin Type 2' column.

test['BsmtFin Type 2'].unique()

array(['Unf', 'LwQ', 'NA', 'ALQ', 'GLQ', 'Rec', 'BLQ'], dtype=object)

In [74]:
# Converting ordinal values in 'Bsmt Exposure' column to discrete values of type float.

ordinal_bsmtfin_type = {'GLQ':6.0, 'ALQ':5.0, 'BLQ':4.0, 'Rec':3.0, 'LwQ':2.0, 'Unf':1.0, 'NA':0.0}

test['BsmtFin Type 1'] = test['BsmtFin Type 1'].map(ordinal_bsmtfin_type)

test['BsmtFin Type 2'] = test['BsmtFin Type 2'].map(ordinal_bsmtfin_type)

In [75]:
# Checking type of 'BsmtFin Type 1' column again to verify successful conversion.

test['BsmtFin Type 1'].dtypes

dtype('float64')

In [76]:
# Checking type of 'BsmtFin Type 2' column again to verify successful conversion.

test['BsmtFin Type 2'].dtypes

dtype('float64')

In [77]:
# Checking all unique values of 'BsmtFin Type 1' column.

sorted(test['BsmtFin Type 1'].unique())

[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0]

In [78]:
# Checking all unique values of 'BsmtFin Type 2' column.

sorted(test['BsmtFin Type 2'].unique())

[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0]

In [79]:
# Checking for any values in 'BsmtFin SF 1' column which are negative.

test[test['BsmtFin SF 1']<0]

Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,3Ssn Porch,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type


In [80]:
# Checking for any values in 'BsmtFin SF 2' column which are negative.

test[test['BsmtFin SF 2']<0]

Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,3Ssn Porch,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type


In [81]:
# Checking for any values in 'Bsmt Unf SF' column which are negative.

test[test['Bsmt Unf SF']<0]

Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,3Ssn Porch,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type


In [82]:
# Checking for any values in 'Total Bsmt SF' column which are negative.

test[test['Total Bsmt SF']<0]

Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,3Ssn Porch,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type


In [83]:
# Checking all unique values of 'Heating' column.

sorted(test['Heating'].unique())

['Floor', 'GasA', 'GasW', 'Grav']

In [84]:
# Checking all unique values of 'Heating QC' column.

test['Heating QC'].unique()

array(['Gd', 'TA', 'Ex', 'Fa'], dtype=object)

In [85]:
# Converting ordinal values in 'Heating QC' column to discrete values of type float.

ordinal_heating_qc = {'Ex':5.0, 'Gd':4.0, 'TA':3.0, 'Fa':2.0, 'Po':1.0}

test['Heating QC'] = test['Heating QC'].map(ordinal_heating_qc)

In [86]:
# Checking type of 'Heating QC' column again to verify successful conversion.

test['Heating QC'].dtypes

dtype('float64')

In [87]:
# Checking all unique values of 'Heating QC' column.

sorted(test['Heating QC'].unique())

[2.0, 3.0, 4.0, 5.0]

In [88]:
# Checking all unique values of 'Central Air' column.

test['Central Air'].unique()

array(['N', 'Y'], dtype=object)

In [89]:
# Converting ordinal values in 'Central Air' column to discrete values of type float.

ordinal_central_air = {'Y':1.0, 'N':0.0}

test['Central Air'] = test['Central Air'].map(ordinal_central_air)

In [90]:
# Checking type of 'Central Air' column again to verify successful conversion.

test['Central Air'].dtypes

dtype('float64')

In [91]:
# Checking all unique values of 'Central Air' column.

sorted(test['Central Air'].unique())

[0.0, 1.0]

In [92]:
# Checking all unique values of 'Electrical' column.

test['Electrical'].unique()

array(['FuseP', 'SBrkr', 'FuseA', 'FuseF', nan], dtype=object)

In [93]:
# Filtering all rows with missing values in 'Electrical' column.
# We will deal with data imputation if 'Electrical' column ends up being used in our regression model.

test[test['Electrical'].isna()]

Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,3Ssn Porch,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type
635,1578.0,916386080.0,80.0,RL,73.0,9735.0,Pave,,4.0,Lvl,...,0.0,0.0,0.0,,,,0.0,5.0,2008.0,WD


In [94]:
# Converting ordinal values in 'Electrical' column to discrete values of type float.

ordinal_electrical = {'SBrkr':5.0, 'FuseA':4.0, 'FuseF':3.0, 'FuseP':2.0, 'Mix':1.0}

test['Electrical'] = test['Electrical'].map(ordinal_electrical)

In [95]:
# Checking type of 'Electrical' column again to verify successful conversion.

test['Electrical'].dtypes

dtype('float64')

In [96]:
# Checking all unique values of 'Electrical' column.

sorted(test['Electrical'].unique())

[2.0, 3.0, 4.0, 5.0, nan]

In [97]:
# Checking for any values in '1st Flr SF' column which are negative.

test[test['1st Flr SF']<0]

Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,3Ssn Porch,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type


In [98]:
# Checking for any values in '2nd Flr SF' column which are negative.

test[test['2nd Flr SF']<0]

Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,3Ssn Porch,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type


In [99]:
# Checking for any values in 'Low Qual Fin SF' column which are negative.

test[test['Low Qual Fin SF']<0]

Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,3Ssn Porch,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type


In [100]:
# Checking for any values in 'Gr Liv Area' column which are negative.

test[test['Gr Liv Area']<0]

Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,3Ssn Porch,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type


In [101]:
# Filtering all rows with missing values in 'Bsmt Full Bath' column.

test[test['Bsmt Full Bath'].isna()]

Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,3Ssn Porch,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type


In [102]:
# Checking for any values in 'Bsmt Full Bath' column which are negative.

test[test['Bsmt Full Bath']<0]

Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,3Ssn Porch,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type


In [103]:
# Checking for any values in 'Bsmt Half Bath' column which are negative.

test[test['Bsmt Half Bath']<0]

Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,3Ssn Porch,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type


In [104]:
# Checking for any values in 'Full Bath' column which are negative.

test[test['Full Bath']<0]

Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,3Ssn Porch,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type


In [105]:
# Checking for any values in 'Half Bath' column which are negative.

test[test['Half Bath']<0]

Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,3Ssn Porch,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type


In [106]:
# Checking for any values in 'Bedroom AbvGr' column which are negative.

test[test['Bedroom AbvGr']<0]

Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,3Ssn Porch,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type


In [107]:
# Checking for any values in 'Kitchen AbvGr' column which are negative.

test[test['Kitchen AbvGr']<0]

Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,3Ssn Porch,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type


In [108]:
# Checking all unique values of 'Kitchen Qual' column.

test['Kitchen Qual'].unique()

array(['Fa', 'TA', 'Gd', 'Ex', 'Po'], dtype=object)

In [109]:
# Converting ordinal values in 'Kitchen Qual' column to discrete values of type float.

ordinal_kitchen_qual = {'Ex':5.0, 'Gd':4.0, 'TA':3.0, 'Fa':2.0, 'Po':1.0}

test['Kitchen Qual'] = test['Kitchen Qual'].map(ordinal_kitchen_qual)

In [110]:
# Checking type of 'Kitchen Qual' column again to verify successful conversion.

test['Kitchen Qual'].dtypes

dtype('float64')

In [111]:
# Checking all unique values of 'Kitchen Qual' column.

sorted(test['Kitchen Qual'].unique())

[1.0, 2.0, 3.0, 4.0, 5.0]

In [112]:
# Checking for any values in 'TotRms AbvGrd' column which are negative.

test[test['TotRms AbvGrd']<0]

Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,3Ssn Porch,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type


In [113]:
# Checking all unique values of 'Functional' column.

test['Functional'].unique()

array(['Typ', 'Min2', 'Min1', 'Mod', 'Maj1', 'Maj2'], dtype=object)

In [114]:
# Converting ordinal values in 'Functional' column to discrete values of type float.

ordinal_functional = {'Typ':8.0, 'Min1':7.0, 'Min2':6.0, 'Mod':5.0, 'Maj1':4.0, 'Maj2':3.0, 'Sev':2.0, 'Sal':1.0}

test['Functional'] = test['Functional'].map(ordinal_functional)

In [115]:
# Checking type of 'Functional' column again to verify successful conversion.

test['Functional'].dtypes

dtype('float64')

In [116]:
# Checking all unique values of 'Functional' column.

sorted(test['Functional'].unique())

[3.0, 4.0, 5.0, 6.0, 7.0, 8.0]

In [117]:
# Checking for any values in 'Fireplaces' column which are negative.

test[test['Fireplaces']<0]

Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,3Ssn Porch,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type


In [118]:
# Checking all unique values of 'Fireplace Qu' column.

test['Fireplace Qu'].unique()

array(['NA', 'Gd', 'Fa', 'TA', 'Po', 'Ex'], dtype=object)

In [119]:
# Converting ordinal values in 'Fireplace Qu' column to discrete values of type float.

ordinal_fireplace_qu = {'Ex':5.0, 'Gd':4.0, 'TA':3.0, 'Fa':2.0, 'Po':1.0, 'NA':0.0}

test['Fireplace Qu'] = test['Fireplace Qu'].map(ordinal_fireplace_qu)

In [120]:
# Checking type of 'Fireplace Qu' column again to verify successful conversion.

test['Fireplace Qu'].dtypes

dtype('float64')

In [121]:
# Checking all unique values of 'Fireplace Qu' column.

sorted(test['Fireplace Qu'].unique())

[0.0, 1.0, 2.0, 3.0, 4.0, 5.0]

In [122]:
# Checking all unique values of 'Garage Type' column.

sorted(test['Garage Type'].unique())

['2Types', 'Attchd', 'Basment', 'BuiltIn', 'CarPort', 'Detchd', 'NA']

In [123]:
# Checking for any values in 'Garage Yr Blt' column which are negative.
# We will deal with data imputation if 'Garage Yr Blt' column ends up being used in our regression model.

test[test['Garage Yr Blt'].isna()]

Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,3Ssn Porch,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type
29,1904.0,534451020.0,50.0,RL,51.0,3500.0,Pave,,4.0,Lvl,...,0.0,0.0,0.0,,MnPrv,Shed,2000.0,7.0,2007.0,WD
45,979.0,923228150.0,160.0,RM,21.0,1533.0,Pave,,4.0,Lvl,...,0.0,0.0,0.0,,,,0.0,5.0,2009.0,WD
66,2362.0,527403120.0,20.0,RL,,8125.0,Pave,,3.0,Lvl,...,0.0,0.0,0.0,,,,0.0,6.0,2006.0,WD
68,2188.0,908226180.0,30.0,RH,70.0,4270.0,Pave,,4.0,Bnk,...,0.0,0.0,0.0,,,,0.0,5.0,2007.0,WD
106,1988.0,902207010.0,30.0,RM,40.0,3880.0,Pave,,4.0,Lvl,...,0.0,0.0,0.0,,,,0.0,8.0,2007.0,WD
110,217.0,905101300.0,90.0,RL,72.0,10773.0,Pave,,4.0,Lvl,...,0.0,0.0,0.0,,,,0.0,5.0,2010.0,WD
114,2908.0,923205120.0,20.0,RL,90.0,17217.0,Pave,,4.0,Lvl,...,0.0,0.0,0.0,,,,0.0,7.0,2006.0,WD
145,1507.0,908250040.0,50.0,RL,57.0,8050.0,Pave,,4.0,Lvl,...,0.0,0.0,0.0,,,,0.0,4.0,2008.0,WD
153,1368.0,903476110.0,50.0,RM,60.0,5586.0,Pave,,3.0,Bnk,...,0.0,0.0,0.0,,MnPrv,,0.0,9.0,2008.0,ConLD
157,332.0,923228270.0,160.0,RM,21.0,1900.0,Pave,,4.0,Lvl,...,0.0,0.0,0.0,,,,0.0,6.0,2010.0,WD


In [124]:
# Checking for any values in 'Garage Yr Blt' column which are negative.

test[test['Garage Yr Blt']<0]

Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,3Ssn Porch,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type


In [125]:
# Checking all unique values of 'Garage Finish' column.

test['Garage Finish'].unique()

array(['Unf', 'Fin', 'RFn', 'NA', nan], dtype=object)

In [126]:
# Checking to see if any rows left in 'Garage Finish' column with missing values.
# We will deal with data imputation if 'Garage Finish' column ends up being used in our regression model.

test['Garage Finish'].isna().sum()

1

In [127]:
# Converting ordinal values in 'Garage Finish' column to discrete values of type float.

ordinal_garage_finish = {'Fin':3.0, 'RFn':2.0, 'Unf':1.0, 'NA':0.0}

test['Garage Finish'] = test['Garage Finish'].map(ordinal_garage_finish)

In [128]:
# Checking type of 'Garage Finish' column again to verify successful conversion.

test['Garage Finish'].dtypes

dtype('float64')

In [129]:
# Checking all unique values of 'Garage Finish' column.

sorted(test['Garage Finish'].unique())

[0.0, 1.0, 2.0, 3.0, nan]

In [130]:
# Checking for any values in 'Garage Cars' column which are negative.

test[test['Garage Cars'].isna()]

Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,3Ssn Porch,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type


In [131]:
# Checking for any values in 'Garage Cars' column which are negative.

test[test['Garage Cars']<0]

Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,3Ssn Porch,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type


In [132]:
# Checking for any values in 'Garage Area' column which are negative.

test[test['Garage Area'].isna()]

Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,3Ssn Porch,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type


In [133]:
# Checking for any values in 'Garage Area' column which are negative.

test[test['Garage Area']<0]

Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,3Ssn Porch,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type


In [134]:
# Checking all unique values of 'Garage Qual' column.

test['Garage Qual'].unique()

array(['Po', 'TA', 'Fa', 'NA', 'Gd'], dtype=object)

In [135]:
# Checking all unique values of 'Garage Cond' column.

test['Garage Cond'].unique()

array(['Po', 'TA', 'NA', 'Fa', 'Gd', 'Ex'], dtype=object)

In [136]:
# Converting ordinal values in 'Garage Qual' & 'Garage Cond' columns to discrete values of type float.

ordinal_garage = {'Ex':5.0, 'Gd':4.0, 'TA':3.0, 'Fa':2.0, 'Po':1.0, 'NA':0.0}

test['Garage Qual'] = test['Garage Qual'].map(ordinal_garage)

test['Garage Cond'] = test['Garage Cond'].map(ordinal_garage)

In [137]:
# Checking type of 'Garage Qual' column again to verify successful conversion.

test['Garage Qual'].dtypes

dtype('float64')

In [138]:
# Checking type of 'Garage Cond' column again to verify successful conversion.

test['Garage Cond'].dtypes

dtype('float64')

In [139]:
# Checking all unique values of 'Garage Qual' column.

sorted(test['Garage Qual'].unique())

[0.0, 1.0, 2.0, 3.0, 4.0]

In [140]:
# Checking all unique values of 'Garage Cond' column.

sorted(test['Garage Cond'].unique())

[0.0, 1.0, 2.0, 3.0, 4.0, 5.0]

In [141]:
# Checking all unique values of 'Paved Drive' column.

test['Paved Drive'].unique()

array(['Y', 'N', 'P'], dtype=object)

In [142]:
# Converting ordinal values in 'Paved Drive' column to discrete values of type float.

ordinal_paved_drive = {'Y':3.0, 'P':2.0, 'N':1.0}

test['Paved Drive'] = test['Paved Drive'].map(ordinal_paved_drive)

In [143]:
# Checking type of 'Paved Drive' column again to verify successful conversion.

test['Paved Drive'].dtypes

dtype('float64')

In [144]:
# Checking all unique values of 'Paved Drive' column.

sorted(test['Paved Drive'].unique())

[1.0, 2.0, 3.0]

In [145]:
# Checking for any values in 'Wood Deck SF' column which are negative.

test[test['Wood Deck SF']<0]

Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,3Ssn Porch,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type


In [146]:
# Checking for any values in 'Open Porch SF' column which are negative.

test[test['Open Porch SF']<0]

Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,3Ssn Porch,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type


In [147]:
# Checking for any values in 'Enclosed Porch' column which are negative.

test[test['Enclosed Porch']<0]

Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,3Ssn Porch,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type


In [148]:
# Checking for any values in '3Ssn Porch' column which are negative.

test[test['3Ssn Porch']<0]

Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,3Ssn Porch,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type


In [149]:
# Checking for any values in 'Screen Porch' column which are negative.

test[test['Screen Porch']<0]

Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,3Ssn Porch,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type


In [150]:
# Checking for any values in 'Pool Area' column which are negative.

test[test['Pool Area']<0]

Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,3Ssn Porch,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type


In [151]:
# Checking all unique values of 'Pool QC' column.

test['Pool QC'].unique()

array(['NA', 'Ex', 'TA'], dtype=object)

In [152]:
# Converting ordinal values in 'Pool QC' column to discrete values of type float.

ordinal_pool_qc = {'Ex':4.0, 'Gd':3.0, 'TA':2.0, 'Fa':1.0, 'NA':0.0}

test['Pool QC'] = test['Pool QC'].map(ordinal_pool_qc)

In [153]:
# Checking type of 'Pool QC' column again to verify successful conversion.

test['Pool QC'].dtypes

dtype('float64')

In [154]:
# Checking all unique values of 'Pool QC' column.

sorted(test['Pool QC'].unique())

[0.0, 2.0, 4.0]

In [155]:
# Checking all unique values of 'Fence' column.

test['Fence'].unique()

array(['NA', 'MnPrv', 'GdPrv', 'GdWo', 'MnWw'], dtype=object)

In [156]:
# Converting ordinal values in 'Fence' column to discrete values of type float.

ordinal_fence = {'GdPrv':4.0, 'MnPrv':3.0, 'GdWo':2.0, 'MnWw':1.0, 'NA':0.0}

test['Fence'] = test['Fence'].map(ordinal_fence)

In [157]:
# Checking type of 'Fence' column again to verify successful conversion.

test['Fence'].dtypes

dtype('float64')

In [158]:
# Checking all unique values of 'Fence' column.

sorted(test['Fence'].unique())

[0.0, 1.0, 2.0, 3.0, 4.0]

In [159]:
# Checking all unique values of 'Misc Feature' column.

sorted(test['Misc Feature'].unique())

['Gar2', 'NA', 'Othr', 'Shed']

In [160]:
# Checking for any values in 'Misc Val' column which are negative.

test[test['Misc Val']<0]

Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,3Ssn Porch,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type


In [161]:
# Checking for any values in 'Mo Sold' column which are <= 0 and >= 13.

test[(test['Mo Sold']<=0) & (test['Mo Sold']>=13)]

Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,3Ssn Porch,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type


In [162]:
# Checking for any values in 'Yr Sold' column which are <= 0.

test[test['Yr Sold']<=0]

Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,3Ssn Porch,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type


In [163]:
# Checking all unique values of 'Sale Type' column.

sorted(test['Sale Type'].unique())

['COD', 'CWD', 'Con', 'ConLD', 'ConLI', 'ConLw', 'New', 'Oth', 'VWD', 'WD ']

In [164]:
test.shape

(879, 80)

In [165]:
# Checking for all np.nan values in all columns and filtering out columns which have one or more rows with np.nan values.
# This is the true number of missing values in the dataframe.
# We will do imputation of missing data if any of these columns are chosen to be used in the final regression model.

null_cols = test.isna().sum()[test.isna().sum()!=0]
null_cols

Lot Frontage     160
Mas Vnr Type       1
Mas Vnr Area       1
Electrical         1
Garage Yr Blt     45
Garage Finish      1
dtype: int64

In [166]:
# test dataframe should have exact same columns as train_cleaned.csv
# Read train_cleaned.csv and assign it to a dataframe 'train_cleaned'.

train_cleaned = pd.read_csv('../datasets/train_cleaned.csv', keep_default_na=False)

list(train_cleaned.columns)

['Id',
 'PID',
 'MS SubClass',
 'MS Zoning',
 'Lot Frontage',
 'Lot Area',
 'Street',
 'Alley',
 'Lot Shape',
 'Land Contour',
 'Utilities',
 'Lot Config',
 'Land Slope',
 'Neighborhood',
 'Condition 1',
 'Condition 2',
 'Bldg Type',
 'House Style',
 'Overall Qual',
 'Overall Cond',
 'Year Built',
 'Year Remod/Add',
 'Roof Style',
 'Roof Matl',
 'Exterior 1st',
 'Exterior 2nd',
 'Mas Vnr Type',
 'Mas Vnr Area',
 'Exter Qual',
 'Exter Cond',
 'Foundation',
 'Bsmt Qual',
 'Bsmt Cond',
 'Bsmt Exposure',
 'BsmtFin Type 1',
 'BsmtFin SF 1',
 'BsmtFin Type 2',
 'BsmtFin SF 2',
 'Bsmt Unf SF',
 'Total Bsmt SF',
 'Heating',
 'Heating QC',
 'Central Air',
 'Electrical',
 '1st Flr SF',
 '2nd Flr SF',
 'Low Qual Fin SF',
 'Gr Liv Area',
 'Bsmt Full Bath',
 'Bsmt Half Bath',
 'Full Bath',
 'Half Bath',
 'Bedroom AbvGr',
 'Kitchen AbvGr',
 'Kitchen Qual',
 'TotRms AbvGrd',
 'Functional',
 'Fireplaces',
 'Fireplace Qu',
 'Garage Type',
 'Garage Yr Blt',
 'Garage Finish',
 'Garage Cars',
 'Garage Are

In [167]:
# Filtering extra columns in test dataframe that do not exist in train_cleaned dataframe.

test_extra_cols = []

for col in list(test.columns):
    if col not in list(train_cleaned.columns):
        test_extra_cols.append(col)

test_extra_cols

# No extra columns exist in test dataframe that do not exist in train_cleaned dataframe.

[]

In [168]:
# Filtering missing columns in test dataframe that exist in train_cleaned dataframe.

test_missing_cols = []

for col in list(train_cleaned.columns):
    if col not in list(test.columns):
        test_missing_cols.append(col)

test_missing_cols

# As expected, only 'SalePrice' column is missing in test dataframe, and it exists in train_cleaned dataframe.

['SalePrice']

In [169]:
# Ensuring the order of columns in test dataframe is same as train_cleaned dataframe (just for ease of viewing and comparison purposes).

train_cleaned_cols = list(train_cleaned.columns)
train_cleaned_cols.remove('SalePrice')

test = test[train_cleaned_cols]

In [170]:
# Exporting the cleaned test dataframe to a CSV file using a relative path.

test.to_csv("../datasets/test_cleaned.csv", index=False)

### Data Dictionary

A data dictionary was compiled in MS Excel while going through the data cleaning process of train.csv in previous notebook. It will be imported below so as to ease the process of filtering columns for further process of one-hot encoding of nominal columns.

In [171]:
data_dict = pd.read_csv("../datasets/data_dictionary.csv")

In [172]:
data_dict.head(10)

Unnamed: 0,Column name,Variable type,Col dtype (initial),Col dtype (converted to),Col dtype (final),Needs one-hot encoding,Possible unique values,Unique values in train,Rows of missing data,Description
0,Id,discrete,float,,float,no,,,0,Observation number.
1,PID,nominal,float,,float,no,,,0,Parcel identification number.
2,MS SubClass,nominal,float,,float,yes,16.0,16.0,0,Identifies the type of dwelling involved in th...
3,MS Zoning,nominal,object,,object,yes,8.0,7.0,0,Identifies the general zoning classification o...
4,Lot Frontage,continuous,float,,float,no,,,330,Linear feet of street connected to property.
5,Lot Area,continuous,float,,float,no,,,0,Lot size in square feet.
6,Street,nominal,object,,object,yes,2.0,2.0,0,Type of road access to property.
7,Alley,nominal,object,,object,yes,3.0,3.0,0,Type of alley access to property.
8,Lot Shape,ordinal,object,float,float,no,4.0,4.0,0,"General shape of property. 'Reg' = 4, 'IR1' = ..."
9,Land Contour,nominal,object,,object,yes,4.0,4.0,0,Flatness of the property.


### One-hot Encoding of Nominal Variables

21 nominal columns were identified which need to be one-hot encoded, before they can be utilized as features in machine learning models.

In [173]:
# Filtering columns from data_dict which require one-hot encoding.

data_dict[data_dict['Needs one-hot encoding']=='yes']

Unnamed: 0,Column name,Variable type,Col dtype (initial),Col dtype (converted to),Col dtype (final),Needs one-hot encoding,Possible unique values,Unique values in train,Rows of missing data,Description
2,MS SubClass,nominal,float,,float,yes,16.0,16.0,0,Identifies the type of dwelling involved in th...
3,MS Zoning,nominal,object,,object,yes,8.0,7.0,0,Identifies the general zoning classification o...
6,Street,nominal,object,,object,yes,2.0,2.0,0,Type of road access to property.
7,Alley,nominal,object,,object,yes,3.0,3.0,0,Type of alley access to property.
9,Land Contour,nominal,object,,object,yes,4.0,4.0,0,Flatness of the property.
11,Lot Config,nominal,object,,object,yes,5.0,5.0,0,Lot configuration.
13,Neighborhood,nominal,object,,object,yes,28.0,28.0,0,Physical locations within Ames city limits (ma...
14,Condition 1,nominal,object,,object,yes,9.0,9.0,0,Proximity to various conditions.
15,Condition 2,nominal,object,,object,yes,9.0,8.0,0,Proximity to various conditions (if more than ...
16,Bldg Type,nominal,object,,object,yes,5.0,5.0,0,Type of dwelling.


In [174]:
# Preparing list of columns in test to be one-hot encoded.

test_enc_list = list(data_dict[(data_dict['Needs one-hot encoding']=='yes')]['Column name'])
print(len(test_enc_list))
test_enc_list

21


['MS SubClass',
 'MS Zoning',
 'Street',
 'Alley',
 'Land Contour',
 'Lot Config',
 'Neighborhood',
 'Condition 1',
 'Condition 2',
 'Bldg Type',
 'House Style',
 'Roof Style',
 'Roof Matl',
 'Exterior 1st',
 'Exterior 2nd',
 'Mas Vnr Type',
 'Foundation',
 'Heating',
 'Garage Type',
 'Misc Feature',
 'Sale Type']

In [175]:
test.head()

Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,3Ssn Porch,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type
0,2658.0,902301120.0,190.0,RM,69.0,9142.0,Pave,Grvl,4.0,Lvl,...,0.0,0.0,0.0,0.0,0.0,,0.0,4.0,2006.0,WD
1,2718.0,905108090.0,90.0,RL,,9662.0,Pave,,3.0,Lvl,...,0.0,0.0,0.0,0.0,0.0,,0.0,8.0,2006.0,WD
2,2414.0,528218130.0,60.0,RL,58.0,17104.0,Pave,,3.0,Lvl,...,0.0,0.0,0.0,0.0,0.0,,0.0,9.0,2006.0,New
3,1989.0,902207150.0,30.0,RM,60.0,8520.0,Pave,,4.0,Lvl,...,0.0,0.0,0.0,0.0,0.0,,0.0,7.0,2007.0,WD
4,625.0,535105100.0,20.0,RL,,9500.0,Pave,,3.0,Lvl,...,0.0,185.0,0.0,0.0,0.0,,0.0,7.0,2009.0,WD


In [176]:
test.shape

(879, 80)

In [177]:
# One-hot encoding the nominal columns in test dataframe.

# We also notice from above that 'Mas Vnr Type' column contains 1 row with missing value (NaN).
# Since the number of rows with missing values (1) is only about 0.1% of the total number of rows (879), we will ignore this null values while one-hot encoding.
# So, we set dummy_na=False while one-hot enconding.

test = pd.get_dummies(test, columns=test_enc_list, drop_first=True, dummy_na=False)
test.head()

Unnamed: 0,Id,PID,Lot Frontage,Lot Area,Lot Shape,Utilities,Land Slope,Overall Qual,Overall Cond,Year Built,...,Misc Feature_Shed,Sale Type_CWD,Sale Type_Con,Sale Type_ConLD,Sale Type_ConLI,Sale Type_ConLw,Sale Type_New,Sale Type_Oth,Sale Type_VWD,Sale Type_WD
0,2658.0,902301120.0,69.0,9142.0,4.0,4.0,3.0,6.0,8.0,1910.0,...,0,0,0,0,0,0,0,0,0,1
1,2718.0,905108090.0,,9662.0,3.0,4.0,3.0,5.0,4.0,1977.0,...,0,0,0,0,0,0,0,0,0,1
2,2414.0,528218130.0,58.0,17104.0,3.0,4.0,3.0,7.0,5.0,2006.0,...,0,0,0,0,0,0,1,0,0,0
3,1989.0,902207150.0,60.0,8520.0,4.0,4.0,3.0,5.0,6.0,1923.0,...,0,0,0,0,0,0,0,0,0,1
4,625.0,535105100.0,,9500.0,3.0,4.0,3.0,6.0,5.0,1963.0,...,0,0,0,0,0,0,0,0,0,1


In [178]:
test.shape

(879, 202)

In [179]:
# Checking for all np.nan values in all columns and filtering out columns which have one or more rows with np.nan values.
# This is the true number of missing values in the dataframe.
# I will consider imputing this missing data if these columns are chosen as features in the regression model.

null_cols = test.isna().sum()[test.isna().sum()!=0]
null_cols

Lot Frontage     160
Mas Vnr Area       1
Electrical         1
Garage Yr Blt     45
Garage Finish      1
dtype: int64

### Matching columns between train & test datasets

In [180]:
# test dataframe should have exact same columns as train_cleaned_encoded.csv now.
# Read train_cleaned_encoded.csv and assign it to a dataframe 'train_cleaned_encoded'.

train_cleaned_encoded = pd.read_csv('../datasets/train_cleaned_encoded.csv', keep_default_na=False)

list(train_cleaned_encoded.columns)

['Id',
 'PID',
 'Lot Frontage',
 'Lot Area',
 'Lot Shape',
 'Utilities',
 'Land Slope',
 'Overall Qual',
 'Overall Cond',
 'Year Built',
 'Year Remod/Add',
 'Mas Vnr Area',
 'Exter Qual',
 'Exter Cond',
 'Bsmt Qual',
 'Bsmt Cond',
 'Bsmt Exposure',
 'BsmtFin Type 1',
 'BsmtFin SF 1',
 'BsmtFin Type 2',
 'BsmtFin SF 2',
 'Bsmt Unf SF',
 'Total Bsmt SF',
 'Heating QC',
 'Central Air',
 'Electrical',
 '1st Flr SF',
 '2nd Flr SF',
 'Low Qual Fin SF',
 'Gr Liv Area',
 'Bsmt Full Bath',
 'Bsmt Half Bath',
 'Full Bath',
 'Half Bath',
 'Bedroom AbvGr',
 'Kitchen AbvGr',
 'Kitchen Qual',
 'TotRms AbvGrd',
 'Functional',
 'Fireplaces',
 'Fireplace Qu',
 'Garage Yr Blt',
 'Garage Finish',
 'Garage Cars',
 'Garage Area',
 'Garage Qual',
 'Garage Cond',
 'Paved Drive',
 'Wood Deck SF',
 'Open Porch SF',
 'Enclosed Porch',
 '3Ssn Porch',
 'Screen Porch',
 'Pool Area',
 'Pool QC',
 'Fence',
 'Misc Val',
 'Mo Sold',
 'Yr Sold',
 'SalePrice',
 'MS SubClass_30.0',
 'MS SubClass_40.0',
 'MS SubClass_45.0

In [181]:
# Filtering extra columns in test dataframe that do not exist in train_cleaned_encoded dataframe.

test_extra_cols = []

for col in list(test.columns):
    if col not in list(train_cleaned_encoded.columns):
        test_extra_cols.append(col)

test_extra_cols

['Roof Matl_Metal',
 'Roof Matl_Roll',
 'Exterior 1st_PreCast',
 'Exterior 2nd_Other',
 'Exterior 2nd_PreCast',
 'Mas Vnr Type_CBlock',
 'Heating_GasA',
 'Sale Type_VWD']

In [182]:
# The above list of columns need to be dropped from the test dataframe, because they do not exist in the train_cleaned_encoded dataframe.
# If they do not exist in the train_cleaned_encoded dataframe, they cannot possibly be a part of the regression model.

test.drop(labels=test_extra_cols, axis=1, inplace=True)

In [183]:
# Filtering missing columns in test dataframe that exist in train_cleaned_encoded dataframe.

test_missing_cols = []

for col in list(train_cleaned_encoded.columns):
    if col not in list(test.columns):
        test_missing_cols.append(col)

test_missing_cols

['SalePrice',
 'MS SubClass_150.0',
 'MS Zoning_C (all)',
 'Neighborhood_GrnHill',
 'Neighborhood_Landmrk',
 'Condition 2_Feedr',
 'Condition 2_RRAe',
 'Condition 2_RRAn',
 'Condition 2_RRNn',
 'Roof Matl_CompShg',
 'Roof Matl_Membran',
 'Exterior 1st_CBlock',
 'Exterior 1st_ImStucc',
 'Exterior 1st_Stone',
 'Exterior 2nd_Stone',
 'Heating_OthW',
 'Heating_Wall',
 'Misc Feature_Gar2',
 'Misc Feature_TenC']

In [184]:
# Apart from 'SalePrice' (expected), there are a few columns that are missing in test dataframe, and they exist in train_cleaned_encoded dataframe.
# So, the above list of columns need to be added into the test dataframe, because they exist in the train_cleaned_encoded dataframe.
# If they exist in the train_cleaned_encoded dataframe, they can possibly be a part of the regression model.
# These columns will all have value 0.

test_missing_cols.remove('SalePrice')

for col in test_missing_cols:
    test[col]=0

In [185]:
# Ensuring the order of columns in test dataframe is same as train_cleaned_encoded dataframe (just for ease of viewing and comparison purposes).

train_cleaned_encoded_cols = list(train_cleaned_encoded.columns)
train_cleaned_encoded_cols.remove('SalePrice')

test = test[train_cleaned_encoded_cols]

In [186]:
# Exporting the cleaned & encoded test dataframe to a CSV file using a relative path.

test.to_csv("../datasets/test_cleaned_encoded.csv", index=False)