<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 2 Regression Challenge Data Cleaning File

_Authors: Joel Quek (SG)_

## 1. Initialisation of Files

### Import Libraries

In [892]:
import numpy as np
import pandas as pd
# pd.set_option('display.max_columns', 500)
import matplotlib.pyplot as plt
import seaborn as sns 
from sklearn.linear_model import LinearRegression, LassoCV, RidgeCV
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict
from sklearn.metrics import r2_score
import re

### Instantiate File Paths

In [893]:
sample_sub_file = 'datasets/sample_sub_reg.csv' 
test_file = 'datasets/test.csv'
train_file = 'datasets/train.csv'

In [894]:
sample_sub_df = pd.read_csv(sample_sub_file)
test_df = pd.read_csv(test_file)
train_df = pd.read_csv(train_file)

### Enable all Rows and Columns to be Printed

In [895]:
pd.options.display.max_rows = 200
pd.options.display.max_columns = 200


## 2. Sample Sub Cleaning

In [896]:
sample_sub_df.head()

Unnamed: 0,Id,SalePrice
0,2,181479.1217
1,4,181479.1217
2,6,181479.1217
3,7,181479.1217
4,17,181479.1217


## 3. Train Data Cleaning

### (I) Loading the Data

1. How does the dataframe look like?
2. What are the column names?
3. What is the shape of the dataframe?

In [897]:
pd.options.display.max_columns = None
print(train_df.shape)
train_df.head()

(2051, 81)


Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,Utilities,Lot Config,Land Slope,Neighborhood,Condition 1,Condition 2,Bldg Type,House Style,Overall Qual,Overall Cond,Year Built,Year Remod/Add,Roof Style,Roof Matl,Exterior 1st,Exterior 2nd,Mas Vnr Type,Mas Vnr Area,Exter Qual,Exter Cond,Foundation,Bsmt Qual,Bsmt Cond,Bsmt Exposure,BsmtFin Type 1,BsmtFin SF 1,BsmtFin Type 2,BsmtFin SF 2,Bsmt Unf SF,Total Bsmt SF,Heating,Heating QC,Central Air,Electrical,1st Flr SF,2nd Flr SF,Low Qual Fin SF,Gr Liv Area,Bsmt Full Bath,Bsmt Half Bath,Full Bath,Half Bath,Bedroom AbvGr,Kitchen AbvGr,Kitchen Qual,TotRms AbvGrd,Functional,Fireplaces,Fireplace Qu,Garage Type,Garage Yr Blt,Garage Finish,Garage Cars,Garage Area,Garage Qual,Garage Cond,Paved Drive,Wood Deck SF,Open Porch SF,Enclosed Porch,3Ssn Porch,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type,SalePrice
0,109,533352170,60,RL,,13517,Pave,,IR1,Lvl,AllPub,CulDSac,Gtl,Sawyer,RRAe,Norm,1Fam,2Story,6,8,1976,2005,Gable,CompShg,HdBoard,Plywood,BrkFace,289.0,Gd,TA,CBlock,TA,TA,No,GLQ,533.0,Unf,0.0,192.0,725.0,GasA,Ex,Y,SBrkr,725,754,0,1479,0.0,0.0,2,1,3,1,Gd,6,Typ,0,,Attchd,1976.0,RFn,2.0,475.0,TA,TA,Y,0,44,0,0,0,0,,,,0,3,2010,WD,130500
1,544,531379050,60,RL,43.0,11492,Pave,,IR1,Lvl,AllPub,CulDSac,Gtl,SawyerW,Norm,Norm,1Fam,2Story,7,5,1996,1997,Gable,CompShg,VinylSd,VinylSd,BrkFace,132.0,Gd,TA,PConc,Gd,TA,No,GLQ,637.0,Unf,0.0,276.0,913.0,GasA,Ex,Y,SBrkr,913,1209,0,2122,1.0,0.0,2,1,4,1,Gd,8,Typ,1,TA,Attchd,1997.0,RFn,2.0,559.0,TA,TA,Y,0,74,0,0,0,0,,,,0,4,2009,WD,220000
2,153,535304180,20,RL,68.0,7922,Pave,,Reg,Lvl,AllPub,Inside,Gtl,NAmes,Norm,Norm,1Fam,1Story,5,7,1953,2007,Gable,CompShg,VinylSd,VinylSd,,0.0,TA,Gd,CBlock,TA,TA,No,GLQ,731.0,Unf,0.0,326.0,1057.0,GasA,TA,Y,SBrkr,1057,0,0,1057,1.0,0.0,1,0,3,1,Gd,5,Typ,0,,Detchd,1953.0,Unf,1.0,246.0,TA,TA,Y,0,52,0,0,0,0,,,,0,1,2010,WD,109000
3,318,916386060,60,RL,73.0,9802,Pave,,Reg,Lvl,AllPub,Inside,Gtl,Timber,Norm,Norm,1Fam,2Story,5,5,2006,2007,Gable,CompShg,VinylSd,VinylSd,,0.0,TA,TA,PConc,Gd,TA,No,Unf,0.0,Unf,0.0,384.0,384.0,GasA,Gd,Y,SBrkr,744,700,0,1444,0.0,0.0,2,1,3,1,TA,7,Typ,0,,BuiltIn,2007.0,Fin,2.0,400.0,TA,TA,Y,100,0,0,0,0,0,,,,0,4,2010,WD,174000
4,255,906425045,50,RL,82.0,14235,Pave,,IR1,Lvl,AllPub,Inside,Gtl,SawyerW,Norm,Norm,1Fam,1.5Fin,6,8,1900,1993,Gable,CompShg,Wd Sdng,Plywood,,0.0,TA,TA,PConc,Fa,Gd,No,Unf,0.0,Unf,0.0,676.0,676.0,GasA,TA,Y,SBrkr,831,614,0,1445,0.0,0.0,2,0,3,1,TA,6,Typ,0,,Detchd,1957.0,Unf,2.0,484.0,TA,TA,N,0,59,0,0,0,0,,,,0,3,2010,WD,138500


The documentation for column meanings can be found online

http://jse.amstat.org/v19n3/decock/DataDocumentation.txt

In [898]:
train_df.columns

Index(['Id', 'PID', 'MS SubClass', 'MS Zoning', 'Lot Frontage', 'Lot Area',
       'Street', 'Alley', 'Lot Shape', 'Land Contour', 'Utilities',
       'Lot Config', 'Land Slope', 'Neighborhood', 'Condition 1',
       'Condition 2', 'Bldg Type', 'House Style', 'Overall Qual',
       'Overall Cond', 'Year Built', 'Year Remod/Add', 'Roof Style',
       'Roof Matl', 'Exterior 1st', 'Exterior 2nd', 'Mas Vnr Type',
       'Mas Vnr Area', 'Exter Qual', 'Exter Cond', 'Foundation', 'Bsmt Qual',
       'Bsmt Cond', 'Bsmt Exposure', 'BsmtFin Type 1', 'BsmtFin SF 1',
       'BsmtFin Type 2', 'BsmtFin SF 2', 'Bsmt Unf SF', 'Total Bsmt SF',
       'Heating', 'Heating QC', 'Central Air', 'Electrical', '1st Flr SF',
       '2nd Flr SF', 'Low Qual Fin SF', 'Gr Liv Area', 'Bsmt Full Bath',
       'Bsmt Half Bath', 'Full Bath', 'Half Bath', 'Bedroom AbvGr',
       'Kitchen AbvGr', 'Kitchen Qual', 'TotRms AbvGrd', 'Functional',
       'Fireplaces', 'Fireplace Qu', 'Garage Type', 'Garage Yr Blt',
       'G

### (II) Data cleaning: Initial check

Check the following in the cells below:
1. Do we have any null values?
2. Are any numerical columns being read in as `object`?
3. What are the basic statistics of the dataset?

In [899]:
train_df.isnull().sum().sort_values(ascending=False)

Pool QC            2042
Misc Feature       1986
Alley              1911
Fence              1651
Fireplace Qu       1000
Lot Frontage        330
Garage Finish       114
Garage Qual         114
Garage Yr Blt       114
Garage Cond         114
Garage Type         113
Bsmt Exposure        58
BsmtFin Type 2       56
BsmtFin Type 1       55
Bsmt Cond            55
Bsmt Qual            55
Mas Vnr Area         22
Mas Vnr Type         22
Bsmt Half Bath        2
Bsmt Full Bath        2
Garage Area           1
Total Bsmt SF         1
Bsmt Unf SF           1
BsmtFin SF 2          1
BsmtFin SF 1          1
Garage Cars           1
Mo Sold               0
Sale Type             0
Full Bath             0
Half Bath             0
Bedroom AbvGr         0
Kitchen AbvGr         0
Kitchen Qual          0
Yr Sold               0
Misc Val              0
Pool Area             0
Screen Porch          0
TotRms AbvGrd         0
Functional            0
Fireplaces            0
3Ssn Porch            0
Enclosed Porch  

**Null Values Dictionary**

Too many columns, so we need to pull out only the columns with null values.

In [900]:
null_values = train_df.isnull().sum().sort_values(ascending=False)

null_value_dict = {}

for key_, value_ in null_values.items():
    if value_ > 0:
        null_value_dict[key_]=value_

null_value_dict

{'Pool QC': 2042,
 'Misc Feature': 1986,
 'Alley': 1911,
 'Fence': 1651,
 'Fireplace Qu': 1000,
 'Lot Frontage': 330,
 'Garage Finish': 114,
 'Garage Qual': 114,
 'Garage Yr Blt': 114,
 'Garage Cond': 114,
 'Garage Type': 113,
 'Bsmt Exposure': 58,
 'BsmtFin Type 2': 56,
 'BsmtFin Type 1': 55,
 'Bsmt Cond': 55,
 'Bsmt Qual': 55,
 'Mas Vnr Area': 22,
 'Mas Vnr Type': 22,
 'Bsmt Half Bath': 2,
 'Bsmt Full Bath': 2,
 'Garage Area': 1,
 'Total Bsmt SF': 1,
 'Bsmt Unf SF': 1,
 'BsmtFin SF 2': 1,
 'BsmtFin SF 1': 1,
 'Garage Cars': 1}

In [901]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2051 entries, 0 to 2050
Data columns (total 81 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Id               2051 non-null   int64  
 1   PID              2051 non-null   int64  
 2   MS SubClass      2051 non-null   int64  
 3   MS Zoning        2051 non-null   object 
 4   Lot Frontage     1721 non-null   float64
 5   Lot Area         2051 non-null   int64  
 6   Street           2051 non-null   object 
 7   Alley            140 non-null    object 
 8   Lot Shape        2051 non-null   object 
 9   Land Contour     2051 non-null   object 
 10  Utilities        2051 non-null   object 
 11  Lot Config       2051 non-null   object 
 12  Land Slope       2051 non-null   object 
 13  Neighborhood     2051 non-null   object 
 14  Condition 1      2051 non-null   object 
 15  Condition 2      2051 non-null   object 
 16  Bldg Type        2051 non-null   object 
 17  House Style   

In [902]:
train_df.describe()

Unnamed: 0,Id,PID,MS SubClass,Lot Frontage,Lot Area,Overall Qual,Overall Cond,Year Built,Year Remod/Add,Mas Vnr Area,BsmtFin SF 1,BsmtFin SF 2,Bsmt Unf SF,Total Bsmt SF,1st Flr SF,2nd Flr SF,Low Qual Fin SF,Gr Liv Area,Bsmt Full Bath,Bsmt Half Bath,Full Bath,Half Bath,Bedroom AbvGr,Kitchen AbvGr,TotRms AbvGrd,Fireplaces,Garage Yr Blt,Garage Cars,Garage Area,Wood Deck SF,Open Porch SF,Enclosed Porch,3Ssn Porch,Screen Porch,Pool Area,Misc Val,Mo Sold,Yr Sold,SalePrice
count,2051.0,2051.0,2051.0,1721.0,2051.0,2051.0,2051.0,2051.0,2051.0,2029.0,2050.0,2050.0,2050.0,2050.0,2051.0,2051.0,2051.0,2051.0,2049.0,2049.0,2051.0,2051.0,2051.0,2051.0,2051.0,2051.0,1937.0,2050.0,2050.0,2051.0,2051.0,2051.0,2051.0,2051.0,2051.0,2051.0,2051.0,2051.0,2051.0
mean,1474.033642,713590000.0,57.008776,69.0552,10065.208191,6.11214,5.562165,1971.708922,1984.190151,99.695909,442.300488,47.959024,567.728293,1057.987805,1164.488055,329.329108,5.512921,1499.330083,0.427526,0.063446,1.577279,0.371039,2.843491,1.042906,6.435885,0.590931,1978.707796,1.776585,473.671707,93.83374,47.556802,22.571916,2.591419,16.511458,2.397855,51.574354,6.219893,2007.775719,181469.701609
std,843.980841,188691800.0,42.824223,23.260653,6742.488909,1.426271,1.104497,30.177889,21.03625,174.963129,461.204124,165.000901,444.954786,449.410704,396.446923,425.671046,51.06887,500.447829,0.522673,0.251705,0.549279,0.501043,0.826618,0.20979,1.560225,0.638516,25.441094,0.764537,215.934561,128.549416,66.747241,59.84511,25.229615,57.374204,37.78257,573.393985,2.744736,1.312014,79258.659352
min,1.0,526301100.0,20.0,21.0,1300.0,1.0,1.0,1872.0,1950.0,0.0,0.0,0.0,0.0,0.0,334.0,0.0,0.0,334.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,1895.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0,12789.0
25%,753.5,528458100.0,20.0,58.0,7500.0,5.0,5.0,1953.5,1964.5,0.0,0.0,0.0,220.0,793.0,879.5,0.0,0.0,1129.0,0.0,0.0,1.0,0.0,2.0,1.0,5.0,0.0,1961.0,1.0,319.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,2007.0,129825.0
50%,1486.0,535453200.0,50.0,68.0,9430.0,6.0,5.0,1974.0,1993.0,0.0,368.0,0.0,474.5,994.5,1093.0,0.0,0.0,1444.0,0.0,0.0,2.0,0.0,3.0,1.0,6.0,1.0,1980.0,2.0,480.0,0.0,27.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0,162500.0
75%,2198.0,907180100.0,70.0,80.0,11513.5,7.0,6.0,2001.0,2004.0,161.0,733.75,0.0,811.0,1318.75,1405.0,692.5,0.0,1728.5,1.0,0.0,2.0,1.0,3.0,1.0,7.0,1.0,2002.0,2.0,576.0,168.0,70.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0,214000.0
max,2930.0,924152000.0,190.0,313.0,159000.0,10.0,9.0,2010.0,2010.0,1600.0,5644.0,1474.0,2336.0,6110.0,5095.0,1862.0,1064.0,5642.0,3.0,2.0,4.0,2.0,8.0,3.0,15.0,4.0,2207.0,5.0,1418.0,1424.0,547.0,432.0,508.0,490.0,800.0,17000.0,12.0,2010.0,611657.0


### (III) Clean Columns with Missing Values

#### Null Value Dictionary

There are 2051 rows in this dataset.

We need to handle the missing values.

We will refer to the documentation at
http://jse.amstat.org/v19n3/decock/DataDocumentation.txt 

In [903]:
null_value_dict

{'Pool QC': 2042,
 'Misc Feature': 1986,
 'Alley': 1911,
 'Fence': 1651,
 'Fireplace Qu': 1000,
 'Lot Frontage': 330,
 'Garage Finish': 114,
 'Garage Qual': 114,
 'Garage Yr Blt': 114,
 'Garage Cond': 114,
 'Garage Type': 113,
 'Bsmt Exposure': 58,
 'BsmtFin Type 2': 56,
 'BsmtFin Type 1': 55,
 'Bsmt Cond': 55,
 'Bsmt Qual': 55,
 'Mas Vnr Area': 22,
 'Mas Vnr Type': 22,
 'Bsmt Half Bath': 2,
 'Bsmt Full Bath': 2,
 'Garage Area': 1,
 'Total Bsmt SF': 1,
 'Bsmt Unf SF': 1,
 'BsmtFin SF 2': 1,
 'BsmtFin SF 1': 1,
 'Garage Cars': 1}

There are columns with hidden null values. For example when null valuse are inteh form of 'None' or '0'

----

#### (a) Drop Columns with Excessive NA Values

'Pool QC': 2042
'Misc Feature': 1986
'Alley': 1911
'Fence': 1651
'Fireplace Qu': 1000

At first glance, these columns have a lot of missing data, which is **more than half of the total number of data** (with the exception of Fireplace Qu, which is also quite close)

I would therefore, drop all these data.

In [904]:
train_df.drop(['Pool QC','Misc Feature','Alley','Fence','Fireplace Qu'], axis=1, inplace=True)

----

#### (b) Drop NA from Garage Columns


The garage columns contain large (but not significant) amount of NA values.

'Garage Finish': 114
'Garage Qual': 114
'Garage Yr Blt': 114
'Garage Cond': 114
'Garage Type': 113

All these datas represent descriptions of garages. Columns that are missing these datas actually represent houses with **no garages**. So we can actually drop these data rows.

In [905]:
train_df.dropna(subset=['Garage Qual','Garage Finish','Garage Yr Blt','Garage Cond','Garage Type'],axis=0,inplace=True)

---

#### (c) Drop Masonry Veneer Columns


'Mas Vnr Area': 22
'Mas Vnr Type': 22

This is a column with hidden null values in the form of 'None' and '0'

In [906]:
train_df['Mas Vnr Type'] = train_df['Mas Vnr Type'].map({'None': np.nan})

In [907]:
train_df['Mas Vnr Type'].isnull().count()

1937

In [908]:
(train_df['Mas Vnr Area']==0).count()

1937

**More than half** of the data set are null and zero values for masonry veneer. I will remove these two columns.

In [909]:
train_df.drop(['Mas Vnr Area','Mas Vnr Type'], axis=1, inplace=True)

---

#### (d) Lot Frontage [Imputation of Missing Values]

'Lot Frontage': 330

The missing data from the 'Lot Frontage' columns do not tally with the 'Lot Area' columns. Which means there are a lot of rows where Lot Area is given but Lot Frontage is empty. The missing data is classified MAR.

I will fill the NA values with the mean.

In [910]:
train_df['Lot Frontage']=train_df['Lot Frontage'].fillna(train_df['Lot Frontage'].mean())

----

#### (e) Basement Columns

'Bsmt Exposure': 58
'BsmtFin Type 2': 56
'BsmtFin Type 1': 55
'Bsmt Cond': 55
'Bsmt Qual': 55

All these datas represent descriptions of basements. Rows containing NA values just mean the house has no basement. We can drop these rows.

In [911]:
train_df.dropna(subset=['Bsmt Exposure','BsmtFin Type 2','BsmtFin Type 1','Bsmt Cond','Bsmt Qual'],axis=0,inplace=True)

In [912]:
train_df.shape

(1887, 74)

----

#### (f) Columns with Very few Missing Values


'Bsmt Half Bath': 2
'Bsmt Full Bath': 2

'Garage Area': 1
'Total Bsmt SF': 1
'Bsmt Unf SF': 1
'BsmtFin SF 2': 1
'BsmtFin SF 1': 1
'Garage Cars': 1

I will just drop those rows with NA in these columns.

In [913]:
train_df.dropna(subset=['Bsmt Half Bath','Bsmt Full Bath','Garage Cars','Garage Area','Bsmt Unf SF','BsmtFin SF 2','Total Bsmt SF','BsmtFin SF 1'],axis=0,inplace=True)

----

#### (e) Drop 'Misc Val'

In [914]:
train_df.drop('Misc Val', axis=1, inplace=True)

### (IV) Hot Encode all Remaining Non Numerical Columns

#### (a) Remaining Columns in Dataframe

In [915]:
train_df.columns

Index(['Id', 'PID', 'MS SubClass', 'MS Zoning', 'Lot Frontage', 'Lot Area',
       'Street', 'Lot Shape', 'Land Contour', 'Utilities', 'Lot Config',
       'Land Slope', 'Neighborhood', 'Condition 1', 'Condition 2', 'Bldg Type',
       'House Style', 'Overall Qual', 'Overall Cond', 'Year Built',
       'Year Remod/Add', 'Roof Style', 'Roof Matl', 'Exterior 1st',
       'Exterior 2nd', 'Exter Qual', 'Exter Cond', 'Foundation', 'Bsmt Qual',
       'Bsmt Cond', 'Bsmt Exposure', 'BsmtFin Type 1', 'BsmtFin SF 1',
       'BsmtFin Type 2', 'BsmtFin SF 2', 'Bsmt Unf SF', 'Total Bsmt SF',
       'Heating', 'Heating QC', 'Central Air', 'Electrical', '1st Flr SF',
       '2nd Flr SF', 'Low Qual Fin SF', 'Gr Liv Area', 'Bsmt Full Bath',
       'Bsmt Half Bath', 'Full Bath', 'Half Bath', 'Bedroom AbvGr',
       'Kitchen AbvGr', 'Kitchen Qual', 'TotRms AbvGrd', 'Functional',
       'Fireplaces', 'Garage Type', 'Garage Yr Blt', 'Garage Finish',
       'Garage Cars', 'Garage Area', 'Garage Qual', 'Gar

In [916]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1887 entries, 0 to 2050
Data columns (total 73 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Id               1887 non-null   int64  
 1   PID              1887 non-null   int64  
 2   MS SubClass      1887 non-null   int64  
 3   MS Zoning        1887 non-null   object 
 4   Lot Frontage     1887 non-null   float64
 5   Lot Area         1887 non-null   int64  
 6   Street           1887 non-null   object 
 7   Lot Shape        1887 non-null   object 
 8   Land Contour     1887 non-null   object 
 9   Utilities        1887 non-null   object 
 10  Lot Config       1887 non-null   object 
 11  Land Slope       1887 non-null   object 
 12  Neighborhood     1887 non-null   object 
 13  Condition 1      1887 non-null   object 
 14  Condition 2      1887 non-null   object 
 15  Bldg Type        1887 non-null   object 
 16  House Style      1887 non-null   object 
 17  Overall Qual  

Columns which are non-numerical

In [917]:
train_df.select_dtypes(include='object').columns

Index(['MS Zoning', 'Street', 'Lot Shape', 'Land Contour', 'Utilities',
       'Lot Config', 'Land Slope', 'Neighborhood', 'Condition 1',
       'Condition 2', 'Bldg Type', 'House Style', 'Roof Style', 'Roof Matl',
       'Exterior 1st', 'Exterior 2nd', 'Exter Qual', 'Exter Cond',
       'Foundation', 'Bsmt Qual', 'Bsmt Cond', 'Bsmt Exposure',
       'BsmtFin Type 1', 'BsmtFin Type 2', 'Heating', 'Heating QC',
       'Central Air', 'Electrical', 'Kitchen Qual', 'Functional',
       'Garage Type', 'Garage Finish', 'Garage Qual', 'Garage Cond',
       'Paved Drive', 'Sale Type'],
      dtype='object')

#### (b) Hot Encoding and Final Cleaning

##### (i) PID Column

PID (Nominal): Parcel identification number. This column seems unnecessary, so I will drop it.

In [918]:
train_df.drop(['PID'], axis=1, inplace=True)

##### (ii) MS SubClass and MS Zoning

MS SubClass (Nominal): Identifies the type of dwelling involved in the sale.	

       020	1-STORY 1946 & NEWER ALL STYLES
       030	1-STORY 1945 & OLDER
       040	1-STORY W/FINISHED ATTIC ALL AGES
       045	1-1/2 STORY - UNFINISHED ALL AGES
       050	1-1/2 STORY FINISHED ALL AGES
       060	2-STORY 1946 & NEWER
       070	2-STORY 1945 & OLDER
       075	2-1/2 STORY ALL AGES
       080	SPLIT OR MULTI-LEVEL
       085	SPLIT FOYER
       090	DUPLEX - ALL STYLES AND AGES
       120	1-STORY PUD (Planned Unit Development) - 1946 & NEWER
       150	1-1/2 STORY PUD - ALL AGES
       160	2-STORY PUD - 1946 & NEWER
       180	PUD - MULTILEVEL - INCL SPLIT LEV/FOYER
       190	2 FAMILY CONVERSION - ALL STYLES AND AGES
       
MS Zoning (Nominal): Identifies the general zoning classification of the sale.
		
       A	Agriculture
       C	Commercial
       FV	Floating Village Residential
       I	Industrial
       RH	Residential High Density
       RL	Residential Low Density
       RP	Residential Low Density Park 
       RM	Residential Medium Density

In [919]:
def ms_zoning(x):
    if x == 'A':
        x='Agriculture'
    if x=='C':
        x='Commercial'
    if x=='FV':
        x='Floating Village Residential'
    if x== 'I':
        x='Industrial'
    if x=='RH': 
        x='Residential High Density'
    if x=='RL': 
        x='Residential Low Density'
    if x=='RP': 
        x='Residential Low Density Park'
    if x=='RM': 
        x='Residential Medium Density'
    return x

In [920]:
def ms_subclass(x):
    if x==190:
        x='Family Conversion'
    elif x==180:
        x='PUD Multi'
    elif x==160:
        x='2 Storey PUD'
    elif x==150:
        x='1.5 Storey PUD'
    elif x==120:
        x='1 Storey PUD'
    elif x==90:
        x='Duplex All'
    elif x==85:
        x='Split Foyer'
    elif x==80:
        x='Split'
    elif x==75:
        x='2.5 Story All'
    elif x==70:
        x='2 Storey 1946 Older'
    elif x==60:
        x='2 Storey 1946 Newer'
    elif x==50:
        x='1.5 Finished'
    elif x==45:
        x='1.5 Unfinished'
    elif x==40:
        x='1 Storey W/Finished Attic'
    elif x==30:
        x='1 Storey 1945 older'
    elif x==20:
        x='1 Storey Newer All'
    return x

In [921]:
train_df['MS Zoning']=train_df['MS Zoning'].apply(ms_zoning)
train_df['MS SubClass']=train_df['MS SubClass'].apply(ms_subclass)


In [922]:
train_df.head()

Unnamed: 0,Id,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Lot Shape,Land Contour,Utilities,Lot Config,Land Slope,Neighborhood,Condition 1,Condition 2,Bldg Type,House Style,Overall Qual,Overall Cond,Year Built,Year Remod/Add,Roof Style,Roof Matl,Exterior 1st,Exterior 2nd,Exter Qual,Exter Cond,Foundation,Bsmt Qual,Bsmt Cond,Bsmt Exposure,BsmtFin Type 1,BsmtFin SF 1,BsmtFin Type 2,BsmtFin SF 2,Bsmt Unf SF,Total Bsmt SF,Heating,Heating QC,Central Air,Electrical,1st Flr SF,2nd Flr SF,Low Qual Fin SF,Gr Liv Area,Bsmt Full Bath,Bsmt Half Bath,Full Bath,Half Bath,Bedroom AbvGr,Kitchen AbvGr,Kitchen Qual,TotRms AbvGrd,Functional,Fireplaces,Garage Type,Garage Yr Blt,Garage Finish,Garage Cars,Garage Area,Garage Qual,Garage Cond,Paved Drive,Wood Deck SF,Open Porch SF,Enclosed Porch,3Ssn Porch,Screen Porch,Pool Area,Mo Sold,Yr Sold,Sale Type,SalePrice
0,109,2 Storey 1946 Newer,Residential Low Density,69.594544,13517,Pave,IR1,Lvl,AllPub,CulDSac,Gtl,Sawyer,RRAe,Norm,1Fam,2Story,6,8,1976,2005,Gable,CompShg,HdBoard,Plywood,Gd,TA,CBlock,TA,TA,No,GLQ,533.0,Unf,0.0,192.0,725.0,GasA,Ex,Y,SBrkr,725,754,0,1479,0.0,0.0,2,1,3,1,Gd,6,Typ,0,Attchd,1976.0,RFn,2.0,475.0,TA,TA,Y,0,44,0,0,0,0,3,2010,WD,130500
1,544,2 Storey 1946 Newer,Residential Low Density,43.0,11492,Pave,IR1,Lvl,AllPub,CulDSac,Gtl,SawyerW,Norm,Norm,1Fam,2Story,7,5,1996,1997,Gable,CompShg,VinylSd,VinylSd,Gd,TA,PConc,Gd,TA,No,GLQ,637.0,Unf,0.0,276.0,913.0,GasA,Ex,Y,SBrkr,913,1209,0,2122,1.0,0.0,2,1,4,1,Gd,8,Typ,1,Attchd,1997.0,RFn,2.0,559.0,TA,TA,Y,0,74,0,0,0,0,4,2009,WD,220000
2,153,1 Storey Newer All,Residential Low Density,68.0,7922,Pave,Reg,Lvl,AllPub,Inside,Gtl,NAmes,Norm,Norm,1Fam,1Story,5,7,1953,2007,Gable,CompShg,VinylSd,VinylSd,TA,Gd,CBlock,TA,TA,No,GLQ,731.0,Unf,0.0,326.0,1057.0,GasA,TA,Y,SBrkr,1057,0,0,1057,1.0,0.0,1,0,3,1,Gd,5,Typ,0,Detchd,1953.0,Unf,1.0,246.0,TA,TA,Y,0,52,0,0,0,0,1,2010,WD,109000
3,318,2 Storey 1946 Newer,Residential Low Density,73.0,9802,Pave,Reg,Lvl,AllPub,Inside,Gtl,Timber,Norm,Norm,1Fam,2Story,5,5,2006,2007,Gable,CompShg,VinylSd,VinylSd,TA,TA,PConc,Gd,TA,No,Unf,0.0,Unf,0.0,384.0,384.0,GasA,Gd,Y,SBrkr,744,700,0,1444,0.0,0.0,2,1,3,1,TA,7,Typ,0,BuiltIn,2007.0,Fin,2.0,400.0,TA,TA,Y,100,0,0,0,0,0,4,2010,WD,174000
4,255,1.5 Finished,Residential Low Density,82.0,14235,Pave,IR1,Lvl,AllPub,Inside,Gtl,SawyerW,Norm,Norm,1Fam,1.5Fin,6,8,1900,1993,Gable,CompShg,Wd Sdng,Plywood,TA,TA,PConc,Fa,Gd,No,Unf,0.0,Unf,0.0,676.0,676.0,GasA,TA,Y,SBrkr,831,614,0,1445,0.0,0.0,2,0,3,1,TA,6,Typ,0,Detchd,1957.0,Unf,2.0,484.0,TA,TA,N,0,59,0,0,0,0,3,2010,WD,138500


##### (iii) Lot Frontage 

The Lot Frontage column has been imputed with the mean

##### (iv) 5-Point Rating System

There is a 5-point rating system used in a few columns 
	
       Ex	Excellent
       Gd	Good
       TA	Average/Typical
       Fa	Fair
       Po	Poor

I will change these values to ordinals.

In [923]:
def five_point(x):
    if str(x)=='Ex':
        x=5
    elif str(x)=='Gd':
        x=4
    elif str(x)=='TA':
        x=3
    elif str(x)=='Fa':
        x=2 
    elif str(x)=='Po':
        x=1
    elif str(x)=='nan' or str(x)=='None':
        x=0
    return x

##### (v) Exter Qual and Exter Cond [Exterior Quality and Exterior Condition]

In [924]:
train_df.head()

Unnamed: 0,Id,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Lot Shape,Land Contour,Utilities,Lot Config,Land Slope,Neighborhood,Condition 1,Condition 2,Bldg Type,House Style,Overall Qual,Overall Cond,Year Built,Year Remod/Add,Roof Style,Roof Matl,Exterior 1st,Exterior 2nd,Exter Qual,Exter Cond,Foundation,Bsmt Qual,Bsmt Cond,Bsmt Exposure,BsmtFin Type 1,BsmtFin SF 1,BsmtFin Type 2,BsmtFin SF 2,Bsmt Unf SF,Total Bsmt SF,Heating,Heating QC,Central Air,Electrical,1st Flr SF,2nd Flr SF,Low Qual Fin SF,Gr Liv Area,Bsmt Full Bath,Bsmt Half Bath,Full Bath,Half Bath,Bedroom AbvGr,Kitchen AbvGr,Kitchen Qual,TotRms AbvGrd,Functional,Fireplaces,Garage Type,Garage Yr Blt,Garage Finish,Garage Cars,Garage Area,Garage Qual,Garage Cond,Paved Drive,Wood Deck SF,Open Porch SF,Enclosed Porch,3Ssn Porch,Screen Porch,Pool Area,Mo Sold,Yr Sold,Sale Type,SalePrice
0,109,2 Storey 1946 Newer,Residential Low Density,69.594544,13517,Pave,IR1,Lvl,AllPub,CulDSac,Gtl,Sawyer,RRAe,Norm,1Fam,2Story,6,8,1976,2005,Gable,CompShg,HdBoard,Plywood,Gd,TA,CBlock,TA,TA,No,GLQ,533.0,Unf,0.0,192.0,725.0,GasA,Ex,Y,SBrkr,725,754,0,1479,0.0,0.0,2,1,3,1,Gd,6,Typ,0,Attchd,1976.0,RFn,2.0,475.0,TA,TA,Y,0,44,0,0,0,0,3,2010,WD,130500
1,544,2 Storey 1946 Newer,Residential Low Density,43.0,11492,Pave,IR1,Lvl,AllPub,CulDSac,Gtl,SawyerW,Norm,Norm,1Fam,2Story,7,5,1996,1997,Gable,CompShg,VinylSd,VinylSd,Gd,TA,PConc,Gd,TA,No,GLQ,637.0,Unf,0.0,276.0,913.0,GasA,Ex,Y,SBrkr,913,1209,0,2122,1.0,0.0,2,1,4,1,Gd,8,Typ,1,Attchd,1997.0,RFn,2.0,559.0,TA,TA,Y,0,74,0,0,0,0,4,2009,WD,220000
2,153,1 Storey Newer All,Residential Low Density,68.0,7922,Pave,Reg,Lvl,AllPub,Inside,Gtl,NAmes,Norm,Norm,1Fam,1Story,5,7,1953,2007,Gable,CompShg,VinylSd,VinylSd,TA,Gd,CBlock,TA,TA,No,GLQ,731.0,Unf,0.0,326.0,1057.0,GasA,TA,Y,SBrkr,1057,0,0,1057,1.0,0.0,1,0,3,1,Gd,5,Typ,0,Detchd,1953.0,Unf,1.0,246.0,TA,TA,Y,0,52,0,0,0,0,1,2010,WD,109000
3,318,2 Storey 1946 Newer,Residential Low Density,73.0,9802,Pave,Reg,Lvl,AllPub,Inside,Gtl,Timber,Norm,Norm,1Fam,2Story,5,5,2006,2007,Gable,CompShg,VinylSd,VinylSd,TA,TA,PConc,Gd,TA,No,Unf,0.0,Unf,0.0,384.0,384.0,GasA,Gd,Y,SBrkr,744,700,0,1444,0.0,0.0,2,1,3,1,TA,7,Typ,0,BuiltIn,2007.0,Fin,2.0,400.0,TA,TA,Y,100,0,0,0,0,0,4,2010,WD,174000
4,255,1.5 Finished,Residential Low Density,82.0,14235,Pave,IR1,Lvl,AllPub,Inside,Gtl,SawyerW,Norm,Norm,1Fam,1.5Fin,6,8,1900,1993,Gable,CompShg,Wd Sdng,Plywood,TA,TA,PConc,Fa,Gd,No,Unf,0.0,Unf,0.0,676.0,676.0,GasA,TA,Y,SBrkr,831,614,0,1445,0.0,0.0,2,0,3,1,TA,6,Typ,0,Detchd,1957.0,Unf,2.0,484.0,TA,TA,N,0,59,0,0,0,0,3,2010,WD,138500


In [925]:
train_df['Exter Qual']=train_df['Exter Qual'].apply(five_point)
train_df['Exter Cond']=train_df['Exter Cond'].apply(five_point)

In [926]:
train_df.rename(columns={'Exter Qual':'Exterior Quality', 'Exter Cond': 'Exterior Condition'}, inplace=True)

In [927]:
train_df.head()

Unnamed: 0,Id,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Lot Shape,Land Contour,Utilities,Lot Config,Land Slope,Neighborhood,Condition 1,Condition 2,Bldg Type,House Style,Overall Qual,Overall Cond,Year Built,Year Remod/Add,Roof Style,Roof Matl,Exterior 1st,Exterior 2nd,Exterior Quality,Exterior Condition,Foundation,Bsmt Qual,Bsmt Cond,Bsmt Exposure,BsmtFin Type 1,BsmtFin SF 1,BsmtFin Type 2,BsmtFin SF 2,Bsmt Unf SF,Total Bsmt SF,Heating,Heating QC,Central Air,Electrical,1st Flr SF,2nd Flr SF,Low Qual Fin SF,Gr Liv Area,Bsmt Full Bath,Bsmt Half Bath,Full Bath,Half Bath,Bedroom AbvGr,Kitchen AbvGr,Kitchen Qual,TotRms AbvGrd,Functional,Fireplaces,Garage Type,Garage Yr Blt,Garage Finish,Garage Cars,Garage Area,Garage Qual,Garage Cond,Paved Drive,Wood Deck SF,Open Porch SF,Enclosed Porch,3Ssn Porch,Screen Porch,Pool Area,Mo Sold,Yr Sold,Sale Type,SalePrice
0,109,2 Storey 1946 Newer,Residential Low Density,69.594544,13517,Pave,IR1,Lvl,AllPub,CulDSac,Gtl,Sawyer,RRAe,Norm,1Fam,2Story,6,8,1976,2005,Gable,CompShg,HdBoard,Plywood,4,3,CBlock,TA,TA,No,GLQ,533.0,Unf,0.0,192.0,725.0,GasA,Ex,Y,SBrkr,725,754,0,1479,0.0,0.0,2,1,3,1,Gd,6,Typ,0,Attchd,1976.0,RFn,2.0,475.0,TA,TA,Y,0,44,0,0,0,0,3,2010,WD,130500
1,544,2 Storey 1946 Newer,Residential Low Density,43.0,11492,Pave,IR1,Lvl,AllPub,CulDSac,Gtl,SawyerW,Norm,Norm,1Fam,2Story,7,5,1996,1997,Gable,CompShg,VinylSd,VinylSd,4,3,PConc,Gd,TA,No,GLQ,637.0,Unf,0.0,276.0,913.0,GasA,Ex,Y,SBrkr,913,1209,0,2122,1.0,0.0,2,1,4,1,Gd,8,Typ,1,Attchd,1997.0,RFn,2.0,559.0,TA,TA,Y,0,74,0,0,0,0,4,2009,WD,220000
2,153,1 Storey Newer All,Residential Low Density,68.0,7922,Pave,Reg,Lvl,AllPub,Inside,Gtl,NAmes,Norm,Norm,1Fam,1Story,5,7,1953,2007,Gable,CompShg,VinylSd,VinylSd,3,4,CBlock,TA,TA,No,GLQ,731.0,Unf,0.0,326.0,1057.0,GasA,TA,Y,SBrkr,1057,0,0,1057,1.0,0.0,1,0,3,1,Gd,5,Typ,0,Detchd,1953.0,Unf,1.0,246.0,TA,TA,Y,0,52,0,0,0,0,1,2010,WD,109000
3,318,2 Storey 1946 Newer,Residential Low Density,73.0,9802,Pave,Reg,Lvl,AllPub,Inside,Gtl,Timber,Norm,Norm,1Fam,2Story,5,5,2006,2007,Gable,CompShg,VinylSd,VinylSd,3,3,PConc,Gd,TA,No,Unf,0.0,Unf,0.0,384.0,384.0,GasA,Gd,Y,SBrkr,744,700,0,1444,0.0,0.0,2,1,3,1,TA,7,Typ,0,BuiltIn,2007.0,Fin,2.0,400.0,TA,TA,Y,100,0,0,0,0,0,4,2010,WD,174000
4,255,1.5 Finished,Residential Low Density,82.0,14235,Pave,IR1,Lvl,AllPub,Inside,Gtl,SawyerW,Norm,Norm,1Fam,1.5Fin,6,8,1900,1993,Gable,CompShg,Wd Sdng,Plywood,3,3,PConc,Fa,Gd,No,Unf,0.0,Unf,0.0,676.0,676.0,GasA,TA,Y,SBrkr,831,614,0,1445,0.0,0.0,2,0,3,1,TA,6,Typ,0,Detchd,1957.0,Unf,2.0,484.0,TA,TA,N,0,59,0,0,0,0,3,2010,WD,138500


##### (vi) Overall Qual & Overall Cond [Overall Quality & Overall Condition]

In [928]:
print(type(train_df['Overall Qual'][1]))

<class 'numpy.int64'>


In [929]:
print(type(train_df['Overall Cond'][1]))

<class 'numpy.int64'>


In [930]:
def remove_nan(x):
    if str(x)=='nan':
        x=0
    return int(x)  

In [931]:
train_df['Overall Qual']=train_df['Overall Qual'].apply(remove_nan)
train_df['Overall Cond']=train_df['Overall Cond'].apply(remove_nan)

In [932]:
train_df.rename(columns={'Overall Qual':'Overall Quality', 'Overall Cond':'Overall Condition'}, inplace=True)

In [933]:
train_df.head()

Unnamed: 0,Id,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Lot Shape,Land Contour,Utilities,Lot Config,Land Slope,Neighborhood,Condition 1,Condition 2,Bldg Type,House Style,Overall Quality,Overall Condition,Year Built,Year Remod/Add,Roof Style,Roof Matl,Exterior 1st,Exterior 2nd,Exterior Quality,Exterior Condition,Foundation,Bsmt Qual,Bsmt Cond,Bsmt Exposure,BsmtFin Type 1,BsmtFin SF 1,BsmtFin Type 2,BsmtFin SF 2,Bsmt Unf SF,Total Bsmt SF,Heating,Heating QC,Central Air,Electrical,1st Flr SF,2nd Flr SF,Low Qual Fin SF,Gr Liv Area,Bsmt Full Bath,Bsmt Half Bath,Full Bath,Half Bath,Bedroom AbvGr,Kitchen AbvGr,Kitchen Qual,TotRms AbvGrd,Functional,Fireplaces,Garage Type,Garage Yr Blt,Garage Finish,Garage Cars,Garage Area,Garage Qual,Garage Cond,Paved Drive,Wood Deck SF,Open Porch SF,Enclosed Porch,3Ssn Porch,Screen Porch,Pool Area,Mo Sold,Yr Sold,Sale Type,SalePrice
0,109,2 Storey 1946 Newer,Residential Low Density,69.594544,13517,Pave,IR1,Lvl,AllPub,CulDSac,Gtl,Sawyer,RRAe,Norm,1Fam,2Story,6,8,1976,2005,Gable,CompShg,HdBoard,Plywood,4,3,CBlock,TA,TA,No,GLQ,533.0,Unf,0.0,192.0,725.0,GasA,Ex,Y,SBrkr,725,754,0,1479,0.0,0.0,2,1,3,1,Gd,6,Typ,0,Attchd,1976.0,RFn,2.0,475.0,TA,TA,Y,0,44,0,0,0,0,3,2010,WD,130500
1,544,2 Storey 1946 Newer,Residential Low Density,43.0,11492,Pave,IR1,Lvl,AllPub,CulDSac,Gtl,SawyerW,Norm,Norm,1Fam,2Story,7,5,1996,1997,Gable,CompShg,VinylSd,VinylSd,4,3,PConc,Gd,TA,No,GLQ,637.0,Unf,0.0,276.0,913.0,GasA,Ex,Y,SBrkr,913,1209,0,2122,1.0,0.0,2,1,4,1,Gd,8,Typ,1,Attchd,1997.0,RFn,2.0,559.0,TA,TA,Y,0,74,0,0,0,0,4,2009,WD,220000
2,153,1 Storey Newer All,Residential Low Density,68.0,7922,Pave,Reg,Lvl,AllPub,Inside,Gtl,NAmes,Norm,Norm,1Fam,1Story,5,7,1953,2007,Gable,CompShg,VinylSd,VinylSd,3,4,CBlock,TA,TA,No,GLQ,731.0,Unf,0.0,326.0,1057.0,GasA,TA,Y,SBrkr,1057,0,0,1057,1.0,0.0,1,0,3,1,Gd,5,Typ,0,Detchd,1953.0,Unf,1.0,246.0,TA,TA,Y,0,52,0,0,0,0,1,2010,WD,109000
3,318,2 Storey 1946 Newer,Residential Low Density,73.0,9802,Pave,Reg,Lvl,AllPub,Inside,Gtl,Timber,Norm,Norm,1Fam,2Story,5,5,2006,2007,Gable,CompShg,VinylSd,VinylSd,3,3,PConc,Gd,TA,No,Unf,0.0,Unf,0.0,384.0,384.0,GasA,Gd,Y,SBrkr,744,700,0,1444,0.0,0.0,2,1,3,1,TA,7,Typ,0,BuiltIn,2007.0,Fin,2.0,400.0,TA,TA,Y,100,0,0,0,0,0,4,2010,WD,174000
4,255,1.5 Finished,Residential Low Density,82.0,14235,Pave,IR1,Lvl,AllPub,Inside,Gtl,SawyerW,Norm,Norm,1Fam,1.5Fin,6,8,1900,1993,Gable,CompShg,Wd Sdng,Plywood,3,3,PConc,Fa,Gd,No,Unf,0.0,Unf,0.0,676.0,676.0,GasA,TA,Y,SBrkr,831,614,0,1445,0.0,0.0,2,0,3,1,TA,6,Typ,0,Detchd,1957.0,Unf,2.0,484.0,TA,TA,N,0,59,0,0,0,0,3,2010,WD,138500


##### (vii)

##### ( ) Final Check of Null Values

In [934]:
train_df.isnull().sum().sort_values(ascending=False)

Id                    0
MS SubClass           0
Functional            0
TotRms AbvGrd         0
Kitchen Qual          0
Kitchen AbvGr         0
Bedroom AbvGr         0
Half Bath             0
Full Bath             0
Bsmt Half Bath        0
Bsmt Full Bath        0
Gr Liv Area           0
Low Qual Fin SF       0
2nd Flr SF            0
1st Flr SF            0
Electrical            0
Central Air           0
Fireplaces            0
Garage Type           0
Garage Yr Blt         0
Enclosed Porch        0
Sale Type             0
Yr Sold               0
Mo Sold               0
Pool Area             0
Screen Porch          0
3Ssn Porch            0
Open Porch SF         0
Garage Finish         0
Wood Deck SF          0
Paved Drive           0
Garage Cond           0
Garage Qual           0
Garage Area           0
Garage Cars           0
Heating QC            0
Heating               0
Total Bsmt SF         0
Lot Config            0
House Style           0
Bldg Type             0
Condition 2     

There are no more null values

In [935]:
train_df.head()

Unnamed: 0,Id,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Lot Shape,Land Contour,Utilities,Lot Config,Land Slope,Neighborhood,Condition 1,Condition 2,Bldg Type,House Style,Overall Quality,Overall Condition,Year Built,Year Remod/Add,Roof Style,Roof Matl,Exterior 1st,Exterior 2nd,Exterior Quality,Exterior Condition,Foundation,Bsmt Qual,Bsmt Cond,Bsmt Exposure,BsmtFin Type 1,BsmtFin SF 1,BsmtFin Type 2,BsmtFin SF 2,Bsmt Unf SF,Total Bsmt SF,Heating,Heating QC,Central Air,Electrical,1st Flr SF,2nd Flr SF,Low Qual Fin SF,Gr Liv Area,Bsmt Full Bath,Bsmt Half Bath,Full Bath,Half Bath,Bedroom AbvGr,Kitchen AbvGr,Kitchen Qual,TotRms AbvGrd,Functional,Fireplaces,Garage Type,Garage Yr Blt,Garage Finish,Garage Cars,Garage Area,Garage Qual,Garage Cond,Paved Drive,Wood Deck SF,Open Porch SF,Enclosed Porch,3Ssn Porch,Screen Porch,Pool Area,Mo Sold,Yr Sold,Sale Type,SalePrice
0,109,2 Storey 1946 Newer,Residential Low Density,69.594544,13517,Pave,IR1,Lvl,AllPub,CulDSac,Gtl,Sawyer,RRAe,Norm,1Fam,2Story,6,8,1976,2005,Gable,CompShg,HdBoard,Plywood,4,3,CBlock,TA,TA,No,GLQ,533.0,Unf,0.0,192.0,725.0,GasA,Ex,Y,SBrkr,725,754,0,1479,0.0,0.0,2,1,3,1,Gd,6,Typ,0,Attchd,1976.0,RFn,2.0,475.0,TA,TA,Y,0,44,0,0,0,0,3,2010,WD,130500
1,544,2 Storey 1946 Newer,Residential Low Density,43.0,11492,Pave,IR1,Lvl,AllPub,CulDSac,Gtl,SawyerW,Norm,Norm,1Fam,2Story,7,5,1996,1997,Gable,CompShg,VinylSd,VinylSd,4,3,PConc,Gd,TA,No,GLQ,637.0,Unf,0.0,276.0,913.0,GasA,Ex,Y,SBrkr,913,1209,0,2122,1.0,0.0,2,1,4,1,Gd,8,Typ,1,Attchd,1997.0,RFn,2.0,559.0,TA,TA,Y,0,74,0,0,0,0,4,2009,WD,220000
2,153,1 Storey Newer All,Residential Low Density,68.0,7922,Pave,Reg,Lvl,AllPub,Inside,Gtl,NAmes,Norm,Norm,1Fam,1Story,5,7,1953,2007,Gable,CompShg,VinylSd,VinylSd,3,4,CBlock,TA,TA,No,GLQ,731.0,Unf,0.0,326.0,1057.0,GasA,TA,Y,SBrkr,1057,0,0,1057,1.0,0.0,1,0,3,1,Gd,5,Typ,0,Detchd,1953.0,Unf,1.0,246.0,TA,TA,Y,0,52,0,0,0,0,1,2010,WD,109000
3,318,2 Storey 1946 Newer,Residential Low Density,73.0,9802,Pave,Reg,Lvl,AllPub,Inside,Gtl,Timber,Norm,Norm,1Fam,2Story,5,5,2006,2007,Gable,CompShg,VinylSd,VinylSd,3,3,PConc,Gd,TA,No,Unf,0.0,Unf,0.0,384.0,384.0,GasA,Gd,Y,SBrkr,744,700,0,1444,0.0,0.0,2,1,3,1,TA,7,Typ,0,BuiltIn,2007.0,Fin,2.0,400.0,TA,TA,Y,100,0,0,0,0,0,4,2010,WD,174000
4,255,1.5 Finished,Residential Low Density,82.0,14235,Pave,IR1,Lvl,AllPub,Inside,Gtl,SawyerW,Norm,Norm,1Fam,1.5Fin,6,8,1900,1993,Gable,CompShg,Wd Sdng,Plywood,3,3,PConc,Fa,Gd,No,Unf,0.0,Unf,0.0,676.0,676.0,GasA,TA,Y,SBrkr,831,614,0,1445,0.0,0.0,2,0,3,1,TA,6,Typ,0,Detchd,1957.0,Unf,2.0,484.0,TA,TA,N,0,59,0,0,0,0,3,2010,WD,138500


### (V) Data Columns

In [936]:
train_df.columns

Index(['Id', 'MS SubClass', 'MS Zoning', 'Lot Frontage', 'Lot Area', 'Street',
       'Lot Shape', 'Land Contour', 'Utilities', 'Lot Config', 'Land Slope',
       'Neighborhood', 'Condition 1', 'Condition 2', 'Bldg Type',
       'House Style', 'Overall Quality', 'Overall Condition', 'Year Built',
       'Year Remod/Add', 'Roof Style', 'Roof Matl', 'Exterior 1st',
       'Exterior 2nd', 'Exterior Quality', 'Exterior Condition', 'Foundation',
       'Bsmt Qual', 'Bsmt Cond', 'Bsmt Exposure', 'BsmtFin Type 1',
       'BsmtFin SF 1', 'BsmtFin Type 2', 'BsmtFin SF 2', 'Bsmt Unf SF',
       'Total Bsmt SF', 'Heating', 'Heating QC', 'Central Air', 'Electrical',
       '1st Flr SF', '2nd Flr SF', 'Low Qual Fin SF', 'Gr Liv Area',
       'Bsmt Full Bath', 'Bsmt Half Bath', 'Full Bath', 'Half Bath',
       'Bedroom AbvGr', 'Kitchen AbvGr', 'Kitchen Qual', 'TotRms AbvGrd',
       'Functional', 'Fireplaces', 'Garage Type', 'Garage Yr Blt',
       'Garage Finish', 'Garage Cars', 'Garage Area', 'Gar

### (VI) Categories of Factors

#### Column Header Documentation

http://jse.amstat.org/v19n3/decock/DataDocumentation.txt

In [937]:
train_df.columns

Index(['Id', 'MS SubClass', 'MS Zoning', 'Lot Frontage', 'Lot Area', 'Street',
       'Lot Shape', 'Land Contour', 'Utilities', 'Lot Config', 'Land Slope',
       'Neighborhood', 'Condition 1', 'Condition 2', 'Bldg Type',
       'House Style', 'Overall Quality', 'Overall Condition', 'Year Built',
       'Year Remod/Add', 'Roof Style', 'Roof Matl', 'Exterior 1st',
       'Exterior 2nd', 'Exterior Quality', 'Exterior Condition', 'Foundation',
       'Bsmt Qual', 'Bsmt Cond', 'Bsmt Exposure', 'BsmtFin Type 1',
       'BsmtFin SF 1', 'BsmtFin Type 2', 'BsmtFin SF 2', 'Bsmt Unf SF',
       'Total Bsmt SF', 'Heating', 'Heating QC', 'Central Air', 'Electrical',
       '1st Flr SF', '2nd Flr SF', 'Low Qual Fin SF', 'Gr Liv Area',
       'Bsmt Full Bath', 'Bsmt Half Bath', 'Full Bath', 'Half Bath',
       'Bedroom AbvGr', 'Kitchen AbvGr', 'Kitchen Qual', 'TotRms AbvGrd',
       'Functional', 'Fireplaces', 'Garage Type', 'Garage Yr Blt',
       'Garage Finish', 'Garage Cars', 'Garage Area', 'Gar

#### Micro Factors Affecting Property Value



https://www.christinekangproperties.com/post/what-are-the-factors-affecting-singapore-residential-property-market


Investment in Infrastructure. 

Landscape. 

Unit orientation and Views. 

Design & Architectural Style.



1. Location and Proximity to Amenities[External Tangible Factors]
'MS SubClass', 'MS Zoning', 'Lot Config',  'Neighborhood', 'Condition 1', 'Condition 2', 'Bldg Type', 'Land Contour' , 'Land Slope', 'Street', 'Lot Shape'

2. Quality and Condition of Home
'Overall Quality', 'Overall Condition', 'Exterior Quality', 'Exterior Condition', 'Functional', 'Garage Qual',
       'Garage Cond',

3. Parts of the Home  
'House Style','Roof Style', 'Roof Matl', 'Exterior 1st',
       'Exterior 2nd', 'Foundation',  'Bsmt Qual', 'Bsmt Cond', 'Bsmt Exposure', 'BsmtFin Type 1',
       , 'BsmtFin Type 2','Bsmt Full Bath', 'Bsmt Half Bath', 'Full Bath', 'Half Bath', 'Bedroom AbvGr', 'Kitchen AbvGr', 'Kitchen Qual', 'TotRms AbvGrd',
       'Fireplaces', 'Garage Type', 'Garage Finish', 'Garage Cars', 'Paved Drive'

4. Dimensions [Related to Floor Plan]
'1st Flr SF', '2nd Flr SF','BsmtFin SF 1','BsmtFin SF 2', 'Bsmt Unf SF',
       'Total Bsmt SF',  'Low Qual Fin SF','Gr Liv Area',  'Garage Area', 'Wood Deck SF', 'Open Porch SF',  'Enclosed Porch', '3Ssn Porch', 'Screen Porch',
       'Pool Area',  'Lot Frontage', 'Lot Area',

5. Utilities and Intangible Factors
'Heating', 'Heating QC', 'Central Air', 'Electrical', 'Utilities',  

6. Time Factors
 'Mo Sold', 'Yr Sold', 'Year Built',
       'Year Remod/Add', 'Garage Yr Blt'

7. Price
'Misc Val', 'Sale Type', 'SalePrice'

### (VII) Factors Groupings Dataframes

I will now create separate dataframes for different class of factors.

In [938]:
location_proximity = train_df[['MS SubClass', 'MS Zoning', 'Lot Config',  'Neighborhood', 'Condition 1', 'Condition 2', 'Bldg Type', 'Land Contour' , 'Land Slope', 'Street', 'Lot Shape']]

quality_condition = train_df[['Overall Quality', 'Overall Condition', 'Exterior Quality', 'Exterior Condition', 'Functional', 'Garage Qual','Garage Cond']]

parts_of_home = train_df[['House Style','Roof Style', 'Roof Matl', 'Exterior 1st',
       'Exterior 2nd', 'Foundation',  'Bsmt Qual', 'Bsmt Cond', 'Bsmt Exposure', 'BsmtFin Type 1',
       'BsmtFin Type 2','Bsmt Full Bath', 'Bsmt Half Bath', 'Full Bath', 'Half Bath', 'Bedroom AbvGr', 'Kitchen AbvGr', 'Kitchen Qual', 'TotRms AbvGrd',
       'Fireplaces', 'Garage Type', 'Garage Finish', 'Garage Cars', 'Paved Drive']]

dimensions = train_df[['1st Flr SF', '2nd Flr SF','BsmtFin SF 1','BsmtFin SF 2', 'Bsmt Unf SF',
       'Total Bsmt SF',  'Low Qual Fin SF','Gr Liv Area',  'Garage Area', 'Wood Deck SF', 'Open Porch SF',  'Enclosed Porch', '3Ssn Porch', 'Screen Porch',
       'Pool Area',  'Lot Frontage', 'Lot Area']]

utilities_intangible = train_df[['Heating', 'Heating QC', 'Central Air', 'Electrical', 'Utilities']]

time_factors_df = train_df[['Mo Sold', 'Yr Sold', 'Year Built','Year Remod/Add', 'Garage Yr Blt']]

price_factors = train_df[['Sale Type', 'SalePrice']]

### (VIII) Feature Engineering

#### (a) Price Per Square Feet

In [939]:
print(price_factors.shape)
price_factors.info()

(1887, 2)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1887 entries, 0 to 2050
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Sale Type  1887 non-null   object
 1   SalePrice  1887 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 108.8+ KB


In [940]:
print(dimensions.shape)
dimensions.info()

(1887, 17)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1887 entries, 0 to 2050
Data columns (total 17 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   1st Flr SF       1887 non-null   int64  
 1   2nd Flr SF       1887 non-null   int64  
 2   BsmtFin SF 1     1887 non-null   float64
 3   BsmtFin SF 2     1887 non-null   float64
 4   Bsmt Unf SF      1887 non-null   float64
 5   Total Bsmt SF    1887 non-null   float64
 6   Low Qual Fin SF  1887 non-null   int64  
 7   Gr Liv Area      1887 non-null   int64  
 8   Garage Area      1887 non-null   float64
 9   Wood Deck SF     1887 non-null   int64  
 10  Open Porch SF    1887 non-null   int64  
 11  Enclosed Porch   1887 non-null   int64  
 12  3Ssn Porch       1887 non-null   int64  
 13  Screen Porch     1887 non-null   int64  
 14  Pool Area        1887 non-null   int64  
 15  Lot Frontage     1887 non-null   float64
 16  Lot Area         1887 non-null   int64  
dtypes: 

In [941]:
price_factors['Price Per Sq Ft'] = price_factors['SalePrice']/dimensions['Lot Area']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  price_factors['Price Per Sq Ft'] = price_factors['SalePrice']/dimensions['Lot Area']


In [942]:
price_factors.head()

Unnamed: 0,Sale Type,SalePrice,Price Per Sq Ft
0,WD,130500,9.654509
1,WD,220000,19.143752
2,WD,109000,13.759152
3,WD,174000,17.751479
4,WD,138500,9.72954


### (VIIII) Export New Dataframes to CSV

In [943]:
train_df.to_csv('train_clean.csv')
location_proximity.to_csv('location_proximity.csv')
quality_condition.to_csv('quality_condition.csv')
parts_of_home.to_csv('parts_of_home.csv')
dimensions.to_csv('dimensions.csv')
utilities_intangible.to_csv('utilities_intangible.csv')
time_factors_df.to_csv('time_factors.csv')
price_factors.to_csv('price_factors.csv')

---

## 4. Test Data Cleaning

The documentation fo column meanings can be found online

http://jse.amstat.org/v19n3/decock/DataDocumentation.txt

In [944]:
print(test_df.shape)
test_df.head()

(878, 80)


Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,Utilities,Lot Config,Land Slope,Neighborhood,Condition 1,Condition 2,Bldg Type,House Style,Overall Qual,Overall Cond,Year Built,Year Remod/Add,Roof Style,Roof Matl,Exterior 1st,Exterior 2nd,Mas Vnr Type,Mas Vnr Area,Exter Qual,Exter Cond,Foundation,Bsmt Qual,Bsmt Cond,Bsmt Exposure,BsmtFin Type 1,BsmtFin SF 1,BsmtFin Type 2,BsmtFin SF 2,Bsmt Unf SF,Total Bsmt SF,Heating,Heating QC,Central Air,Electrical,1st Flr SF,2nd Flr SF,Low Qual Fin SF,Gr Liv Area,Bsmt Full Bath,Bsmt Half Bath,Full Bath,Half Bath,Bedroom AbvGr,Kitchen AbvGr,Kitchen Qual,TotRms AbvGrd,Functional,Fireplaces,Fireplace Qu,Garage Type,Garage Yr Blt,Garage Finish,Garage Cars,Garage Area,Garage Qual,Garage Cond,Paved Drive,Wood Deck SF,Open Porch SF,Enclosed Porch,3Ssn Porch,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type
0,2658,902301120,190,RM,69.0,9142,Pave,Grvl,Reg,Lvl,AllPub,Inside,Gtl,OldTown,Norm,Norm,2fmCon,2Story,6,8,1910,1950,Gable,CompShg,AsbShng,AsbShng,,0.0,TA,Fa,Stone,Fa,TA,No,Unf,0,Unf,0,1020,1020,GasA,Gd,N,FuseP,908,1020,0,1928,0,0,2,0,4,2,Fa,9,Typ,0,,Detchd,1910.0,Unf,1,440,Po,Po,Y,0,60,112,0,0,0,,,,0,4,2006,WD
1,2718,905108090,90,RL,,9662,Pave,,IR1,Lvl,AllPub,Inside,Gtl,Sawyer,Norm,Norm,Duplex,1Story,5,4,1977,1977,Gable,CompShg,Plywood,Plywood,,0.0,TA,TA,CBlock,Gd,TA,No,Unf,0,Unf,0,1967,1967,GasA,TA,Y,SBrkr,1967,0,0,1967,0,0,2,0,6,2,TA,10,Typ,0,,Attchd,1977.0,Fin,2,580,TA,TA,Y,170,0,0,0,0,0,,,,0,8,2006,WD
2,2414,528218130,60,RL,58.0,17104,Pave,,IR1,Lvl,AllPub,Inside,Gtl,Gilbert,Norm,Norm,1Fam,2Story,7,5,2006,2006,Gable,CompShg,VinylSd,VinylSd,,0.0,Gd,TA,PConc,Gd,Gd,Av,GLQ,554,Unf,0,100,654,GasA,Ex,Y,SBrkr,664,832,0,1496,1,0,2,1,3,1,Gd,7,Typ,1,Gd,Attchd,2006.0,RFn,2,426,TA,TA,Y,100,24,0,0,0,0,,,,0,9,2006,New
3,1989,902207150,30,RM,60.0,8520,Pave,,Reg,Lvl,AllPub,Inside,Gtl,OldTown,Norm,Norm,1Fam,1Story,5,6,1923,2006,Gable,CompShg,Wd Sdng,Wd Sdng,,0.0,Gd,TA,CBlock,TA,TA,No,Unf,0,Unf,0,968,968,GasA,TA,Y,SBrkr,968,0,0,968,0,0,1,0,2,1,TA,5,Typ,0,,Detchd,1935.0,Unf,2,480,Fa,TA,N,0,0,184,0,0,0,,,,0,7,2007,WD
4,625,535105100,20,RL,,9500,Pave,,IR1,Lvl,AllPub,Inside,Gtl,NAmes,Norm,Norm,1Fam,1Story,6,5,1963,1963,Gable,CompShg,Plywood,Plywood,BrkFace,247.0,TA,TA,CBlock,Gd,TA,No,BLQ,609,Unf,0,785,1394,GasA,Gd,Y,SBrkr,1394,0,0,1394,1,0,1,1,3,1,TA,6,Typ,2,Gd,Attchd,1963.0,RFn,2,514,TA,TA,Y,0,76,0,0,185,0,,,,0,7,2009,WD


In [945]:
test_df.isnull().sum().sort_values(ascending=False)

Pool QC            874
Misc Feature       837
Alley              820
Fence              706
Fireplace Qu       422
Lot Frontage       160
Garage Yr Blt       45
Garage Finish       45
Garage Qual         45
Garage Cond         45
Garage Type         44
BsmtFin Type 1      25
Bsmt Qual           25
Bsmt Cond           25
Bsmt Exposure       25
BsmtFin Type 2      25
Electrical           1
Mas Vnr Type         1
Mas Vnr Area         1
Kitchen AbvGr        0
TotRms AbvGrd        0
Bedroom AbvGr        0
Half Bath            0
Full Bath            0
Bsmt Half Bath       0
Bsmt Full Bath       0
Gr Liv Area          0
Kitchen Qual         0
Id                   0
Functional           0
Fireplaces           0
2nd Flr SF           0
Garage Cars          0
Garage Area          0
Paved Drive          0
Wood Deck SF         0
Open Porch SF        0
Enclosed Porch       0
3Ssn Porch           0
Screen Porch         0
Pool Area            0
Misc Val             0
Mo Sold              0
Yr Sold    

In [946]:
null_values = test_df.isnull().sum().sort_values(ascending=False)

null_value_dict = {}

for key_, value_ in null_values.items():
    if value_ > 0:
        null_value_dict[key_]=value_

null_value_dict

{'Pool QC': 874,
 'Misc Feature': 837,
 'Alley': 820,
 'Fence': 706,
 'Fireplace Qu': 422,
 'Lot Frontage': 160,
 'Garage Yr Blt': 45,
 'Garage Finish': 45,
 'Garage Qual': 45,
 'Garage Cond': 45,
 'Garage Type': 44,
 'BsmtFin Type 1': 25,
 'Bsmt Qual': 25,
 'Bsmt Cond': 25,
 'Bsmt Exposure': 25,
 'BsmtFin Type 2': 25,
 'Electrical': 1,
 'Mas Vnr Type': 1,
 'Mas Vnr Area': 1}

In [947]:
test_df.drop(['Pool QC','Misc Feature','Alley','Fence','Fireplace Qu'], axis=1, inplace=True)

In [948]:
# test_df.dropna(subset=['Garage Qual','Garage Finish','Garage Yr Blt','Garage Cond','Garage Type'],axis=0,inplace=True)

In [949]:
print(test_df.shape)

(878, 75)


In [950]:
test_df['Mas Vnr Type'] = test_df['Mas Vnr Type'].map({'None': np.nan})

In [951]:
test_df.drop(['Mas Vnr Area','Mas Vnr Type'], axis=1, inplace=True)

In [952]:
test_df['Lot Frontage']=test_df['Lot Frontage'].fillna(test_df['Lot Frontage'].mean())

In [953]:
test_df['Bsmt Exposure'].value_counts()

No    567
Av    130
Gd     80
Mn     76
Name: Bsmt Exposure, dtype: int64

In [954]:
# test_df.dropna(subset=['Bsmt Exposure','BsmtFin Type 2','BsmtFin Type 1','Bsmt Cond','Bsmt Qual'],axis=0,inplace=True)

In [955]:
# test_df.dropna(subset=['Bsmt Half Bath','Bsmt Full Bath','Garage Cars','Garage Area','Bsmt Unf SF','BsmtFin SF 2','Total Bsmt SF','BsmtFin SF 1'],axis=0,inplace=True)

In [956]:
test_df.drop('Misc Val', axis=1, inplace=True)

In [957]:
test_df.columns

Index(['Id', 'PID', 'MS SubClass', 'MS Zoning', 'Lot Frontage', 'Lot Area',
       'Street', 'Lot Shape', 'Land Contour', 'Utilities', 'Lot Config',
       'Land Slope', 'Neighborhood', 'Condition 1', 'Condition 2', 'Bldg Type',
       'House Style', 'Overall Qual', 'Overall Cond', 'Year Built',
       'Year Remod/Add', 'Roof Style', 'Roof Matl', 'Exterior 1st',
       'Exterior 2nd', 'Exter Qual', 'Exter Cond', 'Foundation', 'Bsmt Qual',
       'Bsmt Cond', 'Bsmt Exposure', 'BsmtFin Type 1', 'BsmtFin SF 1',
       'BsmtFin Type 2', 'BsmtFin SF 2', 'Bsmt Unf SF', 'Total Bsmt SF',
       'Heating', 'Heating QC', 'Central Air', 'Electrical', '1st Flr SF',
       '2nd Flr SF', 'Low Qual Fin SF', 'Gr Liv Area', 'Bsmt Full Bath',
       'Bsmt Half Bath', 'Full Bath', 'Half Bath', 'Bedroom AbvGr',
       'Kitchen AbvGr', 'Kitchen Qual', 'TotRms AbvGrd', 'Functional',
       'Fireplaces', 'Garage Type', 'Garage Yr Blt', 'Garage Finish',
       'Garage Cars', 'Garage Area', 'Garage Qual', 'Gar

In [958]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 878 entries, 0 to 877
Data columns (total 72 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Id               878 non-null    int64  
 1   PID              878 non-null    int64  
 2   MS SubClass      878 non-null    int64  
 3   MS Zoning        878 non-null    object 
 4   Lot Frontage     878 non-null    float64
 5   Lot Area         878 non-null    int64  
 6   Street           878 non-null    object 
 7   Lot Shape        878 non-null    object 
 8   Land Contour     878 non-null    object 
 9   Utilities        878 non-null    object 
 10  Lot Config       878 non-null    object 
 11  Land Slope       878 non-null    object 
 12  Neighborhood     878 non-null    object 
 13  Condition 1      878 non-null    object 
 14  Condition 2      878 non-null    object 
 15  Bldg Type        878 non-null    object 
 16  House Style      878 non-null    object 
 17  Overall Qual    

In [959]:
test_df.drop(['PID'], axis=1, inplace=True)

In [960]:
def ms_zoning(x):
    if x == 'A':
        x='Agriculture'
    if x=='C':
        x='Commercial'
    if x=='FV':
        x='Floating Village Residential'
    if x== 'I':
        x='Industrial'
    if x=='RH': 
        x='Residential High Density'
    if x=='RL': 
        x='Residential Low Density'
    if x=='RP': 
        x='Residential Low Density Park'
    if x=='RM': 
        x='Residential Medium Density'
    return x

In [961]:
def ms_subclass(x):
    if x==190:
        x='Family Conversion'
    elif x==180:
        x='PUD Multi'
    elif x==160:
        x='2 Storey PUD'
    elif x==150:
        x='1.5 Storey PUD'
    elif x==120:
        x='1 Storey PUD'
    elif x==90:
        x='Duplex All'
    elif x==85:
        x='Split Foyer'
    elif x==80:
        x='Split'
    elif x==75:
        x='2.5 Story All'
    elif x==70:
        x='2 Storey 1946 Older'
    elif x==60:
        x='2 Storey 1946 Newer'
    elif x==50:
        x='1.5 Finished'
    elif x==45:
        x='1.5 Unfinished'
    elif x==40:
        x='1 Storey W/Finished Attic'
    elif x==30:
        x='1 Storey 1945 older'
    elif x==20:
        x='1 Storey Newer All'
    return x

In [962]:
test_df['MS Zoning']=test_df['MS Zoning'].apply(ms_zoning)
test_df['MS SubClass']=test_df['MS SubClass'].apply(ms_subclass)

In [963]:
test_df.head()

Unnamed: 0,Id,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Lot Shape,Land Contour,Utilities,Lot Config,Land Slope,Neighborhood,Condition 1,Condition 2,Bldg Type,House Style,Overall Qual,Overall Cond,Year Built,Year Remod/Add,Roof Style,Roof Matl,Exterior 1st,Exterior 2nd,Exter Qual,Exter Cond,Foundation,Bsmt Qual,Bsmt Cond,Bsmt Exposure,BsmtFin Type 1,BsmtFin SF 1,BsmtFin Type 2,BsmtFin SF 2,Bsmt Unf SF,Total Bsmt SF,Heating,Heating QC,Central Air,Electrical,1st Flr SF,2nd Flr SF,Low Qual Fin SF,Gr Liv Area,Bsmt Full Bath,Bsmt Half Bath,Full Bath,Half Bath,Bedroom AbvGr,Kitchen AbvGr,Kitchen Qual,TotRms AbvGrd,Functional,Fireplaces,Garage Type,Garage Yr Blt,Garage Finish,Garage Cars,Garage Area,Garage Qual,Garage Cond,Paved Drive,Wood Deck SF,Open Porch SF,Enclosed Porch,3Ssn Porch,Screen Porch,Pool Area,Mo Sold,Yr Sold,Sale Type
0,2658,Family Conversion,Residential Medium Density,69.0,9142,Pave,Reg,Lvl,AllPub,Inside,Gtl,OldTown,Norm,Norm,2fmCon,2Story,6,8,1910,1950,Gable,CompShg,AsbShng,AsbShng,TA,Fa,Stone,Fa,TA,No,Unf,0,Unf,0,1020,1020,GasA,Gd,N,FuseP,908,1020,0,1928,0,0,2,0,4,2,Fa,9,Typ,0,Detchd,1910.0,Unf,1,440,Po,Po,Y,0,60,112,0,0,0,4,2006,WD
1,2718,Duplex All,Residential Low Density,69.545961,9662,Pave,IR1,Lvl,AllPub,Inside,Gtl,Sawyer,Norm,Norm,Duplex,1Story,5,4,1977,1977,Gable,CompShg,Plywood,Plywood,TA,TA,CBlock,Gd,TA,No,Unf,0,Unf,0,1967,1967,GasA,TA,Y,SBrkr,1967,0,0,1967,0,0,2,0,6,2,TA,10,Typ,0,Attchd,1977.0,Fin,2,580,TA,TA,Y,170,0,0,0,0,0,8,2006,WD
2,2414,2 Storey 1946 Newer,Residential Low Density,58.0,17104,Pave,IR1,Lvl,AllPub,Inside,Gtl,Gilbert,Norm,Norm,1Fam,2Story,7,5,2006,2006,Gable,CompShg,VinylSd,VinylSd,Gd,TA,PConc,Gd,Gd,Av,GLQ,554,Unf,0,100,654,GasA,Ex,Y,SBrkr,664,832,0,1496,1,0,2,1,3,1,Gd,7,Typ,1,Attchd,2006.0,RFn,2,426,TA,TA,Y,100,24,0,0,0,0,9,2006,New
3,1989,1 Storey 1945 older,Residential Medium Density,60.0,8520,Pave,Reg,Lvl,AllPub,Inside,Gtl,OldTown,Norm,Norm,1Fam,1Story,5,6,1923,2006,Gable,CompShg,Wd Sdng,Wd Sdng,Gd,TA,CBlock,TA,TA,No,Unf,0,Unf,0,968,968,GasA,TA,Y,SBrkr,968,0,0,968,0,0,1,0,2,1,TA,5,Typ,0,Detchd,1935.0,Unf,2,480,Fa,TA,N,0,0,184,0,0,0,7,2007,WD
4,625,1 Storey Newer All,Residential Low Density,69.545961,9500,Pave,IR1,Lvl,AllPub,Inside,Gtl,NAmes,Norm,Norm,1Fam,1Story,6,5,1963,1963,Gable,CompShg,Plywood,Plywood,TA,TA,CBlock,Gd,TA,No,BLQ,609,Unf,0,785,1394,GasA,Gd,Y,SBrkr,1394,0,0,1394,1,0,1,1,3,1,TA,6,Typ,2,Attchd,1963.0,RFn,2,514,TA,TA,Y,0,76,0,0,185,0,7,2009,WD


In [964]:
def five_point(x):
    if str(x)=='Ex':
        x=5
    elif str(x)=='Gd':
        x=4
    elif str(x)=='TA':
        x=3
    elif str(x)=='Fa':
        x=2 
    elif str(x)=='Po':
        x=1
    elif str(x)=='nan' or str(x)=='None':
        x=0
    return x

In [965]:
test_df['Exter Qual']=test_df['Exter Qual'].apply(five_point)
test_df['Exter Cond']=test_df['Exter Cond'].apply(five_point)

In [966]:
test_df.rename(columns={'Exter Qual':'Exterior Quality', 'Exter Cond': 'Exterior Condition'}, inplace=True)

In [967]:
test_df.head(3)

Unnamed: 0,Id,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Lot Shape,Land Contour,Utilities,Lot Config,Land Slope,Neighborhood,Condition 1,Condition 2,Bldg Type,House Style,Overall Qual,Overall Cond,Year Built,Year Remod/Add,Roof Style,Roof Matl,Exterior 1st,Exterior 2nd,Exterior Quality,Exterior Condition,Foundation,Bsmt Qual,Bsmt Cond,Bsmt Exposure,BsmtFin Type 1,BsmtFin SF 1,BsmtFin Type 2,BsmtFin SF 2,Bsmt Unf SF,Total Bsmt SF,Heating,Heating QC,Central Air,Electrical,1st Flr SF,2nd Flr SF,Low Qual Fin SF,Gr Liv Area,Bsmt Full Bath,Bsmt Half Bath,Full Bath,Half Bath,Bedroom AbvGr,Kitchen AbvGr,Kitchen Qual,TotRms AbvGrd,Functional,Fireplaces,Garage Type,Garage Yr Blt,Garage Finish,Garage Cars,Garage Area,Garage Qual,Garage Cond,Paved Drive,Wood Deck SF,Open Porch SF,Enclosed Porch,3Ssn Porch,Screen Porch,Pool Area,Mo Sold,Yr Sold,Sale Type
0,2658,Family Conversion,Residential Medium Density,69.0,9142,Pave,Reg,Lvl,AllPub,Inside,Gtl,OldTown,Norm,Norm,2fmCon,2Story,6,8,1910,1950,Gable,CompShg,AsbShng,AsbShng,3,2,Stone,Fa,TA,No,Unf,0,Unf,0,1020,1020,GasA,Gd,N,FuseP,908,1020,0,1928,0,0,2,0,4,2,Fa,9,Typ,0,Detchd,1910.0,Unf,1,440,Po,Po,Y,0,60,112,0,0,0,4,2006,WD
1,2718,Duplex All,Residential Low Density,69.545961,9662,Pave,IR1,Lvl,AllPub,Inside,Gtl,Sawyer,Norm,Norm,Duplex,1Story,5,4,1977,1977,Gable,CompShg,Plywood,Plywood,3,3,CBlock,Gd,TA,No,Unf,0,Unf,0,1967,1967,GasA,TA,Y,SBrkr,1967,0,0,1967,0,0,2,0,6,2,TA,10,Typ,0,Attchd,1977.0,Fin,2,580,TA,TA,Y,170,0,0,0,0,0,8,2006,WD
2,2414,2 Storey 1946 Newer,Residential Low Density,58.0,17104,Pave,IR1,Lvl,AllPub,Inside,Gtl,Gilbert,Norm,Norm,1Fam,2Story,7,5,2006,2006,Gable,CompShg,VinylSd,VinylSd,4,3,PConc,Gd,Gd,Av,GLQ,554,Unf,0,100,654,GasA,Ex,Y,SBrkr,664,832,0,1496,1,0,2,1,3,1,Gd,7,Typ,1,Attchd,2006.0,RFn,2,426,TA,TA,Y,100,24,0,0,0,0,9,2006,New


In [968]:
def remove_nan(x):
    if str(x)=='nan':
        x=0
    return int(x)  

In [969]:
test_df['Overall Qual']=test_df['Overall Qual'].apply(remove_nan)
test_df['Overall Cond']=test_df['Overall Cond'].apply(remove_nan)

In [970]:
test_df.rename(columns={'Overall Qual':'Overall Quality', 'Overall Cond':'Overall Condition'}, inplace=True)

In [971]:
test_df.head(3)

Unnamed: 0,Id,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Lot Shape,Land Contour,Utilities,Lot Config,Land Slope,Neighborhood,Condition 1,Condition 2,Bldg Type,House Style,Overall Quality,Overall Condition,Year Built,Year Remod/Add,Roof Style,Roof Matl,Exterior 1st,Exterior 2nd,Exterior Quality,Exterior Condition,Foundation,Bsmt Qual,Bsmt Cond,Bsmt Exposure,BsmtFin Type 1,BsmtFin SF 1,BsmtFin Type 2,BsmtFin SF 2,Bsmt Unf SF,Total Bsmt SF,Heating,Heating QC,Central Air,Electrical,1st Flr SF,2nd Flr SF,Low Qual Fin SF,Gr Liv Area,Bsmt Full Bath,Bsmt Half Bath,Full Bath,Half Bath,Bedroom AbvGr,Kitchen AbvGr,Kitchen Qual,TotRms AbvGrd,Functional,Fireplaces,Garage Type,Garage Yr Blt,Garage Finish,Garage Cars,Garage Area,Garage Qual,Garage Cond,Paved Drive,Wood Deck SF,Open Porch SF,Enclosed Porch,3Ssn Porch,Screen Porch,Pool Area,Mo Sold,Yr Sold,Sale Type
0,2658,Family Conversion,Residential Medium Density,69.0,9142,Pave,Reg,Lvl,AllPub,Inside,Gtl,OldTown,Norm,Norm,2fmCon,2Story,6,8,1910,1950,Gable,CompShg,AsbShng,AsbShng,3,2,Stone,Fa,TA,No,Unf,0,Unf,0,1020,1020,GasA,Gd,N,FuseP,908,1020,0,1928,0,0,2,0,4,2,Fa,9,Typ,0,Detchd,1910.0,Unf,1,440,Po,Po,Y,0,60,112,0,0,0,4,2006,WD
1,2718,Duplex All,Residential Low Density,69.545961,9662,Pave,IR1,Lvl,AllPub,Inside,Gtl,Sawyer,Norm,Norm,Duplex,1Story,5,4,1977,1977,Gable,CompShg,Plywood,Plywood,3,3,CBlock,Gd,TA,No,Unf,0,Unf,0,1967,1967,GasA,TA,Y,SBrkr,1967,0,0,1967,0,0,2,0,6,2,TA,10,Typ,0,Attchd,1977.0,Fin,2,580,TA,TA,Y,170,0,0,0,0,0,8,2006,WD
2,2414,2 Storey 1946 Newer,Residential Low Density,58.0,17104,Pave,IR1,Lvl,AllPub,Inside,Gtl,Gilbert,Norm,Norm,1Fam,2Story,7,5,2006,2006,Gable,CompShg,VinylSd,VinylSd,4,3,PConc,Gd,Gd,Av,GLQ,554,Unf,0,100,654,GasA,Ex,Y,SBrkr,664,832,0,1496,1,0,2,1,3,1,Gd,7,Typ,1,Attchd,2006.0,RFn,2,426,TA,TA,Y,100,24,0,0,0,0,9,2006,New


In [972]:
location_proximity = test_df[['Id', 'MS SubClass', 'MS Zoning', 'Lot Config', 'Condition 1', 'Condition 2', 'Bldg Type', 'Land Contour' , 'Land Slope', 'Street', 'Lot Shape']]

quality_condition = test_df[['Overall Quality', 'Overall Condition', 'Exterior Quality', 'Exterior Condition', 'Functional', 'Garage Qual','Garage Cond']]

parts_of_home = test_df[['House Style','Roof Style', 'Roof Matl', 'Exterior 1st',
       'Exterior 2nd', 'Foundation',  'Bsmt Qual', 'Bsmt Cond', 'Bsmt Exposure', 'BsmtFin Type 1',
       'BsmtFin Type 2','Bsmt Full Bath', 'Bsmt Half Bath', 'Full Bath', 'Half Bath', 'Bedroom AbvGr', 'Kitchen AbvGr', 'Kitchen Qual', 'TotRms AbvGrd',
       'Fireplaces', 'Garage Type', 'Garage Finish', 'Garage Cars', 'Paved Drive']]

dimensions = test_df[['1st Flr SF', '2nd Flr SF','BsmtFin SF 1','BsmtFin SF 2', 'Bsmt Unf SF',
       'Total Bsmt SF',  'Low Qual Fin SF','Gr Liv Area',  'Garage Area', 'Wood Deck SF', 'Open Porch SF',  'Enclosed Porch', '3Ssn Porch', 'Screen Porch',
       'Pool Area',  'Lot Frontage', 'Lot Area']]

utilities_intangible = test_df[['Heating', 'Heating QC', 'Central Air', 'Electrical', 'Utilities']]

time_factors_df = test_df[['Mo Sold', 'Yr Sold', 'Year Built','Year Remod/Add', 'Garage Yr Blt']]


## Factor 1

In [973]:
location_proximity = pd.get_dummies(location_proximity,columns=['MS SubClass', 'MS Zoning', 'Lot Config', 'Condition 1',
       'Condition 2', 'Bldg Type', 'Land Contour', 'Land Slope', 'Street',
       'Lot Shape'], drop_first=True)

print(location_proximity.shape)

location_proximity.head(3)

(878, 47)


Unnamed: 0,Id,MS SubClass_1 Storey Newer All,MS SubClass_1 Storey PUD,MS SubClass_1 Storey W/Finished Attic,MS SubClass_1.5 Finished,MS SubClass_1.5 Unfinished,MS SubClass_2 Storey 1946 Newer,MS SubClass_2 Storey 1946 Older,MS SubClass_2 Storey PUD,MS SubClass_2.5 Story All,MS SubClass_Duplex All,MS SubClass_Family Conversion,MS SubClass_PUD Multi,MS SubClass_Split,MS SubClass_Split Foyer,MS Zoning_Floating Village Residential,MS Zoning_I (all),MS Zoning_Residential High Density,MS Zoning_Residential Low Density,MS Zoning_Residential Medium Density,Lot Config_CulDSac,Lot Config_FR2,Lot Config_FR3,Lot Config_Inside,Condition 1_Feedr,Condition 1_Norm,Condition 1_PosA,Condition 1_PosN,Condition 1_RRAe,Condition 1_RRAn,Condition 1_RRNe,Condition 1_RRNn,Condition 2_Norm,Condition 2_PosA,Bldg Type_2fmCon,Bldg Type_Duplex,Bldg Type_Twnhs,Bldg Type_TwnhsE,Land Contour_HLS,Land Contour_Low,Land Contour_Lvl,Land Slope_Mod,Land Slope_Sev,Street_Pave,Lot Shape_IR2,Lot Shape_IR3,Lot Shape_Reg
0,2658,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,1,0,1,0,0,0,0,0,0,1,0,1,0,0,0,0,0,1,0,0,1,0,0,1
1,2718,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,1,0,0,1,0,0,0,0,1,0,0,1,0,0,0
2,2414,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,0


## Factor 2

In [974]:
quality_condition.head()

Unnamed: 0,Overall Quality,Overall Condition,Exterior Quality,Exterior Condition,Functional,Garage Qual,Garage Cond
0,6,8,3,2,Typ,Po,Po
1,5,4,3,3,Typ,TA,TA
2,7,5,4,3,Typ,TA,TA
3,5,6,4,3,Typ,Fa,TA
4,6,5,3,3,Typ,TA,TA


In [975]:
def functional(x):
    if x == 'Typ':
        x=8
    elif x == 'Min1':
        x=7
    elif x=='Min2':
        x=6
    elif x=='Mod':
        x=5
    elif x=='Maj1':
        x=4
    elif x=='Maj2':
        x=3
    elif x=='Sal':
        x=2
    elif x=='Sev':
        x=1
    else:
        x=0
    return x

In [976]:
def ranking(x):
    if x=='Ex':
        x=6
    elif x=='Gd':
        x=5
    elif x=='TA':
        x=4
    elif x=='Fa':
        x=3
    elif x=='Po':
        x=2
    else:
        x=0
    return x

In [977]:
quality_condition['Garage Qual'] = quality_condition['Garage Qual'].map(ranking)

quality_condition['Garage Cond'] = quality_condition['Garage Cond'].map(ranking)

quality_condition['Functional'] = quality_condition['Functional'].map(functional)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  quality_condition['Garage Qual'] = quality_condition['Garage Qual'].map(ranking)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  quality_condition['Garage Cond'] = quality_condition['Garage Cond'].map(ranking)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  quality_condition['Functional'] = quality_c

In [978]:
quality_condition.head(3)

Unnamed: 0,Overall Quality,Overall Condition,Exterior Quality,Exterior Condition,Functional,Garage Qual,Garage Cond
0,6,8,3,2,8,2,2
1,5,4,3,3,8,4,4
2,7,5,4,3,8,4,4


In [979]:
quality_condition['Overall_Exterior_Quality']=quality_condition['Overall Quality']*quality_condition['Exterior Quality']
quality_condition['Overall_Exterior_Condition']=quality_condition['Overall Condition']*quality_condition['Exterior Condition']
quality_condition['Garage_Qual_Condition']=quality_condition['Garage Qual']*quality_condition['Garage Cond']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  quality_condition['Overall_Exterior_Quality']=quality_condition['Overall Quality']*quality_condition['Exterior Quality']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  quality_condition['Overall_Exterior_Condition']=quality_condition['Overall Condition']*quality_condition['Exterior Condition']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/inde

In [980]:
quality_condition.head()

Unnamed: 0,Overall Quality,Overall Condition,Exterior Quality,Exterior Condition,Functional,Garage Qual,Garage Cond,Overall_Exterior_Quality,Overall_Exterior_Condition,Garage_Qual_Condition
0,6,8,3,2,8,2,2,18,16,4
1,5,4,3,3,8,4,4,15,12,16
2,7,5,4,3,8,4,4,28,15,16
3,5,6,4,3,8,3,4,20,18,12
4,6,5,3,3,8,4,4,18,15,16


In [981]:
quality_condition=quality_condition[['Overall_Exterior_Quality','Overall_Exterior_Condition','Garage_Qual_Condition']]

## Factor 3

In [982]:
parts_of_home.head(3)

Unnamed: 0,House Style,Roof Style,Roof Matl,Exterior 1st,Exterior 2nd,Foundation,Bsmt Qual,Bsmt Cond,Bsmt Exposure,BsmtFin Type 1,BsmtFin Type 2,Bsmt Full Bath,Bsmt Half Bath,Full Bath,Half Bath,Bedroom AbvGr,Kitchen AbvGr,Kitchen Qual,TotRms AbvGrd,Fireplaces,Garage Type,Garage Finish,Garage Cars,Paved Drive
0,2Story,Gable,CompShg,AsbShng,AsbShng,Stone,Fa,TA,No,Unf,Unf,0,0,2,0,4,2,Fa,9,0,Detchd,Unf,1,Y
1,1Story,Gable,CompShg,Plywood,Plywood,CBlock,Gd,TA,No,Unf,Unf,0,0,2,0,6,2,TA,10,0,Attchd,Fin,2,Y
2,2Story,Gable,CompShg,VinylSd,VinylSd,PConc,Gd,Gd,Av,GLQ,Unf,1,0,2,1,3,1,Gd,7,1,Attchd,RFn,2,Y


In [983]:
parts_of_home.columns

Index(['House Style', 'Roof Style', 'Roof Matl', 'Exterior 1st',
       'Exterior 2nd', 'Foundation', 'Bsmt Qual', 'Bsmt Cond', 'Bsmt Exposure',
       'BsmtFin Type 1', 'BsmtFin Type 2', 'Bsmt Full Bath', 'Bsmt Half Bath',
       'Full Bath', 'Half Bath', 'Bedroom AbvGr', 'Kitchen AbvGr',
       'Kitchen Qual', 'TotRms AbvGrd', 'Fireplaces', 'Garage Type',
       'Garage Finish', 'Garage Cars', 'Paved Drive'],
      dtype='object')

In [984]:
parts_of_home = parts_of_home [['Roof Matl','Foundation','Bsmt Cond','Bsmt Exposure','BsmtFin Type 1','BsmtFin Type 2','Full Bath','Kitchen Qual','TotRms AbvGrd','Paved Drive']]

In [985]:
parts_of_home.head(3)

Unnamed: 0,Roof Matl,Foundation,Bsmt Cond,Bsmt Exposure,BsmtFin Type 1,BsmtFin Type 2,Full Bath,Kitchen Qual,TotRms AbvGrd,Paved Drive
0,CompShg,Stone,TA,No,Unf,Unf,2,Fa,9,Y
1,CompShg,CBlock,TA,No,Unf,Unf,2,TA,10,Y
2,CompShg,PConc,Gd,Av,GLQ,Unf,2,Gd,7,Y


In [986]:
def paved(x):
    if x=='Y':
        x=2
    elif x=='P':
        x=1
    else:
        x=0
    return x

In [987]:
parts_of_home['Paved Drive'] = parts_of_home['Paved Drive'].map(paved)

In [988]:
def bsmt_expose(x):
    if x=='Gd':
        x=4
    elif x=='Av':
        x=3
    elif x=='Mn':
        x=2
    elif x=='No':
        x=1
    else:
        x=0
    return x

In [989]:
parts_of_home['Bsmt Exposure']=parts_of_home['Bsmt Exposure'].map(bsmt_expose)

In [990]:
parts_of_home['Kitchen Qual'] = parts_of_home['Kitchen Qual'].map(ranking)
parts_of_home['Bsmt Cond'] = parts_of_home['Bsmt Cond'].map(ranking)

In [991]:
parts_of_home.head(3)

Unnamed: 0,Roof Matl,Foundation,Bsmt Cond,Bsmt Exposure,BsmtFin Type 1,BsmtFin Type 2,Full Bath,Kitchen Qual,TotRms AbvGrd,Paved Drive
0,CompShg,Stone,4,1,Unf,Unf,2,3,9,2
1,CompShg,CBlock,4,1,Unf,Unf,2,4,10,2
2,CompShg,PConc,5,3,GLQ,Unf,2,5,7,2


## Factor 4

In [992]:
utilities_intangible.head(3)

Unnamed: 0,Heating,Heating QC,Central Air,Electrical,Utilities
0,GasA,Gd,N,FuseP,AllPub
1,GasA,TA,Y,SBrkr,AllPub
2,GasA,Ex,Y,SBrkr,AllPub


In [993]:
utilities_intangible['Utilities']=utilities_intangible['Utilities'].map({'AllPub':1, 'NoSeWa':0})

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  utilities_intangible['Utilities']=utilities_intangible['Utilities'].map({'AllPub':1, 'NoSeWa':0})


In [994]:
def electric(x):
    if x=='SBrkr':
        x=4
    elif x=='FuseA':
        x=3
    elif x=='FuseF':
        x=2
    elif x=='FuseP':
        x=1
    else:
        x=0
    return x

In [995]:
utilities_intangible['Electrical']=utilities_intangible['Electrical'].map(electric)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  utilities_intangible['Electrical']=utilities_intangible['Electrical'].map(electric)


In [996]:
utilities_intangible['Central Air']=utilities_intangible['Central Air'].map({'Y':1,'N':0})

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  utilities_intangible['Central Air']=utilities_intangible['Central Air'].map({'Y':1,'N':0})


In [997]:
utilities_intangible['Heating QC']=utilities_intangible['Heating QC'].map(ranking)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  utilities_intangible['Heating QC']=utilities_intangible['Heating QC'].map(ranking)


In [998]:
utilities_intangible.head(3)

Unnamed: 0,Heating,Heating QC,Central Air,Electrical,Utilities
0,GasA,5,0,1,1.0
1,GasA,4,1,4,1.0
2,GasA,6,1,4,1.0


## Factor 5

In [999]:
time_factors_df.head(3)

Unnamed: 0,Mo Sold,Yr Sold,Year Built,Year Remod/Add,Garage Yr Blt
0,4,2006,1910,1950,1910.0
1,8,2006,1977,1977,1977.0
2,9,2006,2006,2006,2006.0


In [1000]:
time_factors_df['Built Age'] = time_factors_df['Yr Sold'] - time_factors_df['Year Built']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  time_factors_df['Built Age'] = time_factors_df['Yr Sold'] - time_factors_df['Year Built']


In [1001]:
time_factors_df['Remod Age'] = time_factors_df['Yr Sold'] - time_factors_df['Year Remod/Add']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  time_factors_df['Remod Age'] = time_factors_df['Yr Sold'] - time_factors_df['Year Remod/Add']


In [1002]:
time_factors_df.head(3)

Unnamed: 0,Mo Sold,Yr Sold,Year Built,Year Remod/Add,Garage Yr Blt,Built Age,Remod Age
0,4,2006,1910,1950,1910.0,96,56
1,8,2006,1977,1977,1977.0,29,29
2,9,2006,2006,2006,2006.0,0,0


In [1003]:
time_factors_df = time_factors_df.drop(['Mo Sold','Yr Sold','Year Built','Year Remod/Add','Garage Yr Blt'], axis=1)

## Merged Dataframe

In [1004]:
test_merge = location_proximity.join(quality_condition, how='outer')

In [1005]:
test_merge=test_merge.join(parts_of_home,how='outer')

In [1006]:
test_merge=test_merge.join(utilities_intangible,how='outer')

In [1007]:
test_merge=test_merge.join(time_factors_df,how='outer')

In [1008]:
test_merge.head()

Unnamed: 0,Id,MS SubClass_1 Storey Newer All,MS SubClass_1 Storey PUD,MS SubClass_1 Storey W/Finished Attic,MS SubClass_1.5 Finished,MS SubClass_1.5 Unfinished,MS SubClass_2 Storey 1946 Newer,MS SubClass_2 Storey 1946 Older,MS SubClass_2 Storey PUD,MS SubClass_2.5 Story All,MS SubClass_Duplex All,MS SubClass_Family Conversion,MS SubClass_PUD Multi,MS SubClass_Split,MS SubClass_Split Foyer,MS Zoning_Floating Village Residential,MS Zoning_I (all),MS Zoning_Residential High Density,MS Zoning_Residential Low Density,MS Zoning_Residential Medium Density,Lot Config_CulDSac,Lot Config_FR2,Lot Config_FR3,Lot Config_Inside,Condition 1_Feedr,Condition 1_Norm,Condition 1_PosA,Condition 1_PosN,Condition 1_RRAe,Condition 1_RRAn,Condition 1_RRNe,Condition 1_RRNn,Condition 2_Norm,Condition 2_PosA,Bldg Type_2fmCon,Bldg Type_Duplex,Bldg Type_Twnhs,Bldg Type_TwnhsE,Land Contour_HLS,Land Contour_Low,Land Contour_Lvl,Land Slope_Mod,Land Slope_Sev,Street_Pave,Lot Shape_IR2,Lot Shape_IR3,Lot Shape_Reg,Overall_Exterior_Quality,Overall_Exterior_Condition,Garage_Qual_Condition,Roof Matl,Foundation,Bsmt Cond,Bsmt Exposure,BsmtFin Type 1,BsmtFin Type 2,Full Bath,Kitchen Qual,TotRms AbvGrd,Paved Drive,Heating,Heating QC,Central Air,Electrical,Utilities,Built Age,Remod Age
0,2658,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,1,0,1,0,0,0,0,0,0,1,0,1,0,0,0,0,0,1,0,0,1,0,0,1,18,16,4,CompShg,Stone,4,1,Unf,Unf,2,3,9,2,GasA,5,0,1,1.0,96,56
1,2718,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,1,0,0,1,0,0,0,0,1,0,0,1,0,0,0,15,12,16,CompShg,CBlock,4,1,Unf,Unf,2,4,10,2,GasA,4,1,4,1.0,29,29
2,2414,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,0,28,15,16,CompShg,PConc,5,3,GLQ,Unf,2,5,7,2,GasA,6,1,4,1.0,0,0
3,1989,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,1,20,18,12,CompShg,CBlock,4,1,Unf,Unf,1,4,5,0,GasA,4,1,4,1.0,84,1
4,625,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,0,18,15,16,CompShg,CBlock,4,1,BLQ,Unf,1,4,6,2,GasA,5,1,4,1.0,46,46


In [1009]:
test_merge.shape

(878, 67)

I need 878 rows for Kaggle submission

In [1010]:
test_merge2 = test_merge.set_index('Id')

In [1011]:
test_merge.to_csv('test_clean.csv')

In [1012]:
test_merge2.to_csv('test_clean_Id.csv')