# Housing Prices - Evaluation

### James Mwakichako - jmwakich@hawk.iit.edu
### Michael Baroody  - mbaroody@hawk.iit.edu

### Description 

In [114]:
%matplotlib inline
import pandas as pd
from ipywidgets import widgets
from IPython.display import display
import numpy as np
import matplotlib.pyplot as plt

# train DataFrame object
# see http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html
# and http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
train = pd.read_csv("train.csv", header = 0)

Recall that we had some features with many missing values. Below are all the features that have some missing values. All the other features have all values filled in. 

In [115]:
train = pd.read_csv("train.csv", header = 0)

print("Feature \tPercent Values Missing")
print("------- \t----------------------")

# there are 19 features that contain missing values 
# see http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.count.html#pandas.DataFrame.count
print((1 - (train.count() / len(train))).sort_values(ascending=False).nlargest(19))

Feature 	Percent Values Missing
------- 	----------------------
PoolQC          0.995205
MiscFeature     0.963014
Alley           0.937671
Fence           0.807534
FireplaceQu     0.472603
LotFrontage     0.177397
GarageCond      0.055479
GarageType      0.055479
GarageYrBlt     0.055479
GarageFinish    0.055479
GarageQual      0.055479
BsmtExposure    0.026027
BsmtFinType2    0.026027
BsmtFinType1    0.025342
BsmtCond        0.025342
BsmtQual        0.025342
MasVnrArea      0.005479
MasVnrType      0.005479
Electrical      0.000685
dtype: float64


We want to throw out all features that have > 25% missing values. That means 'PoolQC', 'MiscFeature', 'Alley', 'Fence', and 'FireplaceQu' will all be excluded from the training data. 

In [116]:
del train['PoolQC']
del train['MiscFeature']
del train['Alley']
del train['Fence']
del train['FireplaceQu']

We will fill in missing values for the rest of the features. For numerical features, we will to fill in the missing values with the mean for that column. For example, we know that the mean value for all of the known 'LotFrontage' values is around 70. Therefore, for all of the 'NaN' values encountered in the 'LotFrontage' column, we will replace the value with 70.

In [119]:
# fill in the missing numerical features with the means
# see http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.mean.html#pandas.DataFrame.mean
# and http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.fillna.html#pandas.Series.fillna
means = train.mean(skipna=True)
for feature,mean in means.iteritems():
    train[feature] = train[feature].fillna(value=mean)

For the categorical features, we will fill in the missing values with the mode category for that column. For example, it is known that the most frequent 'GarageQual' is 'TA.' Therefore, for all of the 'NaN' values we encounter in the 'GarageQual' column, we will replace the value with 'TA.'

In [120]:
# fill in the missing categorical values with the modes
# see http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.mode.html#pandas.DataFrame.mode
train = train.fillna(train.mode().iloc[0])

Now we must deal with the categorical variables. We do this by encoding the categorical feature labels with numbers, using sklearn.preprocessing.