# Housing Prices - Evaluation

### James Mwakichako - jmwakich@hawk.iit.edu
### Michael Baroody  - mbaroody@hawk.iit.edu

### Preprocessing

Before we are able to fit our model, we have to take care of missing values and categorical features. 

In [21]:
%matplotlib inline
import pandas as pd
from ipywidgets import widgets
from IPython.display import display
import numpy as np
import matplotlib.pyplot as plt
from sklearn import preprocessing, linear_model, model_selection, neural_network

# train DataFrame object
# see http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html
# and http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
train = pd.read_csv("train.csv", header = 0)

Recall that we had some features with many missing values. Below are all the features that have some missing values. All the other features have all values filled in. 

In [22]:
print("Feature \tProportion Values Missing")
print("------- \t----------------------")

# there are 19 features that contain missing values 
# see http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.count.html#pandas.DataFrame.count
print((1 - (train.count() / len(train))).sort_values(ascending=False).nlargest(19))

Feature 	Proportion Values Missing
------- 	----------------------
PoolQC          0.995205
MiscFeature     0.963014
Alley           0.937671
Fence           0.807534
FireplaceQu     0.472603
LotFrontage     0.177397
GarageCond      0.055479
GarageType      0.055479
GarageYrBlt     0.055479
GarageFinish    0.055479
GarageQual      0.055479
BsmtExposure    0.026027
BsmtFinType2    0.026027
BsmtFinType1    0.025342
BsmtCond        0.025342
BsmtQual        0.025342
MasVnrArea      0.005479
MasVnrType      0.005479
Electrical      0.000685
dtype: float64


We want to throw out all features that have > 25% missing values. That means 'PoolQC', 'MiscFeature', 'Alley', 'Fence', and 'FireplaceQu' will all be excluded from the training data. We also don't care about the 'Id' for obvious reasons, and 'Utilities' because virtually all datapoints have the same value for the 'Utilities' feature. 

In [23]:
del train['PoolQC']
del train['MiscFeature']
del train['Alley']
del train['Fence']
del train['FireplaceQu']
del train['Id']
del train['Utilities']

Below are the features and their respective types: 

In [24]:
# purposefully-ordered 
ordered_categorical_features = ['OverallCond', 'Fireplaces', 'GarageCars', 
                                'TotRmsAbvGrd', 'BedroomAbvGr', 'FullBath', 
                                'BsmtFullBath', 'OverallQual', 'KitchenQual', 
                                'CentralAir', 'HeatingQC', 'BsmtCond', 'BsmtQual', 
                                'ExterCond', 'ExterQual', 'BsmtHalfBath']

# arbitrarily-ordered
unordered_categorical_features = ['MSZoning', 'Street', 'LotShape', 
                                  'LandContour', 'LotConfig', 'LandSlope', 
                                  'Neighborhood', 'Condition1', 'Condition2', 
                                  'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 
                                  'Exterior1st', 'Exterior2nd', 'MasVnrType', 'Foundation', 
                                  'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Heating', 
                                  'Electrical', 'GarageFinish', 'Functional', 'GarageType', 
                                  'GarageQual', 'GarageCond', 'PavedDrive', 'SaleType', 'SaleCondition', 
                                  'MSSubClass', 'PoolArea', 'MoSold']

# note choice of years as continuous 
continuous_features = ['LotFrontage', 'LotArea', 'YearBuilt', 
                       'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 
                       'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 
                       '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 
                       'GrLivArea', 'HalfBath', 'KitchenAbvGr', 
                       'GarageYrBlt', 'GarageArea', 'WoodDeckSF', 
                       'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 
                       'ScreenPorch', 'MiscVal', 'YrSold']
    

We want to encode ordered categorical features in such a way that reflects their class heirarchy and unordered categorical features in such a way that each class is equidistant from each other class using OneHotEncoder. 

But first we must fill in the missing ('Nan') values. For numerical features, we will to fill in the missing values with the mean for that column. For example, we know that the mean value for all of the known 'LotFrontage' values is around 70. Therefore, for all of the 'NaN' values encountered in the 'LotFrontage' column, we will replace the value with 70.

In [25]:
means = {feat:np.nanmean(train[feat].values) for feat in continuous_features}
for feature,mean in means.items():
    train[feature] = train[feature].fillna(value=mean)

For the categorical features, we will fill in the missing values with the mode category for that column. For example, it is known that the most frequent 'GarageQual' is 'TA.' Therefore, for all of the 'NaN' values we encounter in the 'GarageQual' column, we will replace the value with 'TA.'

In [26]:
# fill in the missing categorical values with the modes 
categorical_features = unordered_categorical_features + ordered_categorical_features
modes = {feat:train[feat].mode()[0] for feat in categorical_features}
for feature,mode in modes.items():
    train[feature] = train[feature].fillna(value=mode)

Now we must encode the categorical variables. The unordered categorical features need to be transformed using the OneHotEncoder (expanding our data using pandas.get_dummies). Some categorical features need special treatement. Many of the ordered categorical variables are already encoded. For example, 'OverallCond.'

In [27]:
# simple encoding scheme much like http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html#sklearn.preprocessing.LabelEncoder
for feat in unordered_categorical_features:
    dummies = pd.get_dummies(train[feat], prefix=feat)
    train[dummies.columns] = dummies
    train = train.drop(feat, 1)


# some ordered categorical need special treatement
special_encode = ['KitchenQual','HeatingQC','BsmtCond', 
                 'ExterCond', 'ExterQual', 'CentralAir']



CondKey = {
        # answer not available (0 weight)
        'TA' : 0,
        # Poor 
        'Po' : 1,
        # Fair
        'Fa' : 1,
        # Good
        'Gd' : 2,
        # Excellent
        'Ex' : 3
}

BinKey = {
    # No
    'N' : 0,
    # Yes
    'Y' : 1
}

def encode(data, key):
    return np.array([key[d] for d in data])

train['KitchenQual'] = encode(train['KitchenQual'], CondKey)
train['HeatingQC'] = encode(train['HeatingQC'], CondKey)
train['BsmtCond'] = encode(train['BsmtCond'], CondKey)
train['BsmtQual'] = encode(train['BsmtQual'], CondKey)
train['ExterCond'] = encode(train['ExterCond'], CondKey)
train['ExterQual'] = encode(train['ExterQual'], CondKey)
train['CentralAir'] = encode(train['CentralAir'], BinKey)



In [30]:
# and http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html#pandas.DataFrame.drop
X = train.drop('SalePrice', 1).values
y = train['SalePrice'].values

# see http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html#sklearn-linear-model-lasso
#lasso = linear_model.Lasso(alpha=6.0)
#lasso_average_score = np.mean(model_selection.cross_val_score(lasso, X, y, cv=10, scoring='r2'))

# http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression
linReg = linear_model.LinearRegression()
linReg_average_score = np.mean(model_selection.cross_val_score(linReg, X, y, cv=10, scoring='r2'))

# http://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html#sklearn.neural_network.MLPClassifier
mlp = neural_network.MLPRegressor()
mlp_average_score = np.mean(model_selection.cross_val_score(mlp, X, y, cv=2, scoring='r2'))

print('Classifier\t\t\tAverage Cross-Validation Score (k=10 folds)')
print('----------\t\t\t-------------------------------------------')
print('Lasso\t\t\t\t%0.3f' % lasso_average_score)
print('Linear\t\t\t\t%0.3f' % linReg_average_score)
print('MLP\t\t\t\t%0.3f' % mlp_average_score)
#print('MLP\t\t\t%d' % mlp_average_score)



Classifier			Average Cross-Validation Score (k=10 folds)
----------			-------------------------------------------
Lasso				0.815
Linear				-4050743.415
MLP				0.396


