# Housing Prices - Evaluation

### James Mwakichako - jmwakich@hawk.iit.edu
### Michael Baroody  - mbaroody@hawk.iit.edu

### Preprocessing

Before we are able to fit our model, we have to take care of missing values and categorical features. 

In [2]:
%matplotlib inline
import pandas as pd
from ipywidgets import widgets
from IPython.display import display
import numpy as np
import matplotlib.pyplot as plt
from sklearn import preprocessing, linear_model, model_selection, neural_network

# train DataFrame object
# see http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html
# and http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
train = pd.read_csv("train.csv", header = 0)

Recall that we had some features with many missing values. Below are all the features that have some missing values. All the other features have all values filled in. 

In [3]:
print("Feature \tProportion Values Missing")
print("------- \t----------------------")

# there are 19 features that contain missing values 
# see http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.count.html#pandas.DataFrame.count
print((1 - (train.count() / len(train))).sort_values(ascending=False).nlargest(19))

Feature 	Proportion Values Missing
------- 	----------------------
PoolQC          0.995205
MiscFeature     0.963014
Alley           0.937671
Fence           0.807534
FireplaceQu     0.472603
LotFrontage     0.177397
GarageCond      0.055479
GarageType      0.055479
GarageYrBlt     0.055479
GarageFinish    0.055479
GarageQual      0.055479
BsmtExposure    0.026027
BsmtFinType2    0.026027
BsmtFinType1    0.025342
BsmtCond        0.025342
BsmtQual        0.025342
MasVnrArea      0.005479
MasVnrType      0.005479
Electrical      0.000685
dtype: float64


We want to throw out all features that have > 25% missing values. That means 'PoolQC', 'MiscFeature', 'Alley', 'Fence', and 'FireplaceQu' will all be excluded from the training data. 

In [4]:
del train['PoolQC']
del train['MiscFeature']
del train['Alley']
del train['Fence']
del train['FireplaceQu']

We will fill in missing values for the rest of the features. For numerical features, we will to fill in the missing values with the mean for that column. For example, we know that the mean value for all of the known 'LotFrontage' values is around 70. Therefore, for all of the 'NaN' values encountered in the 'LotFrontage' column, we will replace the value with 70.

In [5]:
# fill in the missing numerical features with the means
# see http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.mean.html#pandas.DataFrame.mean
# and http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.fillna.html#pandas.Series.fillna
means = train.mean(skipna=True)
for feature,mean in means.iteritems():
    train[feature] = train[feature].fillna(value=mean)

For the categorical features, we will fill in the missing values with the mode category for that column. For example, it is known that the most frequent 'GarageQual' is 'TA.' Therefore, for all of the 'NaN' values we encounter in the 'GarageQual' column, we will replace the value with 'TA.'

In [6]:
# fill in the missing categorical values with the modes
# see http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.mode.html#pandas.DataFrame.mode
train = train.fillna(train.mode().iloc[0])

Now we must deal with the categorical variables. We do this by encoding the categorical feature labels with numbers, using sklearn.preprocessing. In the visualization phase of our project, we identified 49 categorical features. However, some of these were already encoded with numbers. The 'MoSold' feature, for example, is a categorical feature of the month the house was sold, and it is already encoded for us by the number of the month in the calendar. Therefore, we leave it alone. 

In [7]:
# first, find those features that are unencoded categorical
categorical_features = [feat for feat in train.columns.values if train[feat].dtype == 'object']

# see http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html
le = preprocessing.LabelEncoder()
for feature in categorical_features: 
    le = le.fit(train[feature])
    train[feature] = pd.Series(le.transform(train[feature]))

In [11]:
# and http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html#pandas.DataFrame.drop
X = train.drop('SalePrice', 1).values
y = train['SalePrice'].values

# see http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html#sklearn-linear-model-lasso
lasso = linear_model.Lasso(alpha=1.0)
lasso_average_score = np.mean(model_selection.cross_val_score(lasso, X, y, cv=10))

# http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression
linReg = linear_model.LinearRegression()
linReg_average_score = np.mean(model_selection.cross_val_score(linReg, X, y, cv=10))

# http://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html#sklearn.neural_network.MLPClassifier
mlp = neural_network.MLPClassifier()
print(y)
mlp_average_score = np.mean(model_selection.cross_val_score(mlp, X, y, cv=2, scoring='accuracy'))

print('Classifier\t\t\tAverage Cross-Validation Score (k=10 folds)')
print('----------\t\t\t-------------------------------------------')
print('Lasso\t\t\t\t%0.3f' % lasso_average_score)
print('Linear\t\t\t\t%0.3f' % linReg_average_score)
print('MLP\t\t\t%d' % mlp_average_score)

[208500 181500 223500 ..., 266500 142125 147500]




Classifier			Average Cross-Validation Score (k=10 folds)
----------			-------------------------------------------
Lasso				0.821
Linear				0.821
MLP			0
