<img src="https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png" width="50" height="50">

# Project 3: Ames Housing Price Prediction

In this project, you will use everything we've learned so far to predict the home prices of properties in the lovely Ames, Iowa. The Ames dataset is included in this repo as `data/ames.csv` and `data/ames_description.txt`.

To pass this project, you must use all of the best practices and stipulations outlined below. More specifically, the following best practices and instructions must be met. 

**Read the following carefully:**
* All imports must go in the first code cell of the notebook.
* As best you can, follow all coding syntax best practices used by the instructor in class. For example:
    - `snake_case` variable names.
    - Spaces around operators and after commas.
    - No cell requires you to scroll horizontally. Split up long lines of code into multiple lines.
    - No cell output is too long. If you intend on showing data, please use `.head()` or similar.
    - **THESE ARE NOT SUBJECTIVE TO THE INSTRUCTOR'S TASTES!!** We have been meticulously following the proper Python style guide called PEP8. You can read more about it [here](https://www.python.org/dev/peps/pep-0008/).

Also required:
* Manipulate, clean, augment, and transform data before cells involving modeling.
* Produce clean, `scikit-learn`-usable `X` and `y` objects as done in class. Our y-variable for this problem will be `SalePrice`, the sale price of a home.
* You _must_ carry out one of the model validation techniques done in class.
    - This should be done _only_ on the data contained in `train.csv`. Yes, that means you might be splitting a dataset called `train.csv` into a train/test split.
* You _must_ carry out at least **TWO** types of models appropriate for this task. They do not need to be good. The quality of your model fit _will not impact your grade at all._
* For each model, calculate **THREE** different model metrics from the `sklearn.metrics` submodule. (ie, six total model metrics).
    - Built-in methods don't count. ie, `.score()` and `.oob_score_` don't satisfy this requirement.
* One of the above model metrics must be human-interpretable.
* In a few sentences, interpret the interpretable metric for each model. According to this metric, which model was better?

---

Optionally, you may submit predictions to this [Kaggle competition](https://www.kaggle.com/t/06379d0606b04d74bab15a31734d2a9f). To make a submission:
* Make a prediction for each row of `data/kaggle_test_set.csv`. 
* Upload your solutions to Kaggle in the same format as `data/kaggle_sample_submission.csv` (e.g. use `df.to_csv(..., index=False)`).

In [26]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn.metrics as mt

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_validate
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor

lr = LinearRegression()
rf = RandomForestRegressor()

In [2]:
ames = pd.read_csv('data/ames.csv')

In [3]:
ames.columns = ames.columns.str.lower()

In [4]:
ames.shape

(1460, 81)

In [5]:
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 1500)

In [6]:
round(ames.corr()['saleprice'].sort_values(), 2)

kitchenabvgr    -0.14
enclosedporch   -0.13
mssubclass      -0.08
overallcond     -0.08
yrsold          -0.03
lowqualfinsf    -0.03
id              -0.02
miscval         -0.02
bsmthalfbath    -0.02
bsmtfinsf2      -0.01
3ssnporch        0.04
mosold           0.05
poolarea         0.09
screenporch      0.11
bedroomabvgr     0.17
bsmtunfsf        0.21
bsmtfullbath     0.23
lotarea          0.26
halfbath         0.28
openporchsf      0.32
2ndflrsf         0.32
wooddecksf       0.32
lotfrontage      0.35
bsmtfinsf1       0.39
fireplaces       0.47
masvnrarea       0.48
garageyrblt      0.49
yearremodadd     0.51
yearbuilt        0.52
totrmsabvgrd     0.53
fullbath         0.56
1stflrsf         0.61
totalbsmtsf      0.61
garagearea       0.62
garagecars       0.64
grlivarea        0.71
overallqual      0.79
saleprice        1.00
Name: saleprice, dtype: float64

In [7]:
hood = pd.get_dummies(ames.neighborhood, prefix='neighborhood', drop_first=True)
bldg = pd.get_dummies(ames.bldgtype, prefix='bldg', drop_first=True)
ames = pd.concat([ames, hood, bldg], axis=1)

In [8]:
ames.fireplaces.value_counts()

0    690
1    650
2    115
3      5
Name: fireplaces, dtype: int64

In [9]:
ames['zoning'] = ames.mszoning.replace({'RL':0, 'RM':1, 'FV':1, 'RH':1, 'C (all)':1})
ames['sqft'] = ames.totalbsmtsf + ames['1stflrsf'] + ames['2ndflrsf']

In [10]:
ames['baths'] = ames.bsmtfullbath + ames.fullbath + (ames.bsmthalfbath + ames.halfbath)/2

In [11]:
feature_cols = ['lotarea', 'yearbuilt', 'sqft',
                'fireplaces','bedroomabvgr', 'baths', 'mosold', 'yrsold', 'zoning',
                'neighborhood_Blueste', 'neighborhood_BrDale', 'neighborhood_BrkSide',
                'neighborhood_ClearCr', 'neighborhood_CollgCr', 'neighborhood_Crawfor',
                'neighborhood_Edwards', 'neighborhood_Gilbert', 'neighborhood_IDOTRR',
                'neighborhood_MeadowV', 'neighborhood_Mitchel', 'neighborhood_NAmes',
                'neighborhood_NPkVill', 'neighborhood_NWAmes', 'neighborhood_NoRidge',
                'neighborhood_NridgHt', 'neighborhood_OldTown', 'neighborhood_SWISU',
                'neighborhood_Sawyer', 'neighborhood_SawyerW', 'neighborhood_Somerst',
                'neighborhood_StoneBr', 'neighborhood_Timber', 'neighborhood_Veenker',
                'bldg_2fmCon', 'bldg_Duplex', 'bldg_Twnhs', 'bldg_TwnhsE']

X = ames[feature_cols]
y = ames.saleprice

X_train, X_test, y_train, y_test = train_test_split(X,y) 

In [12]:
pd.set_option('display.max_columns', 500)
X.head()

Unnamed: 0,lotarea,yearbuilt,sqft,fireplaces,bedroomabvgr,baths,mosold,yrsold,zoning,neighborhood_Blueste,neighborhood_BrDale,neighborhood_BrkSide,neighborhood_ClearCr,neighborhood_CollgCr,neighborhood_Crawfor,neighborhood_Edwards,neighborhood_Gilbert,neighborhood_IDOTRR,neighborhood_MeadowV,neighborhood_Mitchel,neighborhood_NAmes,neighborhood_NPkVill,neighborhood_NWAmes,neighborhood_NoRidge,neighborhood_NridgHt,neighborhood_OldTown,neighborhood_SWISU,neighborhood_Sawyer,neighborhood_SawyerW,neighborhood_Somerst,neighborhood_StoneBr,neighborhood_Timber,neighborhood_Veenker,bldg_2fmCon,bldg_Duplex,bldg_Twnhs,bldg_TwnhsE
0,8450,2003,2566,0,3,3.5,2,2008,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,9600,1976,2524,1,3,2.5,5,2007,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
2,11250,2001,2706,1,3,3.5,9,2008,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,9550,1915,2473,1,3,2.0,2,2006,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,14260,2000,3343,1,4,3.5,12,2008,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0


In [38]:
lr.fit(X_train, y_train)

y_pred = lr.predict(X_test)

print('RMSE:',  np.sqrt(mt.mean_squared_error(y_test, y_pred)))
print('Explained Variance:', mt.explained_variance_score(y_test, y_pred))
print('MAE:', mt.mean_absolute_error(y_test, y_pred))

RMSE: 35884.27289416872
Explained Variance: 0.8058216541861771
MAE: 23227.901320619698


In [39]:
rf.fit(X_train, y_train)

rf_y_pred = rf.predict(X_test)

print('RMSE:',  np.sqrt(mt.mean_squared_error(y_test, rf_y_pred)))
print('Explained Variance:', mt.explained_variance_score(y_test, rf_y_pred))
print('MAE:', mt.mean_absolute_error(y_test, rf_y_pred))

RMSE: 32288.95380812464
Explained Variance: 0.8423778093890688
MAE: 19122.45416894977


In [None]:
# Across the board, the RandomForestRegressor fit the test data better. The mean squared error is the average of the squared
# differences between our predicted value and the actual value. The Explained Variance is the percentage of the variance in the
# data that is explained by the model. Mean absolute error is the average of the differences between our predicted value and the
# actual value. This means that, on average, our prediction is off by $19,122.