# Project 2 - Ames Housing Data
## Modeling Tuning

![House](phil-hearing-house-small.jpg)
<br>Photo by:
https://unsplash.com/photos/IYfp2Ixe9nM?utm_source=unsplash&utm_medium=referral&utm_content=creditShareLink

In [121]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression, Ridge, RidgeCV, Lasso, LassoCV
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.dummy import DummyRegressor
from sklearn.metrics import r2_score

In [122]:
houses = pd.read_csv('../datasets/train_processed.csv')
houses.head()

Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,Neighborhood_OldTown,Neighborhood_SWISU,Neighborhood_Sawyer,Neighborhood_SawyerW,Neighborhood_Somerst,Neighborhood_StoneBr,Neighborhood_Timber,Neighborhood_Veenker,floors,bathrooms
0,109,533352170,60,7,69.0552,13517,2,0,1,3,...,0,0,1,0,0,0,0,0,2,3.0
1,544,531379050,60,7,43.0,11492,2,0,1,3,...,0,0,0,1,0,0,0,0,2,4.0
2,153,535304180,20,7,68.0,7922,2,0,0,3,...,0,0,0,0,0,0,0,0,1,2.0
3,318,916386060,60,7,73.0,9802,2,0,0,3,...,0,0,0,0,0,0,1,0,2,3.0
4,255,906425045,50,7,82.0,14235,2,0,1,3,...,0,0,0,1,0,0,0,0,2,2.0


#### Find which features are most correlated with SalePrice

In [123]:
print ('All features:' , houses.columns.values.tolist())

All features: ['Id', 'PID', 'MS SubClass', 'MS Zoning', 'Lot Frontage', 'Lot Area', 'Street', 'Alley', 'Lot Shape', 'Land Contour', 'Lot Config', 'Land Slope', 'Condition 1', 'Condition 2', 'Bldg Type', 'House Style', 'Overall Qual', 'Overall Cond', 'Year Built', 'Year Remod/Add', 'Roof Style', 'Roof Matl', 'Exterior 1st', 'Exterior 2nd', 'Mas Vnr Type', 'Mas Vnr Area', 'Exter Qual', 'Exter Cond', 'Foundation', 'Bsmt Qual', 'Bsmt Cond', 'Bsmt Exposure', 'BsmtFin Type 1', 'BsmtFin SF 1', 'BsmtFin Type 2', 'BsmtFin SF 2', 'Bsmt Unf SF', 'Total Bsmt SF', 'Heating', 'Heating QC', 'Central Air', 'Electrical', '1st Flr SF', '2nd Flr SF', 'Low Qual Fin SF', 'Gr Liv Area', 'Bsmt Full Bath', 'Bsmt Half Bath', 'Full Bath', 'Half Bath', 'Bedroom AbvGr', 'Kitchen AbvGr', 'Kitchen Qual', 'TotRms AbvGrd', 'Functional', 'Fireplaces', 'Fireplace Qu', 'Garage Type', 'Garage Yr Blt', 'Garage Finish', 'Garage Cars', 'Garage Area', 'Garage Qual', 'Garage Cond', 'Paved Drive', 'Wood Deck SF', 'Open Porch S

Don't include the Id or PID columns.

In [124]:
feature_list = houses.columns.values.tolist()[2:]

In [125]:
all_corrs = pd.DataFrame(houses[feature_list].corr()['SalePrice'].sort_values(ascending=False) )

# Idea taken from:
#https://git.generalassemb.ly/DSIR-1116/3.08-lesson-feature-engineering-and-model-workflow/blob/master/solution-code/power-transformer.ipynb

In [126]:
all_corrs.head()

Unnamed: 0,SalePrice
SalePrice,1.0
Overall Qual,0.804237
Exter Qual,0.719518
Gr Liv Area,0.716996
Kitchen Qual,0.694095


#### Create feature set to measure if being in a neighborhood close to Iowa State has a positive correlation.

The problem statement was to gauge whether or not a home's value would have a premium (5%) than other neighborhoods that are farther away from the university.

By looking at several real estate web sites, and Google maps, I determined the following neighborhood values are located close to the Iowa State campus:
>* BrkSide	(Brookside)
*Crawfor	(Crawford)
*IDOTRR	(Iowa DOT and Rail Road)
*OldTown	(Old Town)
*SWISU	South & West of Iowa State University
*Sawyer

![Iowa State](ISU-Hoods-Star-small.png)
---
Null Hypothesis $H_0$:
Properties located in the neighborhoods have a 5% relative sales price.  Measured by price per square foot.


Althernative Hypothesis $H_A$: Prices in the six neighborhoods closest to Iowa State do not have a 5% higher price premium.



Try near_ISU_hoods by itself as a feature

In [142]:
X = houses[['near_ISU']]
y = houses['SalePrice']

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8, random_state=42)

No need to scale if just using one feature

#### Linear Regression

In [143]:
 lr.fit (X_train, y_train)

LinearRegression()

In [144]:
y_preds = lr.predict(X_test)
print (f'The r squared value of the linear regresssion model: {r2_score(y_test, y_preds)}') 

The r squared value of the linear regresssion model: 0.08951967495556534


In [145]:
lr.coef_

array([-61005.9135761])

#### Lasso regression

In [146]:
# Set up a list of Lasso alphas to check.
l_alphas = np.logspace(-3, 0, 100)

# Cross-validate over our list of Lasso alphas.
lasso_cv = LassoCV(alphas=l_alphas, cv=5, max_iter=5000)

# Fit model using best ridge alpha!
lasso_cv.fit(X_train, y_train); 

In [147]:
print(f'Lasso CV score with training data: {lasso_cv.score(X_train, y_train)} ')
print(f'Lasso CV score with testing data: {lasso_cv.score(X_test, y_test)} ')

Lasso CV score with training data: 0.11215147128337843 
Lasso CV score with testing data: 0.0895216256867909 


In [148]:
lasso_cv.coef_

array([-61000.64627117])

Both Linear Regression and Lasso using a single feature of nearISU has a negative coef_ of -61000, implying that homes near Iowa State are expected to have a lower price than if not near the campus.

Next try modeling with each of the neighborhoods near Iowa state, along with a list of other features.

In [149]:
features = ['Overall Qual', 'Year Built','1st Flr SF','2nd Flr SF','Garage Cars','bathrooms',
 'Neighborhood_BrkSide', 'Neighborhood_Crawfor', 'Neighborhood_IDOTRR',
 'Neighborhood_OldTown', 'Neighborhood_SWISU', 'Neighborhood_Sawyer']

X = houses[features]
y = houses['SalePrice']

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8, random_state=42)

#### Linear Regression

In [150]:
ss = StandardScaler()
ss.fit(X_train)
X_train = ss.transform(X_train)
X_test = ss.transform(X_test)

In [151]:
 lr.fit (X_train, y_train)

LinearRegression()

In [152]:
y_preds = lr.predict(X_test)
print (f'The r squared value of the linear regresssion model: {r2_score(y_test, y_preds)}') 

The r squared value of the linear regresssion model: 0.790065893650115


In [153]:
pd.DataFrame(list(zip(X.columns, lr.coef_)))

# From class session 3.03
# https://git.generalassemb.ly/pdmill/3.03-intro-to-linear-regression/blob/master/linear-regression-starter.ipynb

Unnamed: 0,0,1
0,Overall Qual,31519.591331
1,Year Built,10679.791689
2,1st Flr SF,34412.401813
3,2nd Flr SF,18124.337243
4,Garage Cars,6356.129181
5,bathrooms,6332.7958
6,Neighborhood_BrkSide,2631.634081
7,Neighborhood_Crawfor,3765.991508
8,Neighborhood_IDOTRR,1800.219893
9,Neighborhood_OldTown,1690.698106


In [154]:
features = ['Overall Qual', 'Year Built','1st Flr SF','2nd Flr SF','Garage Cars','bathrooms','near_ISU']

X = houses[features]
y = houses['SalePrice']

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8, random_state=42)

In [155]:
ss = StandardScaler()
ss.fit(X_train)
X_train = ss.transform(X_train)
X_test = ss.transform(X_test)

lr.fit (X_train, y_train)

LinearRegression()

In [156]:
y_preds = lr.predict(X_test)
print (f'The r squared value of the linear regresssion model with highly negatively correlated features: {r2_score(y_test, y_preds)}') 

The r squared value of the linear regresssion model with highly negatively correlated features: 0.7884167410320949


In [157]:
pd.DataFrame(list(zip(X.columns, lr.coef_)))

# From class session 3.03
# https://git.generalassemb.ly/pdmill/3.03-intro-to-linear-regression/blob/master/linear-regression-starter.ipynb

Unnamed: 0,0,1
0,Overall Qual,31674.566051
1,Year Built,10762.748898
2,1st Flr SF,34615.267321
3,2nd Flr SF,18211.783652
4,Garage Cars,6335.763674
5,bathrooms,6263.776397
6,near_ISU,4250.104407
