# Project 2 - Ames Housing Data
## Modeling Production

![House](images/phil-hearing-house-small.jpg)
<br>Photo by:
https://unsplash.com/photos/IYfp2Ixe9nM?utm_source=unsplash&utm_medium=referral&utm_content=creditShareLink

In [44]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression, Ridge, RidgeCV, Lasso, LassoCV
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.dummy import DummyRegressor
from sklearn.metrics import r2_score

In [45]:
houses = pd.read_csv('../datasets/train_processed.csv')
houses.head()

Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,Neighborhood_OldTown,Neighborhood_SWISU,Neighborhood_Sawyer,Neighborhood_SawyerW,Neighborhood_Somerst,Neighborhood_StoneBr,Neighborhood_Timber,Neighborhood_Veenker,floors,bathrooms
0,109,533352170,60,7,69.0552,13517,2,0,1,3,...,0,0,1,0,0,0,0,0,2,3.0
1,544,531379050,60,7,43.0,11492,2,0,1,3,...,0,0,0,1,0,0,0,0,2,4.0
2,153,535304180,20,7,68.0,7922,2,0,0,3,...,0,0,0,0,0,0,0,0,1,2.0
3,318,916386060,60,7,73.0,9802,2,0,0,3,...,0,0,0,0,0,0,1,0,2,3.0
4,255,906425045,50,7,82.0,14235,2,0,1,3,...,0,0,0,1,0,0,0,0,2,2.0


#### Best performing model
A Ridge Regression model using approximately 30 features had a similar $R^2$ as a Lasso model, but the Ridge was slightly better on the testing data.

Five features which were highly correlated to other independent variables, and were dropped.

In [46]:
feature_list = houses.columns.values.tolist()[2:]

In [47]:
all_corrs = pd.DataFrame(houses[feature_list].corr()['SalePrice'].sort_values(ascending=False) )

# Idea taken from:
#https://git.generalassemb.ly/DSIR-1116/3.08-lesson-feature-engineering-and-model-workflow/blob/master/solution-code/power-transformer.ipynb

Create a list of the 30 most positively correlated features to start modeling with.

In [48]:
features_pos = []

for index in range(0,30):
    #print(f'Feature: {all_corrs.iloc[index].name.ljust(30)} Correlation:  {all_corrs.iloc[index,0]} ')
    features_pos.append(all_corrs.iloc[index].name)
    
# found out about ljust here:
# https://stackabuse.com/padding-strings-in-python/

Drop some features that are highly correlated with each other.
For example, Garage Area and Garage Cars are highly correlated, so it's safe to drop one of them.

In [49]:
features_pos.remove('Garage Cars') # Highly correlated with Garage Area
features_pos.remove('TotRms AbvGrd') # Highly correlated with Gr Liv Area
features_pos.remove('Fireplaces') # Highly correlated with Fireplace Qu
features_pos.remove('1st Flr SF') # Highly correlated with Total Bsmt SF
features_pos.remove('BsmtFin Type 1') # Highly correlated with Bsmt Qual

# Very important to remove SalePrice from our list of features
features_pos.remove('SalePrice')




### Model Creation
---

In [50]:
X = houses[features_pos]
y = houses['SalePrice']

Create train and test sets, using 80-20 split

In [51]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8, random_state=42)

Scale the data

In [52]:
ss = StandardScaler()
ss.fit(X_train)
X_train = ss.transform(X_train)
X_test = ss.transform(X_test)

Create benchmark model with DummyRegressor, that will use the mean Sales Price

### Ridge Regression Model

Code borrowed from class 4.02

In [53]:
ridge_model = Ridge(alpha=10)

# Set up a list of ridge alphas to check.
# np.logspace generates 100 values equally between 0 and 5,
# then converts them to alphas between 10^0 and 10^5.
r_alphas = np.logspace(0, 5, 100)

# Cross-validate over our list of ridge alphas.
ridge_cv = RidgeCV(alphas=r_alphas, scoring='r2', cv=5)

# Fit model using best ridge alpha!
ridge_cv = ridge_model.fit(X_train, y_train)

In [54]:
# Here is the optimal value of alpha
ridge_cv.alpha

10

In [55]:
print(f'Ridge CV score with training data: {ridge_cv.score(X_train, y_train)} ')
print(f'Ridge CV score with testing data: {ridge_cv.score(X_test, y_test)} ')

Ridge CV score with training data: 0.8886740562102724 
Ridge CV score with testing data: 0.858223857845129 
