## Modeling
- **Establish your baseline score.**
- Fit linear regression. Look at your coefficients. Are any of them wildly overblown?
    **For every 1 unit increase in $x_i$, we expect SalePrice to increase by $\beta_i$.**
- Fit lasso/ridge/elastic net with default parameters.
- Go back and remove features that might be causing issues in your models.
- Tune hyperparameters.
- **Identify a production model.** (This does not have to be your best performing Kaggle model, but rather the model that best answers your problem statement.)
- Refine and interpret your production model.

In [42]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression, LassoCV, RidgeCV
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn import metrics

In [43]:
# Access scaled & Test/Train-Split variables from Notebook: 02_Preprocessing_and_Feature_Engineering
%store -r X_train
%store -r X_test
%store -r y_train
%store -r y_test

In [44]:
# Instantiate linear regression model
lm = LinearRegression()

# Fit the linear regression to chosen scaled-features.
lm.fit(X_train, y_train)

LinearRegression()

***In order to evaluate the model:***
- Obtain three "scores" : 

    1. Train
    
    2. Test
    
    3. Cross-Val (the avg of five Test-scores) - This will serve as a baseline $R^2$ for the model

In [45]:
#If Test and Cross Val are similar, then you have representative test set. 
#If they diverge, then you probably have a large sampling error.

In [46]:
# 1. Train Score
lm.score(X_train,y_train)

0.7777852697275695

In [47]:
# 2. Test Score
lm.score(X_test, y_test)

0.8222516669009068

In [48]:
# 3. Cross-Val: Baseline Score
cross_val_score(lm, X_train, y_train, cv=5).mean()

0.7636902331607144

In [49]:
# The `lm` object contains our model's coefficients
pd.Series(lm.coef_, index=features)

Year Remod/Add     6163.780925
Year Built         8968.494023
1st Flr SF         6612.027035
Total Bsmt SF      5744.896014
Garage Area       10243.796560
Gr Liv Area       22860.742323
Overall Qual      29022.187381
dtype: float64

In [50]:
# And the y-intercept.
lm.intercept_

181807.08072916663

In [51]:
# Instantiate LassoCV and RidgeCV, fit with default parameters
lasso = LassoCV(n_alphas=200)
ridge = RidgeCV(alphas=np.linspace(.1, 10, 100))

In [52]:
# Use cross_val_score to evaluate LassoCV
lasso_scores = cross_val_score(lasso, X_train, y_train, cv=5)
lasso_scores.mean()

0.7636457522513328

In [53]:
# Use cross_val_score to evaluate RidgeCV
ridge_scores = cross_val_score(ridge, X_train, y_train, cv=3)
ridge_scores.mean()

0.7604572901878699

## Modeling
- Go back and remove features that might be causing issues in your models.
- Tune hyperparameters.
- **Identify a production model.** (This is the model that best answers your problem statement.)
- Refine and interpret your production model.