## Modeling
- **Establish your baseline score.**
- Fit linear regression. Look at your coefficients. Are any of them wildly overblown?
    **For every 1 unit increase in $x_i$, we expect SalePrice to increase by $\beta_i$.**
- Fit lasso/ridge/elastic net with default parameters.
- Go back and remove features that might be causing issues in your models.
- Tune hyperparameters.
- **Identify a production model.** (This does not have to be your best performing Kaggle model, but rather the model that best answers your problem statement.)
- Refine and interpret your production model.

In [42]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression, LassoCV, RidgeCV
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn import metrics

In [54]:
# Access scaled & Test/Train-Split variables from Notebook: 02_Preprocessing_and_Feature_Engineering
%store -r X_train
%store -r X_test
%store -r X_train_ss
%store -r X_test_ss
%store -r y_train
%store -r y_test

### Model Preparation

##### Instantiate the models

In [69]:
# Instantiate linear regression model
lm = LinearRegression()

In [75]:
# Instantiate LassoCV and RidgeCV, fit with default parameters
lasso = LassoCV(n_alphas=200)

In [76]:
ridge = RidgeCV(alphas=np.linspace(.1, 10, 100))

##### Establish Baseline Score: Cross Validation

In [72]:
# 3. Cross-Val: Baseline Score
lm_scores = cross_val_score(lm, X_train_ss, y_train, cv=5)
lm_scores.mean()

0.7636902331607144

In [73]:
# Use cross_val_score to evaluate LassoCV
lasso_scores = cross_val_score(lasso, X_train_ss, y_train, cv=5)
lasso_scores.mean()

0.7636457522513329

In [78]:
# Use cross_val_score to evaluate RidgeCV
ridge_scores = cross_val_score(ridge, X_train_ss, y_train, cv=5)
ridge_scores.mean()

0.7639045038601507

### So which model performed best?

##### Fit the best performing model

In [90]:
# Fit the linear regression to chosen scaled-features
lm.fit(X_train_ss, y_train)

# 1. Train Score
lm_train_score = lm.score(X_train_ss,y_train)
# 2. Test Score
lm_test_score = lm.score(X_test_ss, y_test)
# 4. Cross-Val "Test" Score
lm_test_scores = cross_val_score(lm, X_test_ss, y_test, cv=5)

print(f'LinearReg: Train={lm_train_score}, Test={lm_test_score}, \nCross-Val: Train={lm_scores.mean()}, Test={lm_test_scores.mean()}')

LinearReg: Train=0.7777852697275695, Test=0.8222516669009068, 
Cross-Val: Train=0.7636902331607144, Test=0.8272680836096882


In [89]:
# Fit the lasso regression to chosen scaled-features
lasso.fit(X_train, y_train)

# 1. Train Score
lasso_train_score = lasso.score(X_train_ss,y_train)
# 2. Test Score
lasso_test_score = lasso.score(X_test_ss, y_test)
# 4. Cross-Val "Test" Score
lasso_test_scores = cross_val_score(lasso, X_test_ss, y_test, cv=5)

print(f'LassoCV:   Train={lasso_train_score}, Test={lasso_test_score}, \nCross-Val: Train={lasso_scores.mean()}, Test={lasso_test_scores.mean()}')

LassoCV:   Train=0.7777706441649628, Test=0.8218238760027691, 
Cross-Val: Train=0.7636457522513329, Test=0.8275888228568269


In [88]:
# Fit the lasso regression to chosen scaled-features
ridge.fit(X_train, y_train)

# 1. Train Score
ridge_train_score = ridge.score(X_train_ss,y_train)
# 2. Test Score
ridge_test_score = ridge.score(X_test_ss, y_test)
# 4. Cross-Val "Test" Score
ridge_test_scores = cross_val_score(ridge, X_test_ss, y_test, cv=5)

print(f'RidgeCV:   Train={ridge_train_score}, Test={ridge_test_score}, \nCross-Val: Train={ridge_scores.mean()}, Test={ridge_test_scores.mean()}')

RidgeCV:   Train=0.7777766125446623, Test=0.8222867156281176, 
Cross-Val: Train=0.8277320497245618, Test=0.8277320497245618


***In order to evaluate the model:***
- Obtain three "scores" : 

    1. Train
    
    2. Test
    
    3. Cross-Val (the avg of five Test-scores) - This will serve as a baseline $R^2$ for the model

In [None]:
pred = ridge.predict(X_test)

In [45]:
#If Test and Cross Val are similar, then you have representative test set. 
#If they diverge, then you probably have a large sampling error.

In [59]:
# The `lm` object contains our model's coefficients
pd.Series(lm.coef_, index=features)

Year Remod/Add     6163.780925
Year Built         8968.494023
1st Flr SF         6612.027035
Total Bsmt SF      5744.896014
Garage Area       10243.796560
Gr Liv Area       22860.742323
Overall Qual      29022.187381
dtype: float64

In [60]:
# And the y-intercept.
lm.intercept_

181807.08072916666

## Modeling
- Go back and remove features that might be causing issues in your models.
- Tune hyperparameters.
- **Identify a production model.** (This is the model that best answers your problem statement.)
- Refine and interpret your production model.