<h1 style="font-size:42px; text-align:center; margin-bottom:30px;"><span style="color:SteelBlue">Regression:</span> Model Training</h1>
<hr>

At last, it's time to build our models! 

It might seem like it took us a while to get here, but professional data scientists actually spend the bulk of their time on the 3 steps leading up to this one: 
1. Exploratory Analysis
2. Data Cleaning
3. Feature Engineering

That's because the biggest jumps in model performance are from **better data**, not from fancier algorithms.

<br><hr id="toc">

### In this lesson...

First, we'll load our analytical base table from lesson 3. 

Then, we'll go through the essential modeling steps:

1. [Split your dataset](#split)
2. [Build model pipelines](#pipelines)
3. [Declare hyperparameters to tune](#hyperparameters)
4. [Fit and tune models with cross-validation](#fit-tune)
5. [Evaluate metrics and select winner](#evaluate)

Finally, we'll save the best model as a project deliverable!

<br><hr>

### First, let's import libraries, recruit models, and load the analytical base table.

Let's import our libraries and load the dataset. It's good practice to keep all of your library imports at the top of your notebook or program.

In [None]:
# NumPy for numerical computing
import numpy as np

# Pandas for DataFrames
import pandas as pd
pd.set_option('display.max_columns', 100)
pd.set_option('display.float_format', lambda x: '%.3f' % x)

# Matplotlib for visualization
from matplotlib import pyplot as plt
# display plots in the notebook
%matplotlib inline 

# Seaborn for easier visualization
import seaborn as sns

# Scikit-Learn for Modeling
import sklearn

Next, let's import 5 algorithms we introduced in the previous lesson.

In [None]:
# Import ElasticNet, Ridge, and Lasso Regression from sklearn.linear_model
from sklearn.linear_model import ElasticNet, Ridge, Lasso

# Import RandomForest and GradientBoosting Regressors from sklearn.ensemble
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor

<strong>Quick note about this lesson.</strong><br> In this lesson, we'll be relying heavily on Scikit-Learn, which has many helpful functions we can take advantage of. However, we won't import everything right away. Instead, we'll be importing each function from Scikit-Learn as we need it. That way, we can point out where you can find each function.

In [None]:
# Load cleaned dataset from lesson 3
df = pd.read_csv('project_files/real-estate_abt.csv')

print(df.shape)

<br id="split">

# 1. Split your dataset

Let's start with a crucial but sometimes overlooked step: **Splitting** your data.

<br>
First, let's import the <code style="color:steelblue">train_test_split()</code> function from Scikit-Learn.

In [None]:
# Function for splitting training and test set
from sklearn.model_selection import train_test_split

Next, separate your dataframe into separate objects for the target variable (<code style="color:steelblue">y</code>) and the input features (<code style="color:steelblue">X</code>).

In [None]:
# Create separate object for target variable
y = df.tx_price
# Create separate object for input features
X = df.drop('tx_price', axis=1)

<br>

<br><hr style="border-color:royalblue;background-color:royalblue;height:1px;">
## <span style="color:RoyalBlue">Exercise 5.1</span>

**First, split <code style="color:steelblue">X</code> and <code style="color:steelblue">y</code> into training and test sets using the <code style="color:steelblue">train_test_split()</code> function.** 
* **Tip:** Its first two arguments should be X and y.
* **Pass in the argument <code style="color:steelblue">test_size=<span style="color:crimson">0.2</span></code> to set aside 20% of our observations for the test set.**
* **Pass in <code style="color:steelblue">random_state=<span style="color:crimson">1234</span></code> to set the random state for replicable results.**
* You can read more about this function in the <a href="http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html" target="_blank">documentation</a>.

The function returns a tuple with 4 elements: <code style="color:steelblue">(X_train, X_test, y_train, y_test)</code>. Remember, you can **unpack** it and save it into 4 seperate variables.

In [None]:
# Split X and y into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1234)

Let's confirm we have the right number of observations in each subset.

<br>

**Next, run this code to confirm the size of each subset is correct.**

In [None]:
print("Training Set:") 
print("X_train:", X_train.shape)
print("y_train", y_train.shape)
print()
print("Testing Set:")
print("X_test:", X_test.shape)
print("y_test", y_test.shape)

Next, when we train our models, we can fit them on the <code style="color:steelblue">X_train</code> feature values and <code style="color:steelblue">y_train</code> target values.

Finally, when we're ready to evaluate our models on our test set, we would use the trained models to predict <code style="color:steelblue">X_test</code> and evaluate the predictions against <code style="color:steelblue">y_test</code>.

<hr style="border-color:royalblue;background-color:royalblue;height:1px;">
<div style="text-align:center; margin: 40px 0 40px 0;">
    
[**Back to Contents**](#toc)
</div>

<br id="pipelines">

# 2. Build model pipelines

In lesson 1, 2, and 3, you explored the dataset, cleaned it, and engineered new features. However, sometimes we'll want to preprocess the training data even more before feeding it into our algorithms. 

<br>

### Standardization
First, let's show the summary statistics from our training data.

In [None]:
# Summary statistics of X_train
X_train.describe()

Next, standardize the training data manually, creating a new <code style="color:steelblue">X_train_new</code> object.

In [None]:
# Standardize X_train
X_train_new = (X_train - X_train.mean()) / X_train.std()

Let's look at the summary statistics for <code style="color:steelblue">X_train_new</code> to confirm standarization worked correctly.
* How can you tell?

In [None]:
# Summary statistics of X_train_new
X_train_new.describe()

<br>

### Make Pipleline

For the most part, we'll almost never perform manual standardization because we'll include preprocessing steps in **model pipelines**.

<br>
So let's import the <code style="color:steelblue">make_pipeline()</code> function from Scikit-Learn.

In [None]:
# Function for creating model pipelines
from sklearn.pipeline import make_pipeline

Now let's import the <code style="color:steelblue">StandardScaler</code>, which is used for standardization.

In [None]:
# For standardization
from sklearn.preprocessing import StandardScaler

<br>

### Next, create a <code style="color:steelblue">pipelines</code> dictionary.

* It should include 3 keys: <code style="color:crimson">'lasso'</code>, <code style="color:crimson">'ridge'</code>, and <code style="color:crimson">'enet'</code>
* The corresponding values should be pipelines that first standardize the data.
* For the algorithm in each pipeline, set <code style="color:steelblue">random_state=<span style="color:crimson">123</span></code> to ensure replicable results.

In [None]:
# Create pipelines dictionary
pipeline_dict = { 'lasso' : make_pipeline(StandardScaler(), Lasso(random_state=123)),
                 'ridge' : make_pipeline(StandardScaler(), Ridge(random_state=123)),
                 'enet' : make_pipeline(StandardScaler(), ElasticNet(random_state=123)) }

In the next exercise, you'll add pipelines for tree ensembles.

<br>

<hr style="border-color:royalblue;background-color:royalblue;height:1px;">

## <span style="color:RoyalBlue">Exercise 5.2</span>

**Add pipelines for <code style="color:SteelBlue">RandomForestRegressor</code> and <code style="color:SteelBlue">GradientBoostingRegressor</code> to your pipeline dictionary.**
* Name them <code style="color:crimson">'rf'</code> for random forest and <code style="color:crimson">'gb'</code> for gradient boosted tree.
* Both pipelines should standardize the data first.
* For both, set <code style="color:steelblue">random_state=<span style="color:crimson">123</span></code> to ensure replicable results.

In [None]:
# Add a pipeline for 'rf' to 'pipeline_dict'
pipeline_dict['rf'] = make_pipeline(StandardScaler(), RandomForestRegressor(random_state=123))

# Add a pipeline for 'gb' to 'pipeline_dict'
pipeline_dict['gb'] = make_pipeline(StandardScaler(), GradientBoostingRegressor(random_state=123))

Let's make sure our dictionary has pipelines for each of our algorithms.

<br>

**Run this code to confirm that you have all 5 algorithms, each part of a pipeline.**

In [None]:
# Check that we have all 5 algorithms, and that they are all pipelines
for key, value in pipeline_dict.items():
    print( key, type(value) )

Now that we have our pipelines, we're ready to move on to declaring hyperparameters to tune.

<hr style="border-color:royalblue;background-color:royalblue;height:1px;">

<div style="text-align:center; margin: 40px 0 40px 0;">
    
[**Back to Contents**](#toc)
</div>

<br>

<br id="hyperparameters">

# 3. Declare hyperparameters to tune

Up to now, we've been casually talking about "tuning" models, but now it's time to treat the topic more formally.

<br>

**First, list all the tunable hyperparameters for your Lasso regression pipeline.** We can do this to any Scikit Learn algorithm —to see what hyperparameters can be tuned. This is much more of an art than a science.

In [None]:
# List tuneable hyperparameters of our Lasso pipeline
pipeline_dict['lasso'].get_params()

Next, declare hyperparameters to tune for Lasso and Ridge regression.
* Try values between 0.001 and 10 for <code style="color:steelblue">alpha</code>.

> **ex:** [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10]

In [None]:
# Lasso hyperparameters
lasso_hyperparameters = { 'lasso__alpha' : [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10] }

# Ridge hyperparameters 
ridge_hyperparameters = { 'ridge__alpha': [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10] }

Now declare a hyperparameter grid fo Elastic-Net.
* You should tune the <code style="color:steelblue">l1_ratio</code> in addition to <code style="color:steelblue">alpha</code>.

* Try values between 0.1 and 0.9 for <code style="color:steelblue">l1_ratio</code>.
> **ex:** [0.1, 0.3, 0.5, 0.7, 0.9]

In [None]:
# Elastic Net hyperparameters
enet_hyperparameters = {
    'elasticnet__alpha': [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10], 
    'elasticnet__l1_ratio': [0.1, 0.3, 0.5, 0.7, 0.9]
    }

<br>

<br><hr style="border-color:royalblue;background-color:royalblue;height:1px;">
## <span style="color:RoyalBlue">Exercise 5.3</span>

Let's start by declaring the hyperparameter grid for our random forest.

<br>

**Declare a hyperparameter grid for <code style="color:SteelBlue">RandomForestRegressor</code>.**
* Name it <code style="color:steelblue">rf_hyperparameters</code>

* Set <code style="color:steelblue"><span style="color:crimson">'randomforestregressor__n_estimators'</span>: [100, 200]</code>
* Set <code style="color:steelblue"><span style="color:crimson">'randomforestregressor__max_features'</span>: ['auto', 'sqrt', 0.33]</code>

In [None]:
# Random forest hyperparameters
rf_hyperparameters = { 
    'randomforestregressor__n_estimators' : [100, 200],
    'randomforestregressor__max_features': ['auto', 'sqrt', 0.33],
}

Next, let's declare settings to try for our boosted tree.

<br>

**Declare a hyperparameter grid for <code style="color:SteelBlue">GradientBoostingRegressor</code>.**
* Name it <code style="color:steelblue">gb_hyperparameters</code>.
* Set <code style="color:steelblue"><span style="color:crimson">'gradientboostingregressor__n_estimators'</span>: [100, 200]</code>
* Set <code style="color:steelblue"><span style="color:crimson">'gradientboostingregressor__learning_rate'</span>: [0.05, 0.1, 0.2]</code>
* Set <code style="color:steelblue"><span style="color:crimson">'gradientboostingregressor__max_depth'</span>: [1, 3, 5]</code>

In [None]:
# Boosted tree hyperparameters
gb_hyperparameters = { 
    'gradientboostingregressor__n_estimators': [100, 200],
    'gradientboostingregressor__learning_rate': [0.05, 0.1, 0.2],
    'gradientboostingregressor__max_depth': [1, 3, 5]
}

<br>

## Now that we have all of our hyperparameters declared, let's store them in a dictionary for ease of access.

<br>

### Create a <code style="color:steelblue">hyperparameters</code> dictionary.
* Use the same keys as in the <code style="color:steelblue">pipelines</code> dictionary.
    * If you forgot what those keys were, you can insert a new code cell and call <code style="color:steelblue">pipelines.keys()</code> for a reminder.
* Set the values to the corresponding **hyperparameter grids** we've been declaring throughout this module.
    * e.g. <code style="color:steelblue"><span style="color:crimson">'rf'</span> : rf_hyperparameters</code>
    * e.g. <code style="color:steelblue"><span style="color:crimson">'lasso'</span> : lasso_hyperparameters</code>

In [None]:
# Create hyperparameters dictionary
hyperparameters = {
    'rf' : rf_hyperparameters,
    'gb' : gb_hyperparameters,
    'lasso' : lasso_hyperparameters,
    'ridge' : ridge_hyperparameters,
    'enet' : enet_hyperparameters
}

<br>

**Finally, run this code to check that <code style="color:steelblue">hyperparameters</code> is set up correctly.**

In [None]:
for key in ['enet', 'gb', 'ridge', 'rf', 'lasso']:
    
    if key in hyperparameters:
        
        if type(hyperparameters[key]) is dict:
            print( key, 'was found in hyperparameters, and it is a grid.' )
            
        else:
            print( key, 'was found in hyperparameters, but it is not a grid.' )
            
    else:
        print( key, 'was not found in hyperparameters')

<hr style="border-color:royalblue;background-color:royalblue;height:1px;">
<div style="text-align:center; margin: 40px 0 40px 0;">
    
[**Back to Contents**](#toc)
</div>

<br id="fit-tune">

# 4. Fit and tune models with cross-validation

Now that we have our <code style="color:steelblue">pipelines</code> and <code style="color:steelblue">hyperparameters</code> dictionaries declared, we're ready to tune our models with cross-validation.

### Cross-Validation on a Single Model
First, let's to import a helper for cross-validation called <code style="color:steelblue">GridSearchCV</code>.

In [None]:
# Helper for cross-validation
from sklearn.model_selection import GridSearchCV

Next, to see an example, set up cross-validation for Lasso regression.

In [None]:
# Create cross-validation object from Lasso pipeline and Lasso hyperparameters
model = GridSearchCV(pipeline_dict['lasso'], hyperparameters['lasso'], cv=10, n_jobs=-1)

Pass <code style="color:steelblue">X_train</code> and <code style="color:steelblue">y_train</code> into the <code style="color:steelblue">.fit()</code> function to tune hyperparameters.

In [None]:
# Fit and tune model
model.fit(X_train, y_train)

By the way, don't worry if you get the message:

<pre style="color:crimson">ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations</pre>

We'll dive into some of the under-the-hood nuances later.
<br>

<br>

### In the next exercise, we'll write a loop that tunes all of our models.

<br><hr style="border-color:royalblue;background-color:royalblue;height:1px;">
## <span style="color:RoyalBlue">Exercise 5.4</span>

**Create a dictionary of models named <code style="color:SteelBlue">fitted_models</code> that have been tuned using cross-validation.**
* The keys should be the same as those in the <code style="color:SteelBlue">pipelines</code> and <code style="color:SteelBlue">hyperparameters</code> dictionaries. 
* The values should be <code style="color:steelblue">GridSearchCV</code> objects that have been fitted to <code style="color:steelblue">X_train</code> and <code style="color:steelblue">y_train</code>.
* After fitting each model, print <code style="color:crimson">'name, "has been fitted."'</code> just to track the progress.

This step can take a few minutes, so please be patient.

In [None]:
# Create empty dictionary called fitted_models
fitted_models = {}

# Loop through pipeline_dict.items(), grabing the name and pipeline, creating a new model and tuning it on each iteration.
for name, pipeline in pipeline_dict.items():
    
    # 1. Create cross-validation object from pipeline and hyperparameters
    model = GridSearchCV(pipeline, hyperparameters[name], cv=10, n_jobs=-1)
    
    # 2. Fit model on X_train, y_train
    model.fit(X_train, y_train)
    
    # 3. Store model in fitted_models[name] 
    fitted_models[name] = model
    
    # 4. Print name 'has been fitted'
    print(name, 'has been fitted.')

<br>

**Run this code to check that the models are of the correct type.**

In [None]:
# Check that we have 5 cross-validation objects
for key, value in fitted_models.items():
    print( key, type(value) )

<br>

**Finally, run this code to check that the models have been fitted correctly.**

In [None]:
from sklearn.exceptions import NotFittedError

for name, model in fitted_models.items():
    try:
        pred = model.predict(X_test)
        print(name, 'has been fitted.')
        
    except NotFittedError as e:
        print(repr(e))

Nice. Now we're ready to evaluate how our models performed!

<hr style="border-color:royalblue;background-color:royalblue;height:1px;">

<div style="text-align:center; margin: 40px 0 40px 0;">
    
[**Back to Contents**](#toc)
</div>

<br id="evaluate">

# 5. Evaluate models and select winner

### Finally, it's time to evaluate our models and pick the best one.

<br>
Let's display the holdout $R^2$ score for each fitted model.

In [None]:
# Display best_score_ for each fitted model
for name, model in fitted_models.items():
    print(name, model.best_score_)

You should see something similar to the below scores:

  
    lasso 0.309321321129
    ridge 0.316805719351
    enet 0.342759786956
    rf 0.480576134721
    gb 0.48873808731


If your numbers are way off, check to see if you've set the <code style="color:steelblue">random_state=</code> correctly for each of the models.

Next, import the <code style="color:steelblue">r2_score()</code> and <code style="color:steelblue">mean_absolute_error()</code> functions.

In [None]:
# Import r2_score and mean_absolute_error functions
from sklearn.metrics import r2_score 
from sklearn.metrics import mean_absolute_error

<br>

### Let's see how the fitted models perform on our test set!

<br>
First, access your fitted random forest and display the object.

In [None]:
# Display fitted random forest object
fitted_models['rf']

Predict the test set using the fitted random forest.

In [None]:
# Predict test set using fitted random forest
pred = fitted_models['rf'].predict(X_test)

Finally, we use the scoring functions we imported to calculate and print $R^2$ and MAE.

In [None]:
# Calculate and print R^2 and MAE
print('R^2: ', r2_score(y_test, pred))
print('MAE: ', mean_absolute_error(y_test, pred))

In the next exercise, we'll evaluate all of our fitted models on the test set and pick the winner.

<br><hr style="border-color:royalblue;background-color:royalblue;height:1px;">
## <span style="color:RoyalBlue">Exercise 5.5</span>

**Use a <code style="color:SteelBlue">for</code> loop, print the performance of each model in <code style="color:SteelBlue">fitted_models</code> on the test set.**
* Print both <code style="color:SteelBlue">r2_score</code> and <code style="color:SteelBlue">mean_absolute_error</code>.
* Those functions each take two arguments:
    * The actual values for your target variable (<code style="color:SteelBlue">y_test</code>)
    * Predicted values for your target variable
* Label the output with the name of the algorithm. For example:

<pre>
lasso
--------
R^2: 0.409313458932
MAE: 84963.5598922
</pre>

In [None]:
# Code here
for name, model in fitted_models.items(): 
    pred_var = model.predict(X_test)
    print(name)
    print('R^2: ', r2_score(y_test, pred_var))
    print('MAE: ', mean_absolute_error(y_test, pred_var))
    print('===================================')

**Next, ask yourself these questions to pick the winning model:**
* Which model had the highest $R^2$ on the test set?

> Random forest

* Which model had the lowest mean absolute error?

> Random forest

* Are these two models the same one?

> Yes

* Did it also have the best holdout $R^2$ score from cross-validation?

> Yes

* **Does it satisfy our win condition?**

> Yes, its mean absolute error is less than \$70,000!

<br>

**Finally, let's plot the performance of the winning model on the test set. Run the code below.**
* It first plots a scatter plot.
* Then, it plots predicted transaction price on the X-axis.
* Finally, it plots actual transaction price on the y-axis.

In [None]:
gb_pred = fitted_models['rf'].predict(X_test)
plt.scatter(gb_pred, y_test)
plt.xlabel('predicted')
plt.ylabel('actual')
plt.show()

This last visual check is a nice way to confirm our model's performance.
* Are the points scattered around the 45 degree diagonal?

<br>
<hr style="border-color:royalblue;background-color:royalblue;height:1px;">

<div style="text-align:center; margin: 40px 0 40px 0;">
    
[**Back to Contents**](#toc)
</div>

<br>

### Finally, let's save the winning model.

Great job! You've created a pretty kick-ass model for real-estate valuation. Now it's time to save your hard work.

First, let's take a look at the data type of your winning model.

***Run each code cell below after completing the exercises above.***

In [None]:
type(fitted_models['rf'])

It looks like this is still the <code style="color:steelblue">GridSearchCV</code> data type. 
* You can actually directly save this object if you want, because it will use the winning model pipeline by default. 
* However, what we really care about is the actual winning model <code style="color:steelblue">Pipeline</code>, right?

In that case, we can use the <code style="color:steelblue">best\_estimator_</code> method to access it:

In [None]:
type(fitted_models['rf'].best_estimator_)

If we output that object directly, we can also see the winning values for our hyperparameters.

In [None]:
fitted_models['rf'].best_estimator_

See? The winning values for our hyperparameters are:
* <code style="color:steelblue">n_estimators: <span style="color:crimson">200</span></code>
* <code style="color:steelblue">max_features : <span style="color:crimson">'auto'</span></code>

Great, now let's import a helpful package called <code style="color:steelblue">pickle</code>, which saves Python objects to disk.

In [None]:
import pickle

Let's save the winning <code style="color:steelblue">Pipeline</code> object into a pickle file.

In [None]:
with open('saved_models/final_model_employee.pkl', 'wb') as f:
    pickle.dump(fitted_models['rf'].best_estimator_, f)

Congratulations... you've built and saved a successful model trained using machine learning!

As a reminder, here are a few things you did in this module:
* You split your dataset into separate training and test sets.
* You set up preprocessing pipelines.
* You tuned your models using cross-validation.
* And you evaluated your models, selecting and saving the winner.

<br>
<hr>

<div style="text-align:center; margin: 40px 0 40px 0;">
    
[**Back to Contents**](#toc)
</div>
