# <span style="color:SteelBlue">Lesson:</span> Building an ML Pipeline</h1>
<hr>
We've talked about the process of training a model already which is a failry straight forward process when were talking about a single model, with predefined HyperParameters.

However, in most real life situations we DO NOT know what models may be the best suited for the task at hand, let alone the appropriate hyperparameters for each model. 

**The Solution?**

We will write a series of loops that use *Cross-Validation* to "test" each model, including every combination of relevant hyperparameters —for each model, in order to discern which configuration is the most effective.

**Relevant topics for this section on Building Pipelines:**
1. Dictionaries
2. Looping

<hr>

<br>

## 1. Imports

### Import PyData Libraries

In [2]:
# NumPy for numerical computing
import numpy as np
# Pandas for DataFrames
import pandas as pd
# Pickle for saving model files
import pickle as pkl

### Import SKLearn Classifiers

In [3]:
# Import Logistic Regression from sklearn.linear_model
from sklearn.linear_model import LogisticRegression
# Import RandomForestClassifier and GradientBoostingClassifier from sklearn.ensemble
from sklearn.ensemble import RandomForestClassifier, GradientBoostingRegressor 
# Import KNeighborsClassifier from sklearn.neighbors 
from sklearn.neighbors import KNeighborsClassifier 

### Import Libraries to make Pipeline

In [4]:
# import train_test_split from sklearn.model_selection
from sklearn.model_selection import train_test_split 
# import make_pipeline from sklearn.pipeline
from sklearn.pipeline import make_pipeline 
# import StandardScaler from sklearn.preprocessing
from sklearn.preprocessing import StandardScaler 
# import GridSearchCV from sklearn.model_selection
from sklearn.model_selection import GridSearchCV 
# import roc_curve and auc from sklearn.metrics
from sklearn.metrics  import roc_curve, auc 

<hr>
<br>

## 2. Load Data

### `read_csv` `pima_diabetes.csv`

In [5]:
# your code here!
pf = pd.read_csv('./Pima_Indian_Diabetes/data/pima_diabetes.csv')

### `.shape`

In [6]:
# your code here
pf.shape

(768, 9)

### `.head()`

In [32]:

pf.head()
# Your code here!

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


<hr>
<br>

## 3. Initial Data Split

### Split Dataset into `X` and `y` sets

In [7]:
# what should X be?
X = pf.drop('Outcome', axis=1)
X.head()


Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,6,148,72,35,0,33.6,0.627,50
1,1,85,66,29,0,26.6,0.351,31
2,8,183,64,0,0,23.3,0.672,32
3,1,89,66,23,94,28.1,0.167,21
4,0,137,40,35,168,43.1,2.288,33


In [8]:
# what should y be?
y = pf['Outcome']
y.head()

0    1
1    0
2    1
3    0
4    1
Name: Outcome, dtype: int64

### `shape` of `X`

In [9]:
# your code here!
X.shape

(768, 8)

### `shape` of `y`

In [10]:
# your code here
y.shape

(768,)

<br>

### Split `X` and `y` into `Train` and `Test` Sets

**We want to split both the `X` and `y` datasets into Train and Test sets. This will result in 4 datasets total.**

```(X_train, X_test, y_train, y_test)```

> **`X`:** `X_train` and `X_test`

> **`y`:** `y_train` and `y_test`

The **Train Set** is the set that we will operform further splits on, during the cross-validation step. The motivation for this is to find the best combination of model and hyperparameters, which we can only achieve by running performance metrics on every possible combination of Model and Hyperparameters.

The **Test Set** will remain untouched until after the cross-validation step.

### `train_test_split()`

Perform a `train_test_split` with parmeters:
```python 
X
y
test_size should be 0.2, 
Stratify by the `Outcome` column, 
random_state should be 123
```

In [11]:
# your code here
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=123, stratify=y)

### `print()` `shape` of `X_train` and `X_test`

In [12]:
# your code here
print(X_train.shape)
print(X_test.shape)

(614, 8)
(154, 8)


### `print()` `shape` of `y_train` and `y_test`

In [13]:
# your code here
print(y_train.shape)
print(y_test.shape)

(614,)
(154,)


Both **Train** sets, **X_train and y_train**, should have the **same number of rows**

Both **Test** sets, **X_test and y_test**, should also have the **same number of rows**.

<hr>
<br>

## 4. Build Pipeline Dictionary

The pipeline dictionary can contain as many models as your heart's content. We use this approach to automate the process of fitting a model, predicting labels, and finally testing model accuracy.

This will all be done with the SKlearn model_selection function **GridSearchCV()** which expects at least two arguments a **Pipeline** dictionary and a **Hyperparameters** dictionary, with matching keys!!!

Let's start with the **Pipeline Dictionary!**

The `keys` should contain the model name's —as `strings`
The `values` should contain pipeline objects.
>The `pipeline` objects should contain a normalizer/standardizer, and instantiate a model.

<br>

`"model_name": make_pipeline(scalar()/normalizer(), ModelRegressor()/ModelClassifier())`

In [14]:
from sklearn.linear_model import LogisticRegression
# Import RandomForestClassifier and GradientBoostingClassifier from sklearn.ensemble
from sklearn.ensemble import RandomForestClassifier 
# Import KNeighborsClassifier from sklearn.neighbors 
from sklearn.neighbors import KNeighborsClassifier 

In [25]:
pipeline_dict = {
    'l1': make_pipeline(StandardScaler(), LogisticRegression(penalty= 'l1', random_state= 123)),
    'l2': make_pipeline(StandardScaler(), LogisticRegression(penalty= 'l2', random_state= 123)),
    'rf': make_pipeline(StandardScaler(), RandomForestClassifier(random_state= 123)), 
    'kn': make_pipeline(StandardScaler(), KNeighborsClassifier()),
    
}

<hr style="border-color:royalblue;background-color:royalblue;height:1px;">

## <span style="color:RoyalBlue">Exercise 1</span> - Add Remaining `Pipelines` to `pipeline_dict`

Add the `pipelines` for the three remaining Classifiers: `RandomForrestClassifier`, `GradientBoostClassifier`, and `KNeighborsClassifier`.

> 1. All `Pipelines` in `pipeline_dict` should begin with a `StandrardScaler()`
> 2. All `ModelInstantiations()` should set a `random_state = 123`, except for `KNeighborsClassifier`.

In [16]:
#Do work here
pipeline_dict['gb'] = make_pipeline(StandardScaler(), GradientBoostingRegressor(random_state=123))

### Check your work! 
Loope through `pipeline_dict` 
`keys` should contain a `sklearn.pipeline.Pipeline` object as a matching `value`.

In [17]:
#Do work here
for x, y in pipeline_dict.items():
    print(x)
    print(type(y))

l1
<class 'sklearn.pipeline.Pipeline'>
l2
<class 'sklearn.pipeline.Pipeline'>
rf
<class 'sklearn.pipeline.Pipeline'>
kn
<class 'sklearn.pipeline.Pipeline'>
gb
<class 'sklearn.pipeline.Pipeline'>


<hr>
<br>

## 5. Build a Hyperparameters Nested-Dictionary

The following describes the goal for this section, however we will begin with some exercises to help understand the concept of a nested dictionary!

**The Final Hyperparameter Dictionary: `hp_dict`**

> Every `key` in `hp_dict` will match the keys from `pipeline_dict`.

> Every `value` in `hp_dict` will be a dictionary.

**Nested Dictionaries inside `hp_dict`**

> Every `key` in each nested `dict` will be a different hyperparameter that needs tuning.

> Every `value` will hold a range of values that each particular hyperparameter can take on.

**Example of an `hp_dict` with values for 2 different models**

*Notice that the `values` of `hp_dict` are surrounded by `{}`, this means that they are themselves dictionaries!*
```python 
hp_dict = {
    'l1':{"logisticregression__C" : np.linspace(1e-4, 1e4, 10)}
}```

In [18]:
for x, y in pipeline_dict.items():
    print(x, '##')
    for j, value in y.get_params().items():
        print(j, '::', value)

l1 ##
memory :: None
steps :: [('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('logisticregression', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l1', random_state=123, solver='warn',
          tol=0.0001, verbose=0, warm_start=False))]
standardscaler :: StandardScaler(copy=True, with_mean=True, with_std=True)
logisticregression :: LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l1', random_state=123, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)
standardscaler__copy :: True
standardscaler__with_mean :: True
standardscaler__with_std :: True
logisticregression__C :: 1.0
logisticregression__class_weight :: None
logisticregression__dual :: False
logisticregression__fit_intercept :: True
l

<hr style="border-color:royalblue;background-color:royalblue;height:1px;">

## <span style="color:RoyalBlue">Exercise 2</span> - Performing Action during Loop

**Loop through `pipeline_dict` to print out the `type` of object that is associated with the `get_params()` method, for every model "key."**

**Remember:** Dictionaries in Python are set up like this:
```python 
some_dict = {
    key1 : val1
    key2 : val2
}```


**Hint:** I want you to **`print` every `key`** and the **`type` of object that results from calling the `get_params()` method on every matching `value` of `pipeline_dict`**

In [24]:
# Build a loop that checks type of get_params() for ea value in pipeline_dict 

for x, y in pipeline_dict.items():
    print(type(y.get_params()))

<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>
<class 'dict'>


<br>

**What type of object does `get_params()` return? 
What could we do to said object?**

Thoughts?



<br><hr style="border-color:royalblue;background-color:royalblue;height:1px;">
## <span style="color:RoyalBlue">Exercise 3</span> - Nested Dictionary Loop

**Since `get_params()` results in a `dict` object we can use `.items()` to loop through every `key` `value` pair inside of each parameter `dict`!**

There is nothing stopping us from nesting a second for loop from inside a for loop!

This is one of the harder concepts to understand at first, but once you get it, you'll realize how straightforward the process really is!

In [20]:
# Build a nested loop that loops through get_params() dict

<br><hr style="border-color:royalblue;background-color:royalblue;height:1px;">
## <span style="color:RoyalBlue">Exercise 4</span> - Making individual HyperParameter Dictionaries

### Hyperparameters for L1 aka "LASSO" Logistic Regression
We want to tune:
1. C
2. should be `10` *equally spaced values* between 
> `1e-4` or 0.0001 and `1e4` or 10,000
3. **Hint**: What numpy *generator* function allows us to create a range of **linearly spaced values**?

In [21]:
# Create a dictionary for all of the hyperparameters that you want to tune for this model.

### Hyperparameters for L2 aka "Ridge" Logistic Regression
We want to tune:
1. C
2. should be `10` *equally spaced values* between 
> `1e-4` or 0.0001 and `1e4` or 10,000
3. **Hint**: What numpy *generator* function allows us to create a range of **linearly spaced values**?

In [17]:
# Create a dictionary for all of the hyperparameters that you want to tune for this model.

### Hyperparameters for Random Forest Classifier
We want to tune:
1. n_estimators
2. max_features

In [17]:
# Create a dictionary for all of the hyperparameters that you want to tune for this model.

### Hyperparameters for Gradient Boosting Classifier
We want to tune:
1. n_estimators
2. learning_rate
3. max_depth

In [17]:
# Create a dictionary for all of the hyperparameters that you want to tune for this model.

### Hyperparameters for KNN Classifier
We want to tune:
1. n_neighbors

In [17]:
# Create a dictionary for all of the hyperparameters that you want to tune for this model.

<br><hr style="border-color:royalblue;background-color:royalblue;height:1px;">
## <span style="color:RoyalBlue">Exercise 5a</span> - Assemble `hp_dict`!

Construct the final `hp_dict` with similar naming convention to `pipeline_dict`
ex

## <span style="color:RoyalBlue">Exercise 5b</span> - Loop throu `hp_dict` to verify work!

Write a for loop that prints out each `key` and `value` in `hp_dict`

In [107]:
# Your work here!

<br>

<br><hr style="border-color:royalblue;background-color:royalblue;height:1px;">

## 6. Cross-Validation

Now that we have both a `Pipeline_dict` and a `hp_dict` we can build the loop that will handle cross-validation for us.

<br><hr style="border-color:royalblue;background-color:royalblue;height:1px;">
## <span style="color:RoyalBlue">Exercise 6</span> - Write a loop that uses `GridSearchCV()` to fit models!

In [13]:
# Create empty dictionary called fitted_models

# Loop through model pipelines, tuning each one and saving it to fitted_models
    
    # Create cross-validation object from pipeline and hyperparameters
    
    # Fit model on X_train, y_train
    
    # Store model in fitted_models[name] 
    
    # Print '{name} has been fitted'

## 7. Evaluate metrics

Finally, it's time to evaluate our models and pick the best one.

<br>

**First, display the <code style="color:steelblue">best\_score_</code> attribute for each fitted model.**

<br><hr style="border-color:royalblue;background-color:royalblue;height:1px;">
## <span style="color:RoyalBlue">Exercise 7</span> - Access `best_score_` attribute for each cross-validation object in `fitted_models`

**Hint: this is what it looks like to access the `best_score_` attribute for a single member of `fitted_models`**

In [15]:
fitted_models['l1'].best_score_

In [16]:
# Loop thorugh fitted_models and check best_score_ for each fitted_model

<br><hr style="border-color:royalblue;background-color:royalblue;height:1px;">