# <span style="color:SteelBlue">Lesson:</span> Building an ML Pipeline</h1>
<hr>
We've talked about the process of training a model already which is a failry straight forward process when were talking about a single model, with predefined HyperParameters.

However, in most real life situations we DO NOT know what models may be the best suited for the task at hand, let alone the appropriate hyperparameters for each model. 

**The Solution?**

We will write a series of loops that use *Cross-Validation* to "test" each model, including every combination of relevant hyperparameters —for each model, in order to discern which configuration is the most effective.

**Relevant topics for this section on Building Pipelines:**
1. Dictionaries
2. Looping

<hr>

<br>

## 1. Imports

### Import PyData Libraries

In [7]:
# NumPy for numerical computing
import numpy as np

# Pandas for DataFrames
import pandas as pd

# Matplotlib for visualization
import matplotlib.pyplot as plt

# Seaborn for easier visualization
import seaborn as sns

# Pickle for saving model files
import pickle

### Import SKLearn Classifiers

In [35]:
# Import Logistic Regression from sklearn.linear_model
from sklearn.linear_model import LogisticRegression

# Import RandomForestClassifier and GradientBoostingClassifier from sklearn.ensemble
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier


from sklearn.neighbors import KNeighborsClassifier

### Import Libraries to make Pipeline

In [36]:
# import train_test_split from sklearn.model_selection
from sklearn.model_selection import train_test_split

# import make_pipeline from sklearn.pipeline
from sklearn.pipeline import make_pipeline

# import StandardScaler from sklearn.preprocessing
from sklearn.preprocessing import StandardScaler

# import GridSearchCV from sklearn.model_selection
from sklearn.model_selection import GridSearchCV

# import roc_curve and auc from sklearn.metrics
from sklearn.metrics import roc_curve, auc

<hr>
<br>

## 2. Load Data

### `read_csv`

In [10]:
# your code here!

In [11]:
df = pd.read_csv('./pima_indian_diabetes/data/pima_diabetes.csv')

### `.shape`

In [None]:
# your code here!

In [None]:
df.shape

### `.head()`

In [None]:
# Your code here!

In [None]:
df.head()

<hr>
<br>

## 3. Initial Data Split

### Split Dataset into `X` and `y` sets

In [38]:
X = df.drop(['Outcome'], axis=1)
y = df['Outcome']

### `shape` of `X`

In [39]:
# your code here!

In [40]:
X.shape

(768, 8)

### `shape` of `y`

In [41]:
# your code here

In [42]:
y.shape

(768,)

### Split `X` and `y` into Train and Test Sets

**We want to split both the `X` and `y` datasets into Train and Test sets. This will result in 4 datasets total.**

```(X_train, X_test, y_train, y_test)```

> **`X`:** `X_train` and `X_test`

> **`y`:** `y_train` and `y_test`

The **Train Set** is the set that we will operform further splits on, during the cross-validation step. The motivation for this is to find the best combination of model and hyperparameters, which we can only achieve by running performance metrics on every possible combination of Model and Hyperparameters.

The **Test Set** will remain untouched until after the cross-validation step.

### `train_test_split()`

you will pass in the following values to `train_test_split`
```python 
X
y
test_size = 0.2, 
stratify = df['Outcome'], 
random_state = 123
```

In [43]:
# your code here

In [44]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2, 
    stratify = df['Outcome'], 
    random_state = 123)

### `print()` `shape` of `X_train` and `X_test`

In [45]:
# your code here

In [46]:
print("X_train:", X_train.shape) 
print("X_test:", X_test.shape)

X_train: (614, 8)
X_test: (154, 8)


### `print()` `shape` of `y_train` and `y_test`

In [47]:
# your code here

In [48]:
print("y_train:", y_train.shape) 
print("y_test:", y_test.shape)

y_train: (614,)
y_test: (154,)


Both **Train** sets, **X_train and y_train**, should have the **same number of rows**

Both **Test** sets, **X_test and y_test**, should also have the **same number of rows**.

<hr>
<br>

## 4. Build Pipeline Dictionary

The pipeline dictionary can contain as many models as your heart's content. We use this approach to automate the process of fitting a model, predicting labels, and finally testing model accuracy.

This will all be done with the SKlearn model_selection function **GridSearchCV()** which expects at least two arguments a **Pipeline** dictionary and a **Hyperparameters** dictionary, with matching keys!!!

Let's start with the **Pipeline Dictionary!**

The `keys` should contain the model name's —as `strings`
The `values` should contain pipeline objects.
>The `pipeline` objects should contain a normalizer/standardizer, and instantiate a model.

<br>

`"model_name": make_pipeline(scalar()/normalizer(), ModelRegressor()/ModelClassifier())`

In [49]:
pipeline_dict = {
    'l1': make_pipeline(StandardScaler(), LogisticRegression(penalty= 'l1', random_state= 123)),
    'l2': make_pipeline(StandardScaler(), LogisticRegression(penalty= 'l2', random_state= 123)),
}

<hr style="border-color:royalblue;background-color:royalblue;height:1px;">

## <span style="color:RoyalBlue">Exercise 1</span> - Add Pipelines to `Pipeline_dict`

Add the `pipelines` for the three remaining Classifiers: `RandomForrestClassifier`, `GradientBoostClassifier`, and `KNeighborsClassifier`.

> 1. All `Pipelines` in `pipeline_dict` should begin with a `StandrardScaler()`
> 2. All `ModelInstantiations()` should set a `random_state = 123`, except for `KNeighborsClassifier`.

In [50]:
#Do work here
pipeline_dict['rf'] = make_pipeline(StandardScaler(), RandomForestClassifier(random_state= 123))
pipeline_dict['gb'] = make_pipeline(StandardScaler(), GradientBoostingClassifier(random_state=123))
pipeline_dict['kn'] = make_pipeline(StandardScaler(), KNeighborsClassifier())

### Check your work! 
All `pipeline_dict` `keys` should contain a `sklearn.pipeline.Pipeline` object as a matching `value`.

In [20]:
for key, val in pipeline_dict.items():
    print(key + ":", type(val))

l1: <class 'sklearn.pipeline.Pipeline'>
l2: <class 'sklearn.pipeline.Pipeline'>
rf: <class 'sklearn.pipeline.Pipeline'>
gb: <class 'sklearn.pipeline.Pipeline'>
kn: <class 'sklearn.pipeline.Pipeline'>


<hr>
<br>

## 5. Build a Hyperparameters Nested-Dictionary

The following describes the goal for this section, however we will begin with some exercises to help understand the concept of a nested dictionary!

**The Final Hyperparameter Dictionary: `hp_dict`**

> Every `key` in `hp_dict` will match the keys from `pipeline_dict`.

> Every `value` in `hp_dict` will be a dictionary.

**Nested Dictionaries inside `hp_dict`**

> Every `key` in each nested `dict` will be a different hyperparameter that needs tuning.

> Every `value` will hold a range of values that each particular hyperparameter can take on.

**Example of an `hp_dict` with values for 2 different models**

*Notice that the `values` of `hp_dict` are surrounded by `{}`, this means that they are themselves dictionaries!*
```python 
hp_dict = {
    'kn':{"kneighborsclassifier__n_neighbors" : np.arange(1,11)}
    'l1':{"logisticregression__C" : np.linspace(1e-4, 1e4, 10)}
}```

<hr style="border-color:royalblue;background-color:royalblue;height:1px;">

## <span style="color:RoyalBlue">Exercise 2</span> - Performing Action during Loop

**Loop through `pipeline_dict` to print out the `type` of object that is associated with the `get_params()` method, for every model "key."**

**Remember:** Dictionaries in Python are set up like this:
```python 
some_dict = {
    key1 : val1
    key2 : val2
}```


**Hint:** I want you to **`print` every `key`** and the **`type` of object that results from calling the `get_params()` method on every matching `value` of `pipeline_dict`**

In [21]:
for key, value in pipeline_dict.items():
    print(key, type(value.get_params()))

l1 <class 'dict'>
l2 <class 'dict'>
rf <class 'dict'>
gb <class 'dict'>
kn <class 'dict'>


**What type of object does get_params() return? 
What could we do to said object?**

Thoughts?



<br><hr style="border-color:royalblue;background-color:royalblue;height:1px;">
## <span style="color:RoyalBlue">Exercise 3</span> - Nested Dictionary Loop

**Since `get_params()` results in a `dict` object we can use `.items()` to loop through every `key` `value` pair inside of each parameter `dict`!**

There is nothing stopping us from nesting a second for loop from inside a for loop!

This is one of the harder concepts to understand at first, but once you get it, you'll realize how straightforward the process really is!

In [22]:
for key, value in pipeline_dict.items():
    print(key)
    print("-------")
    for k, v in value.get_params().items():
        print(k, v)
    print()

l1
-------
memory None
steps [('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('logisticregression', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l1', random_state=123, solver='warn',
          tol=0.0001, verbose=0, warm_start=False))]
standardscaler StandardScaler(copy=True, with_mean=True, with_std=True)
logisticregression LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l1', random_state=123, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)
standardscaler__copy True
standardscaler__with_mean True
standardscaler__with_std True
logisticregression__C 1.0
logisticregression__class_weight None
logisticregression__dual False
logisticregression__fit_intercept True
logisticregression__intercept

<br><hr style="border-color:royalblue;background-color:royalblue;height:1px;">
## <span style="color:RoyalBlue">Exercise 4</span> - Making individual HyperParameter Dictionaries

### Hyperparameters for L1 aka "LASSO" Logistic Regression
We want to tune:
1. C

should be `10` *equally spaced values* between 
> `1e-4` or 0.0001 and `1e4` or 10,000

**Hint**: What numpy *generator* function allows us to create a range of **equally spaced values**?

In [23]:
# np.linspace(1e-4, 1e4, 10)

In [24]:
l1_hyperparameters = { 
    "logisticregression__C" : np.linspace(1e-4, 1e4, 10) 
}

### Hyperparameters for L2 aka "Ridge" Logistic Regression
We want to tune:
1. C

In [25]:
# np.linspace(1e-4, 1e4, 10)

In [52]:
l2_hyperparameters = { 
    "logisticregression__C" : np.linspace(1e-4, 1e4, 10) 
}   

### Hyperparameters for Random Forest Classifier
We want to tune:
1. n_estimators
2. max_features

In [27]:
rf_hyperparameters = { 
        "randomforestclassifier__n_estimators" : [10, 100, 200], 
        "randomforestclassifier__max_features" : ["auto", "sqrt","log2",None]  
}     

### Hyperparameters for Gradient Boosting Classifier
We want to tune:
1. n_estimators
2. learning_rate
3. max_depth

In [28]:
gb_hyperparameters = {
        "gradientboostingclassifier__n_estimators" : [10, 100, 200],
        "gradientboostingclassifier__learning_rate" : [0.0001, 0.001, 0.01, 0.1, 0.2, 0.3],
        "gradientboostingclassifier__max_depth" : [1, 3, 5, 7, 9]
}

### Hyperparameters for KNN Classifier
We want to tune:
1. n_neighbors

In [29]:
# np.arange(1,11)

In [30]:
kn_hyperparameters = { "kneighborsclassifier__n_neighbors" : 
                      np.arange(1,11)
}

<br><hr style="border-color:royalblue;background-color:royalblue;height:1px;">
## <span style="color:RoyalBlue">Exercise 5a</span> - Assemble `hp_dict`!

Construct the final `hp_dict` with similar naming convention to `pipeline_dict`
ex

In [53]:
hp_dict = {
    'l1' : l1_hyperparameters,
    'l2' : l2_hyperparameters,
    'kn' : kn_hyperparameters,
    'gb' : gb_hyperparameters,
    'rf' : rf_hyperparameters,
}

## <span style="color:RoyalBlue">Exercise 5b</span> - Loop throu `hp_dict` to verify work!

Write a for loop that prints out each `key` and `value` in `hp_dict`, no need to make it look fancy like me, can be done in 3 lines of code. My example took 5.

In [54]:
# Your work here!

In [55]:
for key, val in hp_dict.items():
    print(key)
    print('------')
    for k, v in val.items():
        print(k, v)
    print()

l1
------
logisticregression__C [1.0000000e-04 1.1111112e+03 2.2222223e+03 3.3333334e+03 4.4444445e+03
 5.5555556e+03 6.6666667e+03 7.7777778e+03 8.8888889e+03 1.0000000e+04]

l2
------
logisticregression__C [1.0000000e-04 1.1111112e+03 2.2222223e+03 3.3333334e+03 4.4444445e+03
 5.5555556e+03 6.6666667e+03 7.7777778e+03 8.8888889e+03 1.0000000e+04]

kn
------
kneighborsclassifier__n_neighbors [ 1  2  3  4  5  6  7  8  9 10]

gb
------
gradientboostingclassifier__n_estimators [10, 100, 200]
gradientboostingclassifier__learning_rate [0.0001, 0.001, 0.01, 0.1, 0.2, 0.3]
gradientboostingclassifier__max_depth [1, 3, 5, 7, 9]

rf
------
randomforestclassifier__n_estimators [10, 100, 200]
randomforestclassifier__max_features ['auto', 'sqrt', 'log2', None]



<br>

<br><hr style="border-color:royalblue;background-color:royalblue;height:1px;">

## 6. Cross-Validation

Now that we have both a `Pipeline_dict` and a `hp_dict` we can build the loop that will handle cross-validation for us.

<br><hr style="border-color:royalblue;background-color:royalblue;height:1px;">
## <span style="color:RoyalBlue">Exercise 6</span> - Write a loop that uses `GridSearchCV()` to fit models!

In [56]:
# Create empty dictionary called fitted_models
fitted_models = {}

# Loop through model pipelines, tuning each one and saving it to fitted_models
for name, pipeline in pipeline_dict.items():
    
    # Create cross-validation object from pipeline and hyperparameters
    model = GridSearchCV(pipeline, hp_dict[name], cv=10, n_jobs=-1)
    
    # Fit model on X_train, y_train
    model.fit(X_train, y_train)
    
    # Store model in fitted_models[name] 
    fitted_models[name] = model
    
    # Print '{name} has been fitted'
    print(name, 'has been fitted')

  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)


l1 has been fitted


  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)


l2 has been fitted


  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)


rf has been fitted


  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)


gb has been fitted
kn has been fitted


  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)


## 7. Evaluate metrics

Finally, it's time to evaluate our models and pick the best one.

<br>

**First, display the <code style="color:steelblue">best\_score_</code> attribute for each fitted model.**

<br><hr style="border-color:royalblue;background-color:royalblue;height:1px;">
## <span style="color:RoyalBlue">Exercise 7</span> - Access `best_score_` attribute for each cross-validation object in `fitted_models`

**Hint: this is what it looks like to access the `best_score_` attribute for a single member of `fitted_models`**

In [57]:
fitted_models['l1'].best_score_

0.7736156351791531

In [None]:
# Your for-loop here! 

In [58]:
for model_name, cv_obj in fitted_models.items():
    print(model_name, cv_obj.best_score_)

l1 0.7736156351791531
l2 0.7736156351791531
rf 0.757328990228013
gb 0.7654723127035831
kn 0.737785016286645


<br><hr style="border-color:royalblue;background-color:royalblue;height:1px;">