<img src="./img/vi_logo.png" style="float: left; margin: 10px; height: 45px">




# Vertical Institute Data Science Bootcamp
# Lesson 7: Advanced Model Evaluation


---


### Learning Objectives

**After this lesson, you will be able to:**
- Review initial EDA strategies
- Intuition behind GridSearch
- Implement changes and updates to KNN model using gridsearch
- Find optimal hyperparameters of a model

<a name="prologue"></a>
## Prologue to Gridsearch 

When doing exploratory analysis and starting to think about model selection, we have a few good starting points.

* Looking at coefficient matrices
* Selecting features (variables) to use in our models
* Considering parameters that might work, in a broad sense
* Validation strategy

A **correlation matrix** is used to investigate the **dependence between multiple variables at the same time**. The result is a table containing the correlation coefficients between each variable and the others. **This is ideal for feature selection when deciding which features to use in a predictive model.**

<a id="gs"></a>
## Intro to Gridsearch

What is "gridsearch"? Gridsearch is the process of searching for the optimal set of tuning parameters for a model. It searches across values of parameters and uses cross-validation to evaluate the effect. It's called gridsearch because the idea is that there is a "grid" of parameters that are iteratively searched.


### A Hypothetical Example

Consider these **KNearest Neighbors** parameters:

| Parameter | Potential Values |
| --- | ---|
| **n_neighbors** | int range 1-150 |
| **weights** | strs:  "uniform", "distance" or user defined function |
| **algorithm** | strs: "ball_tree", "kd_tree", "brute", "auto" |
| **leaf_size** | int range 0-150 | 
| **metric** | str: "minkowski" or DistanceObject type |
| **p** | int: 1=manhattan_distance, 2= euclidean_distance |

```python
from sklearn import neighbors

# Search - 1
neighbors.KNeighborsClassifier(n_neighbors=1, weights="uniform", algorithm="ball_tree", leaf_size=30, etc...)
# Search - 2
neighbors.KNeighborsClassifier(n_neighbors=2, weights="uniform", algorithm="ball_tree", leaf_size=30, etc...)
# Search - 3
neighbors.KNeighborsClassifier(n_neighbors=3, weights="uniform", algorithm="ball_tree", leaf_size=30, etc...)
...
... ** chunk chunk chunk -- hours later **
...
# Search - 300,000+
neighbors.KNeighborsClassifier(n_neighbors=150, weights="distance", algorithm="auto", leaf_size=150, etc...)
```

<a id= "iris"></a>
## Using GridSearch to Loan Approval dataset

In [None]:
import pandas as pd
import numpy as np

In [None]:
# Load gridsearch
from sklearn import datasets
from sklearn.model_selection import GridSearchCV # import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier

# Load loan dataset
df = pd.read_csv('clean_loan_approval.csv')

# Creating X and y
y = df['Loan_Status']
X = df[['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
       'Loan_Amount_Term', 'Credit_History']]

In [None]:
X.head()

In [None]:
y.head()

In [None]:
# Check out the KNeighborsClassifier parameters
# Don't run this code. We are asking you to shift tab to study the documentation here!
KNeighborsClassifier( <shift tab here!> )

In [None]:
# In practice, we will pick and choose the important hyperparameters to tune
# Important is based on theory, experience and common knowledge (a lot of information in Google)
# Setup our GridSearch Parameters
search_parameters = {
    'n_neighbors':  [3, 50], 
    'weights':      ["uniform", "distance"],
    'algorithm':    ["ball_tree", "kd_tree", "brute", "auto"],
}

In [None]:
# Initialize KNN 
knn = KNeighborsClassifier()

# Initialize GridSearchCV
# Feed in the model to tune, the hyper parameters to tune
# Verbose > 0 will give us more details in the output below!
clf = GridSearchCV(knn, search_parameters, verbose=1)   

# Fit our training data
# We are building a lot of models here and finding the best one!
clf.fit(X, y)  

### At this point, GridSearch has:
- Performed all of the permutations upon all instances of potential parameters
- Building a model for each set of unqiue parameters
- Sets the class attributes for best params, score, and estimator object for further evaluation

In [None]:
print("Best Params:", clf.best_params_) # best model's parameters
print("Best Score:", clf.best_score_) # best model's accuracy

# Look at the first 5 experiments/models
pd.DataFrame(clf.cv_results_).head()

### Keep in mind

- This is a small dataset
- It's the minimum number of steps to perform this function

### Exercise
### Using Grid Search to Improve Predicting Loan Approval (Decision Tree) In Lesson 6
- We have learnt how to apply GridSearch on kNN, 
- Recall that in lesson 6, we predicted Loan Approval using a decision tree.
- Below are the steps we took to predict loan approval.

In [None]:
# Load loan dataset again just in case
df = pd.read_csv('clean_loan_approval.csv')

# Creating X and y
y = df['Loan_Status']
X = df[['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
       'Loan_Amount_Term', 'Credit_History']]

In [None]:
# Let's building a baseline Decision Tree model with DEFAULT hyperparameters
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Split into train test dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 42)

# Creating DecisionTreeClassifier model object
tree = DecisionTreeClassifier(random_state = 42)

# Training the model with X_train and y_train
tree.fit(X_train, y_train)

# Predict using X_test
y_preds = tree.predict(X_test)

# Compute the accuracy score
accuracy_score(y_test,y_preds)

### Use GridSearch to Improve the Model
- It can be seen that our accuracy score is around 0.7317 (Same for all of us because we set the random_state/seed)
- Now, using GridSearch, find better parameters for this Decision Tree to reduce overfitting and improve the accuracy score.

#### Parameters on GridSearchCV
GridSearchCV implements a “fit” and a “score” method.
Use the below link to help you with exercise
- GridSearchCV Parameters: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
- Scoring Parameters: https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter`

In [None]:
from sklearn.metrics import make_scorer
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score # Note that accuracy_score is by default chosen as the scoring object

clf = DecisionTreeClassifier(random_state=42)

# Create the parameters list we wish to tune.
parameters = {_________________________________________}

# Make an accuracy scoring object.
scorer = make_scorer(accuracy_score) #there are many other scoring objects refer to link above

# Perform grid search on the classifier using 'scorer' as the scoring method.
# Create the object.
grid_obj = GridSearchCV(_______, _____, scoring=_______, cv=20,n_jobs=-1)

# Fit the grid search object to the training data and find the optimal parameters.
# Fit the data
grid_fit = grid_obj.fit(_____, _____)

# View results of GridSearchCV in DataFrame
pd.concat([pd.DataFrame(grid_fit.cv_results_["params"]),pd.DataFrame(grid_fit.cv_results_["mean_test_score"], columns=["Accuracy"])],axis=1)

# Get the estimator.
best_clf = grid_fit.best_estimator_

# Fit the new model.
best_clf.fit(_____, _____)

# Make predictions using the new model.
best_test_predictions = best_clf.predict(X_test)

# Get the accuracy score
_______________(y_test,best_test_predictions)

#### Solution

In [None]:
from sklearn.metrics import make_scorer
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score

# Create model object
clf = DecisionTreeClassifier(random_state=42)

# Create the parameters list we wish to tune.
parameters = {
              'max_depth':[1, 2, 3, 4, 5, 6],
              'min_samples_leaf':[2, 3, 4], 
              'min_samples_split':[10, 20, 30]
             }


# Make an accuracy scoring object. 
# We are asking the CV to evaluate using Accuracy
scorer = make_scorer(accuracy_score)

# Perform grid search on the classifier using 'scorer' as the scoring method.
# Create the grid search object. Feed the model object, parameters and the scoring object into the function
grid_obj = GridSearchCV(clf, parameters, scoring = scorer, cv = 5, verbose=1)

# Fit the grid search object to the training data and find the optimal parameters.
grid_fit = grid_obj.fit(X_train, y_train)

# View results of GridSearchCV in DataFrame
pd.concat([pd.DataFrame(grid_fit.cv_results_["params"]),   
           pd.DataFrame(grid_fit.cv_results_["mean_test_score"], columns=["Accuracy"])],axis=1)


In [None]:
# Get the estimator (the model object with the BEST hyperparameters)
best_clf = grid_fit.best_estimator_
print(best_clf)

# Fit the new model.
best_clf.fit(X_train, y_train)

# Make predictions using the new model.
best_test_predictions = best_clf.predict(X_test)

# Print the accuracy
accuracy_score(y_test,best_test_predictions)

## Sklearn Pipelines

<img src="img/pipe.png" style="height: 600px">

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

In [None]:
# Load loan dataset just in case
df = pd.read_csv('clean_loan_approval.csv')

# Creating X and y
y = df['Loan_Status']
X = df[['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
       'Loan_Amount_Term', 'Credit_History']]

In [None]:
# train test split
X_train,X_test,y_train,y_test = train_test_split(X, y, random_state=42)

# We can use the pipeline function to chain different processes in a particular order
# This is just something nice to do to make your code look cleaner
# Optional
# General flow
pipe = Pipeline([('scaler', StandardScaler()), 
                 ('knn', KNeighborsClassifier())])

# Fit the data into the pipeline in the order
pipe.fit(X_train, y_train)

# Score/Make predictions using the test data
pipe.score(X_test, y_test)

### Inspecting pipelines

In [None]:
# For you to look at the steps if you want to inspect
pipe.named_steps

### GridSearchCV on  pipelines

In [None]:
search_parameters = {'scaler__with_mean':[True,False],
                     'knn__n_neighbors':  [3,10,30,50],
                     'knn__weights': ["uniform", "distance"],
                    }

In [None]:
gs = GridSearchCV(pipe, search_parameters)

gs.fit(X_train, y_train)

print(gs.best_params_)
print(gs.best_estimator_)


<a name="conclusion"></a>
## Lesson Summary


Let's review what we learned today. We:

- Reviewed initial EDA strategies to do model selection
- Went through gridsearch as an optimization method for our estimator
- Found the optimal hyperparameters by tuning the parameters and checking which churns out the best metric score
- Conduct statistical modelling with optimization for the Iris dataset