<img src="https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png" style="float: left; margin: 10px">
# Advanced Model Evaluation
---
Week 4 | Lesson 4.1

### LEARNING OBJECTIVES
*After this lesson, you will be able to:*
- Review Initial EDA Strategies
- Review Classification Metrics
- Intuition Behind GridSearch 
- Implement GridSearch

### LESSON GUIDE
| TIMING  | TYPE  | TOPIC  |
|:-:|---|---|
| 5 min  | [Opening](#opening)  | Review  |
| 10 min  | [Introduction](#introduction)   | Introduction to gridsearch  |
| 15 min  | [Demo](#demo)  | Multinomial logistic regression  |
| 25 min  | [Guided Practice](#guided-practice)  | Gridsearch with multinomial logistic modeling on crime data |
| 25 min  | [Independent Practice](#ind-practice)  | Classification metrics  |
| 5 min  | [Conclusion](#conclusion)  | Gridsearch and multinomial logistic  |

## (5 Mins) First time looking at a dataset:  What do we do?

<a name="introduction"></a>
## Prolog to Gridsearch (10 mins)

When doing exploratory analysis and starting to think about model selection, we have a few good starting points.

* Looking at coeficient matrices
* Selecting features (variables) to use in our models
* Considering parameters that might work, in a broad sense
* Validation strategy

A **correlation matrix** is used to investigate the **dependence between multiple variables at the same time**. The result is a table containing the correlation coefficients between each variable and the others. **This is ideal for feature selection when deciding which features to use in a predictive model.**

NumPy has an easy to use method, as does Pandas, to perform correlation analysis. Let's [review the code](../code/starter-code/week4-4.1-breast-cancer-coefficients.ipynb) for performing a Pearson correlation coefficient matrix on the Breast Cancer Dataset.

_note:  Anyone use Gridsearch before?_

## Intro to Gridsearch

What is "gridsearch"? Gridsearch is the process of searching for the optimal set of tuning parameters for a model. It searches across values of parameters and uses cross-validation to evaluate the effect. It's called gridsearch because the idea is that there is a "grid" of parameters that are iteratively searched.



### A Hypothetical Example

Consider these **KNearest Neighbors** parameters:

| Parameter | Potential Values |
| --- | ---|
| **n_neighbors** | int range 1-150 |
| **weights** | strs:  "uniform", "distance" or user defined function |
| **algorithm** | strs: "ball_tree", "kd_tree", "brute", "auto" |
| **leaf_size** | int range 0-150 | 
| **metric** | str: "minkowski" or DistanceObject type |
| **p** | int: 1=manhattan_distance, 2= euclidean_distance |

```python
from sklearn import neighbors

# Search - 1
neighbors.KNeighborsClassifier(n_neighbors=1, weights="uniform", algorithm="ball_tree", leaf_size=30, etc...)
# Search - 2
neighbors.KNeighborsClassifier(n_neighbors=2, weights="uniform", algorithm="ball_tree", leaf_size=30, etc...)
# Search - 3
neighbors.KNeighborsClassifier(n_neighbors=3, weights="uniform", algorithm="ball_tree", leaf_size=30, etc...)
...
... ** chunk chunk chunk -- hours later **
...
# Search - 300,000+
neighbors.KNeighborsClassifier(n_neighbors=150, weights="distance", algorithm="auto", leaf_size=150, etc...)
```

<a name="demo"></a>
## Demo: Multinomial logistic regression (15 mins)

Review [KNN GridSearch Example](../code/starter-code/4.1-knn_gridsearch_example.ipynb) and [Classification Report](../code/starter-code/week4-4.1-classification-report.ipynb) techniques for use in independent practice and project work.


## Guided Practice: Gridsearch with logistic modeling on poltical data (15-25 mins)
Let's continue to practice these techniques on familiar concepts we've been building on using logistic regression.  This practice will demonstrate a full featured end-to-end implementation with GridSearch and logistic regression to predict a binomial response.

[Logistic GridSearch with Test Train Split](../code/starter-code/4.1-logistic_example_gridsearch.ipynb)


<a name="guided-practice"></a>
## (Optional) Guided Practice: Gridsearch with multinomial logistic modeling on crime data (25 mins)

So far, we have been using logistic regression for binary problems where there are only two class labels. As you might have suspected or read in the documentation, logistic regression can be extended to dependent variables with multiple classes.

We are using the gridsearch in conjunction with multinomial logistic to optimize a model that predicts the category (type) of crime based on various features captured by San Francisco police departments.

> Note: Switch to Jupyter notebook here

[Multinomial logistic regression starter](../code/starter-code/gridsearch-multinomial-logistic-starter.ipynb)

<a name="ind-practice"></a>
## Independent Practice: Classification Metrics (25 minutes)

Use the [Wisconsin Breast Cancer Dataset](../../3.1-classification_visualization_with_tableau/assets/datasets/) (wdbc.csv) to practice GridSearch and classification reporting learned today.

- Create a new notebook
- Load the datset
- Setup test / train split
- Implement GridSearch
  - Try 2-3+ parameters
  - Plug your own ranged parameters (ie: ['option1', 'option2', 'option3', etc] or range(2,50,2))
  - Multiple choices parameters are defined with lists or tuples
- Use classification reporting

** Bonus **
- Use same evaluation tasks with another dataset with categorical predictors using KNN


<a name="conclusion"></a>
## Conclusion (5 mins)
- Review independent practice deliverable(s)
- Recap GridSearch