

## Learning Objectives
By the end of this lesson students will be able to:

- Understand what cross validation is
- Understand what grid searching is
- Use `GridSearchCV` class from sklearn to find optimal hyperparameters
- Differentiate `cross_val_score` from `GridSearchCV`

---

### Cross Validation

"*Cross-validation is a resampling method that uses different portions of the data to test and train a model on different iterations. It is mainly used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice."* -- [Wikipedia](https://en.wikipedia.org/wiki/Cross-validation_(statistics))

In short, this is a way for us to better understand the quality of the predictions made by our estimator. 

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import cross_val_score, train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import SelectKBest
from sklearn.metrics import ConfusionMatrixDisplay

ModuleNotFoundError: No module named 'seaborn'

#### Classification Review

Build and compare a KNN and Logistic Regression model.

In [2]:
cancer = load_breast_cancer(as_frame=True).frame
cancer.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0


In [None]:
#train/test split


In [None]:
#pipeline for knn to scale and model


In [2]:
#pipeline for logistic to scale and model


In [None]:
#evaluate knn 


In [3]:
#evaluate logistic


#### Confusion Matrix

In [4]:
#confusion matrix display


In [5]:
#side by side for knn and logistic


#### K-Fold Cross Validation

![](https://scikit-learn.org/stable/_images/grid_search_cross_validation.png)

In [9]:
# cross validate 


In [10]:
# knn with 35 neighbors


## GridSearch CV
GridSearchCV is a sklearn class. 

It performs cross validation and searches over a bunch of parameters.

It replaces the slow, verbose way of cross validation using a `for` loop with `cross_val_score`. 

Using `GridSearchCV` is generally the best way to optimize hyperparameters.

## Hyperparameters vs parameters.

- __Definition 1 of `parameters`__: a function "defines a parameter, and the calling code passes an argument to that parameter. You can think of the parameter as a parking space and the argument as an automobile." - Qutoed from MSDN in [this SO question](https://stackoverflow.com/q/1788923/4590385).

When you pass them to a function they are called `arguments`. The terms _argument_ and _parameter_ are often used interchangeably.

- __Definition 2 of `parameters`__: the weights in a model. For example, the $ \beta $ values in a linear regression equation. These are the model's parameter.

- `hyperparameters` are the arguments YOU CHOOSE to pass to a transformer or estimator. You tune these to improve model performance. For example, the most important hyperparameter for a scikit-learn Ridge regression model is `alpha`. 


### Just remember: YOU choose the hyperparameters.

---
## GridSearchCV 

- GridSearchCV performs cross validation for multiple models with the data you fit it with. 
- It saves the best performing model and refits it on all the data you pass it.
- You treat it like an estimator.

`GridSearchCV` accepts a scikit-learn `estimator` object and a **parameter grid**.

- The param grid is a dictionary. 
- The key is the name of the hyperparameter argument in scikit-learn.  
- The value is an iterable to search over (generally a list or a range-style object).



#### Set up a parameter grid with several values for `n_neighbors`

#### Instantiate a GridSearchCV object by passing it an estimator and a param_grid.

#### We use this GridSearch object like it's an estimator, fitting, predicting and scoreing it like normal. 

##### Fit it on the training data

##### Score on the training data

##### See all the results of training

##### What were the best params?

##### Make predictions for the test set

##### Score on the train and test set

##### Score the best model on the test data with the default scoring metric

### Grid Searching a `Pipeline`

Here, when setting up the parameter grid you will specify the step in the pipeline you want searched seperated by a double underscore to the name of the parameter that will be searched. 

In [11]:
# pipeline for scaling and knn


In [12]:
# parameter grid


In [13]:
# grid search object


In [14]:
# fit the train data


In [15]:
# what was the best estimator?


#### Problem

Set up a `Pipeline` with steps that:

1. Generate Polynomial Features for data: `poly`
2. Scale the data: `scaler`
3. Select Features: `selector`
4. Build Logistic Regression model: `model`

In [17]:
#pipeline


In [18]:
# parameter grid


In [19]:
# grid search


In [20]:
# score


In [21]:
# view confusion matrix
