Copyright 2020 Andrew M. Olney and made available under [CC BY-SA](https://creativecommons.org/licenses/by-sa/4.0) for text and [Apache-2.0](http://www.apache.org/licenses/LICENSE-2.0) for code.

# Crossvalidation and Nested crossvalidation

The goal of all models we build is to **predict** and **generalize**.
If our model only works for the data we train it on, and doesn't work for anything else, it's not a very useful model.

One way to test generalization is to split the data into training and testing data.
The training data is used to estimate the model's parameters, and the testing data is used to determine how much we can trust the model's predictions on unseen data.
If the performance of the model is very good on the training data but very poor on the testing data, we say that the model has **overfit** the training data.
If the model's performance on the testing data is as good or better than the training data, then we conclude that the model will generalize well.

But what if we're wrong?
What if we got lucky with a train/test split, such that the testing data is "easy" and very similar to the training data?
One way to increase our confidence is to train and test **repeatedly with different train/test splits**.
But again, we have a problem: what if our train/test splits overlap substantially?
We only get more information about model performance if the splits are different, not when they are the same.

### What you will learn

In this notebook you will learn about methods to avoid overfit during training and to increase our confidence in model generalization. We will study the following:

- Crossvalidation
- Nested crossvalidation

### When to use crossvalidation

It's generally recommended to use crossvalidation in standard practice, rather than a single train/test split.
Fortunately crossvalidation is so common that many data science frameworks make it relatively easy to use.
An important consideration is what evaluation metrics mean when models are trained with crossvalidation, so keep evaluation metrics in mind as you progress through this notebook.

## Crossvalidation

**Crossvalidation** is a fairly simple yet elegant idea that solves these problems.
Crossvalidation lets us use **all** our data for both training and testing, by partitioning it in separate sets, typically called **folds**. 

Figure 1 shows all the data split into 5 folds.
Crossvalidation training would create five models using these five folds, such that each model uses a different fold for testing (blue).
All the other folds are used for training.
For example, one model would use fold 1 for testing and folds 2-5 for training.

<!-- ![image.png](attachment:image.png) -->
<img src="attachment:image.png" width="500">
<center><b>Figure 1. Crossvalidation with five folds.</b> Source: <a href="https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation">Adapted from scikit-learn</a></center>

Using crossvalidation, all data is used for both training and testing **but not at the same time.**
This allows us to train with all the data without worrying about overfitting. 
It also lets us robustly test generalization because we've used all the data for testing as well.

Crossvalidation works nicely with all of the standard peformance metrics like $r^2$, accuracy, and precision/recall: because the predictions from separate models are on separate folds, we can calculate performance metrics as though the predictions came from a single model.
Alternatively, we can calculate performance on each fold separately to get a distribution of performance.

Finally, you might see the connections between crossvalidation and out-of-bag (OOB) error with bagging.
Crossvalidation could be viewed as a generalization of OOB, since crossvalidation can be used with any model.
Crossvalidation is also a simplification over OOB, since OOB requires keeping track of which aggregated models are "allowed" to make a prediction on a datapoint, whereas in crossvalidation, a single model makes predictions on its test fold, and that's it.

## Example: Crossvalidation

Let's take a look at the `iris` dataset, which consists of measurements of sepals and petals, as well as the class label for three species of iris.

| Variable                 | Type    | Description                                                                    |                                |
|:--------------------------|:---------|:--------------------------------------------------------------------------------|--------------------------------|
| Sepal Length             | Ratio   | Length of the leaves that support the petals                                    |                                |
| Sepal Width              | Ratio   | Width of the leaves that suppotr the petals                                     |                                |
| Petal Length             | Ratio   | Length of petals                                                               |                                |
| Petal Width              | Ratio   | Width of petals                                                                |                                |
| Species                  | Nominal     | Setosa, Versicolour, Virginica |     

<div style="text-align:center;font-size: smaller">
 <b>Source:</b> This dataset was taken from the <a href="https://archive.ics.uci.edu/ml/datasets/iris">UCI Machine Learning Repository library
    </a></div>
<br>

Because this is a familiar dataset, and because the focus is on crossvalidation, we will skip data exploration steps and go quickly to modeling.

### Load data

Import `pandas` so we can load a dataframe:

- `import pandas as pd`

In [16]:
import pandas as pd

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="A]5Vf/x,=8dk=KKk3OrJ">pd</variable></variables><block type="importAs" id="G*NhDz5Jo?CcaJf3rUv}" x="150" y="308"><field name="libraryName">pandas</field><field name="VAR" id="A]5Vf/x,=8dk=KKk3OrJ">pd</field></block></xml>

Load the dataframe using a CSV file:

- Set `dataframe` to with `pd` do `read_csv` using
    - `"datasets/iris.csv"`
- `dataframe`

In [17]:
dataframe = pd.read_csv('datasets/iris.csv')

dataframe

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="Y+grzOaS*0=OHV6zO8m/">dataframe</variable><variable id="A]5Vf/x,=8dk=KKk3OrJ">pd</variable></variables><block type="variables_set" id="^p6DH/6N@h{Uka=BzJ_j" x="9" y="196"><field name="VAR" id="Y+grzOaS*0=OHV6zO8m/">dataframe</field><value name="VALUE"><block type="varDoMethod" id="69MJvg`#}wvn~s)@r$Oy"><mutation items="1"></mutation><field name="VAR" id="A]5Vf/x,=8dk=KKk3OrJ">pd</field><field name="MEMBER">read_csv</field><data>pd:read_csv</data><value name="ADD0"><block type="text" id="h]qBQ;Jd,%5[VFJ{5Qrg"><field name="TEXT">datasets/iris.csv</field></block></value></block></value></block><block type="variables_get" id="cecq@1cDqS`l`^:tG5pI" x="8" y="296"><field name="VAR" id="Y+grzOaS*0=OHV6zO8m/">dataframe</field></block></xml>

Unnamed: 0,SepalLength,SepalWidth,PetalLength,PetalWidth,Species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


### Prepare train/test sets

Let's separate our predictors (`X`) from our class label (`Y`), putting each into its own dataframe:

- Create `X` and set to with `dataframe` do `drop` using
    - freestyle `columns=["Species"]`
- Create `Y` and set to `dataframe [ ]` containing a list with `"Species"` inside

In [18]:
X = dataframe.drop(columns=["Species"])

Y = dataframe[['Species']]

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="_z/H/tTnOYS1Lla2~;*0">X</variable><variable id="o8#tp%!,_Okj2aVwX(DL">Y</variable><variable id="Y+grzOaS*0=OHV6zO8m/">dataframe</variable></variables><block type="variables_set" id="v$m6b}OQu(wF6Gk!)w:Z" x="95" y="191"><field name="VAR" id="_z/H/tTnOYS1Lla2~;*0">X</field><value name="VALUE"><block type="varDoMethod" id="mA!*r=_YYu?ypCUOhdUQ"><mutation items="1"></mutation><field name="VAR" id="Y+grzOaS*0=OHV6zO8m/">dataframe</field><field name="MEMBER">drop</field><data>dataframe:drop</data><value name="ADD0"><block type="dummyOutputCodeBlock" id="zG~mVrGO2la?/4Tc#+4K"><field name="CODE">columns=["Species"]</field></block></value></block></value></block><block type="variables_set" id="J.(r-ci$Z]0(z{=R,T;+" x="99" y="279"><field name="VAR" id="o8#tp%!,_Okj2aVwX(DL">Y</field><value name="VALUE"><block type="indexer" id="6o/%E8zn:UY-`.k%%dlC"><field name="VAR" id="Y+grzOaS*0=OHV6zO8m/">dataframe</field><value name="INDEX"><block type="lists_create_with" id=".^J[E#0S[3`@2d@+Mu@n"><mutation items="1"></mutation><value name="ADD0"><block type="text" id="M*Uc@Tv*y}{]xN4#,S9j"><field name="TEXT">Species</field></block></value></block></value></block></value></block></xml>

### Train model with crossvalidation

We need libraries for Gaussian naive Bayes, crossvalidation, and `ravel`

- `import sklearn.model_selection as model_selection`
- `import sklearn.naive_bayes as naive_bayes`
- `import numpy as np`

In [19]:
import sklearn.model_selection as model_selection
import sklearn.naive_bayes as naive_bayes
import numpy as np

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="-q~R*yx.VGQ@%!0Q+!gh">model_selection</variable><variable id="KA0k4+i.iu,J]5!9Nj%^">naive_bayes</variable><variable id="hxqH|@PJU~wrI]t*tHgd">np</variable></variables><block type="importAs" id="E{QJ$O@lc8u.;OTWdVgf" x="-8" y="173"><field name="libraryName">sklearn.model_selection</field><field name="VAR" id="-q~R*yx.VGQ@%!0Q+!gh">model_selection</field><next><block type="importAs" id="s:sP+mk)a0E**6%-{LVV"><field name="libraryName">sklearn.naive_bayes</field><field name="VAR" id="KA0k4+i.iu,J]5!9Nj%^">naive_bayes</field><next><block type="importAs" id="%54AMyx`5atYhM+,;y=E"><field name="libraryName">numpy</field><field name="VAR" id="hxqH|@PJU~wrI]t*tHgd">np</field></block></next></block></next></block></xml>

There are several different options for crossvalidation output.
The most straightforward is probably `cross_val_predict`, which takes a model, data, and specifications for crossvalidation, and then trains the model and makes predictions on the test folds:

- Create variable `predictions`
- Set it to with `model_selection` do `cross_val_predict` using
    - with `naive_bayes` create `GaussianNB` using
    - `X`
    - with `np` do `ravel` using `Y`
    - freestyle `cv=10` (for 10 folds)
    
**Note:** we can also use a pipeline instead of a vanilla model if we need to scale variables, etc.

In [20]:
predictions = model_selection.cross_val_predict(naive_bayes.GaussianNB(),X,np.ravel(Y),cv=10)

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="(`;mrW|63Vww]$wlV9+1">predictions</variable><variable id="-q~R*yx.VGQ@%!0Q+!gh">model_selection</variable><variable id="KA0k4+i.iu,J]5!9Nj%^">naive_bayes</variable><variable id="_z/H/tTnOYS1Lla2~;*0">X</variable><variable id="hxqH|@PJU~wrI]t*tHgd">np</variable><variable id="o8#tp%!,_Okj2aVwX(DL">Y</variable></variables><block type="variables_set" id="]yjIJI@2LryF6W}:+b6`" x="-9" y="308"><field name="VAR" id="(`;mrW|63Vww]$wlV9+1">predictions</field><value name="VALUE"><block type="varDoMethod" id="U5{y2!fU@zXC8w9)#?o~"><mutation items="4"></mutation><field name="VAR" id="-q~R*yx.VGQ@%!0Q+!gh">model_selection</field><field name="MEMBER">cross_val_predict</field><data>model_selection:cross_val_predict</data><value name="ADD0"><block type="varCreateObject" id="2)Ww[9`nv@XTl=o#-!y_"><mutation items="1"></mutation><field name="VAR" id="KA0k4+i.iu,J]5!9Nj%^">naive_bayes</field><field name="MEMBER">GaussianNB</field><data>naive_bayes:GaussianNB</data></block></value><value name="ADD1"><block type="variables_get" id="Sy[Advu):V6/lqkf.}-/"><field name="VAR" id="_z/H/tTnOYS1Lla2~;*0">X</field></block></value><value name="ADD2"><block type="varDoMethod" id="]ynEY~?OdYNXt?-?Qd^!"><mutation items="1"></mutation><field name="VAR" id="hxqH|@PJU~wrI]t*tHgd">np</field><field name="MEMBER">ravel</field><data>np:ravel</data><value name="ADD0"><block type="variables_get" id="@0Pu[?L@H0Y*THD{D;I%"><field name="VAR" id="o8#tp%!,_Okj2aVwX(DL">Y</field></block></value></block></value><value name="ADD3"><block type="dummyOutputCodeBlock" id="v?)q`S1P;;8X40dcy*ld"><field name="CODE">cv=10</field></block></value></block></value></block></xml>

Notice how much simpler this is than creating train/test splits!

### Evaluate the model

To measure performance, we need `sklearn.metrics`:

- `import sklearn.metrics as metrics`

In [21]:
import sklearn.metrics as metrics

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="jpz]0=,hLYv~rN^#0dFO">metrics</variable></variables><block type="importAs" id="{H02K*?VA0K3yr,FD/;h" x="51" y="95"><field name="libraryName">sklearn.metrics</field><field name="VAR" id="jpz]0=,hLYv~rN^#0dFO">metrics</field></block></xml>

We can get the accuracy by comparing the predictions to *all* of `Y`:

- with `metrics` do `accuracy_score` using
    - `Y`
    - `predictions`

In [22]:
metrics.accuracy_score(Y,predictions)

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="jpz]0=,hLYv~rN^#0dFO">metrics</variable><variable id="o8#tp%!,_Okj2aVwX(DL">Y</variable><variable id="(`;mrW|63Vww]$wlV9+1">predictions</variable></variables><block type="varDoMethod" id="F}9p/^p9*;pp~ZcNuE;@" x="0" y="176"><mutation items="2"></mutation><field name="VAR" id="jpz]0=,hLYv~rN^#0dFO">metrics</field><field name="MEMBER">accuracy_score</field><data>metrics:accuracy_score</data><value name="ADD0"><block type="variables_get" id="LErTj1eg0L*3UA7lXLE{"><field name="VAR" id="o8#tp%!,_Okj2aVwX(DL">Y</field></block></value><value name="ADD1"><block type="variables_get" id="lMeh@i^cTm54?r|qefnC"><field name="VAR" id="(`;mrW|63Vww]$wlV9+1">predictions</field></block></value></block></xml>

0.9533333333333334

And similarly we can get the recall and precision using all of `Y`:

- `print` with `metrics` do `classification_report` using
    - `Y`
    - `predictions`
    

In [23]:
print(metrics.classification_report(Y,predictions))

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="jpz]0=,hLYv~rN^#0dFO">metrics</variable><variable id="o8#tp%!,_Okj2aVwX(DL">Y</variable><variable id="(`;mrW|63Vww]$wlV9+1">predictions</variable></variables><block type="text_print" id="{9DtH#+Zi{1}^((zc[x!" x="87" y="368"><value name="TEXT"><shadow type="text" id="|?Q+,+Qs3m9t(4eE@qGx"><field name="TEXT">abc</field></shadow><block type="varDoMethod" id="F}9p/^p9*;pp~ZcNuE;@"><mutation items="2"></mutation><field name="VAR" id="jpz]0=,hLYv~rN^#0dFO">metrics</field><field name="MEMBER">classification_report</field><data>metrics:classification_report</data><value name="ADD0"><block type="variables_get" id="LErTj1eg0L*3UA7lXLE{"><field name="VAR" id="o8#tp%!,_Okj2aVwX(DL">Y</field></block></value><value name="ADD1"><block type="variables_get" id="lMeh@i^cTm54?r|qefnC"><field name="VAR" id="(`;mrW|63Vww]$wlV9+1">predictions</field></block></value></block></value></block></xml>

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        50
  versicolor       0.92      0.94      0.93        50
   virginica       0.94      0.92      0.93        50

    accuracy                           0.95       150
   macro avg       0.95      0.95      0.95       150
weighted avg       0.95      0.95      0.95       150



Because we split the data 10 different ways and trained 10 different models, we can be reasonably confident that we will see similar performance on new data.

There are [other crossvalidation methods](https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation) besides `cross_val_predict` that return metrics per fold if you would like to see the range of performance across folds.

## Nested crossvalidation

Not only can crossvalidation help us better estimate model generalization, but it can also help us find **hyperparameters** of our models.
A hyperparameter is something you specify when you create the model, as opposed to a parameter the model learns.
Common examples we've encountered are the value for `K` in KNN, regularization parameters in ridge regression and lasso regression, and the `C` parameter for margin softness in SVM.
Crossvalidation works for hyperparameters because the problem we face with them is much the same as with model fit: we can choose a hyperparameter that works well for some data, but it might not work well on other data.
So if we use crossvalidation, we can get a good idea of how well a particular value for a hyperparameter works across the dataset.

However, for hyperparameter crossvalidation to be really powerful, we need to to explore multiple candidate values for the hyperparameter, and then use crossvalidation for each one.
**Grid search** is a simple way of defining candidate values for hyperparameters.
Simply stated, vanilla grid search takes a list of candidate hyperparameter values you define and creates a separate model to evaluate each of those hyperparameter values.

Grid search and crossvalidation work together like this:

- Grid search gives crossvalidation a hyperparameter value
- Crossvalidation builds as many models using that hyperparameter as there are folds
- The average performance across folds is used to score the hyperparameter
- Grid search gives crossvalidation another hyperparameter value and the process repeats
- Once all hyperparameter values have been scored, the best is returned by grid search

While grid search and crossvalidation are very powerful together, they potentially create another problem for us.
As we've discussed, we don't want to train and test with the same data, because then we don't know for sure our model will generalize.
The problem is that if we do grid search to find hyperparameters on the same data we test our model on, then we've trained and tested with the same data.
We could solve this problem by spliting our data into two sets, one for hyperparameter search + model training, and one for testing.
However, that approach takes us back to the train/test split problem - what if we get a lucky split?
We already solved the train/test split problem with crossvalidation, so can we solve this problem using crossvalidation too?

The answer is yes, we can solve train/test splits for hyperparameters and normal model training at the same time using **nested crossvalidation**.
Simply stated, nested crossvalidation partitions the data into folds and then partitions each of those folds into folds.
The first set of folds (the **outer folds**) are used to train the model in the way we discussed in the last section.
The second set of folds are used to evaluate the hyperparameters, e.g. using grid search.
The reason that nested crossvalidation works is that when the model is tested on a fold, it hasn't used that fold to estimate its hyperparameters or its parameters.
The process for one fold is shown in the figure below.
Notice that the test component of this fold is not used to set hyperparameters nor is it used to train the model.

<!-- ![image.png](attachment:image.png) -->
<img src="attachment:image.png" width="500">
<center><b>Figure 2. Nested crossvalidation.</b> Source: <a href="https://stats.stackexchange.com/questions/319253/k-nearest-neighbors-with-nested-cross-validation">StackExchange</a></center>

## Example: Nested crossvalidation

Let's continue our example with the `iris` dataset, but switch models to `SVC` with has the hyperparameter `C`.

### Train model with nested crossvalidation

We need to import libraries for:

- SVM
- Scale (SVM is very sensitive to standardization)
- Pipeline (to combine scaling and modeling)

So we need the following imports:

- `import sklearn.svm as svm`
- `import sklearn.preprocessing as pp`
- `import sklearn.pipeline as pipe`

In [24]:
import sklearn.svm as svm
import sklearn.preprocessing as pp
import sklearn.pipeline as pipe

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="-1?@LYu7gyhX}3NV$-Lv">svm</variable><variable id=".T`xqZHAu7r0$MQ4s=Yg">pp</variable><variable id="=z9KfSBWpoDS,oj-?yx,">pipe</variable></variables><block type="importAs" id="tb8M1lM!u45Sg7*|pjG)" x="58" y="81"><field name="libraryName">sklearn.svm</field><field name="VAR" id="-1?@LYu7gyhX}3NV$-Lv">svm</field><next><block type="importAs" id="*}7RY7RRNo7VC5@m%Tk["><field name="libraryName">sklearn.preprocessing</field><field name="VAR" id=".T`xqZHAu7r0$MQ4s=Yg">pp</field><next><block type="importAs" id="+dJM+p.?OUVo4DOrpHa*"><field name="libraryName">sklearn.pipeline</field><field name="VAR" id="=z9KfSBWpoDS,oj-?yx,">pipe</field></block></next></block></next></block></xml>

We're going to make a pipeline so we can scale and train in one step.
However, we need to **name the stages** in order for grid search to work:

- Set `model` to with `pipe` create `Pipeline` using` a list containing
    - a tuple (from LISTS; see picture below) containing
        - `"scale"`
        - with `pp` create `StandardScaler` 
    - a tuple containing
        - `"svm"`
        - with `svm` create `SVC` using
            - freestyle `random_state=1`
            - freestyle `kernel="rbf"` (for a radial basis function kernel)

![image.png](attachment:image.png)

In [25]:
model = pipe.Pipeline([('scale',(pp.StandardScaler())), ('svm',(svm.SVC(random_state=1,kernel="rbf")))])

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="WMn%x,]:2AW7#OLRLzEH">model</variable><variable id="=z9KfSBWpoDS,oj-?yx,">pipe</variable><variable id=".T`xqZHAu7r0$MQ4s=Yg">pp</variable><variable id="-1?@LYu7gyhX}3NV$-Lv">svm</variable></variables><block type="variables_set" id="/Qq`Z+-E+@?d7iqIa$gt" x="-81" y="270"><field name="VAR" id="WMn%x,]:2AW7#OLRLzEH">model</field><value name="VALUE"><block type="varCreateObject" id="C1|kicy)4reyhtAHqLiy"><mutation items="1"></mutation><field name="VAR" id="=z9KfSBWpoDS,oj-?yx,">pipe</field><field name="MEMBER">Pipeline</field><data>pipe:Pipeline</data><value name="ADD0"><block type="lists_create_with" id="_l)zP/3CzX:}g67,c=WQ"><mutation items="2"></mutation><value name="ADD0"><block type="tupleBlock" id="1maX^Dw3HS=.B.Nu|X36"><value name="FIRST"><block type="text" id="DiZ?E2]);rmc#9%ta{V:"><field name="TEXT">scale</field></block></value><value name="SECOND"><block type="varCreateObject" id="R0A,0sp15n@tOP;u(ZK["><mutation items="1"></mutation><field name="VAR" id=".T`xqZHAu7r0$MQ4s=Yg">pp</field><field name="MEMBER">StandardScaler</field><data>pp:StandardScaler</data></block></value></block></value><value name="ADD1"><block type="tupleBlock" id="X*K@DJs:A2Y[YDBsYLE9"><value name="FIRST"><block type="text" id="zVx=RpJIpMJVwqSv(}r="><field name="TEXT">svm</field></block></value><value name="SECOND"><block type="varCreateObject" id="7J3ooiL#@;m{L]Z39ntG"><mutation items="2"></mutation><field name="VAR" id="-1?@LYu7gyhX}3NV$-Lv">svm</field><field name="MEMBER">SVC</field><data>svm:SVC</data><value name="ADD0"><block type="dummyOutputCodeBlock" id="G_=6Myq/5e|A}qv]wKgP"><field name="CODE">random_state=1</field></block></value><value name="ADD1"><block type="dummyOutputCodeBlock" id="WGyq^a/lG[ATvg#Lp}={"><field name="CODE">kernel="rbf"</field></block></value></block></value></block></value></block></value></block></value></block></xml>

This is our base model that we need to give to grid search, along with a parameter values we want grid search to try for `C`.
Let's try the list `[1, 10, 100]`!

- Set `gridSearch` to with `model_selection` create `GridSearchCV` using
    - freestyle `estimator=model`
    - freestyle `param_grid={'svm__C': [1, 10, 100]}`
    - freestyle `cv=10` (10 inner folds for grid search)


- Set `predictions` to with `model_selection` do `cross_val_predict` using
    - `gridSearch`
    - `X`
    - with `np` do `ravel` using  `Y`
    - freestyle `cv=10` (10 outer folds for model parameters)

In [26]:
gridSearch = model_selection.GridSearchCV(estimator=model,param_grid={'svm__C':[1,10,100]},cv=10)

predictions = model_selection.cross_val_predict(gridSearch,X,np.ravel(Y),cv=10)

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="yv+gPAZd/8:NDOnmJ~uf">gridSearch</variable><variable id="(`;mrW|63Vww]$wlV9+1">predictions</variable><variable id="-q~R*yx.VGQ@%!0Q+!gh">model_selection</variable><variable id="_z/H/tTnOYS1Lla2~;*0">X</variable><variable id="hxqH|@PJU~wrI]t*tHgd">np</variable><variable id="o8#tp%!,_Okj2aVwX(DL">Y</variable></variables><block type="variables_set" id="MdOb2;C)R63)q*Rc2=ST" x="-121" y="238"><field name="VAR" id="yv+gPAZd/8:NDOnmJ~uf">gridSearch</field><value name="VALUE"><block type="varCreateObject" id="CH5|hYh%Pz@KaZhFAm=y"><mutation items="3"></mutation><field name="VAR" id="-q~R*yx.VGQ@%!0Q+!gh">model_selection</field><field name="MEMBER">GridSearchCV</field><data>model_selection:GridSearchCV</data><value name="ADD0"><block type="dummyOutputCodeBlock" id="_[*=v~x0I=DYlb0$v*)m"><field name="CODE">estimator=model</field></block></value><value name="ADD1"><block type="dummyOutputCodeBlock" id="ei5PqC02zmUAht@0TGE`"><field name="CODE">param_grid={'svm__C':[1,10,100]}</field></block></value><value name="ADD2"><block type="dummyOutputCodeBlock" id="|rwaRAwKdlmxrXv[CB5("><field name="CODE">cv=10</field></block></value></block></value></block><block type="variables_set" id="lSvA0DytvC@jnk4`j!b@" x="-138" y="386"><field name="VAR" id="(`;mrW|63Vww]$wlV9+1">predictions</field><value name="VALUE"><block type="varDoMethod" id="-wd!`}p_pK]R$T[Yv{yX"><mutation items="4"></mutation><field name="VAR" id="-q~R*yx.VGQ@%!0Q+!gh">model_selection</field><field name="MEMBER">cross_val_predict</field><data>model_selection:cross_val_predict</data><value name="ADD0"><block type="variables_get" id="1TBz_GS_%,|[%?#+^iOT"><field name="VAR" id="yv+gPAZd/8:NDOnmJ~uf">gridSearch</field></block></value><value name="ADD1"><block type="variables_get" id="XJzP!dqd;3(:hPM;G=ZY"><field name="VAR" id="_z/H/tTnOYS1Lla2~;*0">X</field></block></value><value name="ADD2"><block type="varDoMethod" id="$T?4UUb7hL:@:x}.v2[,"><mutation items="1"></mutation><field name="VAR" id="hxqH|@PJU~wrI]t*tHgd">np</field><field name="MEMBER">ravel</field><data>np:ravel</data><value name="ADD0"><block type="variables_get" id="t88}nxVtUKyz,ou8{:p-"><field name="VAR" id="o8#tp%!,_Okj2aVwX(DL">Y</field></block></value></block></value><value name="ADD3"><block type="dummyOutputCodeBlock" id="|:`mH*(*4#?X?haDRohM"><field name="CODE">cv=10</field></block></value></block></value></block></xml>

### Evaluate the model

We can get the accuracy by comparing the predictions to *all* of `Y`:

- with `metrics` do `accuracy_score` using
    - `Y`
    - `predictions`

In [27]:
metrics.accuracy_score(Y,predictions)

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="jpz]0=,hLYv~rN^#0dFO">metrics</variable><variable id="o8#tp%!,_Okj2aVwX(DL">Y</variable><variable id="(`;mrW|63Vww]$wlV9+1">predictions</variable></variables><block type="varDoMethod" id="F}9p/^p9*;pp~ZcNuE;@" x="0" y="176"><mutation items="2"></mutation><field name="VAR" id="jpz]0=,hLYv~rN^#0dFO">metrics</field><field name="MEMBER">accuracy_score</field><data>metrics:accuracy_score</data><value name="ADD0"><block type="variables_get" id="LErTj1eg0L*3UA7lXLE{"><field name="VAR" id="o8#tp%!,_Okj2aVwX(DL">Y</field></block></value><value name="ADD1"><block type="variables_get" id="lMeh@i^cTm54?r|qefnC"><field name="VAR" id="(`;mrW|63Vww]$wlV9+1">predictions</field></block></value></block></xml>

0.9533333333333334

And similarly we can get the recall and precision using all of `Y`:

- `print` with `metrics` do `classification_report` using
    - `Y`
    - `predictions`
    

In [28]:
print(metrics.classification_report(Y,predictions))

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="jpz]0=,hLYv~rN^#0dFO">metrics</variable><variable id="o8#tp%!,_Okj2aVwX(DL">Y</variable><variable id="(`;mrW|63Vww]$wlV9+1">predictions</variable></variables><block type="text_print" id="{9DtH#+Zi{1}^((zc[x!" x="87" y="368"><value name="TEXT"><shadow type="text" id="|?Q+,+Qs3m9t(4eE@qGx"><field name="TEXT">abc</field></shadow><block type="varDoMethod" id="F}9p/^p9*;pp~ZcNuE;@"><mutation items="2"></mutation><field name="VAR" id="jpz]0=,hLYv~rN^#0dFO">metrics</field><field name="MEMBER">classification_report</field><data>metrics:classification_report</data><value name="ADD0"><block type="variables_get" id="LErTj1eg0L*3UA7lXLE{"><field name="VAR" id="o8#tp%!,_Okj2aVwX(DL">Y</field></block></value><value name="ADD1"><block type="variables_get" id="lMeh@i^cTm54?r|qefnC"><field name="VAR" id="(`;mrW|63Vww]$wlV9+1">predictions</field></block></value></block></value></block></xml>

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        50
  versicolor       0.94      0.92      0.93        50
   virginica       0.92      0.94      0.93        50

    accuracy                           0.95       150
   macro avg       0.95      0.95      0.95       150
weighted avg       0.95      0.95      0.95       150



In this case, likely because the classification problem is so easy, there is no improvement between Gaussian naive Bayes and our hyperparameter-tuned SVM.

### Hyperparameter

Interestingly, the nested crossvalidation approach in `sklearn` does not return the best hyperparameter values found.
The simplest option is to discover the hyperparameter values using all the data, but then only report the performance using nested crossvalidation above.
Alternatively, one could write custom code to calculate the optimal hyperparameter values for each fold.
To do the simple option:

- with `gridSearch` do `fit` using 
    - `X`
    - with `np` do `ravel` using `Y`

In [29]:
gridSearch.fit(X,np.ravel(Y))

#<xml xmlns="https://developers.google.com/blockly/xml"><variables><variable id="yv+gPAZd/8:NDOnmJ~uf">gridSearch</variable><variable id="_z/H/tTnOYS1Lla2~;*0">X</variable><variable id="hxqH|@PJU~wrI]t*tHgd">np</variable><variable id="o8#tp%!,_Okj2aVwX(DL">Y</variable></variables><block type="varDoMethod" id=".[DsHG}f2S4=]TC5Wwx1" x="17" y="386"><mutation items="2"></mutation><field name="VAR" id="yv+gPAZd/8:NDOnmJ~uf">gridSearch</field><field name="MEMBER">fit</field><data>gridSearch:fit</data><value name="ADD0"><block type="variables_get" id="`B(AHJOp9Qp}qrPxooSc"><field name="VAR" id="_z/H/tTnOYS1Lla2~;*0">X</field></block></value><value name="ADD1"><block type="varDoMethod" id="$atcQ%H`L]6,1bhQmCH9"><mutation items="1"></mutation><field name="VAR" id="hxqH|@PJU~wrI]t*tHgd">np</field><field name="MEMBER">ravel</field><data>np:ravel</data><value name="ADD0"><block type="variables_get" id="L!.qq-Tn}xc#Lht8WJAL"><field name="VAR" id="o8#tp%!,_Okj2aVwX(DL">Y</field></block></value></block></value></block></xml>

Now the best hyperparameter values can be displayed:
    
- freestyle `gridSearch.best_params_`

In [30]:
gridSearch.best_params_

#<xml xmlns="https://developers.google.com/blockly/xml"><block type="dummyOutputCodeBlock" id="-l+|(Er0I~hx@x,VxdlQ" x="-7" y="10"><field name="CODE">gridSearch.best_params_</field></block></xml>

{'svm__C': 1}

We can see that `1` is the best parameter for `C`.
One strategy to improve on this might be to grid search using values closer to 1.

## Check your knowledge

**Hover to see the correct answer.**

1.  What is the primary goal of building models, as stated in the notebook?
- To perfectly fit the training data.
- <div title="Correct answer">To predict and generalize.</div>
- To create the most complex model possible.
- To achieve high performance only on seen data.

2.  What is a common problem when a model performs very well on training data but poorly on testing data?
- Underfitting.
- <div title="Correct answer">Overfitting.</div>
- Generalization.
- Crossvalidation.

3.  What does crossvalidation allow us to do with all our data?
- Use it all for training only.
- Use it all for testing only.
- <div title="Correct answer">Use it all for both training and testing, but not at the same time.</div>
- Use it to manually adjust model parameters.

4.  In a 5-fold crossvalidation, if the model uses Fold 1 for testing, which folds are used for training?
- Only Fold 1.
- Folds 1 and 2.
- <div title="Correct answer">Folds 2-5.</div>
- None of the folds.

5.  Which scikit-learn function is used to get predictions from a crossvalidated model?
- `cross_val_score`
- `train_test_split`
- `fit_predict`
- <div title="Correct answer">`cross_val_predict`</div>

6.  What is a **hyperparameter**?
- A parameter the model learns during training.
- <div title="Correct answer">Something you specify when you create the model.</div>
- A metric used to evaluate model performance.
- A type of crossvalidation.

7.  What is **Grid Search** primarily used for in conjunction with crossvalidation?
- To visualize model performance.
- To automatically split data into training and testing sets.
- <div title="Correct answer">To explore multiple candidate values for hyperparameters.</div>
- To prevent overfitting on the training data.

8.  What problem does **nested crossvalidation** solve?
- The issue of having too much data.
- Simplifying the model training process.
- <div title="Correct answer">Finding hyperparameters on the same data used to test the model.</div>
- Reducing the number of folds in crossvalidation.

9.  In the context of nested crossvalidation, what are the **outer folds** used for?
- Evaluating the hyperparameters.
- <div title="Correct answer">Training the model in the general way.</div>
- Visualizing the data distribution.
- Calculating individual fold performance metrics.

10. What is `SVC` sensitive to, as mentioned in the notebook, which necessitates the use of `StandardScaler` in the pipeline?
- The number of folds.
- The choice of kernel.
- <div title="Correct answer">Standardization of variables.</div>
- The `random_state` parameter.

<!--  -->