<a href="https://colab.research.google.com/github/nickstone1911/data-analysis-practice/blob/main/Cross_Validation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced Business Analytics: Cross Validation in `scikit-learn`

In this code-along notebook, we will:

1. Demonstrate how to use `scikit-learn` and the `train_test_split` module to split our datasets into training and testing datasets
2. Train a machine learning model on training data and evaluate it's performance on testing data
3. Expand on the idea of training and testing data by using a more rigorous cross validation procedure
4. Practice splitting data into various testing, training, holdout, and cross validation datasets
5. Begin to discuss the idea of model/data leakage which is a problem in building models we will discuss in various lessons
6. Introduce model evauation
>- We will have a deeper discussion of model evaluation in other lessons

---

## Resources
>- Chapter 5 in Provost and Fawcett

>- [Overfitting and Its Avoidance Slides](https://docs.google.com/presentation/d/1b14pJWpSYue7QYTf-sLEH48cuoTEwkxvZINHpwo-lBI/edit?usp=sharing)
>- [Holdout and Cross Validation Slides](https://docs.google.com/presentation/d/1pLlqpehomzpM0SIdNn85Zlo_l_rR_Fl6eLRjPfTwsQg/edit?usp=sharing)
>- [Cross-validation: evaluating estimator performance scikit-learn doc ](https://scikit-learn.org/stable/modules/cross_validation.html)

---


# Importing Initial Libraries




In [None]:
import pandas as pd
import numpy as np

# Load the `iris` Dataset
>- For this tutorial we will load the famous and commonly used `iris` dataset directly from `sklearn` datasets

---

# Load Data

In [None]:
from sklearn import datasets

In [None]:
iris = datasets.load_iris(as_frame=True)

iris_df = iris.frame

iris_df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


## Define the Feature and Target Data

In [None]:
X = iris_df.drop('target', axis=1)

X.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [None]:
y = iris_df['target']

y.head()

0    0
1    0
2    0
3    0
4    0
Name: target, dtype: int64

---
# End of Video 1

---

# Train/Test Split Procedure

The following outlines the steps involved in splitting a dataset into training and testing data. Some steps, such as step 0., are there to remind us of the data prep that must be done prior to fitting models. We will skip this step in some tutorials because data prep will be covered in depth in other lessons.

0. Clean and adjust data as necessary for X and y
1. Import Libraries
2. Split Data in Train/Test for both X and y
3. Fit/Train Scaler on Training X Data
4. Scale X Test Data
5. Create Model
6. Fit/Train Model on X Train Data
7. Evaluate Model on X Test Data (by creating predictions and comparing to Y_test)
8. Adjust Parameters as Necessary and repeat steps 5 and 6

---

## Step 1: Import Libraries
Note: in many data mining projects this will be a standard import at the beginning of a notebook but for this introductory lesson we have it here for demonstration purposes.

In [None]:
from sklearn.model_selection import train_test_split

## Step 2: Split Data in Train/Test for both X and y

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= .4, random_state = 0)

Let's examine our split up data:

>- With the parameter `test_size = .4` we are stating that we want 40% of the data to be in the testing set.
>>- So, .4 * 150 (the total records) = 60 records we should expect in the testing data sets.

>- We are setting the `random_state=0` parameter so we all can get reproducible results.

In [None]:
X_train.shape, X_test.shape

((90, 4), (60, 4))

In [None]:
y_train.shape, y_test.shape

((90,), (60,))

## Step 3: Fit/Train Scaler on Training Data

>- Note: its often best practice to scale our data so we will practice that in this first example. Later we may skip this step so we can move through some examples quicker but just keep in mind that scaling is something we usually do.


In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

scaler.fit(X_train)

X_train = scaler.transform(X_train)

X_train[:5]

array([[ 0.18206758,  0.71103882,  0.45664061,  0.55799544],
       [-1.17402201,  0.00522823, -1.10334891, -1.19530695],
       [-0.04394735, -0.93585257,  0.77939706,  0.9337031 ],
       [-0.26996228, -0.93585257,  0.29526238,  0.18228779],
       [-0.26996228, -0.46531217, -0.02749407,  0.18228779]])

## Step 4: Scale Test Data

In [None]:
X_test = scaler.transform(X_test)

## Step 5: Create the Model

In this example, we are going to fit a support vector machine to the data. You could also fit any other classification model to this data as well.

In [None]:
from sklearn.svm import SVC

svc_model = SVC(kernel='linear', C = .1)

## Step 6: Fit/Train Model on Training Data

In [None]:
svc_model.fit(X_train, y_train)

## Step 7: Evaluate Model on X_test data

For this example, we are just going to look at overall model accuracy (the default) using the `score()` method.

>- In future lessons, we will cover other evaluation metrics for both regression and classification problems.

In [None]:
svc_model.score(X_test, y_test)

0.9166666666666666

## Step 8: Adjust parameters as necessary and repeat steps 5-7

>- We picked a small C value initially, let's see if picking a larger one improves the model accuracy.

In [None]:
from sklearn.svm import SVC

svc_model = SVC(kernel='linear', C = 1000)

svc_model.fit(X_train, y_train)

svc_model.score(X_test, y_test)

0.95

Ok, we improved our model with a higher C value! However, this methodology of adjusting parameters using the training and testing sets is flawed so we will pick up in the next lesson on how we can improve on this.

---
# End of Video 2

---

# Holdout Testing and Cross Validation

>- We left off the prior video with an example of trying to mannually adjust the `C` parameter until we were satisfied with the accuracy reported
>- However, training a model this way can lead to what is known as model "leakage" which means our model evaluation metrics no longer are reporting on generalization performance
>- What we want to be able to do, is confidently give evaluation metrics that will generalize to future data

In the next few examples we will show a procedure to create another split on the data to create what is known as a `holdout` dataset. This data will never be seen by the model in training and therefore will serve as a better evaluation of our model performance.

---

# Holdout Testing

Because adjusting models and evaluating performance should never be done on the same testing and training data, let's create a new split of the data that the model will never see in training.

>- For sake of clarity and a reminder of what we are working wiht, we are going to redefine our data for this process

In [None]:
X = iris_df.drop('target', axis=1)

y = iris_df['target']

### First, create the training and test sets as usual


In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= .4, random_state = 0)

X_test.shape, y_test.shape

((60, 4), (60,))

### Now, create a holdout set based on the test data

>- Here, we will be using `train_test_split` again but only on the the test data defined in the prior cell. This will create our final holdout test set for model evaluation.
>- We will use the `_eval` data to test parameters and make other model adjustments until satisfied
>- Then we will use the `_holdout` data to perform our final model evaluations which should be more generalizable

---

Here, we will set `test_size` = .5 which will take 50% of the test data we defined in the previous cell.
>- So we will get 60 * .5 = 30 rows for our holdout data

In [None]:
X_eval, X_holdout, y_eval, y_holdout = train_test_split(X_test, y_test, test_size= .5, random_state = 0)

In [None]:
X_eval.shape, y_eval.shape

((30, 4), (30,))

### Back to Steps 5-8
>- Note: We will skip the scaling in this example but you could go back and try it later to see if it makes a difference
>- In this example we go back through building the model and testing some hyperparameters
>- What is different here is that now once we are satisfied with our parameters, we will do a final evaluation on the holdout data.

We will test the following values for C:
>- .1, 1, 10, 100, 1000
>- We will run these all in the same cell and the jot down the notes in the markdown cell below this code cell

In [None]:
from sklearn.svm import SVC

svc_model = SVC(kernel='linear', C = 1000)

svc_model.fit(X_train, y_train)

svc_model.score(X_eval, y_eval)

0.9333333333333333

## Evaluation Summary

Here was the accuracy of our model based on the various values of C:

| C value 	| Accuracy 	|
|---------	|----------	|
| .1      	| 90%      	|
| 1       	| 93%      	|
| 10      	| 87%      	|
| 100     	| 93%      	|
| 1000    	| 93%      	|


Even though we had some ties for "best" parameter, let's use C=1.

>- Note: Once we pick the parameters based on the `_eval` data, we cannot go back and adjust them after seeing results on the holdout set.
>- If you do this, you will be creating the same problem we had prior to using the holdout data and creating model/data leakage


Final evaluation on holdout data.

In [None]:
svc_model.score(X_holdout, y_holdout)

1.0

## Should we trust these numbers?

Wow, we got 100% accuracy on the holdout testing data with a C=1. However, anytime a model produces 100% accuracy we should be very suspicious. Here's some reasons why we might want to pump the breaks on this evaluation:

>- We are dealing with a small dataset. The iris dataset only has 150 total records
>- When we split this data up twice it left us with some really small training, evaluation, and holdout datasets
>- It is likely that our randomly selected holdout data just happened to be data our model could predict perfectly


Because we have this potential issue that the one, randomly selected, holdout dataset could lead to misleading assumptions of our model's accuracy, `Cross Validation` is often used to calculate many different training and testing/holdout datasets.

We will go over `Cross Validation` in the next section.

---

---
# End of Video 3

---

# Cross-Validation in `scikit-learn`

## Resources
>- [Cross-validation scikit-learn doc](https://scikit-learn.org/stable/modules/cross_validation.html)
>- [Metrics and Scoring scikit-learn doc](https://scikit-learn.org/stable/modules/model_evaluation.html)

---
## Cross Validation Flow Chart
[Cross_Validation Flow Chart](https://drive.google.com/file/d/11uVrnIIebuCNsQaq1aQFtUETHMwpQ63q/view?usp=sharing)

![grid_search_workflow.png](https://drive.google.com/uc?id=11uVrnIIebuCNsQaq1aQFtUETHMwpQ63q)




Let's bring our data definitions down to this section as we start our discussion on cross-validation.

In [None]:
X = iris_df.drop('target', axis=1)

y = iris_df['target']

Now, we train, test, split as usual.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= .4, random_state = 0)

X_test.shape, y_test.shape

((60, 4), (60,))

---

## `cross_val_score`

`cross_val_score` is a common method used to perform a cross-validation in `scikit-learn`. Here are some notes on `cross_val_score`:


### Import statement
```
from sklearn.model_selection import cross_val_score
```

### Common parameters
>- `estimator`: the name of your model
>- `X`: the data to fit. This will usually be our X_train
>- `y`: the target to predict. This will usually by our y_train
>- `scoring`: this is how you want to evaluate your model
>>- Common options for classification: accuracy, f1, precision, recall
>>- Common options for regression: explained_variance, neg_mean_absolute_error, neg_mean_squared_error, r2
>>- See documentation for more options
>- `cv`: this is usually an integer indicating how many folds we want
>>- default is 5-folds but we will change this occasionally

---

Now, lets practice using `cross_val_score`.

>- In this example we are going to start off with a `C` value and then make adjustments similar to our holdout testing example.
>- However, with cross validation, we don't have to worry about splitting the test set up multiple times because the cross validation takes care of this for us.
>- Our `X_test` and `y_test` datasets are the final holdout datasets when using `cross_val_score`

Imports

In [None]:
from sklearn.model_selection import cross_val_score

from sklearn.svm import SVC

Create the model (aka estimator) object with one particular C value. We are going to call this the `v1` model.

In [None]:
svc_model_v1 = SVC(kernel='linear', C = .1)

Evaluate using `cross_val_score` and model accuracy (the default for classification models)
>- Note: `cross_val_score` fits the model to the data so we don't need a separate call to the `fit()` method here



In [None]:
scores_v1= cross_val_score(svc_model_v1, X_train, y_train, cv=10)

scores_v1

array([1.        , 1.        , 0.88888889, 1.        , 1.        ,
       1.        , 0.77777778, 1.        , 1.        , 1.        ])

Notice `cross_val_score` returns an array of the accuracies (or other evaluation metrics) of all the folds. So we can calcuate the average and standard deviation accuracy for the model.

In [None]:
scores_v1.mean(), scores_v1.std()

(0.9666666666666666, 0.07114582486036498)

Ok, now let's try one more value for `C`

In [None]:
svc_model_v2 = SVC(kernel='linear', C = 1)

In [None]:
scores_v2= cross_val_score(svc_model_v2, X_train, y_train, cv=10)

In [None]:
scores_v2.mean(), scores_v2.std()

(0.9888888888888889, 0.03333333333333335)

Finally, since C=1 is still producting the best results, we evaluate on the test data.

>- Now, we need to fit our model prior to the final evaluation

In [None]:
svc_model_v2.fit(X_train, y_train)

In [None]:
svc_model_v2.score(X_test, y_test)

0.9666666666666667

## Section Notes

>- Notice now how our cross validation accuracy (98.0%) was slightly higher than our final testing accuracy (97.7%) but when we only used one holdout set we had an initial accuracy of 93% and a final holdout accuracy of 100% which caused us to pause and be suspicious of the results
>- This is generally what you will see in real world model building. >>- Actually, the final accuracy isn't usually this close to the cross validation but again this is likely due to our small iris dataset

>- Another function in `scikit-learn` is `cross_validate` which we will likely cover in another lesson

---

# End of Video 4

---