# Cross Validation
>  Holdout sets are a great start to model validation. However, using a single train and test set if often not enough. Cross-validation is considered the gold standard when it comes to validating model performance and is almost always used when tuning model hyper-parameters. This chapter focuses on performing cross-validation to validate model performance.

- toc: true 
- badges: true
- comments: true
- author: Lucas Nunes
- categories: [Python, Datacamp, Machine Learning]
- image: images/datacamp/1_supervised_learning_with_scikit_learn/2_regression.png

> Note: This is a summary of the course's chapter 3 exercises "Model Validation in Python" at datacamp. <br>[Github repo](https://github.com/lnunesAI/Datacamp/) / [Course link](https://www.datacamp.com/tracks/machine-learning-scientist-with-python)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams['figure.figsize'] = (8, 8)

## The problems with holdout sets

### Two samples

<div class=""><p>After building several classification models based on the<code>tic_tac_toe</code> dataset, you realize that some models do not generalize as well as others. You have created training and testing splits just as you have been taught, so you are curious why your validation process is not working. </p>
<p>After trying a different training, test split, you noticed differing accuracies for your machine learning model. Before getting too frustrated with the varying results, you have decided to see what else could be going on.</p></div>

In [3]:
tic_tac_toe = pd.read_csv('https://raw.githubusercontent.com/lnunesAI/Datacamp/main/2-machine-learning-scientist-with-python/11-model-validation-in-python/datasets/tic-tac-toe.csv')

In [4]:
tic_tac_toe

Unnamed: 0,Top-Left,Top-Middle,Top-Right,Middle-Left,Middle-Middle,Middle-Right,Bottom-Left,Bottom-Middle,Bottom-Right,Class
0,x,x,x,x,o,o,x,o,o,positive
1,x,x,x,x,o,o,o,x,o,positive
2,x,x,x,x,o,o,o,o,x,positive
3,x,x,x,x,o,o,o,b,b,positive
4,x,x,x,x,o,o,b,o,b,positive
...,...,...,...,...,...,...,...,...,...,...
953,o,x,x,x,o,o,o,x,x,negative
954,o,x,o,x,x,o,x,o,x,negative
955,o,x,o,x,o,x,x,o,x,negative
956,o,x,o,o,x,x,x,o,x,negative


Instructions 1/3
<li>Create samples <code>sample1</code> and <code>sample2</code> with 200 observations that could act as possible testing datasets.</li>

In [None]:
# Create two different samples of 200 observations 
sample1 = tic_tac_toe.sample(200, random_state=1111)
sample2 = tic_tac_toe.sample(200, random_state=1171)

Instructions 2/3
<li>Use the list comprehension statement to find out how many observations these samples have in common.</li>

In [None]:
# Print the number of common observations 
print(len([index for index in sample1.index if index in sample2.index]))

40


Instructions 3/3
<li>Use the <code>Series.value_counts()</code> method to print the values in both samples for column <code>Class</code>.</li>

In [None]:
# Print the number of observations in the Class column for both samples 
print(sample1['Class'].value_counts())
print(sample2['Class'].value_counts())

positive    134
negative     66
Name: Class, dtype: int64
positive    123
negative     77
Name: Class, dtype: int64


**Notice that there are a varying number of positive observations for both sample test sets. Sometimes creating a single test holdout sample is not enough to achieve the high levels of model validation you want. You need to use something more robust.**

### Potential problems

<div class=""><p>Which of the following statements are <strong>TRUE</strong> regarding potential problems with holdout samples:</p>
<ul>
<li><strong>A</strong>: Using different data splitting methods may lead to varying data in the final holdout samples.</li>
<li><strong>B</strong>: If you have limited data, your holdout accuracy may be misleading.</li>
<li><strong>C</strong>: There are no problems. Creating a single train and test sample is the only way to validate models.</li>
<li><strong>D</strong>: You shouldn't use holdout samples with limited data because you are limiting the potential training data.</li>
</ul></div>

<pre>
Possible Answers

A & D

C & D

<b>A & B</b>

A, B, & D

</pre>

**If our models are not generalizing well or if we have limited data, we should be careful using a single training/validation split. You should use the next lesson's topic: cross-validation.**

## Cross-validation

### scikit-learn's KFold()

<div class=""><p>You just finished running a colleagues code that creates a random forest model and calculates an out-of-sample accuracy. You noticed that your colleague's code did not have a random state, and the errors you found were completely different than the errors your colleague reported. </p>
<p>To get a better estimate for how accurate this random forest model will be on new data, you have decided to generate some indices to use for KFold cross-validation.</p></div>

In [5]:
df = pd.read_csv('https://raw.githubusercontent.com/lnunesAI/Datacamp/main/2-machine-learning-scientist-with-python/11-model-validation-in-python/datasets/candy-data.csv')
X = df.drop(['competitorname', 'winpercent'], axis=1).to_numpy()
y = df['winpercent'].to_numpy()

Instructions
<ul>
<li>Call the <code>KFold()</code> method to split data using five splits, shuffling, and a random state of 1111. </li>
<li>Use the <code>split()</code> method of <code>KFold</code> on <code>X</code>.</li>
<li>Print the number of indices in both the train and validation indices lists.</li>
</ul>

In [59]:
from sklearn.model_selection import KFold

# Use KFold
kf = KFold(n_splits=5, shuffle=True, random_state=1111)

# Create splits
splits = kf.split(X)

# Print the number of indices
for train_index, val_index in splits:
    print("Number of training indices: %s" % len(train_index))
    print("Number of validation indices: %s" % len(val_index))

Number of training indices: 68
Number of validation indices: 17
Number of training indices: 68
Number of validation indices: 17
Number of training indices: 68
Number of validation indices: 17
Number of training indices: 68
Number of validation indices: 17
Number of training indices: 68
Number of validation indices: 17


**This dataset has 85 rows. You have created five splits - each containing 68 training and 17 validation indices. You can use these indices to complete 5-fold cross-validation.**

### Using KFold indices

<div class=""><p>You have already created <code>splits</code>, which contains indices for the candy-data dataset to complete 5-fold cross-validation. To get a better estimate for how well a colleague's random forest model will perform on a new data, you want to run this model on the five different training and validation indices you just created. </p>
<p>In this exercise, you will use these indices to check the accuracy of this model using the five different splits. A for loop has been provided to assist with this process.</p></div>

In [60]:
splits = kf.split(X)

Instructions
<ul>
<li>Use <code>train_index</code> and <code>val_index</code> to call the correct indices of <code>X</code> and <code>y</code> when creating training and validation data.</li>
<li>Fit <code>rfc</code> using the training dataset</li>
<li>Use <code>rfc</code> to create predictions for validation dataset and print the validation accuracy</li>
</ul>

In [61]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

rfc = RandomForestRegressor(n_estimators=25, random_state=1111)

# Access the training and validation indices of splits
for train_index, val_index in splits:
    # Setup the training and validation data
    X_train, y_train = X[train_index], y[train_index]
    X_val, y_val = X[val_index], y[val_index]
    # Fit the random forest model
    rfc.fit(X_train, y_train)
    # Make predictions, and print the accuracy
    predictions = rfc.predict(X_val)
    print("Split accuracy: " + str(mean_squared_error(y_val, predictions)))

Split accuracy: 151.5028145199104
Split accuracy: 173.4624060357644
Split accuracy: 132.7340977072911
Split accuracy: 81.50364942339418
Split accuracy: 217.17904656079338


**KFold() is a great method for accessing individual indices when completing cross-validation. One drawback is needing a for loop to work through the indices though. In the next lesson, you will look at an automated method for cross-validation using sklearn.**

## sklearn's cross_val_score()

### scikit-learn's methods

<div class=""><p>You have decided to build a regression model to predict the number of new employees your company will successfully hire next month. You open up a new Python script to get started, but you quickly realize that <code>sklearn</code> has <em>a lot</em> of different modules. Let's make sure you understand the names of the modules, the methods, and which module contains which method. </p>
<p>Follow the instructions below to load in all of the necessary methods for completing cross-validation using <code>sklearn</code>. You will use modules:</p>
<ul>
<li><code>metrics</code></li>
<li><code>model_selection</code></li>
<li><code>ensemble</code></li>
</ul></div>

Instructions
<ul>
<li>Load the method for calculating the scores of cross-validation.</li>
<li>Load the random forest regression method.</li>
<li>Load the mean square error metric.</li>
<li>Load the method for creating a scorer to use with cross-validation.</li>
</ul>

In [63]:
# Instruction 1: Load the cross-validation method
from sklearn.model_selection import cross_val_score

# Instruction 2: Load the random forest regression model
from sklearn.ensemble import RandomForestRegressor

# Instruction 3: Load the mean squared error method
# Instruction 4: Load the function for creating a scorer
from sklearn.metrics import mean_squared_error, make_scorer

**It is easy to see how all of the methods can get mixed up, but it is important to know the names of the methods you need. You can always review the scikit-learn documentation should you need any help**

### Implement cross_val_score()

<div class=""><p>Your company has created several new candies to sell, but they are not sure if they should release all five of them. To predict the popularity of these new candies, you have been asked to build a regression model using the candy dataset. Remember that the response value is a head-to-head win-percentage against other candies. </p>
<p>Before you begin trying different regression models, you have decided to run cross-validation on a simple random forest model to get a baseline error to compare with any future results.</p></div>

Instructions
<ul>
<li>Fill in <code>cross_val_score()</code>. <ul>
<li>Use <code>X_train</code> for the training data, and <code>y_train</code> for the response.</li>
<li>Use <code>rfc</code> as the model, 10-fold cross-validation, and <code>mse</code> for the scoring function.</li></ul></li>
<li>Print the mean of the <code>cv</code> results.</li>
</ul>

In [64]:
rfc = RandomForestRegressor(n_estimators=25, random_state=1111)
mse = make_scorer(mean_squared_error)

# Set up cross_val_score
cv = cross_val_score(estimator=rfc,
                     X=X_train,
                     y=y_train,
                     cv=10,
                     scoring=mse)

# Print the mean error
print(cv.mean())

130.91371947185584


**You now have a baseline score to build on. If you decide to build additional models or try new techniques, you should try to get an error lower than 155.56. Lower errors indicate that your popularity predictions are improving.**

## Leave-one-out-cross-validation (LOOCV)

### When to use LOOCV

<div class=""><p>Which of the following are reasons you might <strong>NOT</strong> run LOOCV on the provided <code>X</code> dataset? 
The <code>X</code> data has been loaded for you to explore as you see fit. </p>
<ul>
<li><strong>A</strong>: The <code>X</code> dataset has 122,624 data points, which might be computationally expensive and slow.</li>
<li><strong>B</strong>: You cannot run LOOCV on classification problems. </li>
<li><strong>C</strong>: You want to test different values for 15 different parameters</li>
</ul></div>

<pre>
Possible Answers

A & B

B & C

<b>A & C</b>

A
</pre>

**This many observations will definitely slow things down and could be computationally expensive. If you don't have time to wait while your computer runs through 1,000 models, you might want to use 5 or 10-fold cross-validation.**

### Leave-one-out-cross-validation

<div class=""><p>Let's assume your favorite candy is not in the candy dataset, and that you are interested in the popularity of this candy. Using 5-fold cross-validation will train on only 80% of the data at a time. The candy dataset <em>only</em> has 85 rows though, and leaving out 20% of the data could hinder our model. However, using leave-one-out-cross-validation allows us to make the most out of our limited dataset and will give you the best estimate for your favorite candy's popularity!</p>
<p>In this exercise, you will use <code>cross_val_score()</code> to perform LOOCV.</p></div>

Instructions
<ul>
<li>Create a scorer using <code>mean_absolute_error</code> for <code>cross_val_score()</code> to use. </li>
<li>Fill out <code>cross_val_score()</code> so that the model <code>rfr</code>, the newly defined <code>mae_scorer</code>, and LOOCV are used. </li>
<li>Print the mean and the standard deviation of <code>scores</code> using <code>numpy</code> (loaded as <code>np</code>).</li>
</ul>

In [65]:
from sklearn.metrics import mean_absolute_error, make_scorer

# Create scorer
mae_scorer = make_scorer(mean_absolute_error)

rfr = RandomForestRegressor(n_estimators=15, random_state=1111)

# Implement LOOCV
scores = cross_val_score(rfr, X=X, y=y, cv=X.shape[0], scoring=mae_scorer)

# Print the mean and standard deviation
print("The mean of the errors is: %s." % np.mean(scores))
print("The standard deviation of the errors is: %s." % np.std(scores))

The mean of the errors is: 9.464989603398694.
The standard deviation of the errors is: 7.265762094853885.


**You have come along way with model validation techniques. The final chapter will wrap up model validation by discussing how to select the best model and give an introduction to parameter tuning.**