# Introduction
<hr style="border:2px solid black"> </hr>


**What?** k-fold CV, Leave-One-Out CV and Repeated Random Test-Train Splits



# Methods to estimate your ML algorithm accuracy
<hr style="border:2px solid black"> </hr>


-  **k-Fold CV** is the gold standard for evaluating the performance of a machine learning algorithm on unseen data with k set to 3, 5, or 10.
- **Train/test split** is good for speed when using a slow algorithm and produces performance estimates with lower bias when using large datasets.
- Techniques like **leave-one-out** and **repeated random splits** can be useful intermediates when trying to balance variance in the estimated performance, model training speed and dataset size.
- **The bottom line?** If in doubt, use 10-Fold CV.
  


# Import modules
<hr style="border:2px solid black"> </hr>

In [1]:
from pandas import read_csv
from sklearn.model_selection import KFold
from IPython.display import Markdown, display
from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Load dataset
<hr style="border:2px solid black"> </hr>

In [4]:
filename = "../DATASETS/pima-indians-diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] 
dataframe = read_csv(filename, names = names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
print("Data shapes: ", dataframe.shape)

Data shapes:  (768, 9)


# Train and a test set
<hr style="border:2px solid black"> </hr>


- We can take our original dataset and split it into two parts. Train the algorithm on the first part, make predictions on the 
second part and evaluate the predictions against the expected results.

- **CONS**: A downside of this technique is that it can have a high variance. This means that differences in the training and test dataset can result in meaningful differences in the estimate of accuracy.



In [12]:
test_size = 0.33
seed = 7
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = test_size, random_state = seed)

print("Original X size", X.shape)
print("Train X size", X_train.shape, "percentage", (X_train.shape[0] / X.shape[0]) * 100 )
print("Test X size", X_test.shape, "percentage", (X_test.shape[0] / X.shape[0]) * 100 )

model = LogisticRegression( max_iter = 1000)
# Train your model
model.fit(X_train, Y_train)
# See how your model is doing on unseen data (_test)
result = model.score(X_test, Y_test)
print("Accuracy on test data: ", result*100.0)

Original X size (768, 8)
Train X size (514, 8) percentage 66.92708333333334
Test X size (254, 8) percentage 33.07291666666667
Accuracy on test data:  78.74015748031496


# k-Fold CV
<hr style="border:2px solid black"> </hr>


- CV is an approach that you can use to estimate the performance of a ML algorithm with less variance than a single train-test set split. 
- It works by splitting the dataset into k-parts. After running cross validation you end up with k different performance scores that you can summarize using a mean and a standard deviation. 



In [15]:
num_folds = 10
seed = 7
kfold = KFold(n_splits=num_folds, shuffle = True, random_state = seed)
model = LogisticRegression(max_iter = 500)
results = cross_val_score(model, X, Y, cv = kfold)

for i in range(len(results)):
    print(str(i + 1) + "-fold has accuracy: ", results[i])

print(f"Mean accuracy: {results.mean()*100.0:.4f} with standard deviation of: {results.std()*100.0:.4f}")

1-fold has accuracy:  0.8311688311688312
2-fold has accuracy:  0.7402597402597403
3-fold has accuracy:  0.7402597402597403
4-fold has accuracy:  0.8051948051948052
5-fold has accuracy:  0.7922077922077922
6-fold has accuracy:  0.7792207792207793
7-fold has accuracy:  0.6623376623376623
8-fold has accuracy:  0.8051948051948052
9-fold has accuracy:  0.8289473684210527
10-fold has accuracy:  0.7368421052631579
Mean accuracy: 77.2163 with standard deviation of: 4.9684


# Leave-One-Out CV
<hr style="border:2px solid black"> </hr>


- You can configure cross validation so that the size of the fold is 1 (k is set to the number of observations in your dataset). 

- This variation of cross validation is called leave-one-out cross validation. 

- The result is a large number of performance measures that can be summarized in an effort to give a more reasonable estimate of the accuracy of your model on unseen data. 

- CONS: A downside is that it can be a computationally more expensive  procedure than k-fold cross validation.

- You can see in the standard deviation that the score has MORE variance than the k-fold cross validation results described above



In [18]:
loocv = LeaveOneOut()
model = LogisticRegression(max_iter = 250)
results = cross_val_score(model, X, Y, cv = loocv)
print(f"Mean accuracy: {results.mean()*100.0:.4f} with standard deviation of: {results.std()*100.0:.4f}")

Mean accuracy: 77.6042 with standard deviation of: 41.6894


# Repeated Random Test-Train Splits
<hr style="border:2px solid black"> </hr>


- Another variation on k-fold cross validation is to create a random split of the data like the train/test split described above, but repeat the process of splitting and evaluation of the algorithm multiple times, like cross validation. 

- This has the speed of using a train/test split and the reduction in variance in the estimated performance of k-fold cross validation.

- **CONS**: A down side is that repetitions may include much of the same data in the train or the test split from run to run, introducing redundancy into the evaluation.



In [20]:
n_splits = 10
test_size = 0.33
seed = 7
kfold = ShuffleSplit(n_splits=n_splits, test_size=test_size, random_state=seed)
model = LogisticRegression(max_iter = 500)
results = cross_val_score(model, X, Y, cv = kfold)
print(f"Mean accuracy: {results.mean()*100.0:.4f} with standard deviation of: {results.std()*100.0:.4f}")

Mean accuracy: 76.5354 with standard deviation of: 2.2354


# Conclusions
<hr style="border:2px solid black"> </hr>


| Methods                           | Type/variation of | Speed                        | Variance                   |
| --------------------------------- | ----------------- | ---------------------------- | -------------------------- |
| Train/test split                  | NA                | Fastest                      | higher than k-fold         |
| k-Fold                            | k-Fold            | Slower than train/test split | less than train/test split |
| Leave-one-out                     | k-Fold            | Slower tha k-Fold            | higher than k-Fold         |
| Repeated Random Test-Train Splits | Train/split       | Slower than train/test split | less than k-Fold           |




# References
<hr style="border:2px solid black"> </hr>


- https://machinelearningmastery.com/evaluate-performance-machine-learning-algorithms-python-using-resampling/

