# K-Fold Cross Validation
K-Fold Cross Validation solves the problem of "high variance" among models. The "high variance" problem occurs when you have significantly varying accuracies when predicting among test sets. 

K-Fold Cross Validation is a resampling procedure to select a machine learning model by dividing the entire data set into K folds and use each fold as a testing set at some point. By dividing the data set into multiple testing sets, it allows for the model to predict multiple testing sets instead of a single test set, better verifying the model's reliability.

### Example of K-Fold Cross Validation
Let's say K = 5. The data set is split into 5 folds.

The first iteration, the first fold is used to test the model and the others are for training.  
The second iteration, the second fold is used to test the model and the others are for training.  
This process is repeated until each of the 5 folds are used as a testing set.

In [1]:
# import libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [2]:
# import the data set
ads_df = pd.read_csv("datasets/social_network_ads.csv")

ads_df.head()

Unnamed: 0,User ID,Gender,Age,EstimatedSalary,Purchased
0,15624510,Male,19,19000,0
1,15810944,Male,35,20000,0
2,15668575,Female,26,43000,0
3,15603246,Female,27,57000,0
4,15804002,Male,19,76000,0


In [3]:
# x is the Age and Estimated Salary columns
x = ads_df.iloc[:, [2, 3]].values

# y is the Purchased column
y = ads_df.iloc[:, 4].values

In [4]:
# split the data set into training and testing data sets
from sklearn.model_selection import train_test_split 
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=0)

In [5]:
# import a Standarization Scaler for Feature Scaling
from sklearn.preprocessing import StandardScaler

# feature scale the training and testing sets
sc_x = StandardScaler()
x_train = sc_x.fit_transform(x_train)
x_test = sc_x.transform(x_test)



# Kernel RBF Support Vector Machine

In [6]:
# import the support vector classifier class
from sklearn.svm import SVC

In [7]:
# create a RBF SVC classifier, then fit to the training set
classifier = SVC(kernel="rbf", random_state=0)
classifier.fit(x_train, y_train)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='rbf', max_iter=-1, probability=False, random_state=0,
  shrinking=True, tol=0.001, verbose=False)

In [8]:
# predict the training set results
y_pred = classifier.predict(x_test)

y_pred

array([0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1,
       0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1,
       0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1,
       1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1])

# Confusion Matrix
The confusion matrix is great if you're only testing a single test set.

In [9]:
# import the confusion matrix function
from sklearn.metrics import confusion_matrix

In [10]:
# create a confusion matrix that compares the y_test (actual) to the y_pred (prediction)
cm = confusion_matrix(y_test, y_pred)

"""
Read the Confusion Matrix diagonally:
64 + 29 = 93 correct predictions
4 + 3 = 7 incorrect predictions
"""
cm

array([[64,  4],
       [ 3, 29]])

# K-Fold Cross Validation
Let's perform a 10-Fold Cross Validation on the Kernel SVM's training set, no need to predict on the training set because we've already done it above.

In [13]:
# import the k-fold cross validation
from sklearn.model_selection import cross_val_score

In [19]:
"""
perform k-fold cross validation on the kernel SVM classifier
- estimator = classifier is the model
- X = x_train means to split the training set into k testing sets to view the accuracies
- y = y_train means the training set's actual values
- cv = 10 means to use 10 cross folds
- n_jobs = -1 means to use all the CPUs on the machine (used on large data sets)
"""
accuracies = cross_val_score(estimator=classifier, X=x_train, y=y_train, cv=10, n_jobs=-1)

# the 10 accuracies of the 10 cross folds from the training set
accuracies

array([0.80645161, 0.96666667, 0.8       , 0.93333333, 0.86666667,
       0.83333333, 0.93333333, 0.93333333, 0.96666667, 0.96551724])

In [18]:
# the mean (average) accuracies of the 10 cross folds from the training set
accuracies.mean()

0.9005302187615868

In [21]:
"""
the standard deviation of the 10 cross folds from the training set

As seen below, the standard deviation is about 6%. Since the mean
is about 90%, then the accuracies typically range from 84% to 96%,
which is actually low variance and low bias because the standard
deviation is low and the accuracies are high, which is great!
"""
accuracies.std()

0.06388957356626285