# ML Nb. 16. K-fold Cross Validation
***

### Compiled by Amit Purswani
LinkedIn: https://www.linkedin.com/in/amit-purswani-2a073777/
***

<b>GitHub Repositories</b>
1. Data Analysis:
https://github.com/kranemetal/Data-Analysis-Projects

2. Machine Learning:
https://github.com/kranemetal/MachineLearning
*******

## Cross Validation
It is a statistical method in which we partition the dataset into a fixed number of folds(or partitions), run the analysis on each fold and then average the overall estimate.
Steps
1. Reserve some portion of sample data-set.
2. Using the rest data-set train the model.
3. Test the model using the reserve portion of the data-set.

## Methods of Cross Validation

### Validation
In this method, we perform training on the 50% of the given data-set and rest 50% is used for the testing purpose. The major drawback of this method is that we perform training on the 50% of the dataset, it may possible that the remaining 50% of the data contains some important information which we are leaving while training our model i.e higher bias.

### LOOCV (Leave One Out Cross Validation)
- In this method, we perform training on the whole data-set but leaves only one data-point of the available data-set and then iterates for each data-point. It has some advantages as well as disadvantages also.
- An __advantage__ of using this method is that we make use of all data points and hence it is low bias.
- The major __drawback__ of this method is that it leads to higher variation in the testing model as we are testing against one data point. If the data point is an outlier it can lead to higher variation. Another drawback is it takes a lot of execution time as it iterates over ‘the number of data points’ times.

### K-Fold Cross Validation
In this method, we split the data-set into k number of subsets(known as folds) then we perform training on the all the subsets but leave one(k-1) subset for the evaluation of the trained model. In this method, we iterate k times with a different subset reserved for testing purpose each time.

__Note:__
It is always suggested that the value of k should be 10 as the lower value of k is takes towards validation and higher value of k leads to LOOCV method.

Source: geeksforgeeks

### Importing libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

### Import dataset

In [2]:
df = pd.read_csv('C:\\Users\krane\Desktop\datasets\Social_Network_Ads.csv')

### Basic checks on dataset

In [3]:
df.shape

(400, 3)

In [4]:
df.head()

Unnamed: 0,Age,EstimatedSalary,Purchased
0,19,19000,0
1,35,20000,0
2,26,43000,0
3,27,57000,0
4,19,76000,0


In [5]:
df.isnull().sum()

Age                0
EstimatedSalary    0
Purchased          0
dtype: int64

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype
---  ------           --------------  -----
 0   Age              400 non-null    int64
 1   EstimatedSalary  400 non-null    int64
 2   Purchased        400 non-null    int64
dtypes: int64(3)
memory usage: 9.5 KB


In [7]:
df.describe()

Unnamed: 0,Age,EstimatedSalary,Purchased
count,400.0,400.0,400.0
mean,37.655,69742.5,0.3575
std,10.482877,34096.960282,0.479864
min,18.0,15000.0,0.0
25%,29.75,43000.0,0.0
50%,37.0,70000.0,0.0
75%,46.0,88000.0,1.0
max,60.0,150000.0,1.0


### Splitting independent variable X and dependent variable Y

In [8]:
x = df.iloc[:,:-1].values
y = df.iloc[:,-1].values

### Splitting the train and test datasets

In [9]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size= 0.3, random_state=0)

### Feature Scaling

In [10]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train = sc.fit_transform(x_train)
x_test = sc.transform(x_test)

### Train Kernel SVM model on Train set

In [11]:
from sklearn.svm import SVC
classifier = SVC (kernel = 'rbf', random_state=0) #Using RBF Kernel
classifier.fit(x_train, y_train) 

SVC(random_state=0)

### Predicting on Test set

In [12]:
y_pred = classifier.predict(x_test)

### Checking the confusion matrix and accuracy score

In [13]:
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred) #confusion matrix(actual, predicted)
print(cm)
accuracy_score(y_test, y_pred) #accuracy score (actual, predicted)

[[72  7]
 [ 4 37]]


0.9083333333333333

### Applying K-fold cross validation

In [15]:
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X=x_train, y=y_train, cv=10) # K is 10, generally chosen value
print("Accuracy: {:.2f}%".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f}".format(accuracies.std()))

Accuracy: 90.71%
Standard Deviation: 0.05


### <center>The End