# K-Fold Cross Validation
K-Fold cross validation is one of the most popular validation models in use to **evaluate the performace** of the Machine Learning model. It allows to have unbiased evaluation of the model since each item in the dataset have an chance to be part of both train and test set. <br> Accuracy is calculated for each set of train and test pair, afterwards the **average** of all such combiantions are taken as the **accuracy** of the model.

### Aim

First, a model will be created with pre defined training and test dataset.
Afterwards, A cross validation will be carried out to check the accuracy of the same.

### Data 

The data represents the body measurements of boys of a class in order to create a model to predict the hobbies of the male students. <br>
The measurement used are **Height, Weight, Chest, Bicepts**. <br>
The Target sports are **Cricket, Football and Basketball**. <br>

**Note** : The data used is created using a excel sheet.

### Libraries Used

In [1]:
import numpy as np
import pandas as pd
from sklearn import model_selection 
from sklearn import svm

### Loading the data 

In [2]:
### Data saved in the same folder
df = pd.read_csv("Data/Players.csv")
df.columns
df.head()

Unnamed: 0,Height,Weight,Chest,Biceps,Trait
0,170,60,50,18,Cricket
1,155,69,52,21,Cricket
2,168,62,53,19,Cricket
3,165,69,47,20,Cricket
4,157,68,50,20,Cricket


### Mapping of the Target data and Extracting the features


In [3]:
df = df.replace({'Cricket': 1, 'Football': 2, 'Basketball': 3})

features = df.drop('Trait', axis=1).as_matrix()
target = np.squeeze(df.filter(['Trait'], axis=1).as_matrix())

### Dataset Preparartion for Training and Testing without Cross validation 

The Data set we have is split into two set , **60%** for **training**vand rest **40%** for **testing**.
model_selection object from skilearn is used for the same.

In [4]:
X_train, X_test, y_train, y_test = model_selection.train_test_split(features, target, test_size=0.4, random_state=1)

###  Training on Support Vector Classification Model

In [5]:
### A model based on SVC is created with the pre defiend training set
clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)

### Checking the  accuracy of the model with Normal Test data

In [6]:
### Now measure its performance with the test data
clf.score(X_test, y_test)   

0.9152542372881356

The accuracy of the model with current test data is **91.25**.

### Crosss Validation of the model

Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample. Here a value of **5** is assigned for **K**. Hence a array of five accuracy values is returned as the cross validation scrore

In [7]:
### Calculating the r2 score for the model
scores = model_selection.cross_val_score(clf, features, target, cv=5)

### To find the accuracy for each fold:
print(scores)

[ 0.93333333  1.          0.93333333  0.86666667  1.        ]


### Cross Validation Score / Mean Score 

In [8]:
### And the mean accuracy of all 5 folds:
print(scores.mean())

0.946666666667


### Observation
The value is **better** than the fixed dataset ,  but this is **not guaranteed** for every usecase.

### Conclusion
The accuracy of the model can be evaluated better if the cross validation is used rather than fixed train and test set. <br>
When we tested our model with fixed data set the accuraacy given was **91.25**. <br>
But when we tested model with Cross Validation, the mean accuracy has **improved to 94.66%**