### Cross Validation

We cannot rely 100% on the train test split alone. This is because models are tested with only training and test data which are reliant on random seeds or particular splits in the data which can lead to varied performance of different models. For example, someone tests on a 80-20 train test split with random seed (123) and gets 78.5 while someone else trains on a 70-30 train test split with random seed (999) and get a performance of 79.87%. Therefore, sometimes the performance on train-test split alone can be misleading. <br>

<b>We need to create different combinations of the data and provide some sort of average score.</b> This is known as cross validation. <br>

<b>K-Fold Cross validation</b> is when we divide the data into different folds or groups of data. We can do this on the entire training set or the entire data. Once we divide the data into groups, we consider one group as the test data while the other remaining groups are the training data. This process continues until the model is trained and tested on every group (every group has a chance of being the test data).<br>

The average of the performance scores for each iteration is the final score. <br>

<ol>
    <li>Training set : The sample of data used to fit the model</li>
    <li>Validation set: The sample of data used to provide an unbiased evaluation of a model that has been fit on the training data while tuning model hyperparams. The evaluation becomes more biased as skill on the validation set is incorporated into the model config.</li>
    <li>Test set: The sample of data used to provide an estimation of model performance on the fit on the training dataset</li>
</ol>

In [1]:
import pandas as pd

In [2]:
data = pd.read_csv("../data/04 - decisiontreeAdultIncome.csv")
data_prep = pd.get_dummies(data,drop_first=True)
x = data_prep.iloc[:,:-1]
y = data_prep.iloc[:,-1]

from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier(random_state=1)

from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(random_state=2)

from sklearn.svm import SVC
svc = SVC(kernel="rbf",gamma=0.5)

In [3]:
from sklearn.model_selection import cross_validate

# 10 fold validation
cv_decision_tree = cross_validate(dtc,x,y,cv=10,return_train_score=True)
cv_random_forest = cross_validate(rfc,x,y,cv=10,return_train_score=True)
cv_svm = cross_validate(svc,x,y,cv=10,return_train_score=True)

In [7]:
cv_svm['test_score'].mean(), cv_random_forest['test_score'].mean(), cv_decision_tree["test_score"].mean()

(0.8036085674097743, 0.7975443368718358, 0.7808671281008731)

In [8]:
cv_svm['train_score'].mean(), cv_random_forest['train_score'].mean(), cv_decision_tree["train_score"].mean()

(0.8744798848765454, 0.9043423576250881, 0.9043535882172298)