## Implementing Cross Valdation on Wine quality dataset

`sklearn` provides two methdods
- cross_val_score
- cross_validate

To simplify cross validation i.e. determining if it's actually worth building the model by training the model over various combinations of the data set.

The difference between `cross_val_score` & `cross_validate` is that the former does not return the trained models whilst the later does along with some other information such as `fit_time, test_score, test_time, estimator`

Let us see a practical implementation of Cross Validation

In [117]:
import warnings
warnings.filterwarnings('ignore')

In [118]:
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score, cross_validate
from sklearn.tree import DecisionTreeClassifier

## About the dataset

Taken from Githbb --> https://github.com/tkeldenich/First_Project_with_Scikit-Learn_MachineLearning/blob/main/winequality-white.csv

The dataset ranks wines according to their quality

In [119]:
data = pd.read_csv("winequality-white.csv", sep=";")
data.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6


In [120]:
# shuffle the dataset to get a fair distribution
data = data.sample(frac=1).reset_index(drop=True)

In [121]:
X = data.drop(columns='quality', axis=1)
y = data['quality']

In [122]:
decisionTree = DecisionTreeClassifier()
scores = cross_val_score(decisionTree, X, y, cv=10)
scores

array([0.60408163, 0.61020408, 0.58163265, 0.63673469, 0.64285714,
       0.63265306, 0.61632653, 0.64693878, 0.62372188, 0.64417178])

In [123]:
scores.mean()

0.6239322231960269

In [124]:
from sklearn.model_selection import train_test_split
X_train_test, X_gtest, y_train_test, y_gtest = train_test_split(X, y, test_size=0.10)

In [125]:
from sklearn import tree
decisionTree = tree.DecisionTreeClassifier()

In [126]:
cv_results = cross_validate(decisionTree, X_train_test, y_train_test, cv=10, return_estimator=True)
cv_results['test_score']

array([0.60997732, 0.63718821, 0.62358277, 0.59637188, 0.63945578,
       0.62585034, 0.62131519, 0.64172336, 0.63409091, 0.62272727])

In [127]:
cv_results['test_score'].mean()

0.6252283034425891

In [128]:
gtest_score = []
for i in range(len(cv_results['estimator'])):
    gtest_score.append(cv_results['estimator'][i].score(X_gtest, y_gtest))

In [129]:
sum(gtest_score) / len(gtest_score)

0.6146938775510205