# Introduction

In [1]:
"""
What? Data splitting strategies.

The 3 main techniques are:
    [1] evalaute models with train and split
    [2] k-fold cross validation
    [3] stratified k-fold cross validation
"""

'\nWhat? Data splitting strategies.\n\nThe two main techniques are:\n[1] Evalaute models with train and split\n[2] k-fold cross validation\n\nDate: 28/11/20\nReference: XGBosst with python, Jason Brownlee\n'

# Import modules

In [1]:
from numpy import loadtxt
from xgboost import XGBClassifier
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score
from IPython.display import Markdown, display
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import train_test_split

# Loading the dataset

In [3]:
# load data
dataset = loadtxt('../DATASETS/pima-indians-diabetes.csv', delimiter=",")
# split data into X and y
X = dataset[:,0:8]
Y = dataset[:,8]

# Evaluate the model with train & split

In [4]:
# split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=7) 

# fit model on training data
model = XGBClassifier()
model.fit(X_train, y_train)

# make predictions for test data
predictions = model.predict(X_test)

# evaluate predictions
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

Accuracy: 74.02%


# Evaluate the model with k-fold cross-validation

In [7]:
"""
Compared to the classical train & split approach the algortihm performance
are evaluated with less variance. It works by splitting the dataset into 
k-parts (e.g. k = 5 or k = 10). Each split of the data is called a fold.
The algorithm is trained on k − 1 folds with one held back and tested on 
the held back fold. This is repeated so that each fold of the dataset is 
given a chance to be the held back test set. The results you get are averaged.
This the best option you have when you do not have much data.
"""

'\nCompared to the classical train & split approach the algortihm performance\nare evaluated with less variance. It works by splitting the dataset into \nk-parts (e.g. k = 5 or k = 10). Each split of the data is called a fold.\nThe algorithm is trained on k − 1 folds with one held back and tested on \nthe held back fold. This is repeated so that each fold of the dataset is \ngiven a chance to be the held back test set. The results you get are averaged.\nThis the best option you have when you do not have much data.\n'

In [10]:
model = XGBClassifier()
kfold = KFold(n_splits=10, shuffle = True, random_state=7)
results = cross_val_score(model, X, Y, cv=kfold)
print("Accuracy: ", results.mean()*100 )
print("Standard deviation: ", results.std()*100)

Accuracy:  72.52563226247437
Standard deviation:  5.469806031642198


# Evaluate the model with stratified k-fold cross validation

In [None]:
"""
If you have many classes or the classes are imbalanced, it can be a good
idea to use stratified folds. This has the effect of enforcing the same
distribution of classes in each fold.
"""

In [11]:
model = XGBClassifier()
kfold = StratifiedKFold(n_splits=10, shuffle = True, random_state=7)
results = cross_val_score(model, X, Y, cv=kfold)
print("Accuracy: ", results.mean()*100 )
print("Standard deviation: ", results.std()*100)

Accuracy:  74.21565276828434
Standard deviation:  3.8262743861637145
