<a href="https://colab.research.google.com/github/mrcrdg/jupyter_set/blob/master/Cross_Validation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Cross Validation

## k-fold Cross Validation

### Toy example:

In [None]:
from numpy import array
from sklearn.model_selection import KFold

Create the sample data


In [None]:
data = array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6])

Create the `KFold`

In [None]:
kfold = KFold(n_splits=3)

List the content of each Fold:

In [None]:
for train, test in kfold.split(data):
  print("train: {}, test: {}".format(data[train], data[test]))

train: [0.3 0.4 0.5 0.6], test: [0.1 0.2]
train: [0.1 0.2 0.5 0.6], test: [0.3 0.4]
train: [0.1 0.2 0.3 0.4], test: [0.5 0.6]


### A more detailed example:

First we will download the [Sonar dataset](https://archive.ics.uci.edu/ml/datasets/Connectionist+Bench+(Sonar,+Mines+vs.+Rocks)) which classifies sonar data into rocks and mines.

In [None]:
! wget -c https://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/sonar.all-data

--2020-11-20 13:37:35--  https://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/sonar.all-data
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 87776 (86K) [application/x-httpd-php]
Saving to: ‘sonar.all-data’


2020-11-20 13:37:36 (1.43 MB/s) - ‘sonar.all-data’ saved [87776/87776]



In [None]:
!head -10 sonar.all-data

0.0200,0.0371,0.0428,0.0207,0.0954,0.0986,0.1539,0.1601,0.3109,0.2111,0.1609,0.1582,0.2238,0.0645,0.0660,0.2273,0.3100,0.2999,0.5078,0.4797,0.5783,0.5071,0.4328,0.5550,0.6711,0.6415,0.7104,0.8080,0.6791,0.3857,0.1307,0.2604,0.5121,0.7547,0.8537,0.8507,0.6692,0.6097,0.4943,0.2744,0.0510,0.2834,0.2825,0.4256,0.2641,0.1386,0.1051,0.1343,0.0383,0.0324,0.0232,0.0027,0.0065,0.0159,0.0072,0.0167,0.0180,0.0084,0.0090,0.0032,R
0.0453,0.0523,0.0843,0.0689,0.1183,0.2583,0.2156,0.3481,0.3337,0.2872,0.4918,0.6552,0.6919,0.7797,0.7464,0.9444,1.0000,0.8874,0.8024,0.7818,0.5212,0.4052,0.3957,0.3914,0.3250,0.3200,0.3271,0.2767,0.4423,0.2028,0.3788,0.2947,0.1984,0.2341,0.1306,0.4182,0.3835,0.1057,0.1840,0.1970,0.1674,0.0583,0.1401,0.1628,0.0621,0.0203,0.0530,0.0742,0.0409,0.0061,0.0125,0.0084,0.0089,0.0048,0.0094,0.0191,0.0140,0.0049,0.0052,0.0044,R
0.0262,0.0582,0.1099,0.1083,0.0974,0.2280,0.2431,0.3771,0.5598,0.6194,0.6333,0.7060,0.5544,0.5320,0.6479,0.6931,0.6759,0.7551,0.8929,0.8619,0.7974,0.6737,0.

Perform the required imports

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold

Read and rearrange data

In [None]:
sonar_dataset = pd.read_csv("sonar.all-data")
sonar_X = sonar_dataset.iloc[:,:-1].values
sonar_y = sonar_dataset.iloc[:,-1:].values

In [None]:
sonar_X.shape

(207, 60)

Split data into train and test sets

In [None]:
X_train, X_test, y_train, y_test = train_test_split(sonar_X, sonar_y, 
                                                    random_state=2, 
                                                    test_size=.3)

Create the 3-fold validation

In [None]:
k_fold = KFold(3)

lr_model = LogisticRegression()

models = []

Now train the classifier on each partition

In [None]:
for k, (train, validation) in enumerate(k_fold.split(X_train, y_train)):
    lr_model = LogisticRegression()
    lr_model.fit(X_train[train], y_train[train].ravel())
    models.append(lr_model)
    print("[fold {}], score on validation: {:.5f}, score on test: {:.5f}, ". 
          format(k, lr_model.score(X_train[validation], y_train[validation]),
                 lr_model.score(X_test, y_test)))

[fold 0], score on validation: 0.81250, score on test: 0.82540, 
[fold 1], score on validation: 0.66667, score on test: 0.76190, 
[fold 2], score on validation: 0.75000, score on test: 0.68254, 


## Leave One Out Cross Validation - LOOCV

### Toy example:

In [None]:
from numpy import array
from sklearn.model_selection import LeaveOneOut

Create the sample data


In [None]:
data = array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6])

Create the `LOOCV`

In [None]:
loocv = LeaveOneOut()

List the content of each Fold:

In [None]:
for train, test in loocv.split(data):
  print("train: {}, test: {}".format(data[train], data[test]))

train: [0.2 0.3 0.4 0.5 0.6], test: [0.1]
train: [0.1 0.3 0.4 0.5 0.6], test: [0.2]
train: [0.1 0.2 0.4 0.5 0.6], test: [0.3]
train: [0.1 0.2 0.3 0.5 0.6], test: [0.4]
train: [0.1 0.2 0.3 0.4 0.6], test: [0.5]
train: [0.1 0.2 0.3 0.4 0.5], test: [0.6]


### A more detailed example:

Continue using the [Sonar dataset](https://archive.ics.uci.edu/ml/datasets/Connectionist+Bench+(Sonar,+Mines+vs.+Rocks))

Perform the required imports

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import LeaveOneOut

Read and rearrange data

In [None]:
sonar_dataset = pd.read_csv("sonar.all-data")
sonar_X = sonar_dataset.iloc[:,:-1].values
sonar_y = sonar_dataset.iloc[:,-1:].values

Split data into train and test sets

In [None]:
X_train, X_test, y_train, y_test = train_test_split(sonar_X, sonar_y, 
                                                    random_state=2, 
                                                    test_size=.3)

Create the N-fold validation

In [None]:
loocv = LeaveOneOut()

lr_model = LogisticRegression()

Now train the classifier on each partition

In [None]:
for k, (train, validation) in enumerate(loocv.split(X_train, y_train)):
    lr_model.fit(X_train[train], y_train[train].ravel())
    print("[fold {}], score on validation: {:.5f}, score on test: {:.5f}, ". 
          format(k, lr_model.score(X_train[validation], y_train[validation]),
                 lr_model.score(X_test, y_test)))

[fold 0], score on validation: 1.00000, score on test: 0.79365, 
[fold 1], score on validation: 1.00000, score on test: 0.77778, 
[fold 2], score on validation: 1.00000, score on test: 0.77778, 
[fold 3], score on validation: 1.00000, score on test: 0.77778, 
[fold 4], score on validation: 1.00000, score on test: 0.77778, 
[fold 5], score on validation: 1.00000, score on test: 0.79365, 
[fold 6], score on validation: 1.00000, score on test: 0.77778, 
[fold 7], score on validation: 1.00000, score on test: 0.77778, 
[fold 8], score on validation: 1.00000, score on test: 0.77778, 
[fold 9], score on validation: 1.00000, score on test: 0.77778, 
[fold 10], score on validation: 1.00000, score on test: 0.79365, 
[fold 11], score on validation: 1.00000, score on test: 0.77778, 
[fold 12], score on validation: 1.00000, score on test: 0.79365, 
[fold 13], score on validation: 0.00000, score on test: 0.77778, 
[fold 14], score on validation: 1.00000, score on test: 0.77778, 
[fold 15], score on 

# Exercise


The Banknote Dataset involves predicting whether a given banknote is authentic given a number of measures taken from a photograph.

It is a binary (2-class) classification problem. The number of observations for each class is not balanced. There are 1,372 observations with 4 input variables and 1 output variable. For more information, check [this link](http://archive.ics.uci.edu/ml/datasets/banknote+authentication).

## Download and read the data

In [None]:
!wget -c http://archive.ics.uci.edu/ml/machine-learning-databases/00267/data_banknote_authentication.txt

--2020-11-17 21:01:47--  http://archive.ics.uci.edu/ml/machine-learning-databases/00267/data_banknote_authentication.txt
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 46400 (45K) [application/x-httpd-php]
Saving to: ‘data_banknote_authentication.txt’


2020-11-17 21:01:48 (453 KB/s) - ‘data_banknote_authentication.txt’ saved [46400/46400]



In [None]:
!head -10 data_banknote_authentication.txt

3.6216,8.6661,-2.8073,-0.44699,0
4.5459,8.1674,-2.4586,-1.4621,0
3.866,-2.6383,1.9242,0.10645,0
3.4566,9.5228,-4.0112,-3.5944,0
0.32924,-4.4552,4.5718,-0.9888,0
4.3684,9.6718,-3.9606,-3.1625,0
3.5912,3.0129,0.72888,0.56421,0
2.0922,-6.81,8.4636,-0.60216,0
3.2032,5.7588,-0.75345,-0.61251,0
1.5356,9.1772,-2.2718,-0.73535,0


In [None]:
import pandas as pd

In [None]:
banknote_dataset = pd.read_csv("data_banknote_authentication.txt")

In [None]:
banknote_dataset.head()

Unnamed: 0,3.6216,8.6661,-2.8073,-0.44699,0
0,4.5459,8.1674,-2.4586,-1.4621,0
1,3.866,-2.6383,1.9242,0.10645,0
2,3.4566,9.5228,-4.0112,-3.5944,0
3,0.32924,-4.4552,4.5718,-0.9888,0
4,4.3684,9.6718,-3.9606,-3.1625,0


## Tasks

1. Create Naive Bayes and Logistic Regression Classifiers;
1. Perform Cross Validations on Both models to select the best one;
1. Create boxplots to show the variation of the k-fold models:
  * Check the [`cross_val_score`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html#sklearn.model_selection.cross_val_score) and [`cross_val_predict`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_predict.html#sklearn.model_selection.cross_val_predict) methods
1. Save the best Model using the Pickle Python Library (see [this link](https://machinelearningmastery.com/save-load-machine-learning-models-python-scikit-learn/) for reference)