## Train Test Frameworks

The following exercise is to practice the syntax of the various functions from sklearn that split data into train and test sets. The goal of this exercise is to get familiar with these different splitting methods before engaging with the more complex activities at the end of the day. 

In [1]:
# import numpy
import numpy as np

In [2]:
X = np.random.normal(0,1,20).reshape(10,2)
y = np.random.normal(0,1,10)

* print X

In [3]:
print(X)

[[ 0.25817009 -0.81034671]
 [-0.22760127  1.02725828]
 [ 0.10375772  3.09888924]
 [-1.69096629  0.11980993]
 [-0.5599216  -1.52798032]
 [ 0.54518442  0.66826113]
 [ 0.46466619  1.04766121]
 [-1.0813783   0.42422541]
 [ 0.39240399  2.22936732]
 [-2.2147161   0.23305425]]


* print y

In [4]:
print(y)

[ 0.06023418  1.53271793  0.00562859  0.87447505 -2.07477936  1.22257457
  0.37234604  0.29206053  0.74078487 -0.11492574]


_____________________________
### Holdout split

* import the **train_test_split** function from sklearn

In [5]:
import sklearn.model_selection as model_selection

* split the data to train set and test set, use a 70:30 ratio or a 80:20 ratio.

In [6]:
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.30)

* print X_train

In [7]:
print(X_train)

[[-2.2147161   0.23305425]
 [ 0.25817009 -0.81034671]
 [-1.0813783   0.42422541]
 [ 0.39240399  2.22936732]
 [-0.22760127  1.02725828]
 [ 0.54518442  0.66826113]
 [ 0.46466619  1.04766121]]


* split the data again but now with the parameter shuffle = False

In [8]:
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.30, shuffle=False)

* print X_train

In [9]:
print(X_train)

[[ 0.25817009 -0.81034671]
 [-0.22760127  1.02725828]
 [ 0.10375772  3.09888924]
 [-1.69096629  0.11980993]
 [-0.5599216  -1.52798032]
 [ 0.54518442  0.66826113]
 [ 0.46466619  1.04766121]]


* print the shape of X_train and X_test

In [10]:
print(X_train.shape)

(7, 2)


In [11]:
print(X_test.shape)

(3, 2)


_________________________________
### K-fold split 

* import the **KFold** function from sklearn

In [12]:
from sklearn.model_selection import KFold

* instantiate KFold with k=5

In [14]:
kf = KFold(n_splits=5)

* iterate over train_index and test_index in kf.split(X) and print them

In [21]:
X = np.array(X)
y = np.array(y)
for train_index, test_index in kf.split(X):
    # X_train, X_test = X[train_index], X[test_index]
    # y_train, y_test = y[train_index], y[test_index]
    # print(y_test)
    print(train_index)
    print(test_index)

[2 3 4 5 6 7 8 9]
[0 1]
[0 1 4 5 6 7 8 9]
[2 3]
[0 1 2 3 6 7 8 9]
[4 5]
[0 1 2 3 4 5 8 9]
[6 7]
[0 1 2 3 4 5 6 7]
[8 9]


* instantiate KFold with k=5 and shuffle=True

In [22]:
kf = KFold(n_splits=5, shuffle=True)

* iterate over train_index and test_index in kf.split(X) and print them

In [23]:
X = np.array(X)
y = np.array(y)
for train_index, test_index in kf.split(X):
    # X_train, X_test = X[train_index], X[test_index]
    # y_train, y_test = y[train_index], y[test_index]
    # print(y_test)
    print(train_index)
    print(test_index)

[0 1 3 5 6 7 8 9]
[2 4]
[0 1 2 3 4 6 7 8]
[5 9]
[2 3 4 5 6 7 8 9]
[0 1]
[0 1 2 3 4 5 6 9]
[7 8]
[0 1 2 4 5 7 8 9]
[3 6]


_______________________________________
### Leave-One-Out split
This is a similar technique to the Leave-p-out in the previous readings, with p=1. Each observation is used as test set separately.
- This is a popular method for tiny datasets.
- It takes a lot of time with bigger datasets and can lead to overfitting on a final model.

* import the **LeaveOneOut** function from sklearn

In [24]:
from sklearn.model_selection import LeaveOneOut

* instantiate LeaveOneOut

In [25]:
loo = LeaveOneOut()

* iterate over train_index and test_index in loo.split(X) and print them

In [26]:
for train_index, test_index in loo.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    print(X_train, X_test, y_train, y_test)

TRAIN: [1 2 3 4 5 6 7 8 9] TEST: [0]
[[-0.22760127  1.02725828]
 [ 0.10375772  3.09888924]
 [-1.69096629  0.11980993]
 [-0.5599216  -1.52798032]
 [ 0.54518442  0.66826113]
 [ 0.46466619  1.04766121]
 [-1.0813783   0.42422541]
 [ 0.39240399  2.22936732]
 [-2.2147161   0.23305425]] [[ 0.25817009 -0.81034671]] [ 1.53271793  0.00562859  0.87447505 -2.07477936  1.22257457  0.37234604
  0.29206053  0.74078487 -0.11492574] [0.06023418]
TRAIN: [0 2 3 4 5 6 7 8 9] TEST: [1]
[[ 0.25817009 -0.81034671]
 [ 0.10375772  3.09888924]
 [-1.69096629  0.11980993]
 [-0.5599216  -1.52798032]
 [ 0.54518442  0.66826113]
 [ 0.46466619  1.04766121]
 [-1.0813783   0.42422541]
 [ 0.39240399  2.22936732]
 [-2.2147161   0.23305425]] [[-0.22760127  1.02725828]] [ 0.06023418  0.00562859  0.87447505 -2.07477936  1.22257457  0.37234604
  0.29206053  0.74078487 -0.11492574] [1.53271793]
TRAIN: [0 1 3 4 5 6 7 8 9] TEST: [2]
[[ 0.25817009 -0.81034671]
 [-0.22760127  1.02725828]
 [-1.69096629  0.11980993]
 [-0.5599216  -1

* print the number of splits

In [27]:
loo.get_n_splits(X)

10