## Train Test Frameworks

The following exercise is to practice the syntax of the various functions from sklearn that split data into train and test sets. The goal of this exercise is to get familiar with these different splitting methods before engaging with the more complex activities at the end of the day. 

In [1]:
# import numpy
import numpy as np

In [2]:
X = np.random.normal(0,1,20).reshape(10,2)
y = np.random.normal(0,1,10)

* print X

In [3]:
# print X
print (X)

[[ 1.08881477e+00  2.22694147e-01]
 [ 1.01627152e-03 -1.02068596e+00]
 [-7.03799520e-01 -8.92483276e-01]
 [-1.91559797e+00  1.25669814e-02]
 [-5.40307657e-01  3.53645517e-01]
 [-7.98893507e-01 -1.57217137e+00]
 [ 4.51386976e-01 -4.08586348e-01]
 [-2.19665228e+00 -6.39243448e-01]
 [-1.24786238e+00  3.09257261e-01]
 [-5.65443169e-01  5.28739896e-01]]


* print y

In [4]:
print(y)

[-0.59012587  0.15430381 -0.05352965  0.3014196   0.24058661  0.67283493
  0.54030159  0.23502545  0.37874257  0.36138157]


_____________________________
### Holdout split

* import the **train_test_split** function from sklearn

In [5]:
# import train_test_split
from sklearn.model_selection import train_test_split


* split the data to train set and test set, use a 70:30 ratio or a 80:20 ratio.

In [7]:
# spit the data into train test split of 67:33 without shuffles. 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=4, shuffle=False)

* print X_train

In [8]:
# print X_train.shape
print("X_train:", X_train)

X_train: [[ 1.08881477e+00  2.22694147e-01]
 [ 1.01627152e-03 -1.02068596e+00]
 [-7.03799520e-01 -8.92483276e-01]
 [-1.91559797e+00  1.25669814e-02]
 [-5.40307657e-01  3.53645517e-01]
 [-7.98893507e-01 -1.57217137e+00]]


* split the data again but now with the parameter shuffle = False

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=4)

* print X_train

In [10]:
print("X_train:", X_train)

X_train: [[-7.03799520e-01 -8.92483276e-01]
 [ 4.51386976e-01 -4.08586348e-01]
 [ 1.08881477e+00  2.22694147e-01]
 [ 1.01627152e-03 -1.02068596e+00]
 [-7.98893507e-01 -1.57217137e+00]
 [-2.19665228e+00 -6.39243448e-01]]


* print the shape of X_train and X_test

In [13]:
# print X_train and X_test
print('X-train',X_train)
print('X-test',X_test)

X-train [[-7.03799520e-01 -8.92483276e-01]
 [ 4.51386976e-01 -4.08586348e-01]
 [ 1.08881477e+00  2.22694147e-01]
 [ 1.01627152e-03 -1.02068596e+00]
 [-7.98893507e-01 -1.57217137e+00]
 [-2.19665228e+00 -6.39243448e-01]]
X-test [[-1.91559797  0.01256698]
 [-1.24786238  0.30925726]
 [-0.54030766  0.35364552]
 [-0.56544317  0.5287399 ]]


_________________________________
### K-fold split 

* import the **KFold** function from sklearn

In [14]:
# import kfold from sklearn
from sklearn.model_selection import KFold

* instantiate KFold with k=5

In [15]:
# instanciate kfold with K=5
kf = KFold(n_splits=5, shuffle=True, random_state=42)

* iterate over train_index and test_index in kf.split(X) and print them

In [18]:
for train_index, test_index in kf.split(X):
    print("TRAIN:", train_index)
    print("TEST:", test_index)
    # X_train, X_test = X[train_index], X[test_index]
    # y_train, y_test = y[train_index], y[test_index]
    # print(X_train.shape, y_train.shape)
    # print(X_test.shape, y_test.shape)

TRAIN: [0 2 3 4 5 6 7 9]
TEST: [1 8]
TRAIN: [1 2 3 4 6 7 8 9]
TEST: [0 5]
TRAIN: [0 1 3 4 5 6 8 9]
TEST: [2 7]
TRAIN: [0 1 2 3 5 6 7 8]
TEST: [4 9]
TRAIN: [0 1 2 4 5 7 8 9]
TEST: [3 6]


* instantiate KFold with k=5 and shuffle=True

In [21]:
kf = KFold(n_splits=5, shuffle=False)

* iterate over train_index and test_index in kf.split(X) and print them

In [22]:
for train_index, test_index in kf.split(X):
    print("TRAIN:", train_index)
    print("TEST:", test_index)

TRAIN: [2 3 4 5 6 7 8 9]
TEST: [0 1]
TRAIN: [0 1 4 5 6 7 8 9]
TEST: [2 3]
TRAIN: [0 1 2 3 6 7 8 9]
TEST: [4 5]
TRAIN: [0 1 2 3 4 5 8 9]
TEST: [6 7]
TRAIN: [0 1 2 3 4 5 6 7]
TEST: [8 9]


_______________________________________
### Leave-One-Out split
This is a similar technique to the Leave-p-out in the previous readings, with p=1. Each observation is used as test set separately.
- This is a popular method for tiny datasets.
- It takes a lot of time with bigger datasets and can lead to overfitting on a final model.

* import the **LeaveOneOut** function from sklearn

In [23]:
# import leaveoneout from sklearn
from sklearn.model_selection import LeaveOneOut

* instantiate LeaveOneOut

In [25]:
# instanciate LeaveOneOut
loo = LeaveOneOut()

* iterate over train_index and test_index in loo.split(X) and print them

In [33]:
# iterate through train_index and test_index in loo.split(X) and print out the train and test indices
for train_index, test_index in loo.split(X):
    n_splits = len(train_index)
    print(train_index)
    print(test_index)


# print out the train and test indices


[1 2 3 4 5 6 7 8 9]
[0]
[0 2 3 4 5 6 7 8 9]
[1]
[0 1 3 4 5 6 7 8 9]
[2]
[0 1 2 4 5 6 7 8 9]
[3]
[0 1 2 3 5 6 7 8 9]
[4]
[0 1 2 3 4 6 7 8 9]
[5]
[0 1 2 3 4 5 7 8 9]
[6]
[0 1 2 3 4 5 6 8 9]
[7]
[0 1 2 3 4 5 6 7 9]
[8]
[0 1 2 3 4 5 6 7 8]
[9]


9


* print the number of splits

In [35]:
# print number of splits
print (n_splits)

9
