<a href="https://colab.research.google.com/github/mockingjay14/ml_scripts/blob/main/examples/sklearn_train_test_split.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sklearn Train Test Split Examples

## Installation
Sklearn and tensorflow are already installed by default

In [None]:
!pip install sklearn

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
!pip list

Package                       Version
----------------------------- ------------------------------
absl-py                       1.1.0
alabaster                     0.7.12
albumentations                0.1.12
altair                        4.2.0
appdirs                       1.4.4
argon2-cffi                   21.3.0
argon2-cffi-bindings          21.2.0
arviz                         0.12.1
astor                         0.8.1
astropy                       4.3.1
astunparse                    1.6.3
atari-py                      0.2.9
atomicwrites                  1.4.1
attrs                         21.4.0
audioread                     2.1.9
autograd                      1.4
Babel                         2.10.3
backcall                      0.2.0
beautifulsoup4                4.6.3
bleach                        5.0.1
blis                          0.7.8
bokeh                         2.3.3
branca                        0.5.0
bs4                           0.0.1
CacheControl                  0.

## Example Data

In [None]:
# Loading of Data

from sklearn.datasets import fetch_openml

mnist = fetch_openml('mnist_784', version=1)
mnist.keys()

dict_keys(['data', 'target', 'frame', 'categories', 'feature_names', 'target_names', 'DESCR', 'details', 'url'])

In [None]:
# Parsing of Data

X, y = mnist["data"], mnist["target"]

print('X type:', type(X))
print('X shape:', X.shape)
print('y type:', type(y))
print('y shape:', y.shape)

X type: <class 'pandas.core.frame.DataFrame'>
X shape: (70000, 784)
y type: <class 'pandas.core.series.Series'>
y shape: (70000,)


## Randomized Test-Train Split
- 80:20 ratio for train test split
- Only suitable for large datasets
- Introduces sampling bias for small datasets

### A) sklearn.model_selection.train_test_split
Generates the training and testing set as an object

In [None]:
# Whole dataset including labels are splitted together

from sklearn.model_selection import train_test_split

X_train, X_test = train_test_split(X, test_size=0.2, random_state=42)

In [None]:
print('X_train type:', type(X_train))

X_train type: <class 'pandas.core.frame.DataFrame'>


In [None]:
# Dataset and labels are splitted separately

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
print('X_train type:', type(X_train))
print('y_train type:', type(y_train))

X_train type: <class 'pandas.core.frame.DataFrame'>
y_train type: <class 'pandas.core.series.Series'>


In [None]:
X_train

Unnamed: 0,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,pixel9,pixel10,...,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783,pixel784
47339,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
67456,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12308,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
32557,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
664,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
37194,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6265,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
54886,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
860,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
y_train

47339    5
67456    4
12308    8
32557    0
664      2
        ..
37194    6
6265     6
54886    1
860      0
15795    0
Name: class, Length: 56000, dtype: category
Categories (10, object): ['0', '1', '2', '3', ..., '6', '7', '8', '9']

### B) sklearn.model_selection.ShuffleSplit
Generates indices for training and testing data for each split. Does not guarantee all folds are unique and different

In [None]:
from sklearn.model_selection import ShuffleSplit

rs = ShuffleSplit(n_splits=5, test_size=.25, random_state=0)
for train_index, test_index in rs.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)

TRAIN: [ 4497  8301 63679 ... 42613 43567 68268] TEST: [10840 56267 14849 ... 68531 15063 16576]
TRAIN: [ 3837 37432  4677 ... 47654 24116 45100] TEST: [37411 20000  4661 ... 23120 63884 13496]
TRAIN: [  790 59804 34867 ... 41304 14060 49410] TEST: [38284 16570 42640 ... 27698 47188 49485]
TRAIN: [37356 17460 25151 ... 67281 47103  2768] TEST: [51288 28825 23427 ...  6366 51576 35778]
TRAIN: [ 5485 50715 55740 ...  7117 19098 19056] TEST: [68514 49648 31810 ... 35715 43982 47499]


## Stratified Sampling
- Respective class proportion in the whole dataset are represented per split
- Split proportions are almost identical to the overall dataset
- Ex. Total Dataset: 60 males, 40 females; Split: 6 males, 4 females

### A) sklearn.model_selection.train_test_split
Generates the training and testing set as an object

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

### B) sklearn.model_selection.StratifiedKFold
Generates the indices for training and test set using K-fold. Shuffles only once before splitting into train and test set. Each fold or test split is unique and non-repeating

In [None]:
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)

for train_index, test_index in skf.split(X, y):
    print("TRAIN:", train_index, "TEST:", test_index)

TRAIN: [    0     1     2 ... 69997 69998 69999] TEST: [   10    14    17 ... 69987 69989 69992]
TRAIN: [    0     1     3 ... 69997 69998 69999] TEST: [    2     6     7 ... 69982 69985 69994]
TRAIN: [    0     1     2 ... 69994 69995 69999] TEST: [    4     5     8 ... 69996 69997 69998]
TRAIN: [    0     1     2 ... 69997 69998 69999] TEST: [    3    20    22 ... 69981 69991 69995]
TRAIN: [    2     3     4 ... 69996 69997 69998] TEST: [    0     1    15 ... 69983 69993 69999]


### C) sklearn.model_selection.StratifiedShuffleSplit
Generates the indices for training and test set using K-fold. Shuffles every time a test and training set is generated. Folds or test split indices can be repetitive 

In [None]:
from sklearn.model_selection import StratifiedShuffleSplit
ss = StratifiedShuffleSplit(n_splits=5, test_size=.25, random_state=0)
for train_index, test_index in ss.split(X, y):
    print("TRAIN:", train_index, "TEST:", test_index)

TRAIN: [25991 37955 65840 ... 28397 26211 47858] TEST: [56892 60635 22817 ... 34597 46016  4112]
TRAIN: [27002 30188  2131 ... 51342 17982 69150] TEST: [57581 39372 30397 ...  6009  6410 66850]
TRAIN: [34669 60943 29288 ... 18047 50427 33681] TEST: [35752 23241 29056 ... 41065 35793  7706]
TRAIN: [33027   859 34066 ... 60200 52233 58905] TEST: [37828 34489 18589 ... 62464 61417 58984]
TRAIN: [15576 38750 65609 ...  6710 10869 36707] TEST: [41403 23450 49617 ... 19032 41571 25120]
