# Data Splits for Predictive Modeling

The problems in this notebook extend the concepts covered in lecture 3: Data Splits for Predictive Modeling.

##### 1. Introduction to stratified splits 

##### a.

The variable `y` below could come from a binary classification problem. Using `sklearn`'s `train_test_split` method make an  $70\%-30\%$ train test split for the given `y` variable. Look at `y_train` and `y_test` after your split.

In [1]:
import numpy as np

In [2]:
y = [0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0]

#### ANSWER

In [3]:
from sklearn.model_selection import train_test_split

In [4]:
y_train, y_test = train_test_split(y,
                                      shuffle=True,
                                      random_state=892,
                                      test_size=.2)

In [5]:
print("y_train", y_train)
print("y_test", y_test)

y_train [0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
y_test [0, 0, 0, 0]


Now write a loop that will make this train test split and print out the training and test sets. Loop through at least 30 times. Did you notice anything that could cause issues for the training of a classification algorithm using these data?

<i>Note: do not include a `random_state` for `train_test_split` in your loop, we want the possibility of a different split each time through the loop.</i>

In [6]:
## running this in a loop 40 times

for i in range(40):
    y_train, y_test = train_test_split(y,
                                      shuffle=True,
                                      test_size=.3)
    print(str(i)+"th","time through")
    print("y_train", y_train)
    print("y_test", y_test)
    print()

0th time through
y_train [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
y_test [1, 0, 0, 0, 0, 0]

1th time through
y_train [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
y_test [1, 0, 0, 0, 0, 0]

2th time through
y_train [0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0]
y_test [0, 0, 0, 0, 0, 0]

3th time through
y_train [1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
y_test [0, 0, 0, 0, 0, 0]

4th time through
y_train [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
y_test [0, 1, 0, 0, 0, 0]

5th time through
y_train [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
y_test [0, 0, 0, 0, 1, 0]

6th time through
y_train [1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]
y_test [0, 0, 0, 0, 0, 0]

7th time through
y_train [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0]
y_test [0, 0, 0, 0, 0, 1]

8th time through
y_train [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1]
y_test [0, 0, 0, 0, 0, 0]

9th time through
y_train [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0]
y_test [0, 0, 0, 0, 0, 0]

10th time through
y_train [0, 0, 0, 0, 0, 0, 0, 0,

We can notice that sometimes the test set has no `1` observations in it, sometimes it only has `1`s. In cases like this it would be impossible to train a classification algorithm to predict `1`.

##### b. 

Look up the `train_test_split` documentation and check out the `stratify` argument. What does this argument do, can we use this to help with the issue identified in 1.?

##### ANSWER

The `stratify` argument separates the data set we're interested in splitting and separates it according to the values of a categorical variable. It then performs the train-test split on each resulting subset. When it is done making the split it recombines the data into a single train and a single test set. This ensures that at least some observations of each category are in the train set and the test set. Yes this can help us with the issue we had in 1.

##### c.

Implement `train_test_split` using `stratify=y` and see what results. Run it many times without a random seed, are there any issues?

In [7]:
## running this in a loop 10 times

for i in range(10):
    y_train, y_test = train_test_split(y,
                                      shuffle=True,
                                      test_size=.3,
                                      stratify=y)
    print(str(i)+"th","time through")
    print("y_train", y_train)
    print("y_test", y_test)
    print()

0th time through
y_train [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]
y_test [0, 0, 0, 1, 0, 0]

1th time through
y_train [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
y_test [0, 0, 0, 0, 1, 0]

2th time through
y_train [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0]
y_test [1, 0, 0, 0, 0, 0]

3th time through
y_train [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]
y_test [0, 0, 1, 0, 0, 0]

4th time through
y_train [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
y_test [0, 0, 0, 0, 1, 0]

5th time through
y_train [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]
y_test [0, 0, 0, 1, 0, 0]

6th time through
y_train [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]
y_test [0, 0, 0, 0, 0, 1]

7th time through
y_train [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]
y_test [0, 0, 0, 0, 0, 1]

8th time through
y_train [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]
y_test [0, 0, 0, 1, 0, 0]

9th time through
y_train [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]
y_test [0, 0, 0, 1, 0, 0]



--------------------------

This notebook was written for the Erd&#337;s Institute C&#337;de Data Science Boot Camp by Matthew Osborne, Ph. D., 2023.

Any potential redistributors must seek and receive permission from Matthew Tyler Osborne, Ph.D. prior to redistribution. Redistribution of the material contained in this repository is conditional on acknowledgement of Matthew Tyler Osborne, Ph.D.'s original authorship and sponsorship of the Erdős Institute as subject to the license (see License.md)