<div class="alert alert-block alert-success">
    <h1 align="center">Scikit-Learn Tips</h1>
    <h3 align="center">Tip 02: Stratify</h3>
</div>

* Stratification means that the train_test_split method returns training and test subsets that have the same proportions of class labels as the input dataset.

In [10]:
import pandas as pd
df = pd.DataFrame({'feature':list(range(12)), 'target':['Not Fraud']*10 + ['Fraud']*2})

In [12]:
df['target'].value_counts()

Not Fraud    10
Fraud         2
Name: target, dtype: int64

In [13]:
X = df[['feature']]
y = df['target']

In [14]:
from sklearn.model_selection import train_test_split

## Not stratified

`y_train` contains **NONE** of the minority class, whereas `y_test` contains **ALL** of the minority class. (This is bad!)

In [15]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)

In [16]:
y_train

1    Not Fraud
7    Not Fraud
9    Not Fraud
3    Not Fraud
0    Not Fraud
5    Not Fraud
Name: target, dtype: object

In [17]:
y_test

6     Not Fraud
11        Fraud
4     Not Fraud
10        Fraud
2     Not Fraud
8     Not Fraud
Name: target, dtype: object

## Stratified

Class proportions are the **SAME** in `y_train` and `y_test`. (This is good!)

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0, stratify=y)

In [8]:
y_train

2     Not Fraud
8     Not Fraud
4     Not Fraud
1     Not Fraud
11        Fraud
9     Not Fraud
Name: target, dtype: object

In [9]:
y_test

0     Not Fraud
7     Not Fraud
3     Not Fraud
5     Not Fraud
10        Fraud
6     Not Fraud
Name: target, dtype: object