## Create subsets with equal class distribution

Given is the dataset `fraud.csv` which includes various **customer features** and the binary variable **`label` indicating fraudulent behavior**.

In [1]:
import pandas as pd

# Read data
fraud = pd.read_csv("../resources/fraud.csv")
fraud.head()

Unnamed: 0,col1,col2,col3,col4,col5,label
0,-1.250814,0.411701,0.983945,-1.219923,-0.578289,0
1,1.479663,0.890066,-0.073772,2.12142,2.033026,1
2,-2.158867,-4.264348,-2.918629,-3.089199,-4.859321,0
3,-1.687921,-1.301744,-0.840542,-1.051996,-1.558555,0
4,0.686342,0.577495,1.477151,-1.920847,-0.955816,0


Your task is to divide the dataset into five **equally sized subsets** (`fraud1.csv`, `fraud2.csv`, ... , `fraud5.csv`). 

An additional condition is that the **binary variable `label`** should be **equally distributed** in all five datasets.

In [2]:
from sklearn.model_selection import StratifiedKFold

# Pseudo split of the data for stratfied k fold function
X = fraud.drop('label', axis=1)
y = fraud['label']

# Initialize kfold object
kfold = StratifiedKFold(n_splits=5)
i = 1
# Generate and loop through folds
for _, test_idx in kfold.split(X, y):
    # Store test set to file
    fraud.iloc[test_idx].to_csv(f"fraud{str(i)}.csv")
    i += 1

In the last step you **check if your procedure was successful**.

In [3]:
# Create dataframe for storing results
results = pd.DataFrame(columns=['file', 'num_samples', '% fraud'])

# Loop through files
for i in range(1, 6):
    # Read file
    filename = f"fraud{str(i)}.csv"
    subset = pd.read_csv(filename)
    
    # Calculate and store statistics
    N = subset.shape[0]
    perc = subset.label.mean()
    results = results.append({
        'file': filename, 
        'num_samples': N, 
        '% fraud': perc
    }, ignore_index=True)

# Print results
results

Unnamed: 0,file,num_samples,% fraud
0,fraud1.csv,200,0.18
1,fraud2.csv,200,0.18
2,fraud3.csv,200,0.185
3,fraud4.csv,200,0.185
4,fraud5.csv,200,0.185
