# Data Extraction
## Extracting and Saving the Holdout Data

I want to keep 20% of the data as a holdout set that we can use to compare the models that we train with the remaining 80% of the data.  The data will be randomly sampled and removed from the original data set.  We will use stratified sampling to maintain the class ratios seen in the original data set.

We will also save these two new datasets (80% training, 20% holdout) to files so we can easily reference them in the future.

In [10]:
# Prepare the data
import ml_utils as mu
from sklearn.model_selection import train_test_split

data = mu.get_training_data(get_full_set=True)

X, y = mu.split_x_and_y(data)

X_train, X_holdout, y_train, y_holdout = train_test_split(X, y, test_size = 0.2, random_state = 43, stratify=y)

training_data = X_train.copy()
training_data['GroupID'] = y_train

holdout_data = X_holdout.copy()
holdout_data['GroupID'] = y_holdout

training_data.to_excel("data/training_data.xlsx", index=False)
holdout_data.to_excel("data/holdout_data.xlsx", index=False)

In [11]:
print("Percentage of each group in training_data")
t = training_data.GroupID.value_counts()
print(t.apply(lambda x: x/sum(t)))

print("Percentage of each group in holdout_data")
h = holdout_data.GroupID.value_counts()
print(h.apply(lambda x: x/sum(h)))

Percentage of each group in training_data
1    0.535235
0    0.322148
3    0.073826
2    0.068792
Name: GroupID, dtype: float64
Percentage of each group in holdout_data
1    0.533333
0    0.320000
3    0.073333
2    0.073333
Name: GroupID, dtype: float64
