# Data Set Curation

In data science, your model will only be as good as your data. You often don't have as much data as you'd like to have.
The following are standard practices and techniques to do more with less, or at least better organize your data to ensure your model
best reflects the real world phenomena you're attempting to model.

## Train, Test and Tuning

In [None]:
import numpy as np
from sklearn.model_selection import train_test_split

# 12 data points of values 1 to 25
X = np.arange(1, 25).reshape(12, 2)
# totally made up labels
Y = np.array([0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0])

x_train, x_test, y_train, y_test = train_test_split(X, Y)

print(x_train)
print(len(x_train))
print()
print(x_test)
print(len(x_test))

Some options that come in handy...
    - _test_size_: how much of the data is used in testing (training set because 1 - test_size)
    - _random_: set a different random seed
    - _stratify_: useful if you have an imbalanced data set, keeps the values of y approximately equal in proportion between test
     train sets

In [None]:
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=4, stratify=Y)
print(x_train)
print(len(x_train))
print()
print(x_test)
print(len(x_test))

If you want to generate a validation set, just run train_test_split twice!
This means:
    - x_train = 70% of the data
    - x_val = 15% of the data
    - x_test = 15% of the data

In [None]:
x_test, x_val, y_test, y_val = train_test_split(x_test, y_test, test_size=0.5, random_state=4, stratify=y_test)
print(x_train)
print(len(x_train))
print()
print(x_val)
print(len(x_val))
print()
print(x_test)
print(len(x_test))

## Cross Validation

Another was to utilize your data set is to use Cross Validation. In this setup you use the entire labeled set, divide it into
train and test splits K times.

The KFold method will take a list and then return K lists of indices into that list representing the K test/train splits.

In [None]:
from sklearn.model_selection import KFold
kf5 = KFold(n_splits=5, shuffle=False)

fold = 1
for train_index, test_index in kf5.split(X):
    print(f"Fold {fold}: \n\ttest indices: {train_index} \n\ttrain indices: {test_index}")