# Data Set Curation

In data science, your model will only be as good as your data. You often don't have as much data as you'd like to have.
The following are standard practices and techniques to do more with less, or at least better organize your data to ensure your model
best reflects the real world phenomena you're attempting to model.

## Train, Test and Tuning

In [13]:
import numpy as np
from sklearn.model_selection import train_test_split

# 12 data points of values 1 to 25
x = np.arange(1, 25).reshape(12, 2)
# totally made up labels
y = np.array([0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0])

x_train, x_test, y_train, y_test = train_test_split(x, y)

print(x_train)
print(len(x_train))
print()
print(x_test)
print(len(x_test))

[[17 18]
 [ 7  8]
 [ 3  4]
 [ 5  6]
 [13 14]
 [ 9 10]
 [11 12]
 [21 22]
 [23 24]]
9

[[15 16]
 [ 1  2]
 [19 20]]
3


Some options that come in handy...
    - _test_size_: how much of the data is used in testing (training set because 1 - test_size)
    - _random_: set a different random seed
    - _stratify_: useful if you have an imbalanced data set, keeps the values of y approximately equal in proportion between test
     train sets

In [14]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=4, stratify=y)
print(x_train)
print(len(x_train))
print()
print(x_test)
print(len(x_test))

[[21 22]
 [ 1  2]
 [15 16]
 [13 14]
 [17 18]
 [19 20]
 [23 24]
 [ 3  4]]
8

[[11 12]
 [ 7  8]
 [ 5  6]
 [ 9 10]]
4


If you want to generate a validation set, just run train_test_split twice!
This means:
    - x_train = 70% of the data
    - x_val = 15% of the data
    - x_test = 15% of the data

In [15]:
x_test, x_val, y_test, y_val = train_test_split(x_test, y_test, test_size=0.5, random_state=4, stratify=y)
print(x_train)
print(len(x_train))
print()
print(x_val)
print(len(x_val))
print()
print(x_test)
print(len(x_test))

[[21 22]
 [ 1  2]
 [15 16]
 [13 14]
 [17 18]
 [19 20]
 [23 24]
 [ 3  4]]
8

[[11 12]
 [ 7  8]]
2

[[ 9 10]
 [ 5  6]]
2


## Cross Validation