# Train-test split

It's time to talk about what is perhaps the most popular method in scikit-learn, the one that helps us divide a dataset into different sets of information.

By now, you should clearly understand the importance of performing this division, so we won't dwell too much on the why, but rather on the how.

We start by importing the method:

In [None]:
from sklearn.model_selection import train_test_split

In reality, the method is quite simple to use, but there are some tricks you should keep in mind to get the most out of it.

## Interface

This function has a somewhat peculiar interface, as it is designed to receive a variable number of arguments. To illustrate, look at the following:

In [None]:
import numpy as np

# Generate some example data
X1 = np.arange(0, 100)
X2 = np.arange(100, 200)
X3 = np.arange(200, 300)

print(f"Shapes: {X1.shape}, {X2.shape}, {X3.shape}")

# Split the data into training and testing sets
X1_train, X1_test, X2_train, X2_test, X3_train, X3_test = train_test_split(X1, X2, X3)



print("Shapes after splitting:")
print(f"X1_train: {X1_train.shape}, X1_test: {X1_test.shape}")
print(f"X2_train: {X2_train.shape}, X2_test: {X2_test.shape}")
print(f"X3_train: {X3_train.shape}, X3_test: {X3_test.shape}")

But don't worry if that seems a bit complex. The most common use case is simpler, where you pass a dataset and its corresponding labels:

In [None]:
X = np.random.rand(100, 2)
y = np.random.randint(0, 2, 100)

X_train, X_test, y_train, y_test = train_test_split(X, y)

## Arguments

### Set Sizes

By default, and without additional arguments, the dataset sizes will be divided into 75% for the training set and 25% for the test set.

These values are modifiable, of course. You can use the `test_size` or `train_size` parameters to modify the size (remember to set only one), and you can use both integer and float values.

If you use an integer value, that exact number will be used, for example:

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=10)

print("Shapes after splitting:")
print(f"X_train: {X_train.shape}, X_test: {X_test.shape}")
print(f"y_train: {y_train.shape}, y_test: {y_test.shape}")

But you can also use floats, which will serve as percentages:

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5)

print("Shapes after splitting:")
print(f"X_train: {X_train.shape}, X_test: {X_test.shape}")
print(f"y_train: {y_train.shape}, y_test: {y_test.shape}")

### Random seed

By default, the function randomly assigns the data to either of the two sets, so two executions will not give us the same results:

In [None]:
X1 = np.arange(0, 100)

X_train, X_test = train_test_split(X1, train_size=0.5)
print("First 10 elements of X_train:", X_train[:10])

X_train, X_test = train_test_split(X1, train_size=0.5)
print("First 10 elements of X_train:", X_train[:10])

If what you want is reproducibility, you can set a random seed using the `random_state` argument:

In [None]:
X1 = np.arange(0, 100)

X_train, X_test = train_test_split(X1, train_size=0.5, random_state=42)
print("First 10 elements of X_train:", X_train[:10])

X_train, X_test = train_test_split(X1, train_size=0.5, random_state=42)
print("First 10 elements of X_train:", X_train[:10])

```{note}
Typically, `random_state` is fixed during development and experimentation for several important reasons:

1. **Reproducibility**: By setting a fixed `random_state`, you ensure that your data splits remain consistent across multiple runs. This is crucial for debugging, iterating on your model, and comparing different approaches.

2. **Consistency in Results**: When you're tweaking hyperparameters or trying different model architectures, a fixed `random_state` helps isolate the impact of your changes. You can be confident that any differences in results are due to your modifications, not random variation in the data split.

3. **Easier Collaboration**: When working in a team, using a fixed `random_state` allows all team members to work with the same data splits, making it easier to compare results and reproduce each other's work.

4. **Debugging**: If you encounter issues or unexpected results, a fixed `random_state` makes it easier to recreate the problem and investigate its cause.

However, it's important to note that while fixing `random_state` is beneficial during development, it shouldn't be the end of your evaluation process. 

Once you've settled on a model, it's a good practice to test it with different random splits (by changing the `random_state` or not setting it) to ensure your model's performance is robust across different data divisions. And in a production environment, you might want to regularly retrain your model on fresh data, in which case you wouldn't use a fixed `random_state`.

Remember, the goal is to develop a model that generalizes well to unseen data, not one that performs well on a single, fixed split. Use a fixed `random_state` as a development tool, but don't rely on it exclusively for your final model evaluation.
```

**Stratification**

When working with imbalanced datasets (those that have more data from one class than others), you can set the `stratify` argument to ensure that the data is distributed evenly between the two sets:

In [None]:
def show_counts(y):
    unique, counts = np.unique(y, return_counts=True)
    counts = dict(zip(unique, counts))
    for class_, count in counts.items():
        print(f"Class {class_}:\t{count:>5} ({count/len(y)*100:00.2f}%)")

We create a sample dataset, pay attention to the counts of the "apple" and "orange" labels:

In [None]:
sample_size = 100000
X = np.random.rand(sample_size, 2)
y = np.random.choice(["apple", "orange", "banana"], sample_size, p=[0.90, 0.05, 0.05])

show_counts(y)

If we divide them without stratification, pay attention to what happens with the counts:

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=45)

print("Training split:")
show_counts(y_train)
print()
print("Test split:")
show_counts(y_test)

But if we do it by stratifying with `y`:

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=60, stratify=y)

show_counts(y_train)
# print(
show_counts(y_test)

### Do not randomize

By default, the function separates the data randomly, but there will be occasions when this is not ideal, for example, when working with time series data. In these cases, taking data randomly would cause a *data leakage* problem. Scikit-learn allows us to disable randomization by passing the `shuffle` argument equal to false:

In [None]:
X = np.arange(20)

print("Original elements of X:", X)

X_train, X_test = train_test_split(X, shuffle=False, test_size=0.5)

print("First 10 elements of X_train:", X_train[:10])
print("First 10 elements of X_test:", X_test[:10])

But if we call it without `shuffle`:

In [None]:
X_train, X_test = train_test_split(X, test_size=0.5)

print("First 10 elements of X_train:", X_train[:10])
print("First 10 elements of X_test:", X_test[:10])


## Conclusion

I hope you now have a clear understanding of the `train_test_split()` method and how to adjust the arguments to meet the needs of your dataset. Remember that this is an important step in creating accurate and effective machine learning models.