# Split data at a given timestamp

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/google/temporian/blob/last-release/docs/src/recipes/split_timestamp.ipynb)

This recipe can be used to split an `EventSet` in two or more subsets at fixed timestamps.

This exact same procedure applies to multi-index data or the default single empty index.

For example, to train a machine learning forecasting model, the data usually needs to be split into train, validation and test `EventSets` at some fixed timestamps in that respective order.

## Example data

In this toy example we'll use two separate indexes, but this also applies to any number of indexes as mentioned above.

The second index has 2 more data points from the previous year. Both of them finish at the same date.

In [None]:
import pandas as pd
import temporian as tp

sample_data = pd.DataFrame(
    data=[
        # date,  index=1, feature
        ["2020-01-01", 1, 1.0],
        ["2020-02-01", 1, 2.0],
        ["2020-03-01", 1, 3.0],
        ["2020-04-01", 1, 4.0],
        ["2020-05-01", 1, 5.0],
        ["2020-06-01", 1, 6.0],
        # date,  index=2, feature
        ["2019-11-01", 2, -20.0],
        ["2019-12-01", 2, -10.0],
        ["2020-01-01", 2, 10.0],
        ["2020-02-01", 2, 20.0],
        ["2020-03-01", 2, 30.0],
        ["2020-04-01", 2, 40.0],
        ["2020-05-01", 2, 50.0],
        ["2020-06-01", 2, 60.0],
    ],
    columns=[
        "timestamp",
        "idx",
        "feature",
    ],
)

sample_evset = tp.from_pandas(sample_data, indexes=["idx"])
sample_evset

## Solution

We want to split this into 3 separate `EventSets` as follows:
* **Train** data: all data points until `2020-03-01` (including it)
* **Validation**: after training and until `2020-05-01` (including it)
* **Test**: data after `2020-05-01` (not including it)

So the proposed steps are:
1. Convert the timestamps of events into a feature.
2. Filter train/validation/test by comparing the timestamps feature to the defined boundaries.

In [None]:
from datetime import datetime, timedelta

# Define boundaries for train/validation/test
train_until = datetime(2020, 3, 1)
val_until = datetime(2020, 5, 1)
one_milisecond = timedelta(milliseconds=1)

### Split based on timestamps

Using before and after we can split the dataset. Notice that both functions are 
non-inclusive so we need to add one milisecond in one of the cases so that the 
values in the limit are included in one of the sets

In [None]:
train_evset = sample_evset.before(train_until + one_milisecond)
val_evset = sample_evset.after(train_until).before(val_until + one_milisecond)
test_evset = sample_evset.after(val_until)

## Check results

In [None]:
train_evset

In [None]:
val_evset

In [None]:
test_evset