# Split data by fraction

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/google/temporian/blob/last-release/docs/src/recipes/split_fraction.ipynb)

This recipe can be used to split an `EventSet` in two or more subsets, each with a specified fraction of the total number of data points.

For example, to train a machine learning forecasting model, the data usually needs to be split into train, validation and test `EventSets`. In this case we'll use `60%` of the data for training, `20%` for validation, and `20%` for test.

## Example data

In [None]:
import temporian as tp
import numpy as np

T = 10
t = np.arange(0, T, 0.1)
signal_evset = tp.event_set(timestamps=t, features={"signal": np.sin(t)})

signal_evset.plot()

## Solution

We want to split this into 3 separate `EventSets` as follows:
* **Train** data: `60%` of the events, at the beginning of the series.
* **Validation**: `20%` of the events, following training data.
* **Test**: Remaining `20%` of the events.

The proposed steps are:
1. Get the total number of events and calculate split limits.
2. Get each event's position in the `EventSet`.
3. Split comparing each event's position to the split limits.

### 1. Calculate split limits

In [None]:
n_events = len(signal_evset.get_index_value(()))

train_until = int(n_events * 0.6)
val_until = train_until + int(n_events * 0.2)

### 2. Get each event's position

The `enumerate()` operator creates a single-feature `EventSet` with the position of each event, keeping the indexes and samplings compatible with the original `EventSet`.

In [None]:
sample_positions = signal_evset.enumerate()

### 3. Split based on positions

Now we compare the `sample_positions` limits of each subset. This will create boolean `EventSets` that can be passed directly to the `filter()` operator.

In [None]:
train_evset = signal_evset.filter(sample_positions <= train_until)
val_evset = signal_evset.filter((sample_positions > train_until) & (sample_positions <= val_until))
test_evset = signal_evset.filter(sample_positions > val_until)

## Check results

In [None]:
train_evset.plot()

In [None]:
val_evset.plot()

In [None]:
test_evset.plot()