# ts2ml

> Tools to Transform a Time Series into Features and Target Dataset

## Install

```sh
pip install ts2ml
```

## How to use

In [None]:
import pandas as pd
from ts2ml.core import add_missing_slots
from ts2ml.core import transform_ts_data_into_features_and_target

In [None]:
df = pd.DataFrame({
    'pickup_hour': ['2022-01-01 00:00:00', '2022-01-01 01:00:00', '2022-01-01 03:00:00', '2022-01-01 01:00:00', '2022-01-01 02:00:00', '2022-01-01 05:00:00'],
    'pickup_location_id': [1, 1, 1, 2, 2, 2],
    'rides': [2, 3, 1, 1, 2, 1]
})
df

Unnamed: 0,pickup_hour,pickup_location_id,rides
0,2022-01-01 00:00:00,1,2
1,2022-01-01 01:00:00,1,3
2,2022-01-01 03:00:00,1,1
3,2022-01-01 01:00:00,2,1
4,2022-01-01 02:00:00,2,2
5,2022-01-01 05:00:00,2,1


Let's fill the missing slots with zeros

In [None]:
df = add_missing_slots(df, datetime_col='pickup_hour', entity_col='pickup_location_id', value_col='rides', freq='H')
df

100%|██████████| 2/2 [00:00<00:00, 472.92it/s]


Unnamed: 0,pickup_hour,pickup_location_id,rides
0,2022-01-01 00:00:00,1,2
1,2022-01-01 01:00:00,1,3
2,2022-01-01 02:00:00,1,0
3,2022-01-01 03:00:00,1,1
4,2022-01-01 04:00:00,1,0
5,2022-01-01 05:00:00,1,0
6,2022-01-01 00:00:00,2,0
7,2022-01-01 01:00:00,2,1
8,2022-01-01 02:00:00,2,2
9,2022-01-01 03:00:00,2,0


Now, let's build features and targets to predict the number of rides for the next hour for each location_id, by using the historical number of rides for the last 3 hours

In [None]:
features, targets = transform_ts_data_into_features_and_target(
    df,
    n_features=3,
    datetime_col='pickup_hour', 
    entity_col='pickup_location_id', 
    value_col='rides',
    n_targets=1,
    step_size=1,
    step_name='hour'
)

100%|██████████| 2/2 [00:00<00:00, 535.64it/s]


In [None]:
features

Unnamed: 0,rides_previous_3_hour,rides_previous_2_hour,rides_previous_1_hour,pickup_hour,pickup_location_id
0,2.0,3.0,0.0,2022-01-01 03:00:00,1
1,3.0,0.0,1.0,2022-01-01 04:00:00,1
2,0.0,1.0,2.0,2022-01-01 03:00:00,2
3,1.0,2.0,0.0,2022-01-01 04:00:00,2


In [None]:
targets

Unnamed: 0,target_rides_next_hour
0,1.0
1,0.0
2,0.0
3,0.0


In [None]:
Xy_df = pd.concat([features, targets], axis=1)
Xy_df

Unnamed: 0,rides_previous_3_hour,rides_previous_2_hour,rides_previous_1_hour,pickup_hour,pickup_location_id,target_rides_next_hour
0,2.0,3.0,0.0,2022-01-01 03:00:00,1,1.0
1,3.0,0.0,1.0,2022-01-01 04:00:00,1,0.0
2,0.0,1.0,2.0,2022-01-01 03:00:00,2,0.0
3,1.0,2.0,0.0,2022-01-01 04:00:00,2,0.0


# Another Example
Montly spaced time series

In [None]:
import pandas as pd
import numpy as np

# Generate timestamp index with monthly frequency
date_rng = pd.date_range(start='1/1/2020', end='12/1/2022', freq='MS')

# Create list of city codes
cities = ['FOR', 'SP', 'RJ']

# Create dataframe with random sales data for each city on each month
df = pd.DataFrame({
    'date': date_rng,
    'city': np.repeat(cities, len(date_rng)//len(cities)),
    'sales': np.random.randint(1000, 5000, size=len(date_rng))
})
df

Unnamed: 0,date,city,sales
0,2020-01-01,FOR,4944
1,2020-02-01,FOR,4586
2,2020-03-01,FOR,1075
3,2020-04-01,FOR,1922
4,2020-05-01,FOR,2655
5,2020-06-01,FOR,4719
6,2020-07-01,FOR,3332
7,2020-08-01,FOR,3789
8,2020-09-01,FOR,1109
9,2020-10-01,FOR,1210


FOR city only have data for 2020 year, RJ only for 2022 and SP only for 2021. Let's also simulate more missing slots between the years.

In [None]:
# Generate random indices to drop
drop_indices = np.random.choice(df.index, size=int(len(df)*0.2), replace=False)

# Drop selected rows from dataframe
df = df.drop(drop_indices)
df.reset_index(drop=True, inplace=True)
df

Unnamed: 0,date,city,sales
0,2020-01-01,FOR,4944
1,2020-02-01,FOR,4586
2,2020-03-01,FOR,1075
3,2020-05-01,FOR,2655
4,2020-06-01,FOR,4719
5,2020-08-01,FOR,3789
6,2020-09-01,FOR,1109
7,2020-10-01,FOR,1210
8,2020-11-01,FOR,4000
9,2021-01-01,SP,1388


Now lets fill the missing slots with zero values. The function will complete the missing slots with zeros:

In [None]:
df_full = add_missing_slots(df, datetime_col='date', entity_col='city', value_col='sales', freq='MS')
df_full

100%|██████████| 3/3 [00:00<00:00, 916.32it/s]


Unnamed: 0,date,city,sales
0,2020-01-01,FOR,4944
1,2020-02-01,FOR,4586
2,2020-03-01,FOR,1075
3,2020-04-01,FOR,0
4,2020-05-01,FOR,2655
...,...,...,...
103,2022-08-01,RJ,0
104,2022-09-01,RJ,2561
105,2022-10-01,RJ,2882
106,2022-11-01,RJ,0


Let's build a dataset for training a machine learning model to predict the sales for the next 3 months, for each city, based on historical data of sales for the previous 6 months.

In [None]:
features, targets = transform_ts_data_into_features_and_target(
    df_full,
    n_features=3,
    datetime_col='date',
    entity_col='city',
    value_col='sales',
    n_targets=1,
    step_size=1,
    step_name='month'
)

100%|██████████| 3/3 [00:00<00:00, 214.63it/s]


In [None]:
pd.concat([features, targets], axis=1)

Unnamed: 0,sales_previous_3_month,sales_previous_2_month,sales_previous_1_month,date,city,target_sales_next_month
0,4944.0,4586.0,1075.0,2020-04-01,FOR,0.0
1,4586.0,1075.0,0.0,2020-05-01,FOR,2655.0
2,1075.0,0.0,2655.0,2020-06-01,FOR,4719.0
3,0.0,2655.0,4719.0,2020-07-01,FOR,0.0
4,2655.0,4719.0,0.0,2020-08-01,FOR,3789.0
...,...,...,...,...,...,...
91,2449.0,2867.0,2848.0,2022-07-01,RJ,2280.0
92,2867.0,2848.0,2280.0,2022-08-01,RJ,0.0
93,2848.0,2280.0,0.0,2022-09-01,RJ,2561.0
94,2280.0,0.0,2561.0,2022-10-01,RJ,2882.0
