# Time Serie Synchronization

## Introduction

When dealing with Time Series we are often facing different Time Conventions. If we abstract problem of *Timestamps Formatting* and *Time Zone*, there still remain important questions:

> &laquo;&nbsp;Do Timestamps in Time Serie stand for the **beginning** or the **end** of the sampling period?&nbsp;&raquo;

> &laquo;&nbsp;How to cope with this issue properly?&nbsp;&raquo;

In this notebook we show capabilities of Pandas in dealing with this issue and how to handle it properly.

## Imports

Only `pandas` is required in this notebook, `numpy` is invoked for convenience only.

In [1]:
import numpy as np
import pandas as pd

## Pandas Timestamps Convention

First we draw the reader attention on Pandas Timestamps Convention. By default, Pandas generates closed Timestamps vectors and considers resampling operations closed with respect to the left.

Reader may want to refer to [Time Series tutorial][1] to have a complete overview of how Pandas handles Timestamps.

[1]: https://pandas.pydata.org/pandas-docs/stable/timeseries.html

### Timestamps Vectors

Timestamps vectors can easily created in many ways using Pandas [`date_range`][1] method (see [Offset Alias list][2]).

[1]: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.date_range.html
[2]: http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases

By default, Timestamps Vector is closed both on left and rigth:

In [2]:
pd.date_range(start='2010-01-01 00:00:00', end='2010-01-01 02:00:00', freq='5T')

DatetimeIndex(['2010-01-01 00:00:00', '2010-01-01 00:05:00',
               '2010-01-01 00:10:00', '2010-01-01 00:15:00',
               '2010-01-01 00:20:00', '2010-01-01 00:25:00',
               '2010-01-01 00:30:00', '2010-01-01 00:35:00',
               '2010-01-01 00:40:00', '2010-01-01 00:45:00',
               '2010-01-01 00:50:00', '2010-01-01 00:55:00',
               '2010-01-01 01:00:00', '2010-01-01 01:05:00',
               '2010-01-01 01:10:00', '2010-01-01 01:15:00',
               '2010-01-01 01:20:00', '2010-01-01 01:25:00',
               '2010-01-01 01:30:00', '2010-01-01 01:35:00',
               '2010-01-01 01:40:00', '2010-01-01 01:45:00',
               '2010-01-01 01:50:00', '2010-01-01 01:55:00',
               '2010-01-01 02:00:00'],
              dtype='datetime64[ns]', freq='5T')

Now we will create a Timestamps Vectors that uses End of Period Convention, using `closed='right'` switch:

In [3]:
t = pd.date_range(start='2010-01-01 00:00:00', end='2010-01-01 02:00:00', freq='5T', closed='right', name='timevalue')
t

DatetimeIndex(['2010-01-01 00:05:00', '2010-01-01 00:10:00',
               '2010-01-01 00:15:00', '2010-01-01 00:20:00',
               '2010-01-01 00:25:00', '2010-01-01 00:30:00',
               '2010-01-01 00:35:00', '2010-01-01 00:40:00',
               '2010-01-01 00:45:00', '2010-01-01 00:50:00',
               '2010-01-01 00:55:00', '2010-01-01 01:00:00',
               '2010-01-01 01:05:00', '2010-01-01 01:10:00',
               '2010-01-01 01:15:00', '2010-01-01 01:20:00',
               '2010-01-01 01:25:00', '2010-01-01 01:30:00',
               '2010-01-01 01:35:00', '2010-01-01 01:40:00',
               '2010-01-01 01:45:00', '2010-01-01 01:50:00',
               '2010-01-01 01:55:00', '2010-01-01 02:00:00'],
              dtype='datetime64[ns]', name='timevalue', freq='5T')

And then we create a dummy [DataFrame][1] to hold a trial TimeSerie:

[1]: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html

In [4]:
df0 = pd.DataFrame(np.arange(t.size), index=t, columns=['data'])
df0

Unnamed: 0_level_0,data
timevalue,Unnamed: 1_level_1
2010-01-01 00:05:00,0
2010-01-01 00:10:00,1
2010-01-01 00:15:00,2
2010-01-01 00:20:00,3
2010-01-01 00:25:00,4
2010-01-01 00:30:00,5
2010-01-01 00:35:00,6
2010-01-01 00:40:00,7
2010-01-01 00:45:00,8
2010-01-01 00:50:00,9


With this convention, it means that value `0` stands from `2010-01-01 00:00:00` to `2010-01-01 00:05:00`, and so on.

## Resampling

### Resampler are natively closed to the left

If we wish to resample this Time Serie to create a new serie of Half Hourly Mean, we can use the [`resample`][1] method.
Notice than `resample` generates a [Resampler][2] object that actually does nothing until you provide it aggregates.

[1]: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.resample.html
[2]: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.resample.Resampler.aggregate.html

In [5]:
df0.resample('30T')

DatetimeIndexResampler [freq=<30 * Minutes>, axis=0, closed=left, label=left, convention=start, base=0]

As we see, by default this method returns a Resampler that is **closed to the left**.

This will create a problem with our Time Serie which is closed to the right, see the result of `count`:

In [6]:
df0.resample('30T').count()

Unnamed: 0_level_0,data
timevalue,Unnamed: 1_level_1
2010-01-01 00:00:00,5
2010-01-01 00:30:00,6
2010-01-01 01:00:00,6
2010-01-01 01:30:00,6
2010-01-01 02:00:00,1


Resampled Time Serie is **now closed to left** and data are **not properly aligned**.

### Unless we explicilty tell it to be closed to the right

We create a Resampler closed to the right, in the same way we have created the Timestamps Vector:

In [7]:
df0.resample('30T', closed='right')

DatetimeIndexResampler [freq=<30 * Minutes>, axis=0, closed=right, label=left, convention=start, base=0]

Now the aggregation operation behaves as expected:

In [8]:
df0.resample('30T', closed='right').count()

Unnamed: 0_level_0,data
timevalue,Unnamed: 1_level_1
2010-01-01 00:00:00,6
2010-01-01 00:30:00,6
2010-01-01 01:00:00,6
2010-01-01 01:30:00,6


But Time Convetion is not conserved, resampled Time Serie is closed to the left.

### But resampled Time Serie is closed to left unless we aslo force it

If we want to select correct data and to keep our Time Convention, we also have to force Index Labels to be closed to the right. This can be done using the switch `label=right`:

In [9]:
df0.resample('30T', closed='right', label='right').count()

Unnamed: 0_level_0,data
timevalue,Unnamed: 1_level_1
2010-01-01 00:30:00,6
2010-01-01 01:00:00,6
2010-01-01 01:30:00,6
2010-01-01 02:00:00,6


Time Serie above is now properly aggregated and have kept its Timestamps Convention.

## Shifting

Another technique to deal with the *closed to the right* Time Convention is to [`shift`][1] our TimeSerie in order to comply with Pandas native behaviours:

[1]: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.shift.html

### By default data are shifted

If we simply invoke `shift` with an `int`, data are shifted and `DatetimeIndex` remains intact:

In [10]:
df0.shift(-1)

Unnamed: 0_level_0,data
timevalue,Unnamed: 1_level_1
2010-01-01 00:05:00,1.0
2010-01-01 00:10:00,2.0
2010-01-01 00:15:00,3.0
2010-01-01 00:20:00,4.0
2010-01-01 00:25:00,5.0
2010-01-01 00:30:00,6.0
2010-01-01 00:35:00,7.0
2010-01-01 00:40:00,8.0
2010-01-01 00:45:00,9.0
2010-01-01 00:50:00,10.0


It has at least two important implications:
 
 - It destructs data that are outside the `DatetimeIndex` scope;
 - Data may be upcasted, here from `int64` to `float64` in order to introduces the sentinel `NaN` for missing values. 

### But setting `freq` switch shifts Timestamps instead of data

If we set `freq` switch, Timestamps are shifted instead of data. That is, the `DatetimeIndex` is updated instead of the data.

In [11]:
df0.shift(-1, freq='5T')

Unnamed: 0_level_0,data
timevalue,Unnamed: 1_level_1
2010-01-01 00:00:00,0
2010-01-01 00:05:00,1
2010-01-01 00:10:00,2
2010-01-01 00:15:00,3
2010-01-01 00:20:00,4
2010-01-01 00:25:00,5
2010-01-01 00:30:00,6
2010-01-01 00:35:00,7
2010-01-01 00:40:00,8
2010-01-01 00:45:00,9


It implies:

 - Orginal Time Convention is not conserved;
 - Data are preserved.
 
Furthermore, the effect can be canceled using the inverse transformation. Which is not possible with the first version of `shift`. Following table compares effect of shifting backward and then forward:

In [12]:
ident0 = df0.shift(-1).shift(+1)
ident1 = df0.shift(-1, freq='5T').shift(+1, freq='5T')
pd.concat([ident0, ident1], axis=1, keys=['destructive', 'conservative'], names=['type', 'name'])

type,destructive,conservative
name,data,data
timevalue,Unnamed: 1_level_2,Unnamed: 2_level_2
2010-01-01 00:05:00,,0
2010-01-01 00:10:00,1.0,1
2010-01-01 00:15:00,2.0,2
2010-01-01 00:20:00,3.0,3
2010-01-01 00:25:00,4.0,4
2010-01-01 00:30:00,5.0,5
2010-01-01 00:35:00,6.0,6
2010-01-01 00:40:00,7.0,7
2010-01-01 00:45:00,8.0,8
2010-01-01 00:50:00,9.0,9


## Comparing both methods

We create a trial DataFrame with random numbers with 10 channels during a month sampled with 5 minutes period.
Data are stored using the closed to the right convention:

In [13]:
t0 = pd.date_range(start='2010-01-01 00:00:00', end='2010-02-01 00:00:00', freq='5T', closed='right')
x0 = np.random.randn(10*t0.size).reshape((t0.size,10))
data = pd.DataFrame(x0, index=t0)
data.iloc[:5,:]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
2010-01-01 00:05:00,-1.542427,0.596901,-0.562643,0.464154,-0.444682,-0.8617,0.458769,-0.167584,-0.817559,-0.574182
2010-01-01 00:10:00,-0.017982,-3.042566,1.019357,1.300443,1.146274,-0.698546,-0.640792,-1.104534,-0.042355,-0.205023
2010-01-01 00:15:00,-0.124204,-2.108244,-0.925032,0.606554,-1.806883,1.47213,0.577052,1.723222,-0.809997,1.246302
2010-01-01 00:20:00,0.029185,-1.976756,-1.152296,-0.49232,1.992786,0.788298,0.129595,0.591323,-0.049326,0.950709
2010-01-01 00:25:00,-1.171512,-1.738278,-1.555979,-0.333753,0.298087,0.220905,0.256691,0.879146,-0.38524,-0.84993


Now, we will create half hourly means for all channels with the requirement to keep time convention.

### Properly setting the resampler

Either, we properly setup the resampler in order to select proper window and keep time convention:

In [14]:
d0 = data.resample('30T', closed='right', label='right').mean()
d0.iloc[:5,:]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
2010-01-01 00:30:00,-0.344455,-0.924595,-0.359026,0.423126,-0.077009,0.138514,0.272003,0.43145,-0.46533,0.088234
2010-01-01 01:00:00,-0.269854,0.065397,0.538902,-0.32376,-0.230892,-0.061717,-0.230343,0.260744,-0.253195,-0.155359
2010-01-01 01:30:00,0.259615,0.315796,-0.332895,0.493858,-0.890526,-0.291227,0.177969,-0.891898,-0.825811,0.040503
2010-01-01 02:00:00,-0.47483,-0.24496,-0.033778,0.709463,-0.505943,-0.595972,1.127461,-0.087473,-0.037939,-0.377605
2010-01-01 02:30:00,0.071892,0.134703,-0.402021,0.516357,-0.003361,-0.267273,-0.725262,0.106016,-0.769794,0.030131


### Using shifts to make resampler work without configuration

Or, we shift backward series, aggregate and then shift forward:

In [15]:
d1 = data.shift(-1, freq='5T').resample('30T').mean().shift(+1, freq='30T')
d1.iloc[:5,:]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
2010-01-01 00:30:00,-0.344455,-0.924595,-0.359026,0.423126,-0.077009,0.138514,0.272003,0.43145,-0.46533,0.088234
2010-01-01 01:00:00,-0.269854,0.065397,0.538902,-0.32376,-0.230892,-0.061717,-0.230343,0.260744,-0.253195,-0.155359
2010-01-01 01:30:00,0.259615,0.315796,-0.332895,0.493858,-0.890526,-0.291227,0.177969,-0.891898,-0.825811,0.040503
2010-01-01 02:00:00,-0.47483,-0.24496,-0.033778,0.709463,-0.505943,-0.595972,1.127461,-0.087473,-0.037939,-0.377605
2010-01-01 02:30:00,0.071892,0.134703,-0.402021,0.516357,-0.003361,-0.267273,-0.725262,0.106016,-0.769794,0.030131


Second code seems tricky but there are use case where this solution is better, specially when you need to deal with multiple time conventions and join data together. In this case, we can shift all dataframes to align them in the same convention, performing operations and then shifting back them in their original time conventions. 

### Equality check

Both methods are equal, they return equivalent DataFrames:

In [16]:
d0.equals(d1)

True