In [1]:
from pandas import DataFrame, Series
import pandas as pd
import sys
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline


from datetime import datetime
from datetime import timedelta
from dateutil.parser import parse

## Resampling and Frequency Conversion

Resampling refers to the process of converting a time series from one frequency to
another. Aggregating higher frequency data to lower frequency is called downsampling,
while converting lower frequency to higher frequency is called upsampling. Not all resampling falls into either of these categories; for example, converting W-WED (weekly
on Wednesday) to W-FRI is neither upsampling nor downstampling. 

pandas objects are equipped with a resample method, which is the workhorse function
for all frequency conversion:

In [2]:
rng = pd.date_range('1/1/2000', periods=100, freq='D')

In [4]:
ts = Series(np.random.randn(len(rng)), index=rng)

In [6]:
ts.resample('M').mean()

2000-01-31    0.106928
2000-02-29   -0.105889
2000-03-31    0.212099
2000-04-30    0.515350
Freq: M, dtype: float64

In [8]:
ts.resample('M', kind='period').mean()

2000-01    0.106928
2000-02   -0.105889
2000-03    0.212099
2000-04    0.515350
Freq: M, dtype: float64

resample is a flexible and high-performance method that can be used to process very
large time series. I’ll illustrate its semantics and use through a series of examples.

Argument Description

freq String or DateOffset indicating desired resampled frequency, e.g. ‘M', ’5min', or Sec
ond(15)

how='mean' Function name or array function producing aggregated value, for example 'mean',
'ohlc', np.max. Defaults to 'mean'. Other common values: 'first', 'last',
'median', 'ohlc', 'max', 'min'.

axis=0 Axis to resample on, default axis=0

fill_method=None How to interpolate when upsampling, as in 'ffill' or 'bfill'. By default does no
interpolation.

closed='right' In downsampling, which end of each interval is closed (inclusive), 'right' or
'left'. Defaults to 'right'

label='right' In downsampling, how to label the aggregated result, with the 'right' or 'left'
bin edge. For example, the 9:30 to 9:35 5-minute interval could be labeled 9:30 or
9:35. Defaults to 'right' (or 9:35, in this example).

loffset=None Time adjustment to the bin labels, such as '-1s' / Second(-1) to shift the aggregate
labels one second earlier

limit=None When forward or backward filling, the maximum number of periods to fill

kind=None Aggregate to periods ('period') or timestamps ('timestamp'); defaults to kind of
index the time series has

convention=None When resampling periods, the convention ('start' or 'end') for converting the low
frequency period to high frequency. Defaults to 'end'

# Downsampling

Aggregating data to a regular, lower frequency is a pretty normal time series task. The
data you’re aggregating doesn’t need to be fixed frequently; the desired frequency defines
bin edges that are used to slice the time series into pieces to aggregate. For example,
to convert to monthly, 'M' or 'BM', the data need to be chopped up into one month
intervals. Each interval is said to be half-open; a data point can only belong to one
interval, and the union of the intervals must make up the whole time frame. There are
a couple things to think about when using resample to downsample data:

* Which side of each interval is closed
* How to label each aggregated bin, either with the start of the interval or the end

To illustrate, let’s look at some one-minute data:

In [9]:
rng = pd.date_range('1/1/2000', periods=12, freq='T')

In [10]:
rng


DatetimeIndex(['2000-01-01 00:00:00', '2000-01-01 00:01:00',
               '2000-01-01 00:02:00', '2000-01-01 00:03:00',
               '2000-01-01 00:04:00', '2000-01-01 00:05:00',
               '2000-01-01 00:06:00', '2000-01-01 00:07:00',
               '2000-01-01 00:08:00', '2000-01-01 00:09:00',
               '2000-01-01 00:10:00', '2000-01-01 00:11:00'],
              dtype='datetime64[ns]', freq='T')

In [11]:
ts = Series(np.arange(12), index=rng)

In [12]:
ts

2000-01-01 00:00:00     0
2000-01-01 00:01:00     1
2000-01-01 00:02:00     2
2000-01-01 00:03:00     3
2000-01-01 00:04:00     4
2000-01-01 00:05:00     5
2000-01-01 00:06:00     6
2000-01-01 00:07:00     7
2000-01-01 00:08:00     8
2000-01-01 00:09:00     9
2000-01-01 00:10:00    10
2000-01-01 00:11:00    11
Freq: T, dtype: int64

Suppose you wanted to aggregate this data into five-minute chunks or bars by taking
the sum of each group:

In [14]:
# ts.resample('5min', how='sum')
ts.resample('5min').sum()

2000-01-01 00:00:00    10
2000-01-01 00:05:00    35
2000-01-01 00:10:00    21
Freq: 5T, dtype: int64

The frequency you pass defines bin edges in five-minute increments. By default, the
right bin edge is inclusive, so the 00:05 value is included in the 00:00 to 00:05 interval.
1 Passing closed='left' changes the interval to be closed on the left:

In [16]:
# ts.resample('5min', how='sum', closed='left')
ts.resample('5min', closed='left').sum()

2000-01-01 00:00:00    10
2000-01-01 00:05:00    35
2000-01-01 00:10:00    21
Freq: 5T, dtype: int64