# Timeseries with pandas

Working with time-series data is an important part of data analysis.

Starting with v0.8, the _pandas_ library has included a rich API for time-series manipulations.

The _pandas_ time-series API includes:

- Creating date ranges
  - From files
  - From scratch
- Manipulations: Shift, resample, filter
- Field accessors (e.g., hour of day)
- Plotting
- Time zones (localization and conversion)
- Dual representations (point-in-time vs interval)

In [12]:
# import packages and set up
from datetime import time, date
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl

mpl.rc('figure', figsize=(10, 8))
pd.set_option('display.notebook_repr_html', True) # turn off html, i.e., without table.
pd.set_option('display.max_rows',8)   # getting the summary of the view versus getting everything. Terminal takes time to buffer.
%matplotlib inline

##Example using tick data
Sample trade ticks from 2011-11-01 to 2011-11-03 for a single security

`parse_dates`: use a list or dict for flexible (possibly multi-column) date parsing

In [11]:
data = pd.read_csv('data.csv', 
                    parse_dates={'Timestamp': ['Date', 'Time']},
                    index_col='Timestamp')
data

ValueError: 'Date' is not in list

ValueError: 'Date' is not in list

In [None]:
ticks = data.ix[:, ['Price', 'Volume']]
ticks.head()

## resample: regularization and frequency conversion

resample the data such that the time interval is constant

In [None]:
bars = ticks.Price.resample('1min', how='ohlc')
bars

### Basic Calculation

In [None]:
minute_range = bars.high - bars.low       #  Minute Price range: High- Low      
minute_range.describe()

In [None]:
minute_return = bars.close / bars.open - 1     # Summary of Minute Return, we can also use log(return) 
pxopen = bars.open
minute_return.describe()

In [None]:
volume = ticks.Volume.resample('1min', how='sum')          # VWAP (volume weighted average price) 
value = ticks.prod(axis=1).resample('1min', how='sum')     # index (0), columns (1)
vwap = value / volume

###Convenient indexing for time series data

In [None]:
vwap.ix['2011-11-01 9:27:20':'2011/11/01 09:32']           

### Timeseries data selection: 


In [None]:
bars.open.at_time('9:30')  # at_time: select data at same time but different days

In [None]:
bars.close.at_time('16:00')

In [None]:
filtered = vwap.between_time('10:00', '16:00')  # between_time: intraday time range
filtered

In [None]:
vol = volume.between_time('10:00', '16:00')
vol.head(20)

### Handling missing data: fillna()


In [None]:
filtered.ix['2011-11-03':'2011-11-04'].head(20)

In [None]:
filled = filtered.fillna(method='pad', limit=1)  
#pad/ffill: propagate last valid observation, backfill/bfill: propagate next valid observation 
filled.ix['2011-11-03':'2011-11-04'].head(20)

In [None]:
vol = vol.fillna(0.)
vol.head(20)

###Simple plotting

In [None]:
filled.ix['2011-11-03':'2011-11-04'].plot()
plt.ylim(103.5, 104.5)
vol.ix['2011-11-03':'2011-11-04'].plot(secondary_y=True, style='r')
plt.title('VWAP / Volume on 2011-11-03')

###Lead/lag
`shift` realigns values

`tshift` manipulates index values

In [None]:
ticks.head()

In [None]:
ticks.shift(1).head()

In [None]:
ticks.shift(-1).head()

In [None]:
ticks.tshift(1, 'min').head()

## stupidly simple strategy: using regression 

In [None]:
mr = minute_return.between_time('9:30', '16:00')  # minute return when markets open
px = pxopen.between_time('9:30', '16:00')

In [None]:
lagged = minute_return.tshift(1, 'min').between_time('9:30', '16:00')
lagged.at_time('16:00')  # the last minute return at 16:00 (minute return of 15:59)

### Now Let's play :)

In [None]:
pd.ols(y=mr, x=lagged)        # OLS using simple return of last minute

In [None]:
mr_vw = vwap / bars.open - 1           # using volume weighted return of last minute
mr_vw = mr_vw.between_time('9:30', '16:00')
lagged_vw = mr_vw.tshift(1, 'min').between_time('9:30', '16:00')
pd.ols(y=mr_vw, x=lagged_vw)

In [None]:
inter_vw = mr_vw * vol    # volume not tradable, but doesn't affect OLS if we assume a fixed percentage of volume is traded by us
inter_vw = inter_vw.between_time('9:30', '16:00')
lagged_inter_vw = inter_vw.tshift(1, 'min').between_time('9:30', '16:00')
pd.ols(y=mr_vw, x=lagged_inter_vw)

#### Convert to percentage volume

In [None]:
vol_prop = vol.groupby(vol.index.day).transform(lambda x: x/x.sum())
vol_prop.head()
#vol.resample('D', how='sum')   # Verify that sum(prop)

In [None]:
inter_prop = mr_vw * vol_prop
inter_prop = inter_prop.between_time('9:30', '16:00')
lagged_inter_prop = inter_prop.tshift(1, 'min').between_time('9:30', '16:00')
pd.ols(y=mr_vw, x=lagged_inter_prop)

# slightly different from previous OLS, since daily volume is changing
# also mean-reversion

In [None]:
# using OLS to predict, assuming one trade each minute.
def pred(x, y, px, r = 0.7):
    length = int(len(x) * r)
    xtrain = x[0:length]
    xtest = x[length:len(x)]
    ytrain = y[0:length]
    ytest = y[length:len(x)]
    pxtrain = px[:length]
    pxtest = px[length:]
    model = pd.ols(y = ytrain, x = xtrain)
    ypred = model.beta['x'] * xtest + model.beta['intercept']
    signal = np.sign(ypred).fillna(0)
    pnl = signal * ytest
    cumpnl = pnl.cumsum()
    return cumpnl
    

In [None]:
p1=pred(lagged_vw, mr_vw, px)
p2=pred(lagged, mr, px)
p3=pred(lagged_inter_vw, mr_vw, px)
p4=pred(lagged_inter_prop, mr_vw, px)
plt.rcParams['figure.figsize'] = 12, 4
plt.figure(1)
plt.subplot(221)
p1.plot()
plt.subplot(222)
p2.plot()
plt.subplot(223)
p3.plot()
plt.subplot(224)
p4.plot()

#### Calculate Volume Deviation

In [None]:
hour = vol.index.hour
hourly_volume = vol.groupby(hour).mean()

In [None]:
hourly_volume.plot(kind='bar')   # market hot at close

Expanding window of hourly means for volume

In [None]:
hourly = vol_prop.resample('H')

def calc_mean(hr):
    hr = time(hour=hr)
    data = hourly.at_time(hr)
    return pd.expanding_mean(data)

df = pd.concat([calc_mean(hr) for hr in range(10, 16)])
df = df.sort_index()
df

Compute deviations from the hourly means

In [None]:
clean_vol = vol_prop.between_time('10:00', '15:59')   
dev = clean_vol - df.reindex(clean_vol.index, method='pad')  # be careful over day boundaries
dev

In [None]:
inter_dev = mr_vw * dev   # using (VWAR * volume deviation) to predict
inter_dev = inter_dev.between_time('10:00', '15:59')
pd.ols(y=mr_vw, x=inter_dev.tshift(1, 'min'))

### Period representation

A lot of time series data is better represented as intervals of time rather than points in time.

This is represented in _pandas_ as Period and PeriodIndex

####Creating periods

In [None]:
p = pd.Period('2005', 'A')
p

In [None]:
pd.Period('2006Q1', 'Q-MAR')

In [None]:
pd.Period('2007-1-1', 'B')

####PeriodRange

In [None]:
prng = pd.period_range('2005', periods=7, freq='A')
prng

####Converting between representations

In [None]:
p

In [None]:
p.to_timestamp()

In [None]:
p.to_timestamp('M', 's')

In [None]:
p.to_timestamp('M', 'e')

In [None]:
prng.to_timestamp(how='e')

In [None]:
prng.to_timestamp('M', 'e')