# DAML 03 - Time Series

Michal Grochmal <michal.grochmal@city.ac.uk>

Working with data we find ourselves defining dimensions over which we want to analyze it
(aggregate it).  Dimensions are notably known in data warehousing and analytic queries
over such warehouses.  One such dimension that always appear for data analysis is the
time dimension.  Windowing, changing granularity or aggregating over specific times in the
time dimension is called time series analysis.

Time series analysis requires us to be able to change the time dimension quickly, and tailor
it to our current needs with little computation.  `pandas` provides the tools for this
through its time indexes: time stamps, time periods and time deltas.  Let's see how we build these.

In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
plt.style.use('seaborn-whitegrid')
mpl.rcParams['figure.figsize'] = (12.5, 6.0)
import pandas as pd
pd.options.display.max_rows = 12

Python has the `datetime` object built into the standard library but it is quite limited.
There is also the [dateutil][] module, by Gustavo Niemeyer, which has a much better
date parser; and the [pytz][], by Stuart Bishop, which allows to localize times and dates
within and between timezones.  `pandas` makes use of all three of these modules to build its
`Timestamp`, `Period` and `Timedelta` objects.

[dateutil]: https://dateutil.readthedocs.io/en/stable/ "dateutil documentation"
[pytz]: http://pythonhosted.org/pytz/ "pytz documentation"

In [None]:
# parsed with python's dateutil
pd.to_datetime('January, 2017'), pd.to_datetime('3rd of February 2016'), pd.to_datetime('6:31PM, Nov 11th, 2017')

In [None]:
date = pd.to_datetime('3rd of January 2018')
date.strftime('%A')

In [None]:
date + pd.to_timedelta(np.arange(12), 'D')

## Indexes on dates

We distinguish between three time definitions:
the **timestamp**, e.g. at which time the plane did land;
the **time period**, e.g. how many planes did land this Wednesday;
and **time deltas** (or durations), e.g. how long ago did the last plane land.
Each of these has a `pandas` object and index type:

*   The `DatetimeIndex` is composed of `Timestamp` objects and the most basic date index type.
*   `PeriodIndex` uses `Period` objects which contain `start_time` and `end_time`
    attributes to check whether a timestamp falls within the period.
*   And the `TimedeltaIndex` is composed of `Timedelta` objects, which represent a duration of time.

We can understand periods as aggregates of timestamps and are internally defined as a single
timestamp (start of period) and a frequency (duration of the period).  All periods within a
`PeriodIndex` must have the same frequency.  The frequency (or duration, or offset) in `pandas`
can be defined in many ways, with letter codes.  The most important ones are:

* `D` - day
* `B` - day, business days only
* `W` - week
* `M` - month
* `A`/`Y` - year
* `H` - hour
* `T`/`min` - minute
* `S` - second

And these can be combined in several ways (e.g. `BAS-APR`, year starting in April on first business day).
It is nearly impossible to remember all combinations, do have a link to the [offset documentation][offset]
handy.  Let's see how to create time based indexes:

[offset]: http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases "frequency codes"

In [None]:
dates = pd.to_datetime(['3rd of January 2016', '2016-Jul-6', '20170708'])
dates

In [None]:
dates.to_period('D')

In [None]:
dates - dates[0]

In [None]:
pd.date_range('2017-11-29', '2017-12-03')

In [None]:
pd.date_range('2017-12-03', periods=6)

In [None]:
pd.date_range('2017-12-03', periods=10, freq='H')

In [None]:
pd.period_range('2017-09', periods=8, freq='M')

In [None]:
pd.timedelta_range(0, periods=10, freq='H')

## Fremont Bridge

The [Fremont bridge][fremont] in Seattle is possibly on of the most studied bridges in the world,
it sits between the Google office in Seattle and the Adobe offices.  It is preferred by cyclists
due to the fact that it is a small bridge linking north Seattle and downtown (the other two bridges
are motorway bridges).  Bicycle counters were installed on both sides of the bridge in 2012, and
are collecting data to this day (as of time of writing).

The data can be downloaded from  <http://data.seattle.gov/> ([direct link][dataset]) but we
can use the `fremont-bridge.csv` file I have already downloaded.  The file has a `Date` column
which we will parse as the index of our time series.  Then we will try to get an understanding
of the data (e.g. using `describe`) and see if we can plot something interesting.

[fremont]: http://www.openstreetmap.org/#map=17/47.64813/-122.34965
[dataset]: https://data.seattle.gov/Transportation/Fremont-Bridge-Hourly-Bicycle-Counts-by-Month-Octo/65db-xm6k).

In [None]:
data = pd.read_csv('daml-03-04-fremont-bridge.csv', index_col='Date', parse_dates=True)
data.head()

In [None]:
data.columns = ['West', 'East']
data['Total'] = data['West'] + data['East']
data.head()

In [None]:
data.info()

In [None]:
data.describe()

In [None]:
data.plot(alpha=0.6)
plt.ylabel('bicycle count');

In [None]:
# probably a better granularity for the full datatset
weekly = data.resample('W').sum()
weekly.plot(style=[':', '--', '-'])
plt.ylabel('bicycle count');

In [None]:
# perhaps a rolling window can be better
daily = data.resample('D').sum()
daily.rolling(30, center=True).sum().plot(style=[':', '--', '-'])
plt.ylabel('mean daily count');

In [None]:
# and if we do it on week
weekly = data.resample('W').sum()
weekly.rolling(10, center=True).sum().plot(style=[':', '--', '-'])
plt.ylabel('mean weekly count');

### Group By

Similar to an RDBMS we can group by directly on the pandas data frame.
On a time series *grouping by* can produce completely different time frames,
known as dicing and slicing the frame.  In general we want to *group by* to
first filter all rows into smaller groups and then apply the aggregation
to each of these smaller groups.

On dates and times this division is quite evident: each day is formed of hours,
each hour of minutes and so on.  Let's first try to get a series of time stamps
from our data and see how we can divide it into divisions such as weeks, days
or hours.

In [None]:
series = data.index.to_series()
series

The date and time properties of the `Series` are on an attribute called `dt`.
This attribute has [several properties][properties] that can be used to aggregate over.

[properties]: http://pandas.pydata.org/pandas-docs/version/0.20/api.html#datetimelike-properties

In [None]:
series.dt.dayofyear

In [None]:
series.dt.dayofweek

In [None]:
series.groupby(series.dt.dayofweek).count()

In [None]:
series.groupby(series.dt.year).count()

In [None]:
series2012 = series['2012']
series2012.groupby(series2012.dt.dayofweek).count()

OK, data collection started on a Wednesday so we do not have data for the first Monday and Tuesday.
2012 ended in a Monday 31st of December, so we can see that there is an extra Monday in there.  2016
was a leap year so we get 24 hours (and 24 data points more).

But finding the missing data (remember the `data.info` above) from the `Series` is not going to be possible.
This is because the series uses the index which is complete.  Yet, finding the missing data from
the actual data frame is pretty easy.

In [None]:
data[(data['West'].isnull()) | (data['East'].isnull())]

That's actually pretty random.  Device malfunction perhaps?  I'll argue that we can safely ignore
these missing points and go back to our full datatset.

Knowing about grouping by we can slice the data in more ways now.  One thing to note is that several
of the properties of the `Series.dt` object are directly available from a `DatetimeIndex`.

In [None]:
# hourly traffic
by_time = data.groupby(data.index.time).mean()
hourly_ticks = 4 * 60 * 60 * np.arange(6)
by_time.plot(xticks=hourly_ticks, style=[':', '--', '-']);

In [None]:
# weekly traffic
by_weekday = data.groupby(data.index.dayofweek).mean()
by_weekday.index = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
by_weekday.plot(style=[':', '--', '-']);

In [None]:
# is it different on the weekend?
weekend = np.where(data.index.weekday < 5, 'Weekday', 'Weekend')
by_time = data.groupby([weekend, data.index.time]).mean()
by_time

In [None]:
fig, ax = plt.subplots(2, 1, figsize=(14, 9))
fig.subplots_adjust(hspace=0.3)
by_time.loc['Weekday'].plot(ax=ax[0], title='Weekdays',
                            xticks=hourly_ticks, style=[':', '--', '-'])
by_time.loc['Weekend'].plot(ax=ax[1], title='Weekends',
                            xticks=hourly_ticks, style=[':', '--', '-']);

## References and Extras

* [Python Data Science Handbook - Chapter 3: Pandas - Working with time series - Jake VanderPlas][1]
* [Is Seattle Really Seeing an Uptick In Cycling? - Jake VanderPlas][2]

[1]: https://jakevdp.github.io/PythonDataScienceHandbook/03.11-working-with-time-series.html
[2]: https://jakevdp.github.io/blog/2014/06/10/is-seattle-really-seeing-an-uptick-in-cycling/