## CS102-4 - Further Computing

Prof. Götz Pfeiffer<br>
School of Mathematics, Statistics and Applied Mathematics<br>
NUI Galway

### 2. Aspects of Data Wrangling

# Week 8: Time Series

* `Pandas` contains a fairly extensive set of tools for working with dates, times, and time-indexed data.
* Date and time data comes in a few flavors:

  - **Time stamps** reference particular **moments in time** (e.g., March 17th, 2020 at 9:00am).
  - **Time intervals** and **periods** reference a length of time between a particular beginning and end point; for   example, the year 2019.
  Periods usually reference a special case of time intervals in which each interval is of uniform length and does not overlap (e.g., 24 hour-long periods comprising days).
  - **Time deltas** or **durations** reference an **exact length of time** (e.g., a duration of 22.56 seconds).

* In this section, we will give a broad overview of how one should approach working with time series.
* We will start with a brief discussion of tools for dealing with dates and times in `Python`, before moving more specifically to a discussion of the tools provided by `Pandas`.
* We will review some short examples of working with time series data in Pandas.

## Dates and Times in `Python`

* The `Python` world has a number of available representations of dates, times, deltas, and timespans.

### Native Python dates and times: ``datetime`` and ``dateutil``

* Python's basic objects for working with dates and times reside in the built-in ``datetime`` module.
* Along with the third-party ``dateutil`` module, you can use it to quickly perform a host of useful functionalities on dates and times.
* For example, you can manually build a date using the ``datetime`` type:

In [None]:
from datetime import datetime
datetime(year=2020, month=3, day=17)

* Or, using the ``dateutil`` module, you can parse dates from a variety of string formats:

In [None]:
from dateutil import parser
date = parser.parse("17th of March, 2020")
date

* A ``datetime`` object can be printed in a variety of formats, e.g. as the day of the week:

In [None]:
date.strftime('%A')

In [None]:
date.strftime('This year\'s national holiday is on %A, %B %d, %Y.')

* Here we've used one of the standard string format codes for printing dates (``"%A"``), which you can read about in the [strftime section](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior) of Python's [datetime documentation](https://docs.python.org/3/library/datetime.html).

* Documentation of other useful date utilities can be found in [dateutil's online documentation](http://labix.org/python-dateutil).

* The power of ``datetime`` and ``dateutil`` lie in their flexibility and easy syntax: you can use these objects and their built-in methods to easily perform nearly any operation you might be interested in.
* Where they break down is when you wish to work with large arrays of dates and times:
just as lists of `Python` numerical variables are suboptimal compared to `NumPy`-style typed numerical arrays, lists of `datetime` objects are suboptimal compared to typed arrays of encoded dates.

### Typed arrays of times: `NumPy`'s ``datetime64`` and `timedelta64`

* In `numpy`, the ``datetime64`` dtype encodes dates as 64-bit integers, and thus allows arrays of dates to be represented very compactly.
* The ``datetime64`` requires a very specific input format:

In [None]:
import numpy as np
date = np.array('2020-03-17', dtype=np.datetime64)
date

* Once we have this date formatted, however, we can quickly do vectorized operations on it:

In [None]:
date + np.arange(16)

* Because of the uniform type in ``datetime64`` arrays, this type of operation can be accomplished much more quickly than if we were working directly with `Python`'s ``datetime`` objects, especially as arrays get large.

* Arrays of durations have dtype `timedelta64`.

* Both ``datetime64`` and ``timedelta64`` objects are built on a **fundamental time unit**.

* Because the ``datetime64`` object is limited to 64 bits, the range of encodable times is $2^{64}$ times this fundamental unit.

* Thus ``datetime64`` imposes a trade-off between time **resolution** and **maximum time span**.

* For example, with a time resolution of one nanosecond, one  can encode a range of $2^{64}$ nanoseconds, 
that is just under 600 years.

In [None]:
2**64/1000/1000/1000/60/60/24/365.25

* `NumPy` will infer the desired unit from the input; for example, here is a day-based datetime:

In [None]:
date = np.datetime64('2015-07-04')
date

In [None]:
np.array(date)

* Here is a minute-based datetime:

In [None]:
date = np.datetime64('2015-07-04 12:00')
date

In [None]:
np.array(date)

* Notice that the time zone is automatically set to the local time on the computer executing the code.
* You can force any desired fundamental unit using one of many format codes; for example, here we'll force a nanosecond-based time:

In [None]:
date = np.datetime64('2015-07-04 12:59:59.50', 'ns')
date

In [None]:
np.array(date)

* The following table, drawn from the [NumPy datetime64 documentation](http://docs.scipy.org/doc/numpy/reference/arrays.datetime.html), lists the available format codes along with the relative and absolute timespans that they can encode:

|Code    | Meaning     | Time span (relative) | Time span (absolute)   |
|--------|-------------|----------------------|------------------------|
| ``Y``  | Year	       | ± 9.2e18 years       | [9.2e18 BC, 9.2e18 AD] |
| ``M``  | Month       | ± 7.6e17 years       | [7.6e17 BC, 7.6e17 AD] |
| ``W``  | Week	       | ± 1.7e17 years       | [1.7e17 BC, 1.7e17 AD] |
| ``D``  | Day         | ± 2.5e16 years       | [2.5e16 BC, 2.5e16 AD] |
| ``h``  | Hour        | ± 1.0e15 years       | [1.0e15 BC, 1.0e15 AD] |
| ``m``  | Minute      | ± 1.7e13 years       | [1.7e13 BC, 1.7e13 AD] |
| ``s``  | Second      | ± 2.9e12 years       | [ 2.9e9 BC, 2.9e9 AD]  |
| ``ms`` | Millisecond | ± 2.9e9 years        | [ 2.9e6 BC, 2.9e6 AD]  |
| ``us`` | Microsecond | ± 2.9e6 years        | [290301 BC, 294241 AD] |
| ``ns`` | Nanosecond  | ± 292 years          | [ 1678 AD, 2262 AD]    |
| ``ps`` | Picosecond  | ± 106 days           | [ 1969 AD, 1970 AD]    |
| ``fs`` | Femtosecond | ± 2.6 hours          | [ 1969 AD, 1970 AD]    |
| ``as`` | Attosecond  | ± 9.2 seconds        | [ 1969 AD, 1970 AD]    |

* For the types of data we see in the real world, a useful default is ``datetime64[ns]``, as it can encode a useful range of modern dates with a suitably fine precision.

### Dates and times in `pandas`: best of both worlds

* Pandas provides a ``Timestamp`` object, which combines the ease-of-use of ``datetime`` and ``dateutil`` with the efficient storage and vectorized interface of ``numpy.datetime64``.

* From a group of these ``Timestamp`` objects, Pandas can construct a ``DatetimeIndex`` that can be used to index data in a ``Series`` or ``DataFrame``.

* For example, 
we can parse a flexibly formatted string date, and use format codes to output the day of the week:

In [None]:
import pandas as pd
date = pd.to_datetime("17th of March, 2020")
date

In [None]:
date.strftime('%A')

* Additionally, we can do `NumPy`-style vectorized operations directly on this same object:

In [None]:
dates = date + pd.to_timedelta(np.arange(12), 'D')
dates

In [None]:
dates.strftime('%A')

In [None]:
dates = date + pd.to_timedelta(np.arange(10), 'Y')
dates

In [None]:
dates.strftime('%A')

## Pandas Time Series: Indexing by Time

* We can construct a ``Series`` object that has time indexed data:

In [None]:
dates = pd.DatetimeIndex(['2014-07-04', '2014-08-04',
                          '2015-07-04', '2015-08-04'])
data = pd.Series([0, 1, 2, 3], index=dates)
data

* We can use any of the ``Series`` indexing patterns, passing values that can be coerced into dates:

In [None]:
data['2014-07-04':'2015-07-04']

* There are additional special date-only indexing operations, such as passing a year to obtain a slice of all data from that year:

In [None]:
data['2015']

## `Pandas` Time Series Data Structures

* This section will introduce the fundamental `Pandas` data structures for working with time series data:

  - For **time stamps**, `Pandas` provides the ``Timestamp`` type. As mentioned before, it is essentially a replacement for Python's native ``datetime``, but is based on the more efficient ``numpy.datetime64`` data type. The associated Index structure is ``DatetimeIndex``.
  - For **time Periods**, `Pandas` provides the ``Period`` type. This encodes a fixed-frequency interval based on ``numpy.datetime64``. The associated index structure is ``PeriodIndex``.
  - For **time deltas** or **durations**, `Pandas` provides the ``Timedelta`` type. ``Timedelta`` is a more efficient replacement for `Python`'s native ``datetime.timedelta`` type, and is based on ``numpy.timedelta64``. The associated index structure is ``TimedeltaIndex``.

* The most fundamental of these date/time objects are the ``Timestamp`` and ``DatetimeIndex`` objects.
* While these class objects can be invoked directly, it is more common to use the ``pd.to_datetime()`` function, which can parse a wide variety of formats.
* Passing a single date to ``pd.to_datetime()`` yields a ``Timestamp``; passing a series of dates by default yields a ``DatetimeIndex``:

In [None]:
dates = pd.to_datetime([datetime(2015, 7, 3), '4th of July, 2015',
                       '2015-Jul-6', '07-07-2015', '20150708'])
dates

* Any ``DatetimeIndex`` can be converted to a ``PeriodIndex`` with the ``to_period()`` function with the addition of a frequency code; here we'll use ``'D'`` to indicate daily frequency:

In [None]:
dates.to_period('D')

A `Timedelta`, or a ``TimedeltaIndex``, is created, for example, when dates are subtracted from one another:

In [None]:
dates[3] - dates[0]

In [None]:
dates - dates[0]

### Regular sequences: ``pd.date_range()``

* To make the creation of regular date sequences more convenient, `Pandas` offers a few functions for this purpose: ``pd.date_range()`` for timestamps, ``pd.period_range()`` for periods, and ``pd.timedelta_range()`` for time deltas.
* We've seen that `Python`'s ``range()`` and `NumPy`'s ``np.arange()`` turn a startpoint, endpoint, and optional stepsize into a sequence.
* Similarly, ``pd.date_range()`` accepts a start date, an end date, and an optional frequency code to create a regular sequence of dates.
* By default, the frequency is one day:

In [None]:
pd.date_range('2015-07-03', '2015-07-10')

* Alternatively, the date range can be specified with a startpoint and a number of periods:

In [None]:
pd.date_range('2015-07-03', periods=8)

* The spacing can be modified by altering the ``freq`` argument, which defaults to ``D``.
* For example, here we will construct a range of hourly timestamps:

In [None]:
pd.date_range('2015-07-03', periods=8, freq='H')

* To create regular sequences of ``Period`` or ``Timedelta`` values, the very similar ``pd.period_range()`` and ``pd.timedelta_range()`` functions are useful.
* Here are some monthly periods:

In [None]:
pd.period_range('2015-07', periods=8, freq='M')

* And a sequence of durations increasing by an hour:

In [None]:
pd.timedelta_range(0, periods=10, freq='H')

* All of these require knowledge of `Pandas` frequency codes, which we'll summarize in the next section.

## Frequencies and Offsets

* Fundamental to these Pandas time series tools is the concept of a frequency or date offset.
* The following table summarizes the main codes available:

| Code   | Description         | Code   | Description          |
|--------|---------------------|--------|----------------------|
| ``D``  | Calendar day        | ``B``  | Business day         |
| ``W``  | Weekly              |        |                      |
| ``M``  | Month end           | ``BM`` | Business month end   |
| ``Q``  | Quarter end         | ``BQ`` | Business quarter end |
| ``A``  | Year end            | ``BA`` | Business year end    |
| ``H``  | Hours               | ``BH`` | Business hours       |
| ``T``  | Minutes             |        |                      |
| ``S``  | Seconds             |        |                      |
| ``L``  | Milliseonds         |        |                      |
| ``U``  | Microseconds        |        |                      |
| ``N``  | nanoseconds         |        |                      |

* The monthly, quarterly, and annual frequencies are all marked at the end of the specified period.
* By adding an ``S`` suffix to any of these, they instead will be marked at the beginning:

| Code    | Description            || Code    | Description            |
|---------|------------------------||---------|------------------------|
| ``MS``  | Month start            ||``BMS``  | Business month start   |
| ``QS``  | Quarter start          ||``BQS``  | Business quarter start |
| ``AS``  | Year start             ||``BAS``  | Business year start    |

* Additionally, you can change the month used to mark any quarterly or annual code by adding a three-letter month code as a suffix:

  - ``Q-JAN``, ``BQ-FEB``, ``QS-MAR``, ``BQS-APR``, etc.
  - ``A-JAN``, ``BA-FEB``, ``AS-MAR``, ``BAS-APR``, etc.

* In the same way, the split-point of the weekly frequency can be modified by adding a three-letter weekday code:

  - ``W-SUN``, ``W-MON``, ``W-TUE``, ``W-WED``, etc.

* On top of this, codes can be combined with numbers to specify other frequencies.
* For example, for a frequency of 2 hours 30 minutes, we can combine the hour (``H``) and minute (``T``) codes as follows:

In [None]:
pd.timedelta_range(0, periods=9, freq="2H30T")

* All of these short codes refer to specific instances of `Pandas` time series offsets, which can be found in the ``pd.tseries.offsets`` module.
* For example, we can create a business day offset directly as follows:

In [None]:
from pandas.tseries.offsets import BDay
pd.date_range('2015-07-01', periods=5, freq=BDay())

For more discussion of the use of frequencies and offsets, see the ["DateOffset" section](http://pandas.pydata.org/pandas-docs/stable/timeseries.html#dateoffset-objects) of the Pandas documentation.

## Example: Waterlevels around Kilcolgan

* The OPW publishes realtime data on [waterlevels](http://waterlevel.ie/) every 15 minutes.

* These data can be seen on their web site, or downloaded as a CSV file for further processing ...


(Uncomment the following 3 lines, then execute the cell to download up-to-date
versions of the files ...)

In [None]:
#!curl -o data/water02.csv http://waterlevel.ie/data/month/29002_0001.csv
#!curl -o data/water10.csv http://waterlevel.ie/data/month/29010_0001.csv
#!curl -o data/water11.csv http://waterlevel.ie/data/month/29011_0001.csv

* Each dataset has a datetime and a value column.

* Once the datasets are downloaded, we can use `Pandas` to read the CSV output into a ``DataFrame``.

* We will specify that we want to use the `datetime` column as an index, and we want these dates to be automatically parsed:

In [None]:
water02 = pd.read_csv('data/water02.csv', index_col='datetime', parse_dates=True)
water02.head()

In [None]:
water10 = pd.read_csv('data/water10.csv', index_col='datetime', parse_dates=True)
water10.head()

In [None]:
water11 = pd.read_csv('data/water11.csv', index_col='datetime', parse_dates=True)
water11.head()

In [None]:
df = pd.concat([water10, water11, water02], axis=1)

In [None]:
df.columns = ['water10', 'water11', 'water02']

In [None]:
df.head()

### Visualizing the data

* We can gain some insight into the dataset by visualizing it.

In [None]:
import matplotlib.pyplot as plt
df.plot()

In [None]:
df[['water10', 'water11']].plot()

## References

* `datetime64`, `timedelta64`: [[doc]](https://docs.scipy.org/doc/numpy/reference/arrays.datetime.html)


* the ["Time Series/Date" section](http://pandas.pydata.org/pandas-docs/stable/timeseries.html) of the Pandas online documentation.


* ...

## Exercises

1.  Find time-based data on the current Covid-19 pandemie,  collate and plot them in a meaningful way.