## CS102-4 - Further Computing

Prof. Götz Pfeiffer<br>
School of Mathematics, Statistics and Applied Mathematics<br>
NUI Galway

### 2. Aspects of Data Wrangling

# Week 8: Time Series

* **Times** and **Dates** can be fairly complex data to handle,
  as we are just reminded by the start of Daylight Saving Time
  (not all days have 24 hours), the 2020 leap year (not every year
  has 365 days)
  and the question of how to determine the date of Easter?

* `Pandas` contains a fairly extensive set of tools for working with dates, times, and time-indexed data.

* Date and time data comes in a few flavors:

  - **Time stamps** reference particular **moments in time** (e.g., March 17th, 2021 at 9:00am).
  - **Time intervals** and **periods** reference a length of time between a particular beginning and end point; for   example, the year 2020.
  - Periods usually reference a special case of several non-overlapping time intervals of uniform length (e.g., 24 hour-long periods comprising days).
  - **Time deltas** or **durations** reference an **exact length of time** (e.g., a duration of 22.56 seconds).

* Here, we will give a broad overview of how one should approach working with time series.

* We will start with a brief discussion of tools for dealing with dates and times in `Python`,
see how this can be improved on in `numpy`, before moving more specifically to a discussion of some of the tools provided by `Pandas`.

* We will also see some short examples of working with time series data in `Pandas`.

## Dates and Times in `Python`

* The `Python` world has a number of available representations of dates, times, deltas, and timespans.

### Native `Python` dates and times: `datetime` and `dateutil`

* Python's basic objects for working with dates and times reside in the built-in `datetime` module.

* Along with the third-party `dateutil` module, you can use it to quickly perform a host of useful functionalities on dates and times.

### Creating date objects

* For example, you can manually build a date using the ``datetime`` type:

In [None]:
from datetime import datetime
datetime(year=2021, month=3, day=17)

* Or, using the ``dateutil`` module, you can parse dates from a variety of string formats:

In [None]:
from dateutil import parser
date = parser.parse("17th of March, 2021")
date

### Printing Dates

* A ``datetime`` object can be **printed** in a variety of formats, e.g. as the day of the week:

In [None]:
date.strftime('%A')

In [None]:
date.strftime('This year\'s national holiday is on %A, %B %d, %Y.')

* Here we've used one of the standard string format codes for printing dates (``"%A"``), which you can read about in the [strftime section](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior) of Python's [datetime documentation](https://docs.python.org/3/library/datetime.html).

* Documentation of other useful date utilities can be found in [dateutil's online documentation](http://labix.org/python-dateutil).

### `Numpy` typed arrays of times

* In `numpy`, the ``datetime64`` dtype encodes dates as 64-bit integers, and thus allows arrays of dates to be represented very compactly.
* The ``datetime64`` requires a very specific input format:

In [None]:
import numpy as np
date = np.array('2021-03-17', dtype=np.datetime64)
date

* Once we have this date formatted, however, we can quickly do vectorized operations on it:

In [None]:
date + np.arange(16)

* Because of the uniform type in ``datetime64`` arrays, this type of operation can be accomplished much more quickly than if we were working directly with `Python`'s ``datetime`` objects, especially as arrays get large.

* Arrays of **durations** have dtype `timedelta64`.

* Both ``datetime64`` and ``timedelta64`` objects are built on a **fundamental time unit**.

* Because the ``datetime64`` object is limited to 64 bits, the range of encodable times is $2^{64}$ times this fundamental unit.

* Thus ``datetime64`` imposes a trade-off between time **resolution** and **maximum time span**.

* For example, with a time resolution of **one nanosecond**, one  can encode a range of $2^{64}$ nanoseconds, 
that is just under 600 years.

In [None]:
2**64/1000/1000/1000/60/60/24/365.25

* `Numpy` will infer the desired unit from the input; for example, here is a day-based datetime:

In [None]:
date = np.datetime64('2021-04-01')
date

In [None]:
np.array(date)

* Here is a minute-based datetime:

In [None]:
date = np.datetime64('2021-04-01 09:00')
date

In [None]:
np.array(date)

* You can set the fundamental unit explicitly; for example, here we'll force a nanosecond-based time:

In [None]:
date = np.datetime64('2021-04-01 09:11:59.50', 'ns')
date

In [None]:
np.array(date)

* The following table, drawn from the `numpy` [datetime64 documentation](http://docs.scipy.org/doc/numpy/reference/arrays.datetime.html), lists the available format codes along with the relative and absolute timespans that they can encode:

|Code    | Meaning     | Time span (relative) | Time span (absolute)   |
|--------|-------------|----------------------|------------------------|
| ``Y``  | Year	       | ± 9.2e18 years       | [9.2e18 BC, 9.2e18 AD] |
| ``M``  | Month       | ± 7.6e17 years       | [7.6e17 BC, 7.6e17 AD] |
| ``W``  | Week	       | ± 1.7e17 years       | [1.7e17 BC, 1.7e17 AD] |
| ``D``  | Day         | ± 2.5e16 years       | [2.5e16 BC, 2.5e16 AD] |
| ``h``  | Hour        | ± 1.0e15 years       | [1.0e15 BC, 1.0e15 AD] |
| ``m``  | Minute      | ± 1.7e13 years       | [1.7e13 BC, 1.7e13 AD] |
| ``s``  | Second      | ± 2.9e12 years       | [ 2.9e9 BC, 2.9e9 AD]  |
| ``ms`` | Millisecond | ± 2.9e9 years        | [ 2.9e6 BC, 2.9e6 AD]  |
| ``us`` | Microsecond | ± 2.9e6 years        | [290301 BC, 294241 AD] |
| ``ns`` | Nanosecond  | ± 292 years          | [ 1678 AD, 2262 AD]    |
| ``ps`` | Picosecond  | ± 106 days           | [ 1969 AD, 1970 AD]    |
| ``fs`` | Femtosecond | ± 2.6 hours          | [ 1969 AD, 1970 AD]    |
| ``as`` | Attosecond  | ± 9.2 seconds        | [ 1969 AD, 1970 AD]    |

* For the types of data we see in the real world, a useful default is ``datetime64[ns]``, as it can encode a useful range of modern dates with a suitably fine precision.

### Dates and times in `pandas`: best of both worlds

* Pandas provides a ``Timestamp`` object, which combines the ease-of-use of ``datetime`` and ``dateutil`` with the efficient storage and vectorized interface of ``numpy.datetime64``.

* From a group of these ``Timestamp`` objects, Pandas can construct a ``DatetimeIndex`` that can be used to index data in a ``Series`` or ``DataFrame``.

* For example, 
we can parse a flexibly formatted string date, and use format codes to output the day of the week:

In [None]:
import pandas as pd
date = pd.to_datetime("17th of March, 2021")
date

In [None]:
date.strftime('%A')

* Additionally, we can do `Numpy`-style vectorized operations directly on this same object:

In [None]:
dates = date + pd.to_timedelta(np.arange(12), 'D')
dates

In [None]:
dates.strftime('%A')

In [None]:
dates = date + 365.25 * pd.to_timedelta(np.arange(10), 'D')
dates

In [None]:
dates.strftime('%A')

## `Pandas` Time Series: Indexing by Time

* We can construct a ``Series`` object that has time indexed data:

In [None]:
dates = pd.DatetimeIndex(['2020-07-04', '2020-08-04',
                          '2021-07-04', '2021-08-04'])
data = pd.Series([0, 1, 2, 3], index=dates)
data

* We can use any of the ``Series`` indexing patterns, passing values that can be coerced into dates:

In [None]:
data['2020-07-04':'2021-07-04']

* There are additional special **date-only indexing** operations, such as passing a year to obtain a slice of all data from that year:

In [None]:
data['2020']

## `Pandas` Time Series Data Structures

* The fundamental `Pandas` data structures for working with time series data:

  - For **time stamps**, `Pandas` provides the ``Timestamp`` type. As mentioned before, it is essentially a replacement for Python's native ``datetime``, but is based on the more efficient ``numpy.datetime64`` data type. The associated Index structure is ``DatetimeIndex``.
  - For **time Periods**, `Pandas` provides the ``Period`` type. This encodes a fixed-frequency interval based on ``numpy.datetime64``. The associated index structure is ``PeriodIndex``.
  - For **time deltas** or **durations**, `Pandas` provides the ``Timedelta`` type. ``Timedelta`` is a more efficient replacement for `Python`'s native ``datetime.timedelta`` type, and is based on ``numpy.timedelta64``. The associated index structure is ``TimedeltaIndex``.

* The most fundamental of these date/time objects are the ``Timestamp`` and ``DatetimeIndex`` objects.
* While these class objects can be invoked directly, it is more common to use the ``pd.to_datetime()`` function, which can parse a wide variety of formats.
* Passing a single date to ``pd.to_datetime()`` yields a ``Timestamp``; passing a series of dates by default yields a ``DatetimeIndex``:

In [None]:
dates = pd.to_datetime([datetime(2020, 7, 3), '4th of July, 2020',
                       '2020-Jul-6', '07-07-2020', '20200708'])
dates

* Any ``DatetimeIndex`` can be converted to a ``PeriodIndex`` with the ``to_period()`` function with the addition of a frequency code; here we'll use ``'D'`` to indicate daily frequency:

In [None]:
dates.to_period('D')

* A `Timedelta`, or a ``TimedeltaIndex``, is created, for example, when dates are subtracted from one another:

In [None]:
dates[3] - dates[0]

In [None]:
dates - dates[0]

## Example: Waterlevels along the Dunkellin River

* The OPW publishes realtime data on [waterlevels](http://waterlevel.ie/) around the country every 15 minutes.

* These data can be seen on their web site, or downloaded as a CSV file for further processing ...


In [None]:
level11 = pd.read_csv("https://waterlevel.ie/data/month/29011_0001.csv")
level11.head()

* Each dataset has a `datetime` and a `value` column.

* We can use the `Pandas` method `read_cvs` to download the data and put it into a `DataFrame`.

* We do the download again and this time specify that we want to use the `datetime` column as index, and we want these dates to be automatically parsed:

In [None]:
level11 = pd.read_csv("https://waterlevel.ie/data/month/29011_0001.csv", index_col='datetime', parse_dates=True)
level11.head()

Also, since these data come from Kilcolgan, we rename the column accordingly.

In [None]:
level11.columns = ['Kilcolgan']
level11.head()

Let's get some additional data from 6 miles up the river.


In [None]:
level10 = pd.read_csv("https://waterlevel.ie/data/month/29010_0001.csv", index_col='datetime', parse_dates=True)
level10.columns = ['Craughwell']
level10.head()

### Visualizing the data

* We can gain some insight into the dataset by visualizing it.

In [None]:
level11.plot()

In [None]:
level10.plot()

* In order to be able to better compare the two curves let's join the data in single dataframe,
and plot them together:

In [None]:
levels = pd.concat([level11, level10], axis = 1)
levels.tail()

In [None]:
levels.plot()

In [None]:
import matplotlib.pyplot as plt
plt.figure(figsize=(15,5))
plt.plot(levels)

##  Example. Covid-19 Cases in Different Counties

* The Government publishes statisitics relating to Covid-19 on its
[website](https://data.gov.ie/) and updates these regulary ...

In [None]:
covid = pd.read_csv("https://opendata-geohive.hub.arcgis.com/datasets/d9be85b30d7748b5b7c09450b8aede63_0.csv", index_col='TimeStamp', parse_dates=True)


In [None]:
covid.tail()

In [None]:
galway = covid[covid.CountyName == "Galway"]

In [None]:
galway.head()

In [None]:
galway.ConfirmedCovidCases

In [None]:
galway.ConfirmedCovidCases.plot()

In [None]:
donegal = covid[covid.CountyName == "Donegal"]
donegal.ConfirmedCovidCases.plot()

In [None]:
total = pd.DataFrame({
    'Donegal': donegal.ConfirmedCovidCases, 
    'Galway': galway.ConfirmedCovidCases
})

In [None]:
total.plot()

* But what's the daily increase? Calculate the difference between values on consecutive days!

In [None]:
daily = total.diff()
daily.tail()

In [None]:
daily.plot()

In [None]:
plt.figure(figsize=(15,5))
plt.plot(daily.loc['2021'])

## References

* `datetime64`, `timedelta64`: [[doc]](https://docs.scipy.org/doc/numpy/reference/arrays.datetime.html)


* the ["Time Series/Date" section](http://pandas.pydata.org/pandas-docs/stable/timeseries.html) of the Pandas online documentation.


* ...

## Exercises

1.  Compare the Galway Covid-19 data to the data from other counties.

2. Explore the Government's [Open Data Portal](https://data.gov.ie) for more data  that are worth downloading, and plotting ...