# [AHA! Activity Health Analytics](http://casas.wsu.edu/)
[Center for Advanced Studies of Adaptive Systems (CASAS)](http://casas.wsu.edu/)

[Washington State University](https://wsu.edu)
# L7 Time Series

## Learner Objectives
At the conclusion of this lesson, participants should have an understanding of:
* Time series data
* Utilizing a Pandas `DateTimeIndex`
* Resampling time series data

## Acknowledgments
Content used in this lesson is based upon information in the following sources:
* [Pandas website](https://pandas.pydata.org/pandas-docs/stable/timeseries.html)

## Time Series Overview
A time series is a series of data points indexed (or listed or graphed) in sequential time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time. The spacing between time points can be any value, but common spacing values include:
* 1, ..., 60+ seconds
* 1, ..., 60+ minutes
* 1, ..., 12, ...,24+ hours
* 1, ..., 7+ days
* 1, ..., 52+ weeks
* 1+ years

It is common to collect data at the most fine-grained spacing possibly because you can always *down sample* the data later. For example, if your data is collected every minute, for one day there will be 1440 samples (60 minutes in an hour times 24 hours in a day). If you want to perform analysis of the data hourly, you can aggregate the minute samples each hour, perhaps by summing the values or taking an average, to yield 24 samples.

In this lesson, we are going to explore time series data in the following aspects:

## Time Series Example
We are going to work with the [sh1_hourly_activities.csv](https://raw.githubusercontent.com/gsprint23/aha/master/lessons/files/sh1_hourly_activities.csv) dataset. This dataset contains 7 days of hourly activity information from a single smart home resident. For each hour, the probability that the smart home resident was performing each of 13 activities is recorded. The 13 activities are as follows:
1. Bathe
1. Bed_Toilet_Transition
1. Cook
1. Eat
1. Enter_Home
1. Leave_Home
1. Personal_Hygiene
1. Relax
1. Sleep
1. Take_Medicine
1. Wash_Dishes
1. Work
1. Other_Activity

A combination of the date and hour uniquely identifies an activity distribution. Here is a sample of the format of the data:

|date|hour|Bathe|...|Other_Activity|
|-|-|-|-|-|
|7/18/2013|0:00:00|0.000001|...|0.000001|
|7/18/2013|1:00:00|0.000001|...|0.000001|
|...|...|...|...|...|
|7/18/2013|22:00:00|0|...|0|
|7/18/2013|23:00:00|0|...|0.875|
|7/19/2013|0:00:00|0|...|0.589|
|7/19/2013|1:00:00|0|...|0|
|...|...|...|...|...|

### MultiIndex
Initially, we may consider reading this data into a Pandas data frame with a hierarchical index (outer: date, inner: hour):

In [1]:
import pandas as pd
import numpy as np

fname = r"files\sh1_hourly_activities.csv"
hier_df = pd.read_csv(fname, header=0, index_col=[0, 1])
print(type(hier_df.index))
print(hier_df.shape, "Number of days:", hier_df.shape[0] // 24)
print(hier_df.head(n=5))

<class 'pandas.indexes.multi.MultiIndex'>
(168, 13) Number of days: 7
                      Bathe  Bed_Toilet_Transition      Cook       Eat  \
date      hour                                                           
7/18/2013 0:00:00  0.000001               0.000001  0.000001  0.000001   
          1:00:00  0.000001               0.000001  0.000001  0.000001   
          2:00:00  0.000001               0.000001  0.000001  0.000001   
          3:00:00  0.000000               0.119792  0.000000  0.000000   
          4:00:00  0.000001               0.000001  0.000001  0.000001   

                   Enter_Home  Leave_Home  Personal_Hygiene     Relax  \
date      hour                                                          
7/18/2013 0:00:00    0.000001    0.000001          0.000001  0.000001   
          1:00:00    0.000001    0.000001          0.000001  0.000001   
          2:00:00    0.000001    0.000001          0.000001  0.000001   
          3:00:00    0.000000    0.000000     

### DateTimeIndex
Using a `MultiIndex` will work for this data; however, Pandas has support for time series indexes with its [`DateTimeIndex`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DatetimeIndex.html). Using a `DateTimeIndex` over a standard index or a `MultiIndex` will help in several ways, to name a few:
* Easy re-sampling with [`resample()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.resample.html) (conform or convert time series to a particular frequency)
* Quick access to date fields via properties such as year, month, etc.
* Unioning of overlapping `DatetimeIndex` objects with the same frequency is very fast (important for fast data alignment)
* Partial string indexing

For more about working with time series data in Pandas, see the [Pandas website](https://pandas.pydata.org/pandas-docs/stable/timeseries.html).

Now, let's read in the data again, but this time let's set our index to be a `DateTimeIndex` constructed from the "date" and "hour" columns. We will set the `parse_dates` keyword of [`read_csv()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html), which will parse the columns 0 and 1 as a date column and name the column "DateTime". Then, we will set this column to be the index of our data frame.

In [3]:
df = pd.read_csv(fname, header=0, parse_dates={"DateTime": [0, 1]})
print(type(df["DateTime"]))
df.set_index("DateTime", inplace=True)
print(type(df.index))
print(df.shape, "Number of days:", df.shape[0] // 24)
print(df.head(n=5))

<class 'pandas.core.series.Series'>
<class 'pandas.tseries.index.DatetimeIndex'>
(168, 13) Number of days: 7
                        Bathe  Bed_Toilet_Transition      Cook       Eat  \
DateTime                                                                   
2013-07-18 00:00:00  0.000001               0.000001  0.000001  0.000001   
2013-07-18 01:00:00  0.000001               0.000001  0.000001  0.000001   
2013-07-18 02:00:00  0.000001               0.000001  0.000001  0.000001   
2013-07-18 03:00:00  0.000000               0.119792  0.000000  0.000000   
2013-07-18 04:00:00  0.000001               0.000001  0.000001  0.000001   

                     Enter_Home  Leave_Home  Personal_Hygiene     Relax  \
DateTime                                                                  
2013-07-18 00:00:00    0.000001    0.000001          0.000001  0.000001   
2013-07-18 01:00:00    0.000001    0.000001          0.000001  0.000001   
2013-07-18 02:00:00    0.000001    0.000001          0.000

### Resampling
Lastly, let's learn how to re-sample our time series data. Suppose instead of hourly sleep information, we want bi-hourly, or daily. We can use the [`resample()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.resample.html) method to easily do this. We just need to specify the frequency to resample by (e.g. hourly, 2-hours, etc.), and how to resample (e.g. mean, sum, etc.). The rule is specified as a string argument of integer frequency + character unit. For example:
* H: hourly frequency
    * e.g. "2H" would be 2 hours
* T: minutely frequency
    * e.g. "5T" would be 5 minutes
* S: secondly frequency
    * e.g. "30S" would be 30 seconds
* D: calendar day frequency
    * e.g. "1D" would be daily
* W: weekly frequency
    * e.g. "4W" would be 4 weeks

In [4]:
# resample to daily averages
activity = "Work"
daily_df = df.resample("1D").mean()
print(daily_df[activity])

DateTime
2013-07-18    0.005037
2013-07-19    0.014286
2013-07-20    0.006566
2013-07-21    0.013086
2013-07-22    0.009781
2013-07-23    0.011276
2013-07-24    0.004174
Freq: D, Name: Work, dtype: float64
