# <img style="float: left; padding-right: 100px; width: 200px" src="../img/parrotai.png">ParrotAI IPT Program


## Module 2C: Working with time series data


**Authors:** Faustine, Davis Davis


---

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# Time series

Time series data is an important form of structured data in many different fields, such
as finance, economics, ecology, neuroscience, and physics. Anything that is observed
or measured at many points in time forms a time series.

## Date and Time Data Types

The Python standard library includes data types for date and time data, as well as
calendar-related functionality. 
main places to start. The datetime.datetime type, or simply datetime , is widely
used:

In [None]:
from datetime import datetime as dt

In [None]:
dt.now()

In [None]:
ts = dt(year=2016, month=12, day=19, hour=13, minute=30)
ts

You can format datetime objects and pandas Timestamp objects, which I’ll introduce
later, as strings using `str` or the `strftime` method, passing a format specification

In [None]:
str(ts)

In [None]:
ts.strftime("%d %B %Y")

## Dates and times in pandas

### The ``Timestamp`` object

Pandas has its own date and time objects, which are compatible with the standard `datetime` objects, but provide some more functionality to work with.  

The `Timestamp` object can also be constructed from a string:

In [None]:
ts = pd.Timestamp('2016-12-19')
ts

Like with `datetime.datetime` objects, there are several useful attributes available on the `Timestamp`. For example, we can get the month (experiment with tab completion!):

In [None]:
ts.month

There is also a `Timedelta` type, which can e.g. be used to add intervals of time:


In [None]:
ts + pd.Timedelta('5 days')

Unfortunately, when working with real world data, you encounter many different `datetime` formats. Most of the time when you have to deal with them, they come in text format, e.g. from a `CSV` file. To work with those data in Pandas, we first have to *parse* the strings to actual `Timestamp` objects.

To convert string formatted dates to Timestamp objects: use the `pandas.to_datetime` function

In [None]:
pd.to_datetime("2016-12-09")

In [None]:
pd.to_datetime("09/12/2016")

For the following demonstration of the time series functionality, we use a [Household Power Consumption data set](https://archive.ics.uci.edu/ml/datasets/individual+household+electric+power+consumption). The Household Power Consumption dataset is a multivariate time series dataset that describes the electricity consumption for a single household over four years. The data was collected between December 2006 and November 2010 and observations of power consumption within the household were collected every minute.

It is a multivariate series comprised of seven variables (besides the date and time); they are:

- global_active_power: The total active power consumed by the household (kilowatts).
- global_reactive_power: The total reactive power consumed by the household (kilowatts).
- voltage: Average voltage (volts).
- global_intensity: Average current intensity (amps).
- sub_metering_1: Active energy for kitchen (watt-hours of active energy).
- sub_metering_2: Active energy for laundry (watt-hours of active energy).
- sub_metering_3: Active energy for climate control systems (watt-hours of active energy).

```
python
df = pd.read_csv("data/household_power_consumption.txt", sep=';', header=0, low_memory=False, parse_dates={'datetime':[0,1]})

```

In [None]:
data = pd.read_csv("timeseries_data.csv", low_memory=False)

In [None]:
data.head()

Let us rename the column name as follows:
    

In [None]:
new_columns = {'Global_active_power':"P", 'Global_reactive_power':"S", 'Voltage':"V",
       'Global_intensity':"I"}
data=data.rename(columns =new_columns)

Next, we can mark all missing values indicated with a ‘?’ character with a NaN value, which is a float.This will allow us to work with the data as one array of floating point values rather than mixed types, which is less efficient.

In [None]:
data=data.replace('?', np.nan)

We already know how to parse a date column with Pandas:



In [None]:
data['datetime'] = pd.to_datetime(data['datetime'])

With `set_index('datetime')`, we set the column with datetime values as the index, which can be done by both `Series` and `DataFrame`.

In [None]:
data = data.set_index("datetime")

In [None]:
data.head()

## The DatetimeIndex

When we ensure the DataFrame has a `DatetimeIndex`, time-series related functionality becomes available:

In [None]:
data.index

Similar to a Series with datetime data, there are some attributes of the timestamp values available:

In [None]:
data.index.day

In [None]:
data.index.year

The `plot` method will also adapt its labels (when you zoom in, you can see the different levels of detail of the datetime labels): For example let plot active power column


In [None]:
data["P"].plot()

**Note** the tpe of error: 
TypeError: Empty 'DataFrame': no numeric data to plot so we can verify the data dype of each column.

In [None]:
data.dtypes

As you can see the data type is object with excption to sub_metering_3. We have to convernt the data of the rest column into numeric using `pd.to_numeric()` function.

In [None]:
columns = ["P", "S", "V", "I", "Sub_metering_1", "Sub_metering_2"]
data[columns] = data[columns].astype(float)

In [None]:
data[["P", "S"]].plot()

We have too much data to sensibly plot on one figure. Let's see how we can easily select part of the data or aggregate the data to other time resolutions in the next sections.

## Selecting data from a time series

We can use label based indexing on a timeseries as expected:

In [None]:
data[pd.Timestamp("2007-01-01 09:00"):pd.Timestamp("2007-01-01 19:00")]["P"].plot()

But, for convenience, indexing a time series also works with strings:

In [None]:
data["2007-01-01 09:00":"2007-01-01 19:00"]['P'].plot()

A nice feature is **"partial string" indexing**, where we can do implicit slicing by providing a partial datetime string.

E.g. all data of 2006:

In [None]:
data['2006']["P"].plot()

Or all data of January up to March 2007:

In [None]:
data['2007-01':'2007-03']["P"].plot()

<div class="alert alert-success">

<b>Activity</b>:

 <ul>
  <li>select all data in January for all different years</li>
</ul>
</div>

In [None]:
data[data.index.month == 1]

<div class="alert alert-success">

<b>Activity</b>:

 <ul>
  <li>select all data in April, May and June for all different years</li>
</ul>
</div>

In [None]:
data[data.index.month.isin([4, 5, 6])]

<div class="alert alert-success">

<b>Activity</b>:

 <ul>
  <li>select all 'daytime' data (between 12h and 15h) for all days</li>
</ul>
</div>

In [None]:
data[(data.index.hour > 12) & (data.index.hour < 15)]['P'].plot()

## The power of pandas: `resample`

A very powerfull method is **`resample`: converting the frequency of the time series** (e.g. from hourly to daily data).

The time series has a frequency of 1 hour. I want to change this to daily:

In [None]:
data.resample('D').mean().head()

<div class="alert alert-info">
<b>REMEMBER</b>: <br><br>

The string to specify the new time frequency: http://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases <br><br>

These strings can also be combined with numbers, eg `'10D'`...

</div>



In [None]:
data['P'].resample('M').mean().plot() 

<div class="alert alert-success">

<b>Activity</b>:

 <ul>
  <li>plot the monthly mean and median values for the years 2008-2010 for current <br><br></li>
</ul>
    
**Note** <br>You can create a new figure with `fig, ax = plt.subplots()` and add each of the plots to the created `ax` object (see documentation of pandas plot function)
</div>

In [None]:
subset = data['2008':'2010']['P']
fig, ax = plt.subplots()
subset.resample('M').mean().plot(ax=ax)
subset.resample('M').median().plot(ax=ax)
ax.legend(["mean", "median"])

In [None]:
subset.resample('M').agg(['mean', 'median']).plot()

<div class="alert alert-success">

<b>Activity</b>:

 <ul>
  <li>plot the monthly mininum and maximum daily average voltage column</li>
</ul>
</div>

In [None]:
daily = data['V'].resample('D').mean() # daily averages calculated

In [None]:
daily.resample('M').agg(['min', 'max']).plot() # monthly minimum and maximum values of these daily averages