# Table of Contents
* [Lecture 8 - Introduction to Time Series Data](#Lecture-8---Introduction-to-Time-Series-Data)
	* &nbsp;
		* [Content](#Content)
		* [Learning Outcomes](#Learning-Outcomes)
* [Importing Time Series Data](#Importing-Time-Series-Data)
* [Converting into Time Series Data](#Converting-into-Time-Series-Data)
* [Filtering Time Series Data](#Filtering-Time-Series-Data)
* [Resampling](#Resampling)
	* &nbsp;
		* [Moving (rolling/running) statistics](#Moving-%28rolling/running%29-statistics)
	* [Shift operations](#Shift-operations)
		* [Exercise:](#Exercise:)


# Lecture 8 - Introduction to Time Series Data

---

### Content

1. Importing time series data
2. Time series data types and conversions
3. Time series filtering
4. Time series resampling
5. Plotting time series

### Learning Outcomes

At the end of this lecture, you should be able to:

* import time series data
* convert datasets into appropriate time series data types
* filter dataframes based on time series conditions
* perform resampling of time series data
* perform running averages on time series data
* visualise time series data

The overall goal of Pandas is that of becoming "the most powerful and flexible open source data analysis manipulation tool available in any language", and it is already well on its way toward realizing this. One of the domains where Pandas has been excelling and has become a proven a tool is in the domain of time series data analysis. 

Time series data is a sequence of data points that comprises of measurements made over a time interval, where the time interval is continuous, having the same distance between consecutive data points, while generating at most one data point for each given moment in time.

Time series analysis is an substantive topic. The aim here will be to provide a brief introduction on how to process, manipulate and visualise time series data using a small subset of Pandas capabilities.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
from pylab import rcParams #this module gives us some controls over plot rendering attributes
rcParams['figure.figsize'] = 15, 10

In [None]:
#this line enables the plots to be embedded into the notebook
%matplotlib inline

# Importing Time Series Data

Below is a dataset extracted from Yahoo Finance showing the daily Apple stock price movements from 1980 to February 2016.

In [None]:
ts_data = pd.read_csv('appleStockPrice.csv')
ts_data.head()

In [None]:
ts_data.tail()

Notice that the data that we imported has the order of observations in a descending order in respect to Date. We will deal with this later.

Examine the data types for each of the columns.

In [None]:
ts_data.info()

Notice that the 'Date' column is an 'object' data type. This means that it has been interepreted as a 'string' rather than as a 'date' data type.

# Converting into Time Series Data

Below is an example of how we can convert a column that is interpreted as a string, into a datetime datatype.

In [None]:
ts_data['Date'] = pd.to_datetime(ts_data['Date'], format='%Y-%m-%d')
ts_data.info()

Notice the 'format' specification and how it fits exactly the format of the original string.

In [None]:
print ts_data.info()
ts_data.head()

The Date column is now a datetime64 data type. Notice that the appearance of the Date column has not changed, which is why it is important to check that the data types are as you would like them to be for each column.



**Exercise:** Use the pd.read_clipboard() function to read the below data into a dataframe. Then convert the column Date into a datetime data type that has the format of Year/Month/Day as above:  

In [None]:
#ts_data['Date'].apply(lambda x: x.strftime('%d/%m/%Y'))
d = pd.read_clipboard()
d

We can perform more powerful manipulation and processing if we make the Date column the index.

In [None]:
ts_data = ts_data.set_index(['Date'])
ts_data.head()

It is important to know how to manually convert columns into datetime and make them into a dataframe index; however, when reading in a csv file, we can do all of the above automatically in future by specifying a couple of parameters. 

In [None]:
ts_data = pd.read_csv('appleStockPrice.csv', index_col='Date', parse_dates=True)
print ts_data.info()
ts_data.head()

# Filtering Time Series Data

In [None]:
ts_data['2016']

In [None]:
ts_data['2016-2']

In [None]:
ts_data.ix['2016-2']

**Exercise:** Filter the above dataframe to only display values from October 2015 to December 2015. 

Given that the index is in the 'wrong' order, it makes it somewhat less intuitive to work with.

We can reorder the index to make things easier.

In [None]:
ts_data.sort_index(ascending=True, inplace=True)
ts_data.head()

**Exercise:** Filter the above dataframe to only display values after January 15 2015.


Filtering can also be done through a *truncate()*. Truncate is simply a convenience function that is equivalent to slicing. Below is an example of filtering data to just December 2015 and January 2016 observations:


In [None]:
ts_data.truncate(before='2015-12-1', after='2016-1-31')

**Exercise:** Use the truncate function to filter the above dataframe to only display values after November 2015.

# Resampling

Resampling transforms time series data into a different frequency (e.g., converting hourly data into daily data). Pandas provide and easy way to perform these frequency conversion operations which are extremely common in  financial applications, but not limited to them only.

Resampling requires that 1) the resampling time period is specified, 2) the method to apply to the resampled data (default is mean). For those familiar with SQL, resampling is essentially a time-based **groupby** operation, followed by a reduction method on each of its groups. 

Reduction can be: 'mean','median','sum','min','max','first','last','ohlc' or other available numpy/user defined transformation.

A variety of built-in reduction time frequencies are available:

In [None]:
ts_data['2015'].resample('M', how='mean')

**Exercise:** Resample the date above dataframe based on the quarter end frequency on data between 1990 and 2010 using the median as the reduction method. 

**Exercise:** Describe what the output of the below means? 

Visualising the data is as simple as calling *plot()* on the required column:

In [None]:
ts_data.head()

In [None]:
#ts_data[['Volume']].resample('M', how='sum').plot()
plt.plot( ts_data[['Volume']].resample('M', how='sum'))

We can increase the size of the plot and render several plots at the same time:

In [None]:
rcParams['figure.figsize'] = 15, 10
ts_data.plot(subplots=True)

**Exercise:** The period leading to the recent global financial crisis and the immediate aftermath are interesting to look at into more detail from the perspective of the adjusted closing price and the total volume of shares traded for Apple. Render separately two plots for these columns for data from 2007 to 2010.

We can use resampling to reduce the frequency of Apple share trading to annual and plot the historical variation between the min/max and the mean prices for Apple shares in each year:

In [None]:
plt.plot( ts_data[['Adj Close']].resample('A', how='mean'))
plt.plot( ts_data[['Adj Close']].resample('A', how='min'))
plt.plot( ts_data[['Adj Close']].resample('A', how='max'))


**Exercise:** Render a graph that is the same as above, only this time use a 5 year frequency:

### Moving (rolling/running) statistics

A rolling average is a series of averages of different subsets of the full data set as defined by a filter window.

It is widely used indicator that helps smooth out price movements by filtering out the noise from random fluctuations.

In [None]:
pd.rolling_mean(ts_data[['Adj Close']], window=5).head(10)

In [None]:
pd.rolling_mean(ts_data[['Adj Close']], window=5).plot(style='-g')

**Exercise:** Generate rolling mean plots on the Volume column for the Apple share trading data. Determine the most 'useful' window size.

##  Shift operations

“Shifting” refers to moving data backward and forward through time. Both Series and
DataFrame have a  shift method for performing this operation.

If we wanted to calculate the difference in oil price from one year to the next (something very common in time series analysis), then pandas provides for us a method called shift(), which allows us to select a column and move the data in it up or down by a given amount. 

In our case, we want to see the difference between the values in price from one year to the next so we will shift the columns by one.

In [None]:
ts_data.head()

In [None]:

ts_data['shifted'] = ts_data['Adj Close'].shift(1)
ts_data

**Exercise**: Plot the positive and negative fluctuations of the oil price from year to year for the above dataset.

### Exercise: 

Read in the oil_price.csv dataset.

Convert the 'Year' feature into datetime and set it as the index.

Perform the same analysis as above using the 'shift' function and plot the InflationAdjustedPrice difference from one year to the next.

In [None]:
%%javascript
require(['base/js/utils'],
function(utils) {
   utils.load_extensions('calico-spell-check', 'calico-document-tools', 'calico-cell-tools');
});