<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Time Series: Autocorrelation


## Learning Objectives
 
**After this lesson, you will be able to:**
- Define autocorrelation and list some real-world examples.
- Use the Pandas `autocorr()` function to compute autocorrelation.
- Calculate and plot the ACF and PACF using StatsModels and Pandas.
- Explain why autocorrelation poses a problem for models that assume independence.
---    

## Lesson Guide

**Autocorrelation**
- [Autocorrelation](#A)
- [Plotting and Interpreting Autocorrelation Functions](#B)
- [Plotting and Interpreting Partial Autocorrelation Functions](#C)
- [Problems Posed by Autocorrelation](#D)
- [Interpreting the ACF and PACF](#E)
- [Independent Practice](#F)
----

<h2><a id = "A">Autocorrelation</a></h2>

While in previous weeks, our analyses has been concerned with the correlation between two or more variables (height and weight, education and salary, etc.), in time series data, autocorrelation is a measure of _how correlated a variable is with itself_.

Specifically, autocorrelation measures how closely related earlier values are with values that occur later in time.

Examples of autocorrelation include:

* In stock market data, the stock price at one point is correlated with the stock price of the point that's directly prior in time. 
    
* In sales data, sales on a Saturday are likely correlated with sales on the next Saturday and the previous Saturday, as well as other days, to more or less of an extent.

> **Check:** What are some examples of autocorrelation that you can think of in the real world?

### How Do We Compute Autocorrelation?

${\Huge R(k) = \frac{\operatorname{E}[(X_{t} - \mu)(X_{t-k} - \mu)]}{\sigma^2}}^*$

To compute [autocorrelation](https://en.wikipedia.org/wiki/Autocorrelation), we fix a **lag**, _k_, which is the delta between the given point and the prior point used to compute the [correlation](https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient).

- With a _k_ value of one, we'd compute how correlated a value is with the prior one. 
- With a _k_ value of 10, we'd compute how correlated a variable is with one that's 10 time points earlier.

$^*$ Note that this formula assumes *stationarity* of the process.

### Guided Practice

Last section, we looked at the Rossman Drugstore data to learn how to handle time series data in Pandas. We'll use this same data set to look for autocorrelation. 

We'll import the data and reduce the scope down to one store. Also recall that we need to preprocess the data in Pandas (convert the time data to a `datetime` object and set it as the index of the DataFrame). 

In [None]:
import pandas as pd
import numpy as np
from datetime import timedelta
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

plt.rcParams['figure.figsize'] = [15, 8]
plt.rcParams['font.size'] = 14
plt.style.use('fivethirtyeight')

In [None]:
pharm = pd.read_csv('data/rossmann.csv', skipinitialspace=True, low_memory=False)
pharm['Date'] = pd.to_datetime(pharm['Date'])
pharm = pharm.set_index('Date')

pharm_store1 = pharm[pharm['Store'] == 1]
print("{} rows by {} columns".format(pharm_store1.shape[0], pharm_store1.shape[1]))
pharm_store1.head()

### Computing Autocorrelation

To compute autocorrelation using the Pandas `.autocorr()` function, we enter the parameter for `lag`. Recall that **lag** is the delta between the given point and the prior point used to compute the autocorrelation. 

In [None]:
pharm_store1['Sales'].autocorr(lag=1)

In [None]:
pharm_store1['Sales'].autocorr(lag=10)

Just like with correlation between different variables, the data become more correlated as the absolute value of this number moves closer to one.

<h2><a id = "B">Plotting Autocorrelation Functions Using StatsModels and Pandas</a></h2>

Autocorrelation plots are often used for checking randomness in time series. This is done by computing autocorrelations for data values at varying time lags. 
* If time series is random, such autocorrelations should be near zero for any and all time-lag separations. 
* If time series is non-random then one or more of the autocorrelations will be significantly non-zero. 

The horizontal lines displayed in the plot correspond to 95% and 99% confidence bands. The dashed line is 99% confidence band.

More about correlograms [here](https://en.wikipedia.org/wiki/Correlogram). Details on the `Pandas` function `autocorrelation_plot()` and an example can be found [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html#visualization-autocorrelation).

In [None]:
from pandas.plotting import autocorrelation_plot

autocorrelation_plot(pharm_store1.Sales);

StatsModels also comes with some convenient packages for calculating and plotting autocorrelation. Load up these two functions and try them out.

In [None]:
from statsmodels.tsa.stattools import acf
from statsmodels.graphics.tsaplots import plot_acf

print(acf(pharm_store1.Sales.values))

In [None]:
plot_acf(pharm_store1.Sales.values, lags=30)
plt.show()

This plots the correlation between the series and a lagged series for the lags indicated on the horizontal axis. For instance, at `0`, the series will be completely correlated with itself, so the blue dot is at `1.0`. The points that fall outside of the blue indicate significant correlation values. Big jumps in autocorrelation appear at lags that are multiples of seven. Our sales data are daily, so it makes a lot of sense that a single Monday's sales would be correlated with the prior Monday's (and the one before it... and so on).

> * `Pandas` `autocorr()` passes subseries of the original series to np.corrcoef. Inside this method, the **sample mean and sample variance of these subseries** are used to determine the correlation coefficient
> * `Statsmodels` `acf()`, in contrary, uses the **overall series sample mean and sample variance** to determine the correlation coefficient.

Our data set here isn't *stationary* (the mean, the variance, and/or the covariance vary over time), so it isn't appropriate to try to diagnose what forecasting model we should use. However, we can see the seasonality of the data set clearly in the ACF.

<h2><a id= "C">Partial Autocorrelation and the Partial Autocorrelation Function (PACF)</a></h2>

Another important chart for diagnosing your time series is the partial autocorrelation chart (PACF). This is similar to autocorrelation, but, instead of being just the correlation at increasing lags, it is the correlation at a given lag, _controlling for the effect of previous lags._

Given a time series $z_{t}$, the partial autocorrelation of lag $k$, denoted $\alpha(k)$, is the autocorrelation between $z_{t}$ and $z_{{t+k}}$ with the linear dependence of $z_{t}$ on $z_{{t+1}}$ through $z_{{t+k-1}}$ removed.

In other words, a partial autocorrelation is a summary of the relationship between an observation in a time series with observations at prior time steps with the relationships of intervening observations removed. 

**The autocorrelation for an observation and an observation at a prior time step is comprised of both the direct correlation and indirect correlations. These indirect correlations are a linear function of the correlation of the observation, with observations at intervening time steps. It is these indirect correlations that the partial autocorrelation function seeks to remove. **

[A Gentle Introduction to Autocorrelation and Partial Autocorrelation](https://machinelearningmastery.com/gentle-introduction-autocorrelation-partial-autocorrelation/)

In [None]:
#Load up the sister functions for partial autocorrelation from StatsModels.

from statsmodels.tsa.stattools import pacf
from statsmodels.graphics.tsaplots import plot_pacf

print(pacf(pharm_store1.Sales.values))

plot_pacf(pharm_store1.Sales.values, lags=30)
plt.show()

This plots the correlation at a given lag (indicated by the horizontal axis), controlling for all of the previous lags. We continue to see big jumps in correlation at the weekly time lags, an indicator that seasonality is still present in our time series. 

> **Check:** How might seasonality in a data set (monthly, weekly, etc.) show up in autocorrelation plots?

<h2><a id = "D">Problems Posed by Autocorrelation</a></h2>

Models like linear regression analysis require that there is little or no autocorrelation in the data. That is, linear regressions requires that the residuals/error terms are independent of one another. So far, we have assumed all of the values in our models have been independent, but this is unlikely with time series data, because the temporal component of time series models means that they will often contain autocorrelation. 

> **What are some problems that could arise when using autocorrelated data with a linear model?**
* Estimated regression coefficients are still unbiased, but they no longer have the minimum variance property.
* The MSE may seriously underestimate the true variance of the errors.
* The standard error of the regression coefficients may seriously underestimate the true standard deviation of the estimated regression coefficients.
* Statistical intervals and inference procedures are no longer strictly applicable.



** Important Takeaways**
* Autocorrelation is a measure of how dependent a data point is on previous data points.
* Investigating ACF and PACF plots can help us identify an appropriate forecasting model and look for seasonality in our time series data.
* Simple linear regression cannot apply to data with autocorrelations because these data no longer have independent errors.

<h2><a id = "D">Differencing a Time Series and Stationarity</a></h2>

If a time series is **stationary**, the mean, variance, and autocorrelation will be constant over time. Forecasting methods typically assume the time series you are forecasting on to be stationary — or at least approximately stationary.

The most common way to make a time series stationary is through "differencing." This procedure converts a time series into the difference between values.

$$ \Delta y_t = y_t - y_{t-1} $$

This removes trends in the time series and ensures that the mean across time is zero. In most cases, this only requires a single difference, although, in some, a second difference (or third, etc.) will be necessary to remove trends.

In [None]:
diff = pharm_store1['Sales'].diff(periods = 7)

In [None]:
fig, ax = plt.subplots()
pharm_store1['Sales'].plot(legend = True);
diff.plot(legend = True);
ax.legend(['Sales', 'First Difference']);

> **Check:** How does differencing help with problems of non-stationarity and autocorrelation in time series data?

<h2><a id = "E">Independent Practice</a></h2>

**Instructor Note:** These are optional and can be assigned as student practice questions outside of class.

### Import the European Retail data set, preprocess the data, and create an initial plot.

In [None]:
import pandas as pd
import numpy as np
from datetime import timedelta
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

plt.rcParams['figure.figsize'] = [15, 8]
plt.rcParams['font.size'] = 14
plt.style.use('fivethirtyeight')

In [None]:
euro = pd.read_csv('./data/euretail.csv')
euro.head()

In [None]:
# Set Year column to index.

In [None]:
#  Use `.stack()` to stack the prescribed level(s) from columns to index.
stacked = euro.stack()
stacked.head(10)

In [None]:
# Plot the data.

### Use `plot_acf` and `plot_pacf` to look at the autocorrelation in the data set.

In [None]:
# Plot the ACF of the stacked data with 30 lags.

In [None]:
# Plot the PACF of the stacked data with 30 lags.

### Interpret your findings.

Our ACF and PACF plots indicate that there is still a lot of autocorrelation in the data set, indicating that our retail data points are not independent. From the PACF plot, there seems to be some seasonality in the data set.