# **Time Series Analysis**

We will now conduct some basic time series prediction techniques on our data. We will load our similar to the `exploratory_data_analysis` notebook.

In [4]:
# Data science packages
import pandas as pd 

TSA_DATA_PATH = '../data/tsa_checkins.csv'

In [5]:
# Loading data
checkin_data = pd.read_csv(TSA_DATA_PATH)[::-1]
# removing covid data
checkin_data['Date'] = pd.to_datetime(checkin_data['Date'])
restrictions_start = '2/2/2020'
restrictions_end = '11/8/2021'
checkins_ex_covid = checkin_data[(checkin_data['Date'] < restrictions_start) | (checkin_data['Date'] > restrictions_end)]

## Partial Autocorrelation Function

#### PACF vs ACF

To understand the partial autocorrelation function (PACF), we must first understand the autocorrelation function (ACF). The ACF is a measure of how much 'effect' one time lag may have on another. For example, let's look at the following three lags

In [6]:
# +---------------------+
# |                     |
# t - 2 ---> t - 1 ---> t

We can see that the time lag previous *directly* affects the current lag. We can also see that the time lag from two periods ago *directly* affects the current lag. However, this `t-2` lag also *indirectly* affects our `t` lag. This is because `t-2` affects `t-1` which affects `t`.

The ACF function is essentially the normal pearson corrleation coefficient between `{t,t-1}` and `{t,t-2}` which will account for the *indirect* effect as well as the direct. 

The PACF will instead just focus on the *direct* relationship between the respective pair of lags. We utilize the PACF because the ACF value can be misleading. The `t-2` lag may seem like a useful predictor only because its indirect affect (and therefore really just the `t-1` lag). By just focusing on the direct effect we figure which lags are truly best. 

#### Deriving the PACF 

We can derive the PACF using the *conditional correlation coefficient* or through regresssion. 

Finding the PACF for $t_{i-3} with the *conditional correlation coefficient* method would yield...

$Cov(t_i, t_{i-3} | t_{i-1}, t_{i-2}) \over{\sigma_{t_i | t_{i-1}, t_{i-2}} \sigma_{t_{i-3} | t_{i-1}, t_{i-2}}}$

We can generalize this for how many ever terms but we are essentially finding the correlation of this term (in our example $t_{i-3}$) with the current, given that we already know the affect of all future lags. 

Using regression...

Let's again start with $t_{i-2}. We first start by fitting a model to predict $t_i$ based on just $t_{i-1}$. We find 

$t_i = \beta_{0} + \beta_{1} t_{i-1} + \epsilon$

where $\epsilon$ is a white noise. We know that not all of $t_i$ can be explained by just $t_{i-1}$ and so the remaining must be the direct correlation of $t_{i-2}$ or 

$t_i = \beta_{0} + \beta_{1} t_{i-1} + \beta_{2} t_{i-2} + \epsilon$

**NOTE**: this is overly simplified as the current lag would likely need many more previous lags to explain its value. We would generalize this concept to as many lags as we need. 