# Week 2 - ARIMA

AutoRegressive Integrated Moving Average models (known as ARIMA models) are the classic time-series statistical model. These models have the ability to accommodate many different idiosyncrasies resulting from evaluating observations that are correlated through time. In this lecture, we will discuss each of these tools, as well as how they can be used. We will also address the question of when to use each of these model features based on the properties of a specific data set.

By the end of this lesson, you should be able to:
- Describe the three components of an ARIMA model
- Explain why time series data does not meet the requirements of the OLS model
- Combine data and Python to implement an ARIMA model

## 1. Time Series Data
---

#### Time Series Data
- Time series data consists of repeated observations of a **single variable**, $y$, and tracking that variable over time, $t$.
$$
y = {y_1, y_2, y_3, ..., y_t}
$$

In [17]:
import pandas as pd
import plotly.express as px

data = pd.read_csv("https://github.com/dustywhite7/Econ8310/raw/master/DataSets/omahaNOAA.csv")
data = data.loc[len(data)-365*24:, ['DATE', 'HOURLYDRYBULBTEMPC']]
data.columns = ['date', 'temp_c']
data = data.loc[data['temp_c']!=0] # temp=0 is a 'missing value', which is annoying but fixable
data['date'] = pd.to_datetime(data['date'])
# And plot it
fig = px.line(data, x='date', y='temp_c')
fig

- We seek to predict $y_{t+1}$ using the information from previous observations $y$
- In order to estimate $y_{t+1}$, we need to find the effect of previous observations of $y$ on the upcoming period. We might write this model as
$$y_{t+1} = \alpha + \sum^t_{s=1} \beta_s * y_s + \epsilon$$
- However, this model violates our OLS model assumptions! We need tools to overcome this

#### Autocorrelation
- Autocorrelation occurs if we can detect a pattern between one period and the next, that is autocorrelation
- This is one of the primary assumptions of the OLS model, as is represented as $Cov(\epsilon_t, \epsilon_s) = 0, \forall t \ne s$
- When we work with time-series data, we frequently observe that one observation is correlated with the next observation in the sequence. Because observations are correlated, our data is not independent and identically distributed, and therefore the standard assumptions of OLS do not hold.
- We need to find a model that can eliminate the autocorrelation almost always seen in time series data

#### Upgrading OLS for time-series
- One of the best ways to account for violations of the standard assumptions is to eliminate the violation of the assumption from our data, and then use OLS.
- This will enable us to take advantage of the interpretability of OLS models, while also using more interesting data to make forecasts.

## 2. AutoRegressive Models (AR)
---

#### What are Autoregressive Models?
- AR models are based on the premise that deviation from the underlying trend in the data persists in all future observations
- AR models are specified to contain a chosen number of lagged observations of y as explanatory variables (they become our x variables), and to use those lags to predict the next value of y
  in the time-series.
- AR models contains relationships between now and infinity stages in the past
$$y_{t} = \alpha + \sum^p_{i=1} p_i * y_{t-i} + \epsilon_t$$
- Here, $p$ is the correlation term between periods and $\epsilon$ is an error (shock) term
- All of the information from the past that persists into every future observation

#### Order for AR Models
- The number of lagged observations is called the **order** of the AR model
    - The model is an AR(p) Model, where p is the order of the model
    - We say AutoRegressive model of order $p$
- The AR coefficients/order (number of lags) tell us how quickly a model returns to its mean
    - If the coefficients on AR variabels adds up to close to 1, then the model reverts to its mean slowly (persistant effect of past observations)
    - If the coefficients sum to near zero, then the model reverts to its mean quickly

## 2. Integration in Time Series
---

One of the most pervasive problems in time-series data is a time trend: a pattern that explains the movement in the series
- It turns out that a time-series with a time trend is what we call non-stationary, with stationarity being an important element of assumption 4
- A non-stationary model is one with non-uniform mean or variance over time

The good news is that we can use differenced (integrated) models to fix this problem

#### Integrated Models
- Integration occurs when a process is non-stationary. A non-stationary process is one that that contains a linear time trend.
- For example, a long-term series of stock prices

- We need to ensure that out data is stationary. To do so, we need to remove any time-trend from the data.
    - This is typically done through differencing: $y^s_i = y_1 - y_{i-1}$
- The goal is to predict the difference between today and tomorrow by using the differences from each pair of days in the past
    - This removes the non-stationary / time trend component of the model
    - This allows us to have a valid model
- Converts the non-stationary data into stationary data with the same mean throughout time

- The integration term $d$ represents the number of differencing operations performed on the data, typically cases in which $d \in \{1, 2\}$
    - $I(1): y^s_t = y_t - y_{t-1}$
    - $I(2): y^s_t = (y_t - y_{t-1}) - (y_{t-1} - y_{t-2})$
        - Where I(2) resemble a difference-in-differences model or second derivative rather than simply a subtration of two previous periods.

Non Stationary Data

![Alt text](../Figures/non%20stationary%20data.png)

data converted into stationary data by calculating the differences between each day

![Alt text](../Figures/Integreated%20Model%20Pattern.png)

## 3. Moving Average Models
---

#### Moving Average Models
- When an AR(.) model accounts for previous values of the dependent variable, MA(.) models account for previous values of the error term
- In a MA model, we want to predict tomorrows value by using the error term from today
- Like an AR model, we choose the order of our MA model by incorporating additional error terms from past periods in the model. In the case of the MA Model, the order is denoted as $q$
$$MA(q) = \alpha + \sum^q_{i=1} \theta_i * \epsilon_{t-i} + \epsilon_t$$

- An MA model suggests that the current value of a time-series depends linearly on previous error terms
    - Current value depends on how far away from the underlying trend previous periods fall
    - The larger $\theta$ becomes, the more persistent those error terms are
- Errors in the MA model are not correlated are not correlated to errors in the next period or previous period, these are independent observations

- AR models' effects last infinitely far into the future
    - Each observation is dependent on the observations before
- In an MA model, the effect of previous periods only persist for q periods
    - Because each error is uncorrelated with previous errors

#### Moving Average Orders
- An order of 1 means that we just care about yesterdays error to predict todays outcome

- MA models may help us to model shocks to a macro system, while an AR component would predict the overall development trends

## 4. Implementing the ARIMA Model
---

As statiticians/economists/analysts, We put these three common time-series problems together to form one of the most-used time-series models around: the Auto Regressive Integrated Moving Average (ARIMA) Model. 

#### The ARIMA Model
- AutoRegressive Integrated Moving Average (ARIMA) Models allow us to:
    - Include lags of the dependent variable
    - Take differences to eliminate trends
    - Invlude lagged error terms

- An ARIMA model is said to have order $(p, d, q)$ models, where $p$, $d$, and $q$ are the parameters denoting the order of the autoregressive terms, integration terms, and moving average terms, respectively.
    - It is often a matter of guessing and checking to find the correct specification for a model
    - Usually, there are only AR of MA terms in a Model
    - ARIMA(0, 1, 1)

#### ARIMA in Python

In [20]:
import statsmodels.api as sm

# Can't have missing data, but also don't want to drop hours, so we will
#   fill the data with last known temperature as our best guess of missing
#   data
data=data.fillna(method='pad')

In [22]:
data

Unnamed: 0,date,temp_c
633345,2016-08-17 15:52:00,31.7
633346,2016-08-17 16:52:00,31.1
633347,2016-08-17 17:52:00,30.6
633348,2016-08-17 18:00:00,30.6
633349,2016-08-17 18:52:00,29.4
...,...,...
642100,2017-05-01 19:52:00,8.9
642101,2017-05-01 20:52:00,6.1
642102,2017-05-01 21:52:00,7.2
642103,2017-05-01 22:52:00,5.6


##### Testing for Stationarity
- We can use the Augmented Dickey-Fuller Test to determine whether or not our data is stationary.
    - $H_0$: A unit root is preset in our data
    - $H_A$: The data is stationary
- This can help us to determine whether or not differencing our data is required or sufficient for inducing stationarity

In [26]:
import statsmodels.tsa.stattools as st
st.adfuller(data['temp_c'].dropna()[-10000:], maxlag=30)

(-4.5366869902461575,
 0.0001687078834400472,
 30,
 8272,
 {'1%': -3.4311407797891897,
  '5%': -2.861889469588051,
  '10%': -2.566956017840825},
 22300.938828584087)

In [28]:
# Implementing an ARIMA(3,1,0) model
arima = sm.tsa.ARIMA(data['temp_c'], order=(3, 1, 0)).fit()
arima.summary()


An unsupported index was provided and will be ignored when e.g. forecasting.


An unsupported index was provided and will be ignored when e.g. forecasting.


An unsupported index was provided and will be ignored when e.g. forecasting.



0,1,2,3
Dep. Variable:,temp_c,No. Observations:,8303.0
Model:,"ARIMA(3, 1, 0)",Log Likelihood,-11726.725
Date:,"Sun, 05 Feb 2023",AIC,23461.45
Time:,13:36:10,BIC,23489.547
Sample:,0,HQIC,23471.05
,- 8303,,
Covariance Type:,opg,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
ar.L1,0.2922,0.008,37.277,0.000,0.277,0.308
ar.L2,0.2168,0.008,25.607,0.000,0.200,0.233
ar.L3,0.0515,0.009,5.824,0.000,0.034,0.069
sigma2,0.9872,0.009,109.267,0.000,0.969,1.005

0,1,2,3
Ljung-Box (L1) (Q):,0.01,Jarque-Bera (JB):,5180.37
Prob(Q):,0.93,Prob(JB):,0.0
Heteroskedasticity (H):,1.0,Skew:,-0.19
Prob(H) (two-sided):,0.97,Kurtosis:,6.85


In [35]:
from datetime import timedelta
import plotly.graph_objects as go

# Generate forecast of next 10 hours

fcast = arima.forecast(steps=10)

# Generate data frame based on forecast
times = [data.iloc[-1, 0] + timedelta(hours=i) for i in range(1, 11)]

forecast = pd.DataFrame([times, fcast]).T
forecast.columns = data.columns = ['date', 'temp_c']

fig = px.line(data[-200:], x='date', y='temp_c')
fig.add_trace(go.Scatter(x=forecast["date"], y=forecast["temp_c"], mode='markers', name='Forecast'))

fig


No supported index is available. Prediction results will be given with an integer index beginning at `start`.

