# ARIMA Model

Source:

ning.oreilly.com/library/view/machine-learning-for/9781119682363/c04.xhtml#head-2-16

In [4]:
!pip install pmdarima
!pip install --upgrade matplotlib
!pip install --upgrade pandas

Defaulting to user installation because normal site-packages is not writeable

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.8 -m pip install --upgrade pip[0m
Defaulting to user installation because normal site-packages is not writeable

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.8 -m pip install --upgrade pip[0m
Defaulting to user installation because normal site-packages is not writeable

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.8 -m pip instal

An **ARMA method** consists of two parts:

- An autoregression
- A moving average model

Compared with the autoregressive and moving average models, ARMA models provide the most efficient linear model of stationary time series, since they are capable of modeling the unknown process with the minimum number of parameters (Zhang et al. 2015).

In particular, ARMA models are used to describe weekly stationary stochastic time series in terms of two polynomials. The first of these polynomials is for autoregression, the second for the moving average. Often this method is referred to as the ARMA(p,q) model, in which:

- p stands for the order of the autoregressive polynomial, and
- q stands for the order of the moving average polynomial.

Here we will see how to simulate time series from AR(p), MA(q), and ARMA(p,q) processes as well as fit time series models to data based on insights gathered from the ACF and PACF.

**Autoregressive integrated moving average (ARIMA)** models are considered a development of the simpler autoregressive moving average (ARMA) models and include the notion of integration.

The main differences between ARMA and ARIMA methods are the notions of integration and differencing. An ARMA model is a stationary model, and it works very well with stationary time series.

ARIMA models have three main components, denoted as p, d, q; 

- p stands for the number of lag variables included in the ARIMA model, also called the lag order.
- d stands for the number of times that the raw values in a time series data set are differenced, also called the degree of differencing.
- q denotes the magnitude of the moving average window, also called the order of moving average.

Let's now take a look at an extension of the ARIMA model in Python, called SARIMAX, which stands for seasonal autoregressive integrated moving average with exogenous factors. 

Data scientists usually apply SARIMAX when they have to deal with time series data sets that have seasonal cycles. 

SARIMAX models support seasonality and exogenous factors.

SARIMAX requires another set of p, d, and q arguments for the seasonality aspect as well as a parameter called s, which is the periodicity of the seasonal cycle in your time series data set.

In [5]:
# Import necessary libraries
from statsmodels.tsa.statespace.sarimax import SARIMAX
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_absolute_percentage_error
import math
import pandas as pd
import numpy as np

In [6]:
# Load data set
df = pd.read_csv('data/GEFCom2014-E.csv')
print(df.head())
df['Date'] = pd.to_datetime(df['Date'])
df['DateTime'] = df.apply(lambda row: row['Date'] + pd.Timedelta(hours=row['Hour']), axis=1)
df.set_index('DateTime', inplace=True)
df = df.rename_axis(None)
print(df)
df.shape

FileNotFoundError: [Errno 2] No such file or directory: 'data/GEFCom2014-E.csv'

In [None]:
train = df['T'].iloc[:95000]
test = df['T'].iloc[95000:]
print('Train data shape: ', train.shape)
print('Test data shape: ', test.shape)
train = train.to_frame()
test = test.to_frame()
print(train.head())
print(test.head())

In [None]:
# Scale train data to be in range (0, 1)
scaler = MinMaxScaler()
train['T'] = scaler.fit_transform(train)
train.head()
 
# Scale test data to be in range (0, 1)
test['T'] = scaler.transform(test)
test.head()

In [None]:
# Specify the number of steps to forecast ahead
HORIZON = 3
print('Forecasting horizon:', HORIZON, 'hours')

In [None]:
# Define the order and seasonal order for the SARIMAX model
order = (4, 1, 0)
seasonal_order = (1, 1, 0, 24)

In [None]:
# Build and fit the SARIMAX model
model = SARIMAX(endog=train, order=order, seasonal_order=seasonal_order)
results = model.fit()
 
print(results.summary())

In [None]:
# Walk-forward validation:
# Create a test data point for each HORIZON step
test_shifted = test.copy()
 
for t in range(1, HORIZON):
    test_shifted['T+'+str(t)] = test_shifted['T'].shift(-t, freq='H')
    
test_shifted = test_shifted.dropna(how='any')

We can make predictions on the test data and use a simpler model (by specifying a different order and seasonal order) for demonstration:

In [None]:
%%time
# Make predictions on the test data
training_window = 720
 
train_ts = train['T']
test_ts = test_shifted
 
history = [x for x in train_ts]
history = history[(-training_window):]
 
predictions = []
 
# Let's user simpler model
order = (2, 1, 0)
seasonal_order = (1, 1, 0, 24)
 
for t in range(test_ts.shape[0]):
    model = SARIMAX(endog=history, order=order, seasonal_order=seasonal_order)
    model_fit = model.fit()
    yhat = model_fit.forecast(steps = HORIZON)
    predictions.append(yhat)
    obs = list(test_ts.iloc[t])
    # move the training window
    history.append(obs[0])
    history.pop(0)
    print(test_ts.index[t])
    print(t+1, ': predicted =', yhat, 'expected =', obs)

In [None]:
# Compare predictions to actual temperature
eval_df = pd.DataFrame(predictions, columns=['t+'+str(t) for t in range(1, HORIZON+1)])
eval_df['timestamp'] = test.index[0:len(test.index)-HORIZON+1]
eval_df = pd.melt(eval_df, id_vars='timestamp', 
value_name='prediction', var_name='h')
eval_df['actual'] = np.array(np.transpose(test_ts)).ravel()
eval_df[['prediction', 'actual']] = scaler.inverse_transform(eval_df[['prediction', 'actual']])

In [None]:
# Compute the mean absolute percentage error (MAPE)
if(HORIZON> 1):
    eval_df['APE'] = (eval_df['prediction'] - 
        eval_df['actual']).abs() / eval_df['actual']
    print(eval_df.groupby('h')['APE'].mean())
 

In [None]:
# Print one-step forecast MAPE
print('One step forecast MAPE: ', (mean_absolute_percentage_error(eval_df[eval_df['h'] 
== 't+1']['prediction'], 
eval_df[eval_df['h'] == 't+1']['actual']))*100, '%')
 
# Print multi-step forecast MAPE
print('Multi-step forecast MAPE: ', 
mean_absolute_percentage_error(eval_df['prediction'], eval_df['actual'])*100, '%')