# ARIMA Model

Source:

ning.oreilly.com/library/view/machine-learning-for/9781119682363/c04.xhtml#head-2-16

In [1]:
!pip install pmdarima
!pip install --upgrade matplotlib
!pip install --upgrade pandas

Defaulting to user installation because normal site-packages is not writeable

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.8 -m pip install --upgrade pip[0m
Defaulting to user installation because normal site-packages is not writeable

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.8 -m pip install --upgrade pip[0m
Defaulting to user installation because normal site-packages is not writeable

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.8 -m pip instal

An **ARMA method** consists of two parts:

- An autoregression
- A moving average model

Compared with the autoregressive and moving average models, ARMA models provide the most efficient linear model of stationary time series, since they are capable of modeling the unknown process with the minimum number of parameters (Zhang et al. 2015).

In particular, ARMA models are used to describe weekly stationary stochastic time series in terms of two polynomials. The first of these polynomials is for autoregression, the second for the moving average. Often this method is referred to as the ARMA(p,q) model, in which:

- p stands for the order of the autoregressive polynomial, and
- q stands for the order of the moving average polynomial.

Here we will see how to simulate time series from AR(p), MA(q), and ARMA(p,q) processes as well as fit time series models to data based on insights gathered from the ACF and PACF.

**Autoregressive integrated moving average (ARIMA)** models are considered a development of the simpler autoregressive moving average (ARMA) models and include the notion of integration.

The main differences between ARMA and ARIMA methods are the notions of integration and differencing. An ARMA model is a stationary model, and it works very well with stationary time series.

ARIMA models have three main components, denoted as p, d, q; 

- p stands for the number of lag variables included in the ARIMA model, also called the lag order.
- d stands for the number of times that the raw values in a time series data set are differenced, also called the degree of differencing.
- q denotes the magnitude of the moving average window, also called the order of moving average.

Let's now take a look at an extension of the ARIMA model in Python, called SARIMAX, which stands for seasonal autoregressive integrated moving average with exogenous factors. 

Data scientists usually apply SARIMAX when they have to deal with time series data sets that have seasonal cycles. 

SARIMAX models support seasonality and exogenous factors.

SARIMAX requires another set of p, d, and q arguments for the seasonality aspect as well as a parameter called s, which is the periodicity of the seasonal cycle in your time series data set.

In [2]:
# Import necessary libraries
from statsmodels.tsa.statespace.sarimax import SARIMAX
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_absolute_percentage_error
import math
import pandas as pd
import numpy as np

In [3]:
# Load data set
df = pd.read_csv('data/GEFCom2014-E.csv')
print(df.head())
df['Date'] = pd.to_datetime(df['Date'])
df['DateTime'] = df.apply(lambda row: row['Date'] + pd.Timedelta(hours=row['Hour']), axis=1)
df.set_index('DateTime', inplace=True)
df = df.rename_axis(None)
print(df)
df.shape

     Date  Hour  load      T
0  1/1/04     1   NaN  37.33
1  1/1/04     2   NaN  37.67
2  1/1/04     3   NaN  37.00
3  1/1/04     4   NaN  36.33
4  1/1/04     5   NaN  36.00


  df['Date'] = pd.to_datetime(df['Date'])


                          Date  Hour    load      T
2004-01-01 01:00:00 2004-01-01     1     NaN  37.33
2004-01-01 02:00:00 2004-01-01     2     NaN  37.67
2004-01-01 03:00:00 2004-01-01     3     NaN  37.00
2004-01-01 04:00:00 2004-01-01     4     NaN  36.33
2004-01-01 05:00:00 2004-01-01     5     NaN  36.00
...                        ...   ...     ...    ...
2014-12-31 20:00:00 2014-12-31    20  4012.0  18.00
2014-12-31 21:00:00 2014-12-31    21  3856.0  16.67
2014-12-31 22:00:00 2014-12-31    22  3671.0  17.00
2014-12-31 23:00:00 2014-12-31    23  3499.0  15.33
2015-01-01 00:00:00 2014-12-31    24  3345.0  15.33

[96432 rows x 4 columns]


(96432, 4)

In [4]:
train = df['T'].iloc[:95000]
test = df['T'].iloc[95000:]
print('Train data shape: ', train.shape)
print('Test data shape: ', test.shape)
train = train.to_frame()
test = test.to_frame()
print(train.head())
print(test.head())

Train data shape:  (95000,)
Test data shape:  (1432,)
                         T
2004-01-01 01:00:00  37.33
2004-01-01 02:00:00  37.67
2004-01-01 03:00:00  37.00
2004-01-01 04:00:00  36.33
2004-01-01 05:00:00  36.00
                         T
2014-11-02 09:00:00  36.33
2014-11-02 10:00:00  36.67
2014-11-02 11:00:00  36.67
2014-11-02 12:00:00  37.00
2014-11-02 13:00:00  37.33


In [5]:
# Scale train data to be in range (0, 1)
scaler = MinMaxScaler()
train['T'] = scaler.fit_transform(train)
train.head()
 
# Scale test data to be in range (0, 1)
test['T'] = scaler.transform(test)
test.head()

Unnamed: 0,T
2014-11-02 09:00:00,0.472435
2014-11-02 10:00:00,0.475391
2014-11-02 11:00:00,0.475391
2014-11-02 12:00:00,0.478261
2014-11-02 13:00:00,0.48113


In [6]:
# Specify the number of steps to forecast ahead
HORIZON = 3
print('Forecasting horizon:', HORIZON, 'hours')

Forecasting horizon: 3 hours


In [7]:
# Define the order and seasonal order for the SARIMAX model
order = (4, 1, 0)
seasonal_order = (1, 1, 0, 24)

In [8]:
# Build and fit the SARIMAX model
model = SARIMAX(endog=train, order=order, seasonal_order=seasonal_order)
results = model.fit()
 
print(results.summary())

  self._init_dates(dates, freq)
  self._init_dates(dates, freq)


                                     SARIMAX Results                                      
Dep. Variable:                                  T   No. Observations:                95000
Model:             SARIMAX(4, 1, 0)x(1, 1, 0, 24)   Log Likelihood              290121.522
Date:                            Wed, 29 Jan 2025   AIC                        -580231.044
Time:                                    14:07:52   BIC                        -580174.276
Sample:                                01-01-2004   HQIC                       -580213.776
                                     - 11-02-2014                                         
Covariance Type:                              opg                                         
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ar.L1          0.2787      0.003    108.275      0.000       0.274       0.284
ar.L2          0.1628      0.003   

In [9]:
# Walk-forward validation:
# Create a test data point for each HORIZON step
test_shifted = test.copy()
 
for t in range(1, HORIZON):
    test_shifted['T+'+str(t)] = test_shifted['T'].shift(-t, freq='H')
    
test_shifted = test_shifted.dropna(how='any')

We can make predictions on the test data and use a simpler model (by specifying a different order and seasonal order) for demonstration:

In [None]:
%%time
# Make predictions on the test data
training_window = 720
 
train_ts = train['T']
test_ts = test_shifted
 
history = [x for x in train_ts]
history = history[(-training_window):]
 
predictions = []
 
# Let's user simpler model
order = (2, 1, 0)
seasonal_order = (1, 1, 0, 24)
 
for t in range(test_ts.shape[0]):
    model = SARIMAX(endog=history, order=order, seasonal_order=seasonal_order)
    model_fit = model.fit()
    yhat = model_fit.forecast(steps = HORIZON)
    predictions.append(yhat)
    obs = list(test_ts.iloc[t])
    # move the training window
    history.append(obs[0])
    history.pop(0)
    print(test_ts.index[t])
    print(t+1, ': predicted =', yhat, 'expected =', obs)

2014-11-02 09:00:00
1 : predicted = [0.47806931 0.49952991 0.51475225] expected = [0.47243478260869565, 0.47539130434782617, 0.47539130434782617]
2014-11-02 10:00:00
2 : predicted = [0.49232    0.50633682 0.51530427] expected = [0.47539130434782617, 0.47539130434782617, 0.4782608695652174]
2014-11-02 11:00:00
3 : predicted = [0.48541104 0.49067928 0.49579837] expected = [0.47539130434782617, 0.4782608695652174, 0.48113043478260875]
2014-11-02 12:00:00
4 : predicted = [0.47802851 0.48140483 0.47747334] expected = [0.4782608695652174, 0.48113043478260875, 0.4869565217391304]
2014-11-02 13:00:00
5 : predicted = [0.48172161 0.47784227 0.47838743] expected = [0.48113043478260875, 0.4869565217391304, 0.49852173913043474]
2014-11-02 14:00:00
6 : predicted = [0.47709964 0.47754671 0.47098292] expected = [0.4869565217391304, 0.49852173913043474, 0.5014782608695653]
2014-11-02 15:00:00
7 : predicted = [0.48984456 0.4849518  0.48090806] expected = [0.49852173913043474, 0.5014782608695653, 0.50434



2014-11-04 13:00:00
53 : predicted = [0.61360816 0.62750561 0.63812576] expected = [0.6145217391304348, 0.6232173913043478, 0.6347826086956522]
2014-11-04 14:00:00
54 : predicted = [0.62866576 0.63949529 0.64381271] expected = [0.6232173913043478, 0.6347826086956522, 0.6347826086956522]
2014-11-04 15:00:00
55 : predicted = [0.6326513 0.6357827 0.629184 ] expected = [0.6347826086956522, 0.6347826086956522, 0.6289565217391304]
2014-11-04 16:00:00
56 : predicted = [0.6384536  0.63233155 0.61803837] expected = [0.6347826086956522, 0.6289565217391304, 0.6260869565217392]
2014-11-04 17:00:00
57 : predicted = [0.62766913 0.61240307 0.6089501 ] expected = [0.6289565217391304, 0.6260869565217392, 0.6232173913043478]
2014-11-04 18:00:00
58 : predicted = [0.61418225 0.6110252  0.59191924] expected = [0.6260869565217392, 0.6232173913043478, 0.6232173913043478]
2014-11-04 19:00:00
59 : predicted = [0.62596086 0.60947899 0.59641362] expected = [0.6232173913043478, 0.6232173913043478, 0.6145217391304

 This problem is unconstrained.
 This problem is unconstrained.
 This problem is unconstrained.
 This problem is unconstrained.

   evaluations in the last line search.  Termination
   may possibly be caused by a bad search direction.
 This problem is unconstrained.

   evaluations in the last line search.  Termination
   may possibly be caused by a bad search direction.
 This problem is unconstrained.
 This problem is unconstrained.

 Bad direction in the line search;
   refresh the lbfgs memory and restart the iteration.

   evaluations in the last line search.  Termination
   may possibly be caused by a bad search direction.
 This problem is unconstrained.
 This problem is unconstrained.

   evaluations in the last line search.  Termination
   may possibly be caused by a bad search direction.
 This pro

2014-11-05 06:00:00
70 : predicted = [0.53635759 0.53990144 0.53906323] expected = [0.5536521739130436, 0.5623478260869565, 0.5652173913043479]
2014-11-05 07:00:00
71 : predicted = [0.56146978 0.56437383 0.59257191] expected = [0.5623478260869565, 0.5652173913043479, 0.5797391304347826]
2014-11-05 08:00:00
72 : predicted = [0.56552932 0.59395371 0.62075586] expected = [0.5652173913043479, 0.5797391304347826, 0.6058260869565217]




In [None]:
# Compare predictions to actual temperature
eval_df = pd.DataFrame(predictions, columns=['t+'+str(t) for t in range(1, HORIZON+1)])
eval_df['timestamp'] = test.index[0:len(test.index)-HORIZON+1]
eval_df = pd.melt(eval_df, id_vars='timestamp', 
value_name='prediction', var_name='h')
eval_df['actual'] = np.array(np.transpose(test_ts)).ravel()
eval_df[['prediction', 'actual']] = scaler.inverse_transform(eval_df[['prediction', 'actual']])

In [None]:
# Compute the mean absolute percentage error (MAPE)
if(HORIZON> 1):
    eval_df['APE'] = (eval_df['prediction'] - 
        eval_df['actual']).abs() / eval_df['actual']
    print(eval_df.groupby('h')['APE'].mean())
 

In [None]:
# Print one-step forecast MAPE
print('One step forecast MAPE: ', (mean_absolute_percentage_error(eval_df[eval_df['h'] 
== 't+1']['prediction'], 
eval_df[eval_df['h'] == 't+1']['actual']))*100, '%')
 
# Print multi-step forecast MAPE
print('Multi-step forecast MAPE: ', 
mean_absolute_percentage_error(eval_df['prediction'], eval_df['actual'])*100, '%')