## Assignment 2 - Predicting inflation in Poland in May 2023

Team nickname: Pomeranians

Team members' names:
Justyna Chmielewska 124736,
Wiktoria Sikora 123859

**Dataset:** https://nbp.pl/statystyka-i-sprawozdawczosc/inflacja-bazowa/

**Description of the model used:**
Time Series Forecasting for Inflation Rate using **ARIMA Model**

*Introduction:*
This project focuses on predicting the inflation rate using the ARIMA (Autoregressive Integrated Moving Average) model. By understanding key concepts and steps in time series forecasting, we can effectively utilize ARIMA for accurate predictions.

- *Time Series Forecasting:*
Involves predicting future values based on the historical data of a variable recorded in sequential order over time. In the context of this project, we are interested in predicting the inflation rate, which represents changes in the general price level of goods and services over time.

- *ARIMA Model Overview:*
The ARIMA model combines three components: Autoregressive (AR), Integrated (I), and Moving Average (MA).

   The Autoregressive (AR) component captures the relationship between the current value and its past values. It considers the notion that the current inflation rate is influenced by its own historical values.

   The Integrated (I) component focuses on transforming the series into a stationary form by differencing consecutive observations. This step removes trends and ensures reliable modeling and forecasting.

   The Moving Average (MA) component considers the dependency between the current value and the residual errors from previous predictions. It helps capture random shocks or noise in the data

- *Importance of Stationarity:*
Stationarity ensures reliable modeling by removing non-stationary patterns like trends and seasonality.

- *ARIMA Model Selection:*
Selecting optimal ARIMA parameters (p, d, q) through analysis of autocorrelation and partial autocorrelation functions.

- *Model Training and Validation:*
Training the ARIMA model using historical data and validating its performance.

- *Forecasting and Evaluation:*
Using the trained ARIMA model to predict future inflation rates.

**ARIMA** is a valuable tool for predicting inflation rates. By following the steps of ARIMA modeling and considering stationarity, accurate forecasts can be achieved.

In [1]:
# loading libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

In [2]:
# loading data
data = pd.read_excel('inflation_rate_poland_2001-2023.xlsx')
data

Unnamed: 0,Year,Month,Inflation
0,2001,1,7.4
1,2001,2,6.6
2,2001,3,6.2
3,2001,4,6.6
4,2001,5,6.9
...,...,...,...
264,2023,1,16.6
265,2023,2,18.4
266,2023,3,16.1
267,2023,4,14.7


In [3]:
# adding new column
from pandas.tseries.offsets import MonthEnd
data['Date'] = pd.to_datetime(data[['Year', 'Month']].assign(DAY=1)) + MonthEnd(1)

In [4]:
# ordering ascending column "Date" 
data = data.sort_values(by=['Date'])
data

Unnamed: 0,Year,Month,Inflation,Date
0,2001,1,7.4,2001-01-31
1,2001,2,6.6,2001-02-28
2,2001,3,6.2,2001-03-31
3,2001,4,6.6,2001-04-30
4,2001,5,6.9,2001-05-31
...,...,...,...,...
264,2023,1,16.6,2023-01-31
265,2023,2,18.4,2023-02-28
266,2023,3,16.1,2023-03-31
267,2023,4,14.7,2023-04-30


In [5]:
# selecting data for modeling
df = data[['Date', 'Inflation']]

In [6]:
# seting "Date" column as index
df.set_index('Date', inplace=True)

In [7]:
# discarding empty rows
df.dropna(subset=['Inflation'], inplace=True)
df

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.dropna(subset=['Inflation'], inplace=True)


Unnamed: 0_level_0,Inflation
Date,Unnamed: 1_level_1
2001-01-31,7.4
2001-02-28,6.6
2001-03-31,6.2
2001-04-30,6.6
2001-05-31,6.9
...,...
2022-12-31,16.6
2023-01-31,16.6
2023-02-28,18.4
2023-03-31,16.1


**Stationarity:**
Stationarity is a crucial concept in time series forecasting. A stationary series has constant statistical properties, such as mean, variance, and autocovariance, over time. Stationarity is essential for accurate modeling as it allows to assume that the patterns and relationships observed in the historical data will persist in the future. Inflation rate data often exhibits non-stationarity due to trends, seasonality, or other factors.

In order to determine the stationarity of the inflation series, we utilized the Augmented Dickey Fuller (ADF) test. The ADF test examines the null hypothesis, which assumes that the data is non-stationary. If the P-value obtained from the test is lower than the significance level (0.05), we can reject the null hypothesis and conclude that the series is indeed stationary.

To summarize, if the P-value is greater than 0.05 (indicating non-stationarity), further investigation is required to determine the appropriate order of differencing. Conversely, if the P-value is equal to or below 0.05, we can consider the order of differencing as 0.

In [8]:
# ADF test
from statsmodels.tsa.stattools import adfuller

def series_transformation(series):
    result = adfuller(series.dropna(), regression='c', autolag='AIC')
    p_value = result[1]
    critical_value = result[4]['5%']
    
    if p_value <= 0.05 and result[0] < critical_value:
        print('P-value = {:.6f}, the series is likely stationary.'.format(p_value))
    else:
        print('P-value = {:.6f}, the series is likely non-stationary.'.format(p_value))
        
series_transformation(df.diff())

P-value = 0.000108, the series is likely stationary.


In [9]:
# finding differencing value
from pmdarima.arima.utils import ndiffs
print(ndiffs(df['Inflation'], test='adf'))
print(ndiffs(df['Inflation'], test='kpss'))
print(ndiffs(df['Inflation'], test='pp'))

1
2
1


In [10]:
# auto ARIMA function
from pmdarima import auto_arima
stepwise_fit = auto_arima(df['Inflation'], trace=True, suppress_warnings=True)
stepwise_fit.summary()

Performing stepwise search to minimize aic
 ARIMA(2,2,2)(0,0,0)[0]             : AIC=371.871, Time=0.24 sec
 ARIMA(0,2,0)(0,0,0)[0]             : AIC=469.264, Time=0.04 sec
 ARIMA(1,2,0)(0,0,0)[0]             : AIC=429.725, Time=0.03 sec
 ARIMA(0,2,1)(0,0,0)[0]             : AIC=378.736, Time=0.03 sec
 ARIMA(1,2,2)(0,0,0)[0]             : AIC=inf, Time=0.12 sec
 ARIMA(2,2,1)(0,0,0)[0]             : AIC=372.683, Time=0.08 sec
 ARIMA(3,2,2)(0,0,0)[0]             : AIC=373.580, Time=0.16 sec
 ARIMA(2,2,3)(0,0,0)[0]             : AIC=inf, Time=0.26 sec
 ARIMA(1,2,1)(0,0,0)[0]             : AIC=370.699, Time=0.07 sec
 ARIMA(0,2,2)(0,0,0)[0]             : AIC=371.480, Time=0.07 sec
 ARIMA(2,2,0)(0,0,0)[0]             : AIC=406.732, Time=0.04 sec
 ARIMA(1,2,1)(0,0,0)[0] intercept   : AIC=inf, Time=0.23 sec

Best model:  ARIMA(1,2,1)(0,0,0)[0]          
Total fit time: 1.384 seconds


0,1,2,3
Dep. Variable:,y,No. Observations:,268.0
Model:,"SARIMAX(1, 2, 1)",Log Likelihood,-182.35
Date:,"Tue, 30 May 2023",AIC,370.699
Time:,23:38:17,BIC,381.45
Sample:,01-31-2001,HQIC,375.018
,- 04-30-2023,,
Covariance Type:,opg,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
ar.L1,0.2912,0.049,5.912,0.000,0.195,0.388
ma.L1,-0.9717,0.010,-97.622,0.000,-0.991,-0.952
sigma2,0.2287,0.012,19.225,0.000,0.205,0.252

0,1,2,3
Ljung-Box (L1) (Q):,0.03,Jarque-Bera (JB):,1004.46
Prob(Q):,0.87,Prob(JB):,0.0
Heteroskedasticity (H):,2.95,Skew:,-0.42
Prob(H) (two-sided):,0.0,Kurtosis:,12.48


In [11]:
# p=2, d=0, q=1
from statsmodels.tsa.arima.model import ARIMA

# fitting the model
model = ARIMA(df['Inflation'], order=(2,0,1), freq='M')
model_fit = model.fit()
model_fit.summary()

  self._init_dates(dates, freq)


0,1,2,3
Dep. Variable:,Inflation,No. Observations:,268.0
Model:,"ARIMA(2, 0, 1)",Log Likelihood,-180.01
Date:,"Tue, 30 May 2023",AIC,370.02
Time:,23:38:17,BIC,387.975
Sample:,01-31-2001,HQIC,377.231
,- 04-30-2023,,
Covariance Type:,opg,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,3.3317,1.194,2.790,0.005,0.991,5.673
ar.L1,1.9256,0.055,35.129,0.000,1.818,2.033
ar.L2,-0.9331,0.055,-16.987,0.000,-1.041,-0.825
ma.L1,-0.7459,0.082,-9.123,0.000,-0.906,-0.586
sigma2,0.2204,0.013,17.314,0.000,0.195,0.245

0,1,2,3
Ljung-Box (L1) (Q):,1.48,Jarque-Bera (JB):,311.79
Prob(Q):,0.22,Prob(JB):,0.0
Heteroskedasticity (H):,2.77,Skew:,0.37
Prob(H) (two-sided):,0.0,Kurtosis:,8.23


In [12]:
# predict values
pred = model_fit.predict(start=0, end=len(df) - 1, typ='levels', dynamic=False)

In [13]:
# display last rows
pred.tail()

2022-12-31    17.519751
2023-01-31    16.346135
2023-02-28    16.310558
2023-03-31    18.407467
2023-04-30    15.578614
Freq: M, Name: predicted_mean, dtype: float64

In [14]:
# root mean squared error
from sklearn.metrics import mean_squared_error
rmse = mean_squared_error(pred, df['Inflation'], squared=False)
rmse

0.5310259974689961

In [15]:
# mean absolute error
from sklearn.metrics import mean_absolute_error
mae = mean_absolute_error(pred, df['Inflation'])
mae

0.34787320592283705

In [16]:
# mean absolute percentage error
mape = np.mean(np.abs(df['Inflation'] - pred) / df['Inflation']) * 100
mape

inf

In [17]:
# correlation
corr = np.corrcoef(pred, df['Inflation'])[0,1]
corr

0.9887431786658245

In [18]:
# predict values
forecast = model_fit.predict(start=0, end=len(df) + 4, typ='levels', dynamic=False)



In [19]:
# display forecasted values
forecast.tail(5)

2023-05-31    13.963218
2023-06-30    13.195508
2023-07-31    12.404729
2023-08-31    11.598389
2023-09-30    10.783612
Freq: M, Name: predicted_mean, dtype: float64