In [None]:
library(tseries)
library(tsibble)
library(forecast)

library(stats)
library(tidyverse)
library(ggplot2)
library(dplyr)
library(corrplot)
library(broom)
library(ggpubr)


### Load the Dataset:

In [None]:
# LOAD
df = read.csv("data/Preprocessed_Dataset.csv", sep = ",")
df$Date <- as.Date(df$Date, format = "%Y-%m-%d")
str(df)

**Time Series Analysis (ARIMA Modeling) Assessing** the behavior/data-generating process of the time-series To understand the data-generating process in an overview, we can find correlations among different lags, before jumping further.

We have done analysis on the top selling Walmart store in US.

Selecting the top Walmart with best sales:

In [None]:
df_20 <- df[df$Store == '20', ]
rownames(df_20) <- NULL
df_20

### What is Stationary in Time Series Context?

In time series analysis, a time series is said to be stationary if its statistical properties remain constant over time. More formally, a time series is stationary if its mean, variance, and covariance are all constant over time.

The reason we need to check for stationarity is that many time series models, such as ARIMA and SARIMA, assume that the time series is stationary. If the time series is not stationary, these models may produce incorrect or unreliable forecasts. Additionally, non-stationary time series may exhibit trends or seasonal patterns that can make it difficult to discern underlying patterns and relationships in the data. By checking for stationarity and transforming the data if necessary, we can make the time series more amenable to analysis and modeling.


### Tests to validate Stationary:

Augmented Dickey-Fuller (ADF) test and the Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test are statistical tests used to check whether a time series is stationary or not.

1. The ADF test is a hypothesis test that checks whether a time series has a unit root, which is a characteristic of non-stationarity. The null hypothesis of the test is that the time series has a unit root, while the alternative hypothesis is that it does not. If the test rejects the null hypothesis, it provides evidence that the time series is stationary.

2. The KPSS test is also a hypothesis test that checks for stationarity, but it takes a slightly different approach than the ADF test. The KPSS test checks whether the time series can be represented as a stationary process around a deterministic trend. The null hypothesis of the test is that the time series is stationary, while the alternative hypothesis is that it has a unit root and is non-stationary. If the test rejects the null hypothesis, it provides evidence that the time series is non-stationary.

In summary, both the ADF and KPSS tests are used to check whether a time series is stationary or not. The ADF test checks for a unit root in the time series, while the KPSS test checks for stationarity around a deterministic trend.

**1. ADF TEST:**

In [None]:
# Perform ADF test on "Weekly_Sales"
adf_result <- adf.test(df_20$Weekly_Sales)
adf_result

# Print the test results
cat("ADF test p-value:", adf_result$p.value)
if (adf_result$p.value < 0.05) {
  cat("\nThe time series is stationary at the 5% significance level.")
} else {
  cat("\nThe time series is not stationary at the 5% significance level.")
}

**2. KPSS Test:**

In [None]:
# Perform KPSS test on "Weekly_Sales"
kpss_result <- kpss.test(df_20$Weekly_Sales)
kpss_result

# Print the test results
cat("KPSS test p-value:", kpss_result$p.value)
if (kpss_result$p.value < 0.05) {
  cat("\nThe time series is not stationary at the 5% significance level.")
} else {
  cat("\nThe time series is stationary at the 5% significance level.")
}

The ADF and KPSS Tests Suggests that the Data is now stationary.

### Creating ts(Time Series) and tsibble Objects

To conduct the analysis, the data needs to be in ts format. To do this, we set the frequency to the number of weeks in a year (365/7), so a period represents a week. Furthermore, we subset all expect the date column and set the start date to the sixth period of 2010. The week number of the first date in the time series is actually 5, but after inspection, setting the start date to the sixth week created the correct ts and tsibble objects.

In [None]:
paste("The length of the time series is", nrow(df_20), "weeks and it ranges from", min(df_20$Date), "to", max(df_20$Date))

Creating Time Series data:

In [None]:
ts <- ts(df_20[, 2:ncol(df_20)], start = c(2010, 6),  frequency = 365.25 / 7)
str(ts)

Creating a tsibble object from the ts object and print its structure.

In [None]:
tsbl <- as_tsibble(ts, pivot_longer = FALSE)
str(tsbl)

### Modeling

Mid- to long-term projections are perhaps more significant than short-term forecasts because the goal is to predict the weekly sales of a Walmart store. In a week, a month, or even a few months, it's unlikely that the Walmart shop will be able to make many changes to its offerings or store. To respond to a specific forecast in the present, the store may need to undergo significant modifications over a period of several months. The store may need to do renovations, hire new staff, or look for new vendors. When large sales are anticipated, Walmart might even choose to pursue a new market opportunity like expanding the branch. As a result, a forecast horizon of a year, or around 52 weeks, appears to be appropriate. Week 43 of 2012 is the final week of the times series. So the forecasting inquiry is: How can we anticipate weekly sales to vary between week 43 of 2012 and week 43 of 2013.

### 1. Arima Modelling:

A different method for forecasting time series is offered by ARIMA models. The two most popular methods for predicting time series are exponential smoothing and ARIMA models, both of which offer complementary approaches to the issue. While ARIMA models seek to describe the autocorrelations in the data, exponential smoothing models are based on a description of the trend and seasonality in the data.

Arima modelling has three main parameters: 
1. The autoregressive (AR) terms capture the effect of the past values of the series on its current value.
2. The moving average (MA) terms capture the effect of the past errors (residuals) on the current value of the series. 
3. The integrated (I) term indicates the number of times the series needs to be differenced in order to achieve stationarity.

So for example, in ARIMA modeling, (3,0,1) represents the order of the model. Specifically, it indicates that the model includes 3 autoregressive (AR) terms, 0 integrated (I) terms, and 1 moving average (MA) term.

**Checking the AIC and BIC values for choosing the best parameters for Arima:**

For the choosing parameter, we choose the model with the lowest AIC and BIC values. 

In [None]:
AIC(
  arima(tsbl$Weekly_Sales,order=c(4,0,0)),
  arima(tsbl$Weekly_Sales,order=c(4,0,1)),
  arima(tsbl$Weekly_Sales,order=c(4,0,2)),
  arima(tsbl$Weekly_Sales,order=c(4,0,3)),
  arima(tsbl$Weekly_Sales,order=c(3,0,0)),
  arima(tsbl$Weekly_Sales,order=c(3,0,1)),
  arima(tsbl$Weekly_Sales,order=c(3,0,2)),
  arima(tsbl$Weekly_Sales,order=c(3,0,3))
)

Based on the AIC values, the best suggested ARIMA is ARIMA(3,0,3) or ARIMA(3,0,2)

In [None]:
BIC(
  arima(tsbl$Weekly_Sales,order=c(4,0,0)),
  arima(tsbl$Weekly_Sales,order=c(4,0,1)),
  arima(tsbl$Weekly_Sales,order=c(4,0,2)),
  arima(tsbl$Weekly_Sales,order=c(4,0,3)),
  arima(tsbl$Weekly_Sales,order=c(3,0,0)),
  arima(tsbl$Weekly_Sales,order=c(3,0,1)),
  arima(tsbl$Weekly_Sales,order=c(3,0,2)),
  arima(tsbl$Weekly_Sales,order=c(3,0,3))
)

Based on the BIC values, the best suggested ARIMA is ARIMA(3,0,3) or ARIMA(4,0,2)

One can also use auto.arima to get the ARIMA values.

In [None]:
model1 <- auto.arima(tsbl$Weekly_Sales,stationary=FALSE,allowdrift=FALSE, seasonal=FALSE,stepwise=FALSE,approximation=FALSE)
summary(model1)

From auto.arima, we see that the best parameters are (0,0,4) which is different from the parameters we choose by AIC and BIC technique.

Plotting the residuals for model 1 to see whether they are white noise.

In [None]:
tsdisplay(residuals(model1))

From the ACF and PACF plots, suggests it is not white noise.

In [None]:
#Arima
arima_fit <- arima(tsbl$Weekly_Sales, order=c(3,0,2))
summary(arima_fit)

Forecasting with Auto.Arima and Arima modelling. 

**1. Forecasting from Auto - Arima Model:**

In [None]:
#Auto - Arima Modelling
plot(forecast(model1, h=10))

**2. Forecasting from Arima Model:**

In [None]:
#Arima Modelling
plot(forecast(arima_fit, h=10))

**From the above observations we can say that arima model predictions are better than auto.arima model.**

### To calculate the accuracy:

In [None]:
df_subset <- subset(df_20, Date <= as.POSIXct("2012-08-03"))
tail(df_subset)

df_subset_1 <- subset(df_20, Date > as.POSIXct("2012-08-03"))

paste("The length of the time series is", nrow(df_subset), "weeks and it ranges from", min(df_subset$Date), "to", max(df_subset$Date))

paste("The length of the time series is", nrow(df_subset_1), "weeks and it ranges from", min(df_subset_1$Date), "to", max(df_subset_1$Date))

In [None]:

ts_1 <- ts(df_subset[, 2:ncol(df_subset)], start = c(2010, 6),  frequency = 365.25 / 7)
str(ts_1)
tsbl_1 <- as_tsibble(ts_1, pivot_longer = FALSE)
str(tsbl_1)


ts_2 <- ts(df_subset_1[, 2:ncol(df_subset_1)], start =  c(2012, 8),  frequency = 365.25 / 7)
str(ts_2)
tsbl_2 <- as_tsibble(ts_2, pivot_longer = FALSE)
str(tsbl_2)

tsbl_2

In [None]:
#Arima
arima_fit_1 <- arima(tsbl_1$Weekly_Sales, order=c(3,0,2))
summary(arima_fit_1)
forecast_1 <- forecast(arima_fit_1, h=12)
plot(forecast_1)

In [None]:
accuracy(forecast_1, tsbl_2$Weekly_Sales)