# Research Skills: Spatiotemporal Data Analysis
## Module 3 - More Time Series Forecasting, Time Series Regression and Classification

Sharon Ong, Department of Cognitive Science and Artificial Intelligence – Tilburg University

You will work with three time-series datasets for forecasting:
1. The monthly totals of international airline passengers (in thousands) from January 1949 to December 1960.
2. The US Macroeconomic Data for Q1, 1959 to Q3,2009.
3. This mulitvariate time series dataset contains various US macroeconomic variables from 1947 to 1962 that are known to be highly collinear.

You will work also work with two datasets for time series classification:
1. The twelve monthly electrical power demand time series from Italy. The classification task is to distinguish days from Oct to March (inclusive) from April to September.
2. The Basic Motions dataset. The data was generated as part of a student project where four students performed four activities whilst wearing a smart watch. The watch collects 3D accelerometer and a 3D gyroscope. It consists of four classes, which are walking, resting, running and badminton

The Entry level exercises include
1. Vector Autogressions
2. Forecasting with Exogenous Variables
3. Time Series Classification

The advanced level exerises include
4. Pipelines and Cross validation with Time Series

### 0. Setup
Please specify in the next cell if you are working from Google Colab or from your own computer. Also indicate if you already have the statsmodels library installed.

In [None]:
COLAB = True
SKTIME_INSTALLED = False
PDARIMA_INSTALLED = False

Now run the following to set up the notebook.

In [None]:
if COLAB:
#    from google.colab import drive
#    drive.mount('/content/drive')
#    # Load the contents of the directory
    !ls
#    # Change your working directory to the folder where you stored your files, e.g.
#    %cd /content/drive/My Drive/Colab Notebooks/STDA

if not SKTIME_INSTALLED:
    !pip install sktime
    !pip install statsmodels

if not PDARIMA_INSTALLED:
    !pip install pmdarima

import itertools
from os.path import join

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


# 1. Multivariate Forecasting with Vector Autoregression

Multivariate forecasting involves predicting future values of multiple dependent variables based on their historical patterns and relationships.

In univariate time series, we focus on only a single time-dependent variable. In contrast, a multivariate time series deals with multiple time-dependent variables. Each variable depends not only on its own past values but also has some dependency on other variables.
For example, imagine a dataset that includes temperature, humidity, and air pressure measurements over time. Here, we have multiple variables (temperature, humidity, and air pressure) that interact with each other. Multivariate time series models consider these interdependencies to forecast future values for all variables simultaneously.

We will use the US Macroeconomic Data for Q1, 1959 to Q3,2009. We will use 2 of the features in this dataset which are
* realgdp - Real gross domestic product
* realcons- Real personal consumption expenditures
* realinv - Real gross private domestic investment
* realgovt- Real federal consumption expenditures & gross investment    

The following code loads and displays the dataset

In [None]:
from sktime.datasets import load_macroeconomic
from sktime.utils.plotting import plot_series

y = load_macroeconomic()
y.plot(subplots = True)

We will use two variables in this dataset, "realgdp" and "realcons".  The following code extracts the 2 features.  The VAR class assumes that the passed time series are stationary. Non-stationary or trending data can often be transformed to be stationary by first-differencing or some other method. Perform a stationary check by applying an Augmented Dickey-Fuller test on each of these variables.


In [None]:
from sktime.param_est.stationarity import StationarityADF
y = y[["realgdp","realcons"]]
#
# Your codes goes here
#

For the time series which are not stationary, make them stationary.  

In [None]:
#
# Your codes goes here
#

The following code splits the code into training and test and runs and evaluate a vector autoregression with a constant trend and up to 4 lags.  

In [None]:
from sktime.forecasting.var import VAR
from sktime.split import temporal_train_test_split
from sktime.utils.plotting import plot_series
from sktime.performance_metrics.forecasting import mean_absolute_percentage_error
import numpy as np

test_size = 48
fh = np.arange(1, test_size+1)

y_train1, y_test1 = temporal_train_test_split(y, test_size=test_size)

# Initialize a vector autoregression with
forecaster = VAR(trend= 'n', maxlags =  4)
forecaster.fit(y_train1)
print(forecaster.is_fitted)
print(forecaster.get_params())

# predict the results
y_pred = forecaster.predict(fh=fh)
print(forecaster.get_fitted_params())

# evaluate the results
print('MAPE for VAR: ', mean_absolute_percentage_error(y_test1, y_pred, symmetric=False))
# plot the results
plot_series(y_train1["realgdp"] , y_test1["realgdp"], y_pred["realgdp"] ,labels=["y_train", "y_val", "y_pred"], title = 'realgdp')
plot_series(y_train1["realcons"] , y_test1["realcons"], y_pred["realcons"] ,labels=["y_train", "y_val", "y_pred"], title = 'realcons')


VARMAX is the vector form of ARIMA (with exogenous variables). Fit a VARMAX and predict and evaluate the results.

In [None]:
from sktime.forecasting.varmax import VARMAX
import numpy as np

test_size = 60
fh = np.arange(1, test_size+1)

y_train1, y_test1 = temporal_train_test_split(y,test_size=test_size)


forecaster = VARMAX(suppress_warnings=True)


#
# Your code goes here
#

Granger causality is a statistical hypothesis test that assesses whether one time series can be used to predict another time series. The following code runs the grangers causality test, If the pvalue is less than 0.05 then it means that one time series can be used to predict another.


In [None]:
from statsmodels.tsa.stattools import grangercausalitytests


gc_res = grangercausalitytests(y, 12)

#2. Univariate forecasting with exogenous variables

This mulitvariate time series dataset contains various US macroeconomic variables from 1947 to 1962 that are known to be highly collinear. We will forecast total employement (TOTEMP) with it's lag and 5 other variables:

* GNPDEFL - Gross national product deflator
* GNP - Gross national product
* UNEMP - Number of unemployed
* ARMED - Size of armed forces
* POP - Population

The following code run a ARIMA model with exogeenous variables. Compare this model with an ARIMA model without exogeenous variables.

In [None]:
from sktime.datasets import load_longley
from sktime.forecasting.base import ForecastingHorizon
from sktime.split import temporal_train_test_split
from sktime.utils.plotting import plot_series
from sktime.forecasting.arima import AutoARIMA
from sktime.forecasting.sarimax import SARIMAX


y, X = load_longley()
y_train, y_test, X_train, X_test = temporal_train_test_split(y=y, X=X, test_size=6)
fh = ForecastingHorizon(y_test.index, is_relative=False)

forecaster = SARIMAX(order=(1, 0, 0), trend="t", seasonal_order=(1, 0, 0, 6))
forecaster.fit(y=y_train, X=X_train)
y_pred = forecaster.predict(fh=fh, X=X_test)
print('MAPE for SARIMAX: ', mean_absolute_percentage_error(y_test, y_pred, symmetric=False))

#
# Your code goes here
#
# try univariate forcasting with ARIMA without exogenous variables

`AutoARIMA` is another function which supports forecasting with exogenous variables.  

In [None]:
forecaster = AutoARIMA(suppress_warnings=True)
forecaster.fit(y=y_train, X=X_train)
y_pred = forecaster.predict(fh=fh, X=X_test)

print('MAPE for AutoARIMA : ', mean_absolute_percentage_error(y_test, y_pred, symmetric=False))

Display the residual errors for the SARIMAX solution with a residual plot, qqplot and histogram plot.  

In [None]:
#
# Your code goes here
#


# 3. Time Series Classification.  

# 3.1  Univariate Time Series Classification
The following code loads the training and test data from the Italy Power Demaind dataset. The dataset is univaraite.

THere are two classes, one for the period Oct to March (class 1) and from April to September (class 2). Display each time series, giving a different label for each different class.

In [None]:
from sktime.datasets import load_basic_motions
from sktime.datasets import load_italy_power_demand
from sktime.classification.distance_based import KNeighborsTimeSeriesClassifier

from sktime.dists_kernels import FlatDist, ScipyDist
from sklearn.metrics import accuracy_score
import numpy as np
import matplotlib.pyplot as plt

In [None]:
X_train, y_train = load_italy_power_demand(split="train", return_type="numpy2d")
X_test, y_test = load_italy_power_demand(split="test", return_type="numpy2d")


#
# Your code goes here
#


We can train a KNN Time Series Classifier and evaluate the accuracy.

In [None]:
eucl_dist = FlatDist(ScipyDist())
clf = KNeighborsTimeSeriesClassifier(n_neighbors=3, distance=eucl_dist)
clf.get_params()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
# for simplest evaluation, compare ground truth to predictions
accuracy_score(y_test, y_pred)



Train and evaluate the dataset with the following classifiers

* IndividualBOSS
* RocketClassifier
* Catch22Classifier

In [None]:
from sktime.classification.dictionary_based import IndividualBOSS
from sktime.classification.feature_based import Catch22Classifier
from sklearn.ensemble import RandomForestClassifier
from sktime.classification.kernel_based import RocketClassifier


#
# Your code goes here
#


# 3.2 Multivariate time series classification

The following code runs the basic motions dataset and display each time series with its class.  The dataset consists of 6 variables from a 3D accelerometer and a 3D gyroscope. It consists of four classes, which are walking, resting, running and badminton


In [None]:
X_train, y_train = load_basic_motions(split="train", return_type="numpy3D")
X_test, y_test = load_basic_motions(split="test", return_type="numpy3D")

# individual time series for each variable
for b in np.arange(0, X_train.shape[1]):
    plt.figure(figsize=(12,4))
    for a in np.arange(0, len(y_train)):
        if(y_train[a] == 'badminton'):
            plt.plot(X_train[a,b],'g')
        elif(y_train[a] == 'running'):
            plt.plot(X_train[a,b],'r--')
        elif(y_train[a] == 'standing'):
            plt.plot(X_train[a,b],'b+-')
        elif(y_train[a] == 'walking'):
            plt.plot(X_train[a,b],'c.-')



Some time series classifies only work from univariate classifiers. Try and train and evaluate the dataset with the following classifiers
* KNeighborsTimeSeriesClassifier
* IndividualBOSS
* RocketClassifier
* Catch22Classifier

Do these classifiers all work with multivariate time series classification?

In [None]:
#
# Your code goes here
#

# 4. Advanced Section: Cross validation and Pipelines with Time Series

#4.1 Cross validation for Time Series Classification.

You can apply cross validation on time series for classification in the same way as you do for classifiers and regressors in sklearn.  You can also tune hyperparameters using `GridSearchCV`. The following code loads the Italy power demand dataset.

1. Initialize a `KNeighborsTimeSeriesClassifier`
2. Perform 5-fold cross validation using with `cross_val_score'`.
3. Split your data into training and testing.
4. Perform hyperparameter tuning on the training set, with `n_neighbors = [1, 3, 5]` and with 5 folds using `GridSearchCV`.
5. Fit the best parameters on the training set and evaluate on the test set.

In [None]:
# cross validation stuff
from sklearn.model_selection import KFold, cross_val_score
from sklearn.model_selection import GridSearchCV
from sktime.classification.distance_based import KNeighborsTimeSeriesClassifier
from sktime.classification.feature_based import SummaryClassifier

X, y = load_italy_power_demand(return_type="numpy2d")

#
# Your code goes here
#

#4.2 Cross validation for time series forecasting

In sktime there are three different types of temporal cross-validation splitters available:
* SingleWindowSplitter, which is equivalent to a single temporal train-test-split
* SlidingWindowSplitter, which is using a rolling window approach and “forgets” the oldest observations as we move more into the future
* ExpandingWindowSplitter, which is using a expanding window approach and keep all observations in the training set as we move more into the future

The following code calls a sliding window splitter.

In [None]:
from sktime.forecasting.model_evaluation import evaluate
from sktime.forecasting.arima import ARIMA
from sktime.datasets import load_shampoo_sales

y = load_shampoo_sales()
y_train, y_test = temporal_train_test_split(y=y, test_size=6)

forecaster = ARIMA()
cv = SlidingWindowSplitter(fh=6, window_length=12, step_length=1)

plot_windows(cv=cv, y=y_train)

out = evaluate(forecaster, cv, y=y)
print(out)


## 4.3 Pipelines for forecasting

We can transform the time series (e.g. apply a log transformer, DeTrender) and before running a forecasting algorithm.

1. Load the airline dataset and set the Forecasting Horizon to 36 months
2. Split your data into training and test.
3. Fit `ARIMA` or `AutoARIMA` on the training set with the appropriate hyperparameters and evaluate on the test set.
4. Make an instance of the `Detrender`, run `fit_transform` on the training set.
5.  Fit `ARIMA` or `AutoARIMA` on the result of `fit_transform`, on the training set with the appropriate hyperparameters. Make predictions up to the forecasting horizon. Perfomr an `inverse_transform` on the predicted results.
6. Plot the predicted time series from No. 3 and No. 5 along with the training and test set.
7. Apply a `Deseasonalizer` transformation together with the `Detrender` the data.

In [None]:
from sktime.transformations.series.detrend import Deseasonalizer, Detrender
from sktime.datasets import load_airline
from sktime.split import temporal_train_test_split
from sktime.utils.plotting import plot_series
from sktime.forecasting.arima import AutoARIMA

from sktime.performance_metrics.forecasting import mean_absolute_percentage_error
import numpy as np

#
# Your code goes here
#


The function `TransformedTargetForecaster` create a pipeline to run the transformer and forecaster.  

In [None]:
from sktime.forecasting.compose import TransformedTargetForecaster

pipe_y = TransformedTargetForecaster(
    steps=[
        ("detrend", Detrender()),
        ("forecaster", AutoARIMA(sp=12, suppress_warnings=True)),
    ]
)
pipe_y.fit(y=y_train)
y_pred = pipe_y.predict(fh=fh)
plot_series(y_train, y_test, y_pred ,labels=["y_train", "test", "y_pred"], title = 'with detrending ')

Run a pipeline which includes a `Detrender` and `Deseasonalizer` with `AutoARIMA` on the same dataset.  


In [None]:
#
# Your code goes here.
#