# Workshop 5 Example 6
# Simple time series analysis

# Autoregression

https://machinelearningmastery.com/autoregression-models-time-series-forecasting-python/

Autoregression is a time series model that uses observations from previous time steps as input to a regression equation to predict the value at the next time step.

A regression model, such as linear regression, models an output value based on a linear combination of input values.

For example:

$y = b_0 + b_1X$

Where $y$ is the prediction,$ b_0$ and $b_1$ are coefficients found by optimizing the model on training data, and $X$ is an input value.

This technique can be used on time series where input variables are taken as observations at previous time steps, called lag variables.

For example, we can predict the value for the next time step (t+1) given the observations at the last two time steps (t-1 and t-2). As a regression model, this would look as follows:

$X(t+1) = b_0 + b_1X(t-1) + b_2X(t-2)$

Because the regression model uses data from the same input variable at previous time steps, it is referred to as an autoregression (regression of self)

## Autocorrelation

An autoregression model makes an assumption that the observations at previous time steps are useful to predict the value at the next time step.

This relationship between variables is called correlation.

If both variables change in the same direction (e.g. go up together or down together), this is called a positive correlation. If the variables move in opposite directions as values change (e.g. one goes up and one goes down), then this is called negative correlation.

We can use statistical measures to calculate the correlation between the output variable and values at previous time steps at various different lags. The stronger the correlation between the output variable and a specific lagged variable, the more weight that autoregression model can put on that variable when modeling.

The correlation statistics can also help to choose which lag variables will be useful in a model and which will not.

### Minimum Daily Temperatures Dataset
This dataset describes the minimum daily temperatures over 10 years (1981-1990) in the city Melbourne, Australia.

The units are in degrees Celsius and there are 3,650 observations.

In [0]:
!wget https://raw.githubusercontent.com/jbrownlee/Datasets/master/daily-min-temperatures.csv
!wget https://raw.githubusercontent.com/jbrownlee/Datasets/master/shampoo.csv
  
# this is to hide Pandas warnings for using from_csv instead of the updated read_csv
import warnings
warnings.filterwarnings('ignore')

In [0]:
import pandas as pd
from matplotlib import pyplot

# load the data using pandas
series = pd.Series.from_csv('daily-min-temperatures.csv', header=0)

# show the first couple of rows to visualize the dataframe
print(series.head())

# plot the data
series.plot()
pyplot.show()

### Quick Check for Autocorrelation

We can plot the observation at the previous time step (t-1) with the observation at the next time step (t+1) as a scatter plot.

This could be done manually by first creating a lag version of the time series dataset and using a built-in scatter plot function in the Pandas library.

Pandas provides a built-in plot to do exactly this, called the lag_plot() function.

In [0]:
import pandas as pd
from matplotlib import pyplot
from pandas.plotting import lag_plot

# read in the data using pandas
series = pd.Series.from_csv('daily-min-temperatures.csv', header=0)

# plot a lag plot to check for autocorrelation
lag_plot(series)
pyplot.show()

We can use a statistical test like the **Pearson correlation coefficient**. This produces a number to summarize how correlated two variables are between -1 (negatively correlated) and +1 (positively correlated) with small values close to zero indicating low correlation and high values above 0.5 or below -0.5 showing high correlation.

Correlation can be calculated easily using the corr() function on the DataFrame of the lagged dataset.

In [0]:
from pandas import Series
from pandas import DataFrame
from pandas import concat
from matplotlib import pyplot

# import data using pandas
series = Series.from_csv('daily-min-temperatures.csv', header=0)

# get the values of the pandas dataframe
values = DataFrame(series.values)

# create dataframe with values and shifted values by 1
dataframe = concat([values.shift(1), values], axis=1)

# dataframe column names
dataframe.columns = ['t-1', 't+1']

# calculate the Pearson correlation coefficient
result = dataframe.corr()
print(result,'\n\nstrong positive correlation 0.775\n')

### Autocorrelation Plots
We can plot the correlation coefficient for each lag variable.

This can very quickly give an idea of which lag variables may be good candidates for use in a predictive model and how the relationship between the observation and its historic values changes over time.

Pandas provides a built-in plot called the autocorrelation_plot() function.

The plot provides the lag number along the x-axis and the correlation coefficient value between -1 and 1 on the y-axis. The plot also includes solid and dashed lines that indicate the 95% and 99% confidence interval for the correlation values. Correlation values above these lines are more significant than those below the line, providing a threshold or cutoff for selecting more relevant lag values.

In [0]:
from pandas import Series
from matplotlib import pyplot
from pandas.plotting import autocorrelation_plot

# import data using pandas
series = Series.from_csv('daily-min-temperatures.csv', header=0)

# build and plot the autocorrelation
autocorrelation_plot(series)
pyplot.show()

each peak in the autocorrelation plot is 1 year apart


### Persistence Model

Let’s say that we want to develop a model to predict the last 7 days of minimum temperatures in the dataset given all prior observations.

The simplest model that we could use to make predictions would be to persist the last observation. We can call this a persistence model and it provides a baseline of performance for the problem that we can use for comparison with an autoregression model.

We can develop a test harness for the problem by splitting the observations into training and test sets, with only the last 7 observations in the dataset assigned to the test set as “unseen” data that we wish to predict.

In [0]:
from pandas import Series
from pandas import DataFrame
from pandas import concat
from matplotlib import pyplot as plt
from sklearn.metrics import mean_squared_error

# load the data using pandas
series = Series.from_csv('daily-min-temperatures.csv', header=0)

# create lagged dataset
values = DataFrame(series.values)
dataframe = concat([values.shift(1), values], axis=1)
dataframe.columns = ['t-1', 't+1']

# split into train and test sets
X = dataframe.values
train, test = X[1:len(X)-7], X[len(X)-7:]
train_X, train_y = train[:,0], train[:,1]
test_X, test_y = test[:,0], test[:,1]

# persistence model
def model_persistence(x):
	return x

# walk-forward validation
predictions = list()
for x in test_X:
  yhat = model_persistence(x)
  predictions.append(yhat)

# create a test score
test_score = mean_squared_error(test_y, predictions)
print('Test MSE: %.3f' % test_score)

# plot predictions vs expected
plt.plot(test_y, color='blue', label='test')
plt.plot(predictions, color='red', label='prediction')
plt.legend(loc='best')
plt.show()

### Autoregression Model
An autoregression model is a linear regression model that uses lagged variables as input variables. The time step value at timestep $t$, $X_t$, is calculated as

$$X_t=c+\sum_i^p \phi_i X_{t-i} +\epsilon_t $$

where the $\phi$ values are the model parameters and $\epsilon_t$ is noise. $p$ is the lag value (i.e. how far in the past to look to calculate the new value)

The statsmodels library provides an autoregression model that automatically selects an appropriate lag value using statistical tests and trains a linear regression model. It is provided in the AR class.

We can use this model by first creating the model AR() and then calling fit() to train it on our dataset. This returns an ARResult object.

Once fit, we can use the model to make a prediction by calling the predict() function for a number of observations in the future. This creates 1 7-day forecast, which is different from the persistence example above.

In [0]:
from pandas import Series
from matplotlib import pyplot as plt
from statsmodels.tsa.ar_model import AR
from sklearn.metrics import mean_squared_error

# import data using pandas
series = Series.from_csv('daily-min-temperatures.csv', header=0)

# split dataset
X = series.values
train, test = X[1:len(X)-7], X[len(X)-7:]

# train autoregression
model = AR(train)
model_fit = model.fit()
print('Lag: %s' % model_fit.k_ar)
print('Coefficients: %s' % model_fit.params)

# make predictions
predictions = model_fit.predict(start=len(train), end=len(train)+len(test)-1, dynamic=True)
for i in range(len(predictions)):
  print('predicted=%f, expected=%f' % (predictions[i], test[i]))

# calculate the error
error = mean_squared_error(test, predictions)
print('\nTest MSE: %.3f' % error)

# plot results
plt.plot(test, color='blue', label='test')
plt.plot(predictions, color='red', label='prediction')
plt.legend(loc='best')
plt.show()

Let's see what happens when we use autoregression to look far into the future

In [0]:
from pandas import Series
from matplotlib import pyplot as plt
from statsmodels.tsa.ar_model import AR
from sklearn.metrics import mean_squared_error
import numpy as np

# import data using pandas
series = Series.from_csv('daily-min-temperatures.csv', header=0)

# split dataset
X = series.values
train = X

num_future_days = 2*365

# train autoregression
model = AR(train)
model_fit = model.fit()
print('Lag: %s' % model_fit.k_ar)

# make predictions
predictions = model_fit.predict(start=len(train), end=len(train)+num_future_days, dynamic=True)

total = np.append(X,predictions)

# plot results
plt.figure(1)
plt.plot(total, color='blue')

plt.figure(2)
plt.plot(predictions, color='red', label='prediction')
plt.legend(loc='best')

plt.show()

# Moving Average (MA)

a common approach for modeling univariate time series. The moving-average model specifies that the output variable depends linearly on the current and various past values of a stochastic (imperfectly predictable) term. The value in a series is calculated as

$$X_t=\mu + \epsilon_t+\theta_1\epsilon_{t-1}+...+\theta_q\epsilon_{t-q}$$

where $\mu$ is the mean of the series, the $\theta$'s are the model parameters and the $\epsilon$'s are white noise error terms

In [0]:
from pandas import Series
from matplotlib import pyplot as plt
from statsmodels.tsa.arima_model import ARMA
from sklearn.metrics import mean_squared_error
import numpy as np

# import data using pandas
series = Series.from_csv('daily-min-temperatures.csv', header=0)

# split dataset
X = series.values
train, test = X[1:len(X)-7], X[len(X)-7:]

# train the moving average model
model = ARMA(train, order=(0, 4))
model_fit = model.fit()
print('Coefficients: %s' % model_fit.params)

# make predictions
predictions = model_fit.predict(start=len(train), end=len(train)+len(test)-1, dynamic=True)
for i in range(len(predictions)):
  print('predicted=%f, expected=%f' % (predictions[i], test[i]))

# calculate the error
error = mean_squared_error(test, predictions)
print('\nTest MSE: %.3f' % error)

# plot results
plt.plot(test, color='blue', label='test')
plt.plot(predictions, color='red', label='prediction')
plt.title('looking 7 points into the future')
plt.legend(loc='best')
plt.show()

Let's now see how well the model fits the training data

In [0]:
from pandas import Series
from matplotlib import pyplot as plt
from statsmodels.tsa.arima_model import ARMA
from sklearn.metrics import mean_squared_error
import numpy as np

# import data using pandas
series = Series.from_csv('daily-min-temperatures.csv', header=0)

# get the series values
X = series.values

# train the moving average model
model = ARMA(X, order=(0, 4))
model_fit = model.fit()
print('Coefficients: %s' % model_fit.params)

# make predictions
predictions = model_fit.predict(start=0, end=len(X)-1, dynamic=False)

# calculate the error
error = mean_squared_error(X, predictions)
print('\nTest MSE: %.3f' % error)

# plot results
plt.figure(1)
plt.plot(X, color='blue', label='test')
plt.plot(predictions, color='red', label='prediction')
plt.legend(loc='best')
plt.xlim((1000,2000))
plt.show()

#  Autoregressive Integrated Moving Average (ARIMA)
The Autoregressive Integrated Moving Average (ARIMA) method models the next step in the sequence as a linear function of the differenced observations and residual errors at prior time steps.

**AR**: Autoregression. A model that uses the dependent relationship between an observation and some number of lagged observations.

**I*: Integrated. The use of differencing of raw observations (e.g. subtracting an observation from an observation at the previous time step) in order to make the time series stationary.

**MA**: Moving Average. A model that uses the dependency between an observation and a residual error from a moving average model applied to lagged observations.

What does it mean for data to be stationary?

The mean of the series should not be a function of time. The red graph below is not stationary because the mean increases over time.

<img src=https://imgur.com/LjtBXwf.png width="500">

The variance of the series should not be a function of time. This property is known as homoscedasticity. Notice in the red graph the varying spread of data over time.

<img src=https://imgur.com/v2Uye7X.png width="500">

Finally, the covariance of the i th term and the (i + m) th term should not be a function of time. In the following graph, you will notice the spread becomes closer as the time increases. Hence, the covariance is not constant with time for the ‘red series’.

<img src=https://i.imgur.com/6HVlvg2.png width="500">

The notation for the model involves specifying the order for the AR(p), I(d), and MA(q) models as parameters to an ARIMA function

**p**: The number of lag observations included in the model, also called the lag order.

**d**: The number of times that the raw observations are differenced, also called the degree of differencing.

**q**: The size of the moving average window, also called the order of moving average.

Some well-known special cases arise naturally or are mathematically equivalent to other popular forecasting models. For example:

An ARIMA(0,1,0) model is given by $ X_{t}=X_{t-1}+\epsilon_t X_t=X_{t-1}+\epsilon_t$ — random walk.

An ARIMA(0,1,0) with a constant, given by $X_4=c+X_{t-1}+\epsilon _t X_t=c+X_{t-1}+\epsilon_t$ — random walk with drift.

An ARIMA(0,0,0) - white noise model.

An ARIMA(0,1,2) - Damped Holt's model.

An ARIMA(0,1,1) - basic exponential smoothing model.

An ARIMA(0,2,2) model is given by $X_t=2X_{t-1}-X_{t-2}+(\alpha +\beta -2)\epsilon_{t-1}+(1-\alpha )\epsilon_{t-2}+\epsilon_t$ — which is equivalent to Holt's linear method with additive errors, or double exponential smoothing.


In [0]:
from pandas import read_csv
from pandas import datetime
from matplotlib import pyplot

def parser(x):
	return datetime.strptime('190'+x, '%Y-%m')

# load in the shampoo sales dataset
series = read_csv('shampoo.csv', header=0, parse_dates=[0], index_col=0, squeeze=True, date_parser=parser)

# show the first few lines of the pandas dataframe
print(series.head())

# plot the data
series.plot()
pyplot.show()

The shampoo sales dataset has a clear trend and is not stationary. This suggests that the time series will require differencing to make it stationary, at least a difference order of 1

Let’s also take a quick look at an autocorrelation plot of the time series. This is also built-in to Pandas. The example below plots the autocorrelation for a large number of lags in the time series.

In [0]:
from pandas import read_csv
from pandas import datetime
from matplotlib import pyplot as plt
from pandas.plotting import autocorrelation_plot

def parser(x):
	return datetime.strptime('190'+x, '%Y-%m')

series = read_csv('shampoo.csv', header=0, parse_dates=[0], index_col=0, squeeze=True, date_parser=parser)
autocorrelation_plot(series)
plt.show()

we can see that there is a positive correlation with the first 10-to-12 lags that is perhaps significant for the first 5 lags.

A good starting point for the AR parameter of the model may be 5.

Let’s start off with something simple. We will fit an ARIMA model to the entire Shampoo Sales dataset and review the residual errors.

First, we fit an ARIMA(5,1,0) model. This sets the lag value to 5 for autoregression, uses a difference order of 1 to make the time series stationary, and uses a moving average model of 0.

In [0]:
from pandas import read_csv
from pandas import datetime
from pandas import DataFrame
from statsmodels.tsa.arima_model import ARIMA
from matplotlib import pyplot as plt

def parser(x):
	return datetime.strptime('190'+x, '%Y-%m')

series = read_csv('shampoo.csv', header=0, parse_dates=[0], index_col=0, squeeze=True, date_parser=parser)

# fit model
model = ARIMA(series, order=(5,1,0))
model_fit = model.fit(disp=0)

# print model summary
print(model_fit.summary())

# plot residual errors
residuals = DataFrame(model_fit.resid)
residuals.plot()
plt.title('residuals')
plt.show()

residuals.plot(kind='kde')
plt.title('density plot of residuals')
plt.show()
print(residuals.describe())

The density residual plot suggests that the errors are Gaussian

Note, that although above we used the entire dataset for time series analysis, ideally we would perform this analysis on just the training dataset when developing a predictive model

## Rolling Forecast ARIMA Model
A rolling forecast is required given the dependence on observations in prior time steps for differencing and the AR model. A crude way to perform this rolling forecast is to re-create the ARIMA model after each new observation is received.

We manually keep track of all observations in a list called history that is seeded with the training data and to which new observations are appended each iteration.

In [0]:
from pandas import read_csv
from pandas import datetime
from matplotlib import pyplot
from statsmodels.tsa.arima_model import ARIMA
from sklearn.metrics import mean_squared_error

def parser(x):
	return datetime.strptime('190'+x, '%Y-%m')

# read in data with pandas
series = read_csv('shampoo.csv', header=0, parse_dates=[0], index_col=0, squeeze=True, date_parser=parser)

# extract dataframe values
X = series.values

# determine the train/test set size and split up the data
size = int(len(X) * 0.66)
train, test = X[0:size], X[size:len(X)]

history = [x for x in train]
predictions = list()
for t in range(len(test)):
  
  # define the model using the history which includes the previous prediction
	model = ARIMA(history, order=(6,1,1))
  
  # fit the model
	model_fit = model.fit(disp=0)
  
  # forecast to get the predicted value (this takes care of 
  # placing data in the original scale if differencing was used)
	output = model_fit.forecast()
	yhat = output[0]
	predictions.append(yhat)
  
  # observed value
	obs = test[t]
  
  # append the predicted/known value to the history
	history.append(obs)
  
  # print the predicted and observed values
	print('predicted=%f, expected=%f' % (yhat, obs))

# determine the error  
error = mean_squared_error(test, predictions)
print('Test MSE: %.3f' % error)

# plot
pyplot.plot(test)
pyplot.plot(predictions, color='red')
pyplot.show()

## Configuring an ARIMA Model
The classical approach for fitting an ARIMA model is to follow the **Box-Jenkins Methodology**.

This is a process that uses time series analysis and diagnostics to discover good parameters for the ARIMA model.

In summary, the steps of this process are as follows:

1. Model Identification. Use plots and summary statistics to identify trends, seasonality, and autoregression elements to get an idea of the amount of differencing and the size of the lag that will be required.

2. Parameter Estimation. Use a fitting procedure to find the coefficients of the regression model.

3. Model Checking. Use plots and statistical tests of the residual errors to determine the amount and type of temporal structure not captured by the model.

The process is repeated until either a desirable level of fit is achieved on training or test sets

# Seasonal Autoregressive Integrated Moving-Average (SARIMA)

## What's wrong with ARIMA?

ARIMA is that it does not support seasonal data. That is a time series with a repeating cycle.

ARIMA expects data that is either not seasonal or has the seasonal component removed, e.g. seasonally adjusted via methods such as seasonal differencing.


## What is SARIMA?

Seasonal Autoregressive Integrated Moving Average, SARIMA or Seasonal ARIMA, is an extension of ARIMA that explicitly supports univariate time series data with a seasonal component.

It adds three new hyperparameters to specify the autoregression (AR), differencing (I) and moving average (MA) for the seasonal component of the series, as well as an additional parameter for the period of the seasonality

Configuring a SARIMA requires selecting hyperparameters for both the trend and seasonal elements of the series.

## Trend Elements

There are three trend elements that require configuration just as in ARIMA
1. p: autoregression order
2. d: difference order
3. q: moving average order

## Seasonal Elements

1. P: seasoonal autoregression order
2. D: seasonal difference order
3. Q: seasonal moving average order
4. m: the number of time seps for a single seasonal period

SARIMA(p,d,q)(P,D,Q)m

## Hyperparameter selection

- auto-correlation function and partial auto-correlaation function plots give a good indication of some parameters
- grid search can be used across the trend and seasonal parameters
https://machinelearningmastery.com/how-to-grid-search-sarima-model-hyperparameters-for-time-series-forecasting-in-python/



## SARIMA in python

Let's go back to the temperature data from earlier

In [0]:
from pandas import Series
from matplotlib import pyplot as plt
from statsmodels.tsa.statespace.sarimax import SARIMAX
from sklearn.metrics import mean_squared_error
import numpy as np

# import data using pandas
series = Series.from_csv('daily-min-temperatures.csv', header=0)

# get the series values
X = series.values

# define model configuration
my_order = (1, 1, 1)
my_seasonal_order = (1, 1, 1, 10)

# define the model
model = SARIMAX(X, order=my_order, seasonal_order=my_seasonal_order)

# fit the model
model_fit = model.fit()
print('Coefficients: %s' % model_fit.params)

# make predictions
predictions = model_fit.predict(start=0, end=len(X)-1, dynamic=False)

# calculate the error
error = mean_squared_error(X, predictions)
print('\nTest MSE: %.3f' % error)

# plot results
plt.figure(1)
plt.plot(X, color='blue', label='test')
plt.plot(predictions, color='red', label='prediction')
plt.legend(loc='best')
plt.title('full data set')

# zoom in on a section
plt.figure(2)
plt.plot(X, color='blue', label='test')
plt.plot(predictions, color='red', label='prediction')
plt.legend(loc='best')
plt.xlim((1000,2000))
plt.title('zoomed')

# zoom in further
plt.figure(3)
plt.plot(X, color='blue', label='test')
plt.plot(predictions, color='red', label='prediction')
plt.legend(loc='best')
plt.xlim((1400,1500))
plt.title('zoomed more')

plt.show()

**Note**

The above example is really only using ARIMA because m=365 as required for a year's worth of day data causes the code to run very slowly.

Take a look at the below example to get an idea of predictive power of SARIMA

In [0]:
# SARIMAX example with seasonal component
from statsmodels.tsa.statespace.sarimax import SARIMAX
from random import random
import numpy as np

# contrived dataset with a period of 10 data points
X = [np.sin(x*2*np.pi/10) + random() for x in range(1, 100)]

# define model configuration
my_order = (1, 1, 1)
my_seasonal_order = (1, 1, 1, 10) # set m=10 because the period is 10 data points

# define the model
model = SARIMAX(X, order=my_order, seasonal_order=my_seasonal_order)

# fit the model
model_fit = model.fit()

# make predictions
num_future = 50
predictions = model_fit.predict(start=0, end=len(X)-1, dynamic=False)
future = model_fit.predict(start=len(X), end=len(X)+num_future, dynamic=True)

total = np.append(predictions, future)

# plot results
plt.figure(1)
plt.plot(X, color='blue', label='test')
plt.plot(total, color='red', label='prediction')
plt.legend(loc='best')
plt.show()

# Singular Spectrum Analysis

Singular spectrum analysis (SSA) is a method for finding structure in time series data. In its basic form, SSA involves no statistical modeling and therefore inference about the significance of the suggested signal features cannot be made

In time series analysis, SSA is a nonparametric spectral estimation method. It combines elements of classical time series analysis, multivariate statistics, multivariate geometry, dynamical systems and signal processing. Its roots lie in spectral decomposition of time series and random fields.

SSA can be an aid in the decomposition of time series into a sum of components, each having a meaningful interpretation. 

<img src=https://raw.githubusercontent.com/jojker/PML_Workshops/master/Summer%202019/Day%205%20-%20Goal%204%20-%20Scientific%20Insights%20from%20Learned%20Models/Ex%206%20-%20simple%20TS%20prediction/SSA_mod.jpg width="1000">

decomposed signal with the first 21 most significant pieces

More SSA details on wikipedia

https://en.wikipedia.org/wiki/Singular_spectrum_analysis

## Example with Stock Market Data

In [0]:
# install quandl
!pip install quandl

# snag some python libraries from github

# plotting functionality (mpl_utils.py)
!wget https://raw.githubusercontent.com/dmarienko/chaos/master/mpl_utils.py
  
# SSA functions (ssa_core.py)
!wget https://raw.githubusercontent.com/dmarienko/chaos/master/ssa_core.py

In [0]:
from ssa_core import ssa, ssa_predict, ssaview, inv_ssa, ssa_cutoff_order
from mpl_utils import set_mpl_theme

import matplotlib.pylab as plt
import quandl
import pandas as pd
import datetime
from datetime import timedelta
from dateutil import parser


%matplotlib inline

# customize mpl a bit
set_mpl_theme('light')


## some handy functions

# plotting function
def fig(w=16, h=5, dpi=96, facecolor=None, edgecolor=None):
    return plt.figure(figsize=(w, h), dpi=dpi, facecolor=facecolor, edgecolor=edgecolor)

# mean absolute percent error
def mape(f, t):
    return 100*((f - t)/t).abs().sum()/len(t)

# mean absolute error
def mae(f, t):
    return 100*((f - t)).abs().sum()/len(t)

Load adjusted close prices for Microsoft (MSFT) from Quandl.

In [0]:
symbol = 'MSFT'
data = quandl.get('WIKI/%s' % symbol, start_date='2012-01-01', end_date='2017-02-01')
closes = data['Adj. Close'].rename('close')

Split series into train and test intervals and see how it looks on *chart*

In [0]:
test_date = '2017-01-01'

train_d = closes[:test_date]
test_d = closes[test_date:]

fig(10, 3)
plt.plot(train_d, label='Train')
plt.plot(test_d, 'r', label='Test')
plt.title('%s adjusted daily close prices' % symbol)
plt.legend()

We can see how SSA decomposes original series into trend components and noise.

There is chart of original series, reconstructed from first n components and residuals.

In statistics, a Q–Q (quantile-quantile) plot is a probability plot, which is a graphical method for comparing two probability distributions by plotting their quantiles against each other


**What happens if you increase the number of compenents in the reconstruction?**

ex: ssaview(train_d.values, 120, [0,1,2,3,4,5,6,7,8,9,10])


**What happens if you change the embedding dimension?**

embedding dimension is the window size

ex: ssaview(train_d.values, 30, [0,1,2,3])

ex: ssaview(train_d.values, 480, [0,1,2,3])

In [0]:
fig(12,8)
ssaview(train_d.values, 120, [0,1,2,3])

# function ssaview()
"""
    Visualising tools for singular spectrum analysis

    Example:
    -------
    >>> import numpy as np
    >>>
    >>> x = np.linspace(0, 5, 1000)
    >>> y = 2*x + 2*np.sin(5*x) + 0.5*np.random.randn(1000)
    >>> ssaview(y, 15, [0,1])

    :param y: series
    :param dim: the embedding dimension
    :param k: components indexes for reconstrunction
"""

Plot the residuals as a density plot to see if it is Gaussian

In [0]:
pc, _, v = ssa(train_d.values, 120)

#function ssa()
"""
    Singular Spectrum Analysis decomposition for a time series

    Example:
    -------
    >>> import numpy as np
    >>>
    >>> x = np.linspace(0, 5, 1000)
    >>> y = 2*x + 2*np.sin(5*x) + 0.5*np.random.randn(1000)
    >>> pc, s, v = ssa(y, 15)

    :param y: time series (array)
    :param dim: the embedding dimension
    :return: (pc, s, v) where
             pc is the matrix with the principal components of y
             s is the vector of the singular values of y given dim
             v is the matrix of the singular vectors of y given dim

"""

# series reconstruction for a given SSA decomposition using a vector of components
reconstructed = inv_ssa(pc, v, [0,1,2,3])

# actual valus minus reconstructed values
noise = train_d.values - reconstructed

# plot the noise as a histogram
plt.hist(noise, 50);

It's possible to reduce embedding space dimension by finding minimal lag

In [0]:
MAX_LAG_NUMBER = 120
n_co = ssa_cutoff_order(train_d.values, dim=MAX_LAG_NUMBER, show_plot=True)

# function ssa_cutoff_order()
"""
    Tries to find best cutoff for number of order when increment changes of informational entropy
    becomes little and the effective information saturates.

    :param x: series
    :param dim: embedding dimensions (200 by default)
    :param cutoff_pctl: percentile of changes (75%)
    :param show_plot: true if we need to see informational curve
    :return: cutoff number
"""

Using minimal lag we forecast the price and plot the results

In [0]:
days_to_predict = 15

# using the best possible lag determined above and 8 reconstruction components
# predict the stock values and compare to the test set
forecast = ssa_predict(train_d.values, n_co, list(range(8)), days_to_predict, 1e-5)

# function ssa_prediction()
"""
    Series data prediction based on SSA
    
    :param x: series to be predicted
    :param dim: the embedding dimension
    :param k: components indexes for reconstruction
    :param n_forecast: number of points to forecast
    :param e: minimum value to ensure convergence
    :param max_iter: maximum number of iterations
    :return: forecasted series
"""

In [0]:
# plot the predicted values for the test set

fig(10, 4)

prev_ser = closes[datetime.date.isoformat(parser.parse(test_date) - timedelta(120)):test_date]
plt.plot(prev_ser, label='Train Data')

test_d = closes[test_date:]
f_ser = pd.DataFrame(data=forecast, index=test_d.index[:days_to_predict], columns=['close'])
orig = pd.DataFrame(test_d[:days_to_predict])

plt.plot(orig, label='Test Data')
plt.plot(f_ser, 'r-', marker='.', label='Forecast')
plt.ylabel('$',rotation=0)
plt.legend()
plt.title('Forecasting %s for %d days, MAPE = %.2f%%' % (symbol, days_to_predict, mape(f_ser, orig)));

# Dynamic Mode Decomposition (DMD)


## Toy dataset
https://mathlab.github.io/PyDMD/build/html/tutorial1dmd.html

In this tutorial we will show how to apply dynamic mode decomposition on snapshots collected during the evolution of a generic system. We present a very simple system since the main purpose of this tutorial is to show the capabilities of the algorithm and the package interface.

First of all we import the DMD class from the pydmd package, we set matplotlib for the notebook and we import numpy.

In [0]:
!pip install pydmd

In [0]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from scipy.linalg import pinv2

def pinv(x): return pinv2(x, rcond=10*np.finfo(float).eps)

from pydmd import DMD

We create the input data by summing two different functions:

$f_1(x,t)=sech(x+3)exp(i2.3t) $

$f_2(x,t)=2sech(x)tanh(x)exp(i2.8t)$

In [0]:
def f1(x,t): 
    return 1./np.cosh(x+3)*np.exp(2.3j*t)

def f2(x,t):
    return 2./np.cosh(x)*np.tanh(x)*np.exp(2.8j*t)

tnumber=512
tmax=8*np.pi
x = np.linspace(-5, 5, 128)
t = np.linspace(0, tmax, tnumber)

xgrid, tgrid = np.meshgrid(x, t)

X1 = f1(xgrid, tgrid)
X2 = f2(xgrid, tgrid)
X = X1 + X2
Xtrain=X[:int(tnumber/2),:]
Xtest=X[int(tnumber/2):,:]

The plots below represent these functions and the dataset.

In [0]:
titles = ['$f_1(x,t)$', '$f_2(x,t)$', '$f$']
data = [X1, X2, X]

fig = plt.figure(figsize=(17,6))
for n, title, d in zip(range(131,134), titles, data):
    plt.subplot(n)
    plt.pcolor(tgrid, xgrid, d.real)
    plt.title(title)
plt.colorbar()
plt.show()

Now we have the temporal snapshots in the input matrix rows: we can easily create a new DMD instance and exploit it in order to compute the decomposition on the data. Since the snapshots must be arranged by columns, in this case we need to transpose the matrix.

In [0]:
dmd = DMD(svd_rank=2)
dmd.fit(Xtrain.T)

The dmd object contains the principal information about the decomposition:

*   the attribute modes is a 2D numpy array where the columns are the low-rank structures individuated;
*   the attribute dynamics is a 2D numpy array where the rows refer to the time evolution of each mode;
*   the attribute eigs refers to the eigenvalues of the low dimensional operator;
*   the attribute reconstructed_data refers to the approximated system evolution.
Moreover, some helpful methods for the graphical representation are provided.

Thanks to the eigenvalues, we can check if the modes are stable or not: if an eigenvalue is on the unit circle, the corresponding mode will be stable; while if an eigenvalue is inside or outside the unit circle, the mode will converge or diverge, respectively. From the following plot, we can note that the two modes are stable.

In [0]:
for eig in dmd.eigs:
    print('Eigenvalue {}: distance from unit circle {}'.format(eig, np.abs(eig.imag**2+eig.real**2 - 1)))

dmd.plot_eigs(show_axes=True, show_unit_circle=True)

We can plot the modes and the dynamics:

In [0]:
for mode in dmd.modes.T:
    plt.plot(x, mode.real)
    plt.title('Modes')
plt.show()

for dynamic in dmd.dynamics:
    plt.plot(t[:int(tnumber/2)], dynamic.real)
    plt.title('Dynamics')
plt.show()

Finally, we can reconstruct the original dataset as the product of modes and dynamics. We plot the evolution of each mode to emphasize their similarity with the input functions and we plot the reconstructed data.

In [0]:
fig = plt.figure(figsize=(17,6))

for n, mode, dynamic in zip(range(131, 133), dmd.modes.T, dmd.dynamics):
    plt.subplot(n)
    plt.pcolor(xgrid[:int(tnumber/2),:], tgrid[:int(tnumber/2),:], (mode.reshape(-1, 1).dot(dynamic.reshape(1, -1))).real.T)
    
plt.subplot(133)
plt.pcolor(tgrid[:int(tnumber/2),:], xgrid[:int(tnumber/2),:], dmd.reconstructed_data.T.real)
plt.colorbar()

plt.show()

We can also plot the absolute error between the data reconstructed from dimensionality reduction and the original one.

In [0]:
plt.pcolor(tgrid[:int(tnumber/2),:], xgrid[:int(tnumber/2),:], (Xtrain-dmd.reconstructed_data.T).real)
fig = plt.colorbar()

The reconstructed system looks almost equal to the original one: the dynamic mode decomposition made possible the identification of the meaningful structures and the complete reconstruction of the system using only the collected snapshots.

# Now we can predict the future states

In [0]:
def predict_one_step(X,dmd):
  if dmd._modes.shape[0]!=X.shape[0]:
    X=X.T
  adjoint_modes = pinv(dmd._modes)
  return np.linalg.multi_dot( [dmd._modes, np.diag(dmd._eigs), adjoint_modes, X] )

In [0]:
# we initialize with the first time point of the test set and step onwards from there
Y=np.zeros(Xtest.shape)
initialization=Xtest[0,:]
for tndx in range(int(tnumber/2)):
  Y[tndx,:]=predict_one_step(initialization,dmd)
  initialization=Y[tndx,:]
  
  
fig = plt.figure(figsize=(6,6))
plt.subplot(111)
plt.pcolor(tgrid[:int(tnumber/2),:], xgrid[:int(tnumber/2),:], Y.real)
plt.colorbar()

Predicting many steps into the future given just one initialization does not capture the dynamics. What if we predict N points ahead while updating the initialization based on the current ground truth. How far ahead can we look?

In [0]:
# now we do a running prediction where we continuously input info
look_ahead_amount=10
Y=np.zeros(Xtest.shape)
for tndx in range(0,int(tnumber/2)):
  #initialization=X[tndx+int(tnumber/2)-look_ahead_amount,:]
  initialization=Xtest[tndx,:]
  for step in range(look_ahead_amount):
    initialization = predict_one_step(initialization,dmd)
  #Y[tndx,:] = initialization
  if (tndx+look_ahead_amount-1)>=int(tnumber/2):
    break
  Y[tndx+look_ahead_amount-1,:] = initialization
  
  
fig = plt.figure(figsize=(12,6))
plt.subplot(131)
plt.pcolor(tgrid[:int(tnumber/2),:], xgrid[:int(tnumber/2),:], Y.real)
plt.colorbar()
plt.subplot(132)
plt.pcolor(tgrid[:int(tnumber/2),:], xgrid[:int(tnumber/2),:], Xtest.real)
plt.colorbar()
plt.subplot(133)
plt.pcolor(tgrid[11:int(tnumber/2),:], xgrid[11:int(tnumber/2),:], abs((Xtest[11:,:]-Y[11:,:]).real))
plt.colorbar()

Looking ahead 10 points, based on the current value does a decent job, how far can we extend this?

In [0]:
# lets step through several values for the look ahead, up to 1/2 the length of Xtest
prctiles=np.zeros((1,int(tnumber/4)))
for look_ahead_amount in range(int(tnumber/4)):
  Y=np.zeros(Xtest.shape)
  #look_ahead_amount=lk_ndx
  for tndx in range(int(tnumber/2)):
    initialization=X[tndx+int(tnumber/2)-look_ahead_amount,:]
    for step in range(look_ahead_amount):
      initialization = predict_one_step(initialization,dmd)
    Y[tndx,:] = initialization
  prctiles[0,look_ahead_amount]=np.percentile(np.divide(abs((Xtest-Y).real),abs(Xtest.real)),50)
  

# the plot does not show. IDK why? -JKJ
fig=plt.figure()
plt.subplot(111)
plt.plot(np.expand_dims(np.arange(prctiles.shape[1]),axis=0),prctiles,linewidth=5.0)
plt.title('error')
plt.show()