<a href="https://colab.research.google.com/github/matinallfather/speech2text/blob/master/notebooks/M4_timeseries_transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import os
import sys

import seaborn as sns
import altair as alt

# Introduction to deep learning (transformer-based) Timeseries Forecast

* **[NeuralProphet (Facebook)](https://arxiv.org/abs/2111.15397?fbclid=IwAR2vCkHYiy5yuPPjWXpJgAJs-uD5NkH4liORt1ch4a6X_kmpMqagGtXyez4)**
  * Hybrid forecasting framework based on PyTorch
  * Local context is introduced with auto-regression and covariate modules, which can be configured as classical linear regression or as Neural Networks
  * Otherwise, NeuralProphet retains the design philosophy of Prophet and provides the same basic model components.
* **[N-BEATS (ElementAI)](https://arxiv.org/abs/1905.10437):** Essentially, N-BEATS is a pure deep learning architecture based on a deep stack of ensembled feed forward networks that are also stacked by interconnecting backcast and forecast links.
  * Easy to use: The model is simple to understand and has a modular structure (blocks and stacks).
  * Multiple time-series: The model has the ability to generalize on many time-series.
* **[N-HiTS (ElementAI)](https://arxiv.org/pdf/2201.12886.pdf):** Extension of N-BEATS model.
  * Improves the accuracy of the predictions and reduces the computational cost. This is achieved by the model sampling the time series at different rates.
  * Multi-rate signal sampling: the model can learn short-term and long-term effects in the series.
* **[DeepAR (Amazon)](https://arxiv.org/abs/1704.04110?context=stat.ML):** A novel time series model that combines both deep-learning and autoregressive characteristics.
  * Multiple time series: DeepAR works really well with multiple time series: A global model is built by using multiple time series with slightly different distributions.
  * Rich set of inputs: Apart from historical data, DeepAR also allows the use of known future time sequences (a characteristic of auto-regressive models) and extra static attributes for series.
  * Automatic scaling: In DeepAR, there is no need to do that manually since the model under the hood scales the autoregressive input.
* **[Spacetimeformer:](https://arxiv.org/abs/2109.12218)** Considers both temporal and spatial relationships.
  * Interesting when dealing with geospatial data, but I have little experience there.
* **[Temporal Fusion Transformer](https://arxiv.org/abs/1912.09363):** Temporal Fusion Transformer (TFT) is a transformer-based time series forecasting model published by Google.
  * Multiple time series: Like the aforementioned models, TFT supports building a model on multiple, heterogeneous time series.
  * Rich number of features: TFT supports 3 types of features: i) time-dependent data with known inputs into the future ii) time-dependent data known only up to the present and iii) categorical/static variables, also known as time-invariant features.
  * Interpretability: TFT gives much emphasis on interpretability. Specifically, by taking advantage of the Variable Selection component, the model can successfully measure the impact of each feature.
  * Prediction Intervals: Similar to DeepAR, TFT outputs a prediction interval along with the predicted values, by using quantile regression.

# Temporal Fusion Transformers

## Introduction

Temporal Fusion Transformer (TFT) is an **attention-based Deep Neural Network**, optimized for great performance and interpretability.

**Advantages and novelties:**

* Rich features:
  1. temporal data with known inputs into the future
  2. temporal data known only up to the present and
  3. exogenous categorical/static variables, also known as time-invariant features.
* Heterogeneous time series: Supports training on multiple time series, splits processing into 2 parts: local processing which focuses on the characteristics of specific events and global processing which captures the collective characteristics of all time series.
* Multi-horizon forecasting: Supports multi-step predictions. Apart from the actual prediction, TFT also outputs prediction intervals, by using the quantile loss function.
* Interpretability: At its core, TFT is a transformer-based architecture. By taking advantage of self-attention, this model presents a novel Muti Head attention mechanism which when analyzed, provides extra insight on feature importances.

![](https://blogger.googleusercontent.com/img/a/AVvXsEjn-GEpuwiBa4Od21FBnTST8-z2jAgyw3rq68AYtrBosFLBgIaFnLC2NV8hwlj8xiuU4Bc5ZKNHrDPldINdgkr8Y2TmekuDp0oLKq9yYCrpooZfwpwKT9MVwQ11LGsXqBckgiPAxoWRdvxAE3RoRn4BHxVhJmnQkZT-w6DdYXEA3yP0xUSdbYDITSgOjQ=w400-h314)

## Format and structure

 TFT is designed to efficiently build feature representations for each input type (i.e., static, known, or observed inputs) for high forecasting performance. The major constituents of TFT (shown below) are:

1. Gating mechanismsto skip over any unused components of the model (learned from the data), providing adaptive depth and network complexity to accommodate a wide range of datasets.
2. Variable selection networksto select relevant input variables at each time step. While conventional DNNs may overfit to irrelevant features, attention-based variable selection can improve generalization by encouraging the model to anchor most of its learning capacity on the most salient features.
3. Static covariate encodersintegrate static features to control how temporal dynamics are modeled. Static features can have an important impact on forecasts, e.g., a store location could have different temporal dynamics for sales (e.g., a rural store may see higher weekend traffic, but a downtown store may see daily peaks after working hours).
4. Temporal processingto learn both long- and short-term temporal relationships from both observed and known time-varying inputs. A sequence-to-sequence layer is employed for local processing as the inductive bias it has for ordered information processing is beneficial, whereas long-term dependencies are captured using a novel interpretable multi-head attention block. This can cut the effective path length of information, i.e., any past time step with relevant information (e.g. sales from last year) can be focused on directly.
5. Prediction intervals show quantile forecasts to determine the range of target values at each prediction horizon, which help users understand the distribution of the output, not just the point forecasts.

![](https://blogger.googleusercontent.com/img/a/AVvXsEiwv46s-50F64kN7H1UkpdWcu2-nhYXULnFFp4kKzDvJsVYJ6FiD8D6HZEAV_f03LRCJZzseotQySVgVTNWqhvcuSMtRCwnmJpkOrXw_G14sUhAx5P8qUXJjaLkAGnW4Pcgvm0o3PvvqDuj6s1koStjlIc10NVBNQzmY0tmjdNFCUJIhyq_Q2R3W5YgQg=w640-h456)

# TFT Implementation high level (DARTS)

[Darts](https://unit8co.github.io/darts/) is a Python library for easy manipulation and forecasting of time series. It contains a variety of models, from classics such as ARIMA to deep neural networks. The models can all be used in the same way, using fit() and predict() functions, similar to scikit-learn. The library also makes it easy to backtest models, combine the predictions of several models, and take external data into account.

Darts supports both univariate and multivariate time series and models. The ML-based models can be trained on potentially large datasets containing multiple time series, and some of the models offer a rich support for probabilistic forecasting.

While there is also a standalone version of TFT (eg. [standalone pytorch implementation](https://pypi.org/project/tft-torch/)), we will for this example use the Darts implementation, since it eases the integration of TFT in your traditional forecasting pipeline.

In [None]:
!pip install darts

In [None]:
from darts import TimeSeries, concatenate
from darts.dataprocessing.transformers import Scaler
from darts.models import TFTModel
from darts.metrics import mape, rmse

from darts.utils.statistics import check_seasonality, plot_acf
from darts.utils.timeseries_generation import datetime_attribute_timeseries
from darts.utils.likelihood_models import QuantileRegression

import warnings

warnings.filterwarnings("ignore")

Darts’ TFTModel incorporates the following main components from the original Temporal Fusion Transformer (TFT) architecture:

* gating mechanisms: skip over unused components of the model architecture
* variable selection networks: select relevant input variables at each time step.
* temporal processing of past and future input with LSTMs (long short-term memory)
* multi-head attention: captures long-term temporal dependencies
* prediction intervals: per default, produces quantile forecasts instead of deterministic values

## Training

TFTModel can be trained with past and future covariates. It is trained sequentially on fixed-size chunks consisting of an encoder and a decoder part:

* encoder: past input with input_chunk_length
  * past target: mandatory
  * past covariates: optional
* decoder: future known input with output_chunk_length
  * future covariates: mandatory (if none are available, consider TFTModel’s optional arguments add_encoders or add_relative_index from here)

In each iteration, the model produces a quantile prediction of shape (output_chunk_length, n_quantiles) on the decoder part.

## Forecast

Per default, TFTModel produces probabilistic quantile forecasts using QuantileRegression. This gives the range of likely target values at each prediction step. Most deep learning models in Darts’ - including TFTModel - support QuantileRegression and 16 other likelihoods to produce probabilistic forecasts by setting likelihood=MyLikelihood() at model creation.

## Toy example (Air Passangers)

Adopted from the [DARTS pakage tutorial](https://unit8co.github.io/darts/examples/13-TFT-examples.html)



This data set that is highly dependent on covariates. Knowing the month tells us a lot about the seasonal component, whereas the year determines the effect of the trend component.

Additionally, let’s convert the time index to integer values and use them as covariates as well.

All of the three covariates are known in the future, and can be used as future_covariates with the TFTModel.

In [None]:
# Read data
from darts.datasets import AirPassengersDataset

series = AirPassengersDataset().load()

In [None]:
series.head()

In [None]:
# we convert monthly number of passengers to average daily number of passengers per month
series = series / TimeSeries.from_series(series.time_index.days_in_month)
series = series.astype(np.float32)

In [None]:
# Create training and validation sets:
training_cutoff = pd.Timestamp("19571201")
train, val = series.split_after(training_cutoff)

In [None]:
# Normalize the time series (note: we avoid fitting the transformer on the validation set)
transformer = Scaler()
train_transformed = transformer.fit_transform(train)
val_transformed = transformer.transform(val)
series_transformed = transformer.transform(series)

In [None]:
# create year, month and integer index covariate series
covariates = datetime_attribute_timeseries(series, attribute="year", one_hot=False)

In [None]:
covariates = covariates.stack(datetime_attribute_timeseries(series, attribute="month", one_hot=False))

In [None]:
covariates = covariates.stack(
    TimeSeries.from_times_and_values(
        times=series.time_index,
        values=np.arange(len(series)),
        columns=["linear_increase"],
    )
)

covariates = covariates.astype(np.float32)

In [None]:
covariates

In [None]:
cov_train, cov_val = covariates.split_after(training_cutoff)

In [None]:
# transform covariates (note: we fit the transformer on train split and can then transform the entire covariates series)
scaler_covs = Scaler()
scaler_covs.fit(cov_train)
covariates_transformed = scaler_covs.transform(covariates)

The TFTModel can only be used if some future input is given. Optional parameters add_encoders and add_relative_index can be useful, especially if we don’t have any future input available. They generate endoded temporal data is used as future covariates.

Since we already have future covariates defined in our example they are commented out.

In [None]:
num_samples = 200
input_chunk_length = 24
forecast_horizon = 12

In [None]:
my_model = TFTModel(
    input_chunk_length=input_chunk_length,
    output_chunk_length=forecast_horizon,
    hidden_size=64,
    lstm_layers=1,
    num_attention_heads=4,
    dropout=0.1,
    batch_size=16,
    n_epochs=100,
    add_relative_index=False,
    add_encoders=None,
    likelihood=QuantileRegression(
        # quantiles= [ 0.01, 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.4, 0.5, 0.6, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95, 0.99]
    ),  # QuantileRegression is set per default
    # loss_fn=MSELoss(),
    random_state=42,
)

In what follows, we can just provide the whole covariates series as future_covariates argument to the model; the model will slice these covariates and use only what it needs in order to train on forecasting the target train_transformed:

In [None]:
my_model.fit(train_transformed, future_covariates=covariates_transformed, verbose=True)

We perform a one-shot prediction of 24 months using the “current” model - i.e., the model at the end of the training procedure:

In [None]:
# before starting, we define some constants
figsize = (9, 6)
lowest_q, low_q, high_q, highest_q = 0.01, 0.1, 0.9, 0.99
label_q_outer = f"{int(lowest_q * 100)}-{int(highest_q * 100)}th percentiles"
label_q_inner = f"{int(low_q * 100)}-{int(high_q * 100)}th percentiles"

In [None]:
def eval_model(model, n, actual_series, val_series):
    pred_series = model.predict(n=n, num_samples=num_samples)

    # plot actual series
    plt.figure(figsize=figsize)
    actual_series[: pred_series.end_time()].plot(label="actual")

    # plot prediction with quantile ranges
    pred_series.plot(
        low_quantile=lowest_q, high_quantile=highest_q, label=label_q_outer
    )
    pred_series.plot(low_quantile=low_q, high_quantile=high_q, label=label_q_inner)

    plt.title("MAPE: {:.2f}%".format(mape(val_series, pred_series)))
    plt.legend()

In [None]:
eval_model(my_model, 24, series_transformed, val_transformed)

Let’s backtest our TFTModel model, to see how it performs with a forecast horizon of 12 months over the last 3 years:

In [None]:
backtest_series = my_model.historical_forecasts(
    series_transformed,
    future_covariates=covariates_transformed,
    start=train.end_time() + train.freq,
    num_samples=num_samples,
    forecast_horizon=forecast_horizon,
    stride=forecast_horizon,
    last_points_only=False,
    retrain=False,
    verbose=True,
)

In [None]:
def eval_backtest(backtest_series, actual_series, horizon, start, transformer):
    plt.figure(figsize=figsize)
    actual_series.plot(label="actual")
    backtest_series.plot(
        low_quantile=lowest_q, high_quantile=highest_q, label=label_q_outer
    )
    backtest_series.plot(low_quantile=low_q, high_quantile=high_q, label=label_q_inner)
    plt.legend()
    plt.title(f"Backtest, starting {start}, {horizon}-months horizon")
    print(
        "MAPE: {:.2f}%".format(
            mape(
                transformer.inverse_transform(actual_series),
                transformer.inverse_transform(backtest_series),
            )
        )
    )

In [None]:
eval_backtest(
    backtest_series=concatenate(backtest_series),
    actual_series=series_transformed,
    horizon=forecast_horizon,
    start=training_cutoff,
    transformer=transformer,
)

# PyTorch TFT implementation

Example partially adapted from [pytorch-forecasting](https://pytorch-forecasting.readthedocs.io/)

In [None]:
# Install old version of pytorch, since current update causes problems (only temorary, soon probably not necessary anymore)
!pip install pytorch_lightning==1.9.0

In [None]:
!pip install pytorch_forecasting

In [None]:
import pytorch_lightning as pl
from pytorch_lightning.callbacks import EarlyStopping, LearningRateMonitor
from pytorch_lightning.loggers import TensorBoardLogger
import torch

from pytorch_forecasting import Baseline, TemporalFusionTransformer, TimeSeriesDataSet
from pytorch_forecasting.data import GroupNormalizer
from pytorch_forecasting.metrics import SMAPE, PoissonLoss, QuantileLoss
from pytorch_forecasting.models.temporal_fusion_transformer.tuning import optimize_hyperparameters

## Data

## Data structure basics

In [None]:
example_data = pd.DataFrame(
    dict(
        time_idx=np.tile(np.arange(6), 3),
        target=np.array([0,1,2,3,4,5,20,21,22,23,24,25,40,41,42,43,44,45]),
        group=np.repeat(np.arange(3), 6),
        holidays = np.tile(['X','Black Friday', 'X','Christmas','X', 'X'],3),
    )
)
example_data

In [None]:
n_encode = 2
n_predict = 3

# create the time-series dataset from the pandas df
dataset = TimeSeriesDataSet(
    example_data,
    group_ids=["group"],
    target="target",
    time_idx="time_idx",
    max_encoder_length= n_encode,
    max_prediction_length=n_predict,
    time_varying_unknown_reals=["target"],
    static_categoricals=["holidays"],
    target_normalizer=None
)

In [None]:
# pass the dataset to a dataloader
dataloader = dataset.to_dataloader(batch_size=1)

#load the first batch
x, y = next(iter(dataloader))

x

## Beer Sales data

We will use the [Stallion dataset from Kaggle](https://www.kaggle.com/datasets/utathya/future-volume-prediction) describing sales of various beverages. Our task is to make a six-month forecast of the sold volume by stock keeping units (SKU), that is products, sold by an agency, that is a store.

In [None]:
from pytorch_forecasting.data.examples import get_stallion_data
data = get_stallion_data()
data.head()

## Preprocessing

The dataset is already in the correct format but misses some important features. Most importantly, we need to add a time index that is incremented by one for each time step. Further, it is beneficial to add date features, which in this case means extracting the month from the date record.

In [None]:
# add time index
data["time_idx"] = data["date"].dt.year * 12 + data["date"].dt.month
data["time_idx"] -= data["time_idx"].min()

# add additional features
data["month"] = data.date.dt.month.astype(str).astype("category")  # categories have be strings
data["log_volume"] = np.log(data.volume + 1e-8)
data["avg_volume_by_sku"] = data.groupby(["time_idx", "sku"], observed=True).volume.transform("mean")
data["avg_volume_by_agency"] = data.groupby(["time_idx", "agency"], observed=True).volume.transform("mean")

# we want to encode special days as one variable and thus need to first reverse one-hot encoding
special_days = [
    "easter_day",
    "good_friday",
    "new_year",
    "christmas",
    "labor_day",
    "independence_day",
    "revolution_day_memorial",
    "regional_games",
    "fifa_u_17_world_cup",
    "football_gold_cup",
    "beer_capital",
    "music_fest",
]
data[special_days] = data[special_days].apply(lambda x: x.map({0: "-", 1: x.name})).astype("category")

In [None]:
data.sample(10, random_state=1337)

In [None]:
data.describe().T

## Create dataset and dataloaders

In [None]:
max_prediction_length = 6
max_encoder_length = 24
training_cutoff = data["time_idx"].max() - max_prediction_length

In [None]:
training = TimeSeriesDataSet(
    data[lambda x: x.time_idx <= training_cutoff],
    time_idx="time_idx",
    target="volume",
    group_ids=["agency", "sku"],
    min_encoder_length=max_encoder_length // 2,  # keep encoder length long (as it is in the validation set)
    max_encoder_length=max_encoder_length,
    min_prediction_length=1,
    max_prediction_length=max_prediction_length,
    static_categoricals=["agency", "sku"],
    static_reals=["avg_population_2017", "avg_yearly_household_income_2017"],
    time_varying_known_categoricals=["special_days", "month"],
    variable_groups={"special_days": special_days},  # group of categorical variables can be treated as one variable
    time_varying_known_reals=["time_idx", "price_regular", "discount_in_percent"],
    time_varying_unknown_categoricals=[],
    time_varying_unknown_reals=[
        "volume",
        "log_volume",
        "industry_volume",
        "soda_volume",
        "avg_max_temp",
        "avg_volume_by_agency",
        "avg_volume_by_sku",
    ],
    target_normalizer=GroupNormalizer(
        groups=["agency", "sku"], transformation="softplus"
    ),  # use softplus and normalize by group
    add_relative_time_idx=True,
    add_target_scales=True,
    add_encoder_length=True,
)

In [None]:
# create validation set (predict=True) which means to predict the last max_prediction_length points in time for each series
validation = TimeSeriesDataSet.from_dataset(training, data, predict=True, stop_randomization=True)

In [None]:
# create dataloaders for model
batch_size = 128  # set this between 32 to 128
train_dataloader = training.to_dataloader(train=True, batch_size=batch_size, num_workers=0)
val_dataloader = validation.to_dataloader(train=False, batch_size=batch_size * 10, num_workers=0)

## Create baseline model

Evaluating a Baseline model that predicts the next 6 months by simply repeating the last observed volume gives us a simle benchmark that we want to outperform.

In [None]:
# calculate baseline mean absolute error, i.e. predict next value as the last available value from the history
actuals = torch.cat([y for x, (y, weight) in iter(val_dataloader)])
baseline_predictions = Baseline().predict(val_dataloader)
(actuals - baseline_predictions).abs().mean().item()

## Train the Temporal Fusion Transformer

In [None]:
# configure network and trainer
pl.seed_everything(42)
trainer = pl.Trainer(
    gpus=0,
    # clipping gradients is a hyperparameter and important to prevent divergance
    # of the gradient for recurrent neural networks
    gradient_clip_val=0.1,
)

### Finding optimal learning rate

In [None]:
tft = TemporalFusionTransformer.from_dataset(
    training,
    # not meaningful for finding the learning rate but otherwise very important
    learning_rate=0.03,
    hidden_size=16,  # most important hyperparameter apart from learning rate
    # number of attention heads. Set to up to 4 for large datasets
    attention_head_size=1,
    dropout=0.1,  # between 0.1 and 0.3 are good values
    hidden_continuous_size=8,  # set to <= hidden_size
    output_size=7,  # 7 quantiles by default
    loss=QuantileLoss(),
    # reduce learning rate if no improvement in validation loss after x epochs
    reduce_on_plateau_patience=4,
)
print(f"Number of parameters in network: {tft.size()/1e3:.1f}k")

In [None]:
# find optimal learning rate
# NOTE: Currently some error, i dont know why. I hope it is due to the recent TFT/ Pytorch lightning update and gets fixed soon
res = trainer.tuner.lr_find(
    tft,
    train_dataloaders=train_dataloader,
    val_dataloaders=val_dataloader,
    max_lr=10.0,
    min_lr=1e-6,
)

## TRain model

In [None]:
# configure network and trainer
early_stop_callback = EarlyStopping(monitor="val_loss", min_delta=1e-4, patience=10, verbose=False, mode="min")
lr_logger = LearningRateMonitor()  # log the learning rate
logger = TensorBoardLogger("lightning_logs")  # logging results to a tensorboard
n_epochs = 30

In [None]:

trainer = pl.Trainer(
    max_epochs=n_epochs,
    gpus=0,
    enable_model_summary=True,
    gradient_clip_val=0.1,
    limit_train_batches=30,  # coment in for training, running valiation every 30 batches
    # fast_dev_run=True,  # comment in to check that networkor dataset has no serious bugs
    callbacks=[lr_logger, early_stop_callback],
    logger=logger,
)

In [None]:
tft = TemporalFusionTransformer.from_dataset(
    training,
    learning_rate=0.03,
    hidden_size=16,
    attention_head_size=1,
    dropout=0.1,
    hidden_continuous_size=8,
    output_size=7,  # 7 quantiles by default
    loss=QuantileLoss(),
    log_interval=10,  # uncomment for learning rate finder and otherwise, e.g. to 10 for logging every 10 batches
    reduce_on_plateau_patience=4,
)
print(f"Number of parameters in network: {tft.size()/1e3:.1f}k")

In [None]:
# fit network
trainer.fit(
    tft,
    train_dataloaders=train_dataloader,
    val_dataloaders=val_dataloader,
)

## Hyperparameter tuning

Would amke sense, but takes too lomng at this case...

In [None]:
#import pickle
#from pytorch_forecasting.models.temporal_fusion_transformer.tuning import optimize_hyperparameters

# # create study
#study = optimize_hyperparameters(
#    train_dataloader,
#    val_dataloader,
#    model_path="optuna_test",
#    n_trials=200,
#    max_epochs=50,
#    gradient_clip_val_range=(0.01, 1.0),
#    hidden_size_range=(8, 128),
#    hidden_continuous_size_range=(8, 128),
#    attention_head_size_range=(1, 4),
#    learning_rate_range=(0.001, 0.1),
#    dropout_range=(0.1, 0.3),
#    trainer_kwargs=dict(limit_train_batches=30),
#    reduce_on_plateau_patience=4,
#    use_learning_rate_finder=False,  # use Optuna to find ideal learning rate or use in-built learning rate finder
#)

# save study results - also we can resume tuning at a later point in time
#with open("test_study.pkl", "wb") as fout:
#    pickle.dump(study, fout)

## show best hyperparameters
#print(study.best_trial.params)

## Evaluate performance

In [None]:
# load the best model according to the validation loss
# (given that we use early stopping, this is not necessarily the last epoch)
best_model_path = trainer.checkpoint_callback.best_model_path
best_tft = TemporalFusionTransformer.load_from_checkpoint(best_model_path)

In [None]:
# calcualte mean absolute error on validation set
actuals = torch.cat([y[0] for x, y in iter(val_dataloader)])
predictions = best_tft.predict(val_dataloader)
(actuals - predictions).abs().mean()

In [None]:
# raw predictions are a dictionary from which all kind of information including quantiles can be extracted
raw_predictions, x = best_tft.predict(val_dataloader, mode="raw", return_x=True)

In [None]:
for idx in range(10):  # plot 10 examples
    best_tft.plot_prediction(x, raw_predictions, idx=idx, add_loss_to_title=True);

### Worst performers

In [None]:
# calcualte metric by which to display
predictions = best_tft.predict(val_dataloader)
mean_losses = SMAPE(reduction="none")(predictions, actuals).mean(1)
indices = mean_losses.argsort(descending=True)  # sort losses

# Only show the worst performers
for idx in range(10):  # plot 10 examples
    best_tft.plot_prediction(
        x, raw_predictions, idx=indices[idx], add_loss_to_title=SMAPE(quantiles=best_tft.loss.quantiles)
    );

### Actuals vs predictions by variables

In [None]:
predictions, x = best_tft.predict(val_dataloader, return_x=True)
predictions_vs_actuals = best_tft.calculate_prediction_actual_by_variable(x, predictions)
best_tft.plot_prediction_actual_by_variable(predictions_vs_actuals);

## PRediction

### Predict on selected data

In [None]:
best_tft.predict(
    training.filter(lambda x: (x.agency == "Agency_01") & (x.sku == "SKU_01") & (x.time_idx_first_prediction == 15)),
    mode="quantiles",
)

In [None]:
raw_prediction, x = best_tft.predict(
    training.filter(lambda x: (x.agency == "Agency_01") & (x.sku == "SKU_01") & (x.time_idx_first_prediction == 15)),
    mode="raw",
    return_x=True,
)
best_tft.plot_prediction(x, raw_prediction, idx=0);

### Predict on new data

Notice: cause we have covariates in the dataset, predicting on new data requires us to define the known covariates upfront.

In [None]:
# select last 24 months from data (max_encoder_length is 24)
encoder_data = data[lambda x: x.time_idx > x.time_idx.max() - max_encoder_length]

# select last known data point and create decoder data from it by repeating it and incrementing the month
# in a real world dataset, we should not just forward fill the covariates but specify them to account
# for changes in special days and prices (which you absolutely should do but we are too lazy here)
last_data = data[lambda x: x.time_idx == x.time_idx.max()]
decoder_data = pd.concat(
    [last_data.assign(date=lambda x: x.date + pd.offsets.MonthBegin(i)) for i in range(1, max_prediction_length + 1)],
    ignore_index=True,
)

# add time index consistent with "data"
decoder_data["time_idx"] = decoder_data["date"].dt.year * 12 + decoder_data["date"].dt.month
decoder_data["time_idx"] += encoder_data["time_idx"].max() + 1 - decoder_data["time_idx"].min()

# adjust additional time feature(s)
decoder_data["month"] = decoder_data.date.dt.month.astype(str).astype("category")  # categories have be strings

# combine encoder and decoder data
new_prediction_data = pd.concat([encoder_data, decoder_data], ignore_index=True)

In [None]:
new_raw_predictions, new_x = best_tft.predict(new_prediction_data, mode="raw", return_x=True)

for idx in range(10):  # plot 10 examples
    best_tft.plot_prediction(new_x, new_raw_predictions, idx=idx, show_future_observed=False);

## Interprete model

In [None]:
interpretation = best_tft.interpret_output(raw_predictions, reduction="sum")
best_tft.plot_interpretation(interpretation)

# Your turn (Bonus): Predicting stocks price with TFT (and make $$)

Your task:

* Download some stock data
* Train a TFT model
* USe maybe several stocks that might be related at once
* Other covariates possible?


## Getting data

In [None]:
# !pip install yfinance

In [None]:
import yfinance as yf

In [None]:
df_stocks = yf.download(tickers=['GOOGL'], period='10y', interval='1d') # , 'AAPL', 'GOOGL'

In [None]:
df_stocks.head()

In [None]:
df_stocks.dtypes

In [None]:
df_stocks = df_stocks.drop('Volume', axis=1)

In [None]:
alt.Chart(data = df_stocks.reset_index()).mark_line().encode(
    x='Date:T',
    y='Close:Q'
)

# Further ressources and cool stuff

* [Timeseries Transformer on Huggingface](https://huggingface.co/docs/transformers/model_doc/time_series_transformer): The Time Series Transformer model is a vanilla encoder-decoder Transformer for time series forecasting.