
# Description of the paper
## Introduction
*TSMixer: An All-MLP Architecture for Time Series Fore
casting* is a paper done by Google research scientists on a novel and powerful model used for long-term time series predictions. 

A number of different techniques have been shown to be quite succesful at this task. However, when this paper was published (September 2023), different other authors had proven that the results that were achieved using very complex state of the art AI models could be achieved using stacking of simpler models/layers on some benchmarks. These papers used MLPs as a base, when the competition were using Transformer based model.

The TSMixer is an attempt at a model of this kind with better performance than its predessor. This model achieves this by only using MLPs stacking using a special kind of Mixer architecture developped specifically for this paper. The aptly named Mixer is a combination of a Time Mixing layer and a Feature Mixing layer. These two layers combined allow the model, using only linear models, to capture temporal patterns (that a traditional model like ARIMA would capture decently well) AND cross-variate information. This is done in the model's most basic form, as the *extended* TSMixer model can also take into account future and static variables on top of all this.

## Preprocessing the data

Before explaining how the layers are created in a TSMixer model specifically, it is important to lay out the basics of time series forecasting to better understand the dimensions of our model.

A time series is a collection of points collected at regular intervals ordered by the time they appeared. As such, the series are often decomposed into components, such as trend (the progression of the series) or the seasonality (e.g temperature is tied to seasons). The ultimate goal of a problem of this kind is to predict the next value(s) of a time series starting from a certain point. Time Series with very strong trend or seasonality are often quite easy to predict but real life datasets will often contain noisy data that may weaken these components for the time series. 

The task at hand is multi-step time series forecasting. This means that we want to predict the next p values of a time series up until a certain point. Moreover, the *TSMixer* model is the multivariate variant of the model described in the paper, it is thus capable of working with time series with multiple variables (or features).

To predict these n values, we take into account the w most recent values up to that point, with w greater than p. w is the window length, its choice is very important for the overall result of the model. 

## Layers 

![paper](images/basicmodel.png)
*Architecture of the basic TSMixer model*

The *basic* TSMixer is composed of N Mixer layers and a temporal projection layer. With the mixer layers composed of a feature mixing layer and a time mixing layer each. These two layers are residual and both present a normalisation of some kind (for this specific model: a 2d-batch pre-normalisation)


1.   Time mixing layer: The time mixing consists of a projection of the entire input matrix. Then, the layer feeds each column (all the time values for every variable) one by one to a fully connected layer (with ReLU and Dropout), the output of which is then projected once again to be used by the feature mixing layer.
2.   Feature mixing layer: feeds each row (all the variable values for every time step) into an MLP.

At the end, the data is fed into a temporal projection that projects the data, feeds it into a fully connected layer (that reduces the dimension to the prediction length) and then re-projects the output to get the final projected time series.


## Implementation details

The diagram of the model cannot explain all choices the google team made when training this model. 



1.   Normalisation: This paper explains the different normalisations the team did in and outside the model. Multi variable time series are commonly normalised before passing through the model (generally using a StandardScaler), this ensures we have uniform scale across all variables. In the model itself is used RevIN (Reversible Instance Normalisation) first "*to ensure a fair comparison with the state-of-the-art PatchTST* ". Instance Normalisation is just using the same underlying maths as batch normalisation but applied on every instance (an instance is a window of data), the "reversible" means that this method can be used to normalise and denormalise the data. 

Finally, in the individual layers, the model uses batch or layer normalisation, at the start or end of the layer (depending on the model we choose: in the 'basic' implementation, the authors use 2d batch pre-normalisation)

2.   Early stopping: The paper talks of Early Stopping for the model if "*the validation loss is not improved after 5 epochs*"

3.   Dataset: The 'basic" implementation of the TSMixer algorithm is mainly benchmarked on the ETT suite of datasets. For our result demonstration, we'll use the ETTh1 dataset specifically. This dataset has 7 features and 17420 time steps in total.

  


## Parameters



*   Optimizer: Adam
*   Loss: MSE
*   Evaluation metric: MAE
*   Input (or window) length: 512
*   Prediction length: 96
*   N° of epochs: 100 (with early stopping)

These are all the "best" parameters that the paper proposes based on other papers' findings. However for the other parameters, a choice has to be made if we want to execute the model's algorithm a single time for a given prediction length. We could also look for the best choice but running the model a single time is computationnally expensive so I am not able to do this at the moment. The paper shows benchmarks with different parameter values for a given prediction length. We'll take the "best" parameters for our prediction length:

*   Learning rate: 0.0001
*   Batch size: 32
*   Hidden size (the dimension of the output of the first FCC in the feature mixing MLP): 512
*   Dropout: 0.9
*   N° of mixer layers: 6






## Other models presented in the paper

*   TSMix Only: This is essentially the TSMixer model with only time mixing (no feature mixing). The benchmarks in the paper prove that in for the datasets we use (ETT), the results of the TSMix Only model are comparable to the TSMixer ones (proving that in this specific case, cross variate information is not really relevant) 
*   TSMixer-Ext: This extended model can leverage future and static variables on top of historical ones. This new model works by swapping the feature mixing layer with a conditional one that takes the static variables into account. The structure of the model is also different (see graphic). The normalisations used are post-normalisation 2d-Layer norms. This model is benchmarked using the M5 dataset, a large-scale collection of time series that also contains static and future variables. Considering the sheer size of this dataset and the limited computation capabilities we have, we will not be able to provide results for this extended model on the whole M5 dataset.

![Plot 1](images/extmodel.png)
*Architecture of the Extended TSMixer model*

