
# Description of the paper
## Introduction
*TSMixer: An All-MLP Architecture for Time Series Fore
casting* is a paper done by Google research scientists on a novel and powerful model used for long-term time series predictions. 

A number of different techniques have been shown to be quite succesful at this task. However, when this paper was published (September 2023), different other authors had proven that the results that were achieved using very complex state of the art AI models could be achieved using stacking of simpler models/layers on some benchmarks. These papers used MLPs as a base, when the competition were using Transformer based model.

The TSMixer is an attempt at a model of this kind with better performance than its predessor. This model achieves this by only using MLPs stacking using a special kind of Mixer architecture developped specifically for this paper. The aptly named Mixer is a combination of a Time Mixing layer and a Feature Mixing layer. These two layers combined allow the model, using only linear models, to capture temporal patterns (that a traditional model like ARIMA would capture decently well) AND cross-variate information. This is done in the model's most basic form, as the *extended* TSMixer model can also take into account future and static variables on top of all this.

## Preprocessing the data

Before explaining how the layers are created in a TSMixer model specifically, it is important to lay out the basics of time series forecasting to better understand the dimensions of our model.

A time series is a collection of points collected at regular intervals ordered by the time they appeared. As such, the series are often decomposed into components, such as trend (the progression of the series) or the seasonality (e.g temperature is tied to seasons). The ultimate goal of a problem of this kind is to predict the next value(s) of a time series starting from a certain point. Time Series with very strong trend or seasonality are often quite easy to predict but real life datasets will often contain noisy data that may weaken these components for the time series. 

The task at hand is multi-step time series forecasting. This means that we want to predict the next p values of a time series up until a certain point. Moreover, the *TSMixer* model is the multivariate variant of the model described in the paper, it is thus capable of working with time series with multiple variables (or features).

To predict these n values, we take into account the w most recent values up to that point, with w greater than p. w is the window length, its choice is very important for the overall result of the model. 

## Layers 

![paper](images/basicmodel.png)

*Architecture of the basic TSMixer model*

The *basic* TSMixer is composed of N Mixer layers and a temporal projection layer. With the mixer layers composed of a feature mixing layer and a time mixing layer each. These two layers are residual and both present a normalisation of some kind (for this specific model: a 2d-batch pre-normalisation)


1.   Time mixing layer: The time mixing consists of a projection of the entire input matrix. Then, the layer feeds each column (all the time values for every variable) one by one to a fully connected layer (with ReLU and Dropout), the output of which is then projected once again to be used by the feature mixing layer.
2.   Feature mixing layer: feeds each row (all the variable values for every time step) into an MLP.

At the end, the data is fed into a temporal projection that projects the data, feeds it into a fully connected layer (that reduces the dimension to the prediction length) and then re-projects the output to get the final projected time series.


## Implementation details

The diagram of the model cannot explain all choices the google team made when training this model. 



1.   Normalisation: This paper explains the different normalisations the team did in and outside the model. Multi variable time series are commonly normalised before passing through the model (generally using a StandardScaler), this ensures we have uniform scale across all variables. In the model itself is used RevIN (Reversible Instance Normalisation) first "*to ensure a fair comparison with the state-of-the-art PatchTST* ". Instance Normalisation is just using the same underlying maths as batch normalisation but applied on every instance (an instance is a window of data), the "reversible" means that this method can be used to normalise and denormalise the data. 

Finally, in the individual layers, the model uses batch or layer normalisation, at the start or end of the layer (depending on the model we choose: in the 'basic' implementation, the authors use 2d batch pre-normalisation)

2.   Early stopping: The paper talks of Early Stopping for the model if "*the validation loss is not improved after 5 epochs*"

3.   Dataset: The 'basic" implementation of the TSMixer algorithm is mainly benchmarked on the ETT suite of datasets. For our result demonstration, we'll use the ETTh1 dataset specifically. This dataset has 7 features and 17420 time steps in total.

  


## Parameters



*   Optimizer: Adam
*   Loss: MSE
*   Evaluation metric: MAE
*   Input (or window) length: 512
*   Prediction length: 96
*   N° of epochs: 100 (with early stopping)
*   Train/test/val split: 12/4/4 (months)

These are all the "best" parameters that the paper proposes based on other papers' findings. However for the other parameters, a choice has to be made if we want to execute the model's algorithm a single time for a given prediction length. We could also look for the best choice but running the model a single time is computationnally expensive so I am not able to do this at the moment. The paper shows benchmarks with different parameter values for a given prediction length. We'll take the "best" parameters for our prediction length:

*   Learning rate: 0.0001
*   Batch size: 32
*   Hidden size (the dimension of the output of the first FCC in the feature mixing MLP): 512
*   Dropout: 0.9
*   N° of mixer layers: 6






## Other models presented in the paper

*   TSMix Only: This is essentially the TSMixer model with only time mixing (no feature mixing). The benchmarks in the paper prove that in for the datasets we use (ETT), the results of the TSMix Only model are comparable to the TSMixer ones (proving that in this specific case, cross variate information is not really relevant) 
*   TSMixer-Ext: This extended model can leverage future and static variables on top of historical ones. This new model works by swapping the feature mixing layer with a conditional one that takes the static variables into account. The structure of the model is also different (see graphic). The normalisations used are post-normalisation 2d-Layer norms. This model is benchmarked using the M5 dataset, a large-scale collection of time series that also contains static and future variables. Considering the sheer size of this dataset and the limited computation capabilities we have, we will not be able to provide results for this extended model on the whole M5 dataset.

![Plot 1](images/extmodel.png)

*Architecture of the Extended TSMixer model*



# How to use the code

The code is stored into the *src* folder. The actual code files are divided into 3 main parts: the dataset creation, the modeland the training. The dataset creation is done in the *etth1_dataset* file, it creates a time series dataset instance that can be used in the DataLoader class later on. The model is defined in the *TSMixer.py* file and the training is done in the *train.py* file. The *main.py* file is used to run the whole algorithm, it takes multiple arguments that can be tuned by the user. If the user wants to run the model with the default parameters, he can simply run the *main.py* file on its own. The *main.py* file also contains the code to plot the results of the model on all the variables of the dataset. 

The data is stored in the *data* folder. The *images* folder contains the images used in this notebook.

# Summary of the main results

It is important to note that due to the limited computation capabilities I have, I was not able to run the model for 100 epochs on my own computer. I used Google Colab to run the model and I was able to run it for 100 epochs on the ETTh1 dataset using a single GPU. This significantly sped up the process and helped me test out some other results. However, I couldn't get the single GPU to perform a grid search to find the best parameters. When I had a code that worked, I thus ran it once at a time using the "best parameters" described by the authors of the paper for a prediction length of 96 first.
The results are as follows:

![paper](images/train_val_loss_best.png)

*Training and validation loss for the best parameters described by the paper's authors*

We can see from this graph that the training loss is decreasing but becomes lower than the validation loss after 20 epochs. This is quite odd for an AI model and may be a sign of underfitting. I think this is due to the very high dropout rate used in the model (0.9). This is a very high dropout rate and may be the cause of the potential under/over fitting. However, the model still performs quite well on the validation set. I can't quite understand this phenomenon so I will not try to explain it further. It is also good to note that the scale of the losses seem to be a bit off when compared by the paper's results, maybe this is due to the fact that the delta I used for the Early Stopping stopped the model too early, making it so the values didn't have time to reach the same scale as the paper's results but I doubt it. We can also try to plot the results of the model on the test set to observe the behaviour of the model. As the model is multivariate, we'll focus on a subset of the feature space. We'll plot the results for 2 variables of interest here. First, we plot the result for the variable 'LULL':

![paper](images/lull.png)

*Results of the model on the test set for the variable 'LULL'*

Just from this visual result, one might say that the model performs quite well on the test set and thus generalises quite well. However, one has to compare this to the actual values of the LULL variable to see if the model is actually good at predicting the values of an actually hard to predict time series. Here is the plot: 

![paper](images/Lull_plot.png)

*Plot of the actual values of the variable 'LULL'*

This variable has noise but seems to also have a strong seasonality. A more complex analysis of this curve would be necessary but considering that cross variate has to be taken into account, this analysis would be quite hard to do.

Let's try to plot the results for another more noisy variable: 'OT':

![paper](images/ot.png)

*Results of the model on the test set for the variable 'OT'*

We can clearly see that the model struggles quite a lot more with this variable. The results are quite noisy and don't seem to follow the actual values of the variable. Here is the plot of the actual values of the variable 'OT':

![paper](images/OT_plot.png)

*Plot of the actual values of the variable 'OT'*

This variable is quite noisy and doesn't really seem to have a clear trend or seasonality. This is quite a complexe variable to predict for any time series forecasting model. These results do not undermine the validity of our model.

I will then try to lower the dropout rate to see if it improves the results.

![paper](images/loss_val_train_05.png)

*Training and validation loss for the best parameters described by the paper's authors using a dropout rate of 0.5*

This graph shows the results of the model using a dropout rate of 0.5. The results are somewhat opposed to the previous ones. The training loss is now always lower than the validation loss by a significant margin. This is hard to interpret as is. This could be a sign of overfitting. We can look for other parameters to see if it improves the results.  

After a quick round of hyperparameter testing. I can report a significant improvement in the results with a completely different set of parameters. It is important to note that due to the complexity resulting from the high number of parameters, I was not available to perform a complete grid search so this result may still be suboptimal. The results are as follows:

*   Input (or window) length: 512
*   Prediction length: 96
*   Learning rate: 0.00001
*   Batch size: 32
*   Hidden size: 64
*   Delta for early stopping: 0.01
*   Tolerance for early stopping: 5
*   Dropout: 0.6
*   N° of mixer layers: 4
*   N° of epochs: 29 (Early stopped)

![paper](images/loss_other.png)

The final graphs for the 96 last values of the test time series for each variable are stored in the *final_all_variables* subfolder in the *images* folder.