# Assignment 5: Extended Long Short-Term Memory (xLSTM)

*Author:* Philipp Seidl

*Copyright statement:* This  material,  no  matter  whether  in  printed  or  electronic  form,  may  be  used  for  personal  and non-commercial educational use only.  Any reproduction of this manuscript, no matter whether as a whole or in parts, no matter whether in printed or in electronic form, requires explicit prior acceptance of the authors.

In this assignment, we will explore the xLSTM architecture, a novel extension of the classic LSTM model. The paper can be found here: https://arxiv.org/abs/2405.04517

## Background
Recurrent Neural Networks (RNNs), particularly LSTMs, have proven highly effective in various sequence modeling tasks. However, the emergence of Transformers, with their parallel processing capabilities, has shifted the focus away from LSTMs, especially in large-scale language modeling.
The xLSTM architecture aims to bridge this gap by enhancing LSTMs with mechanisms inspired by modern LLMs (e.g. block-strucutre, residual connections, ...).  Further it introduces:
- Exponential gating with normalization and stabilization techniques, which improves gradient flow and memory capacity.
- Modifications to the LSTM memory structure, resulting in two variants:
    - sLSTM: Employs a scalar memory with a scalar update rule and a new memory mixing technique through recurrent connections.
    - mLSTM: Features a matrix memory, employs a covariance update rule, and is fully parallelizable, making it suitable for scaling.

By integrating these extensions into residual block backbones, xLSTM blocks are formed, which can then be residually stacked to create complete xLSTM architectures.

## Exercise 1: Environment Setup

When working with new architectures or specialized frameworks, it's essential to correctly set up the environment to ensure reproducability. This exercise focuses on setting up the environment for working with the `xlstm` repository.

1. Visit and clone the official repository: [https://github.com/NX-AI/xlstm](https://github.com/NX-AI/xlstm).  
2. Set up the environment  
3. Document your setup:  
   - OS, Python version, Environment setup, CUDA version (if applicable), and GPU details.  
   - Note any challenges you faced and how you resolved them. 
4. Submit your setup as a bash script using the IPython `%%bash` magic. Ensure it is reproducible.

Getting mLSTM working only is fine (if you encounter issues with sLSTM cuda kernels)

> **Note**: Depending on your system setup, you may need to adjust the `environment_pt220cu121.yaml` file, such as for the CUDA version. For this assignment, it is recommended to run it on GPUs. If you don't have one, consider using  [Colab](https://colab.research.google.com/notebooks/welcome.ipynb#recent=true) or other online resources.

> **Recommendations**: While the repository suggests using `conda`, we recommend using `mamba` or `micromamba` instead (way faster) (except if you are using colab). Learn more about them here: [https://mamba.readthedocs.io/en/latest/index.html](https://mamba.readthedocs.io/en/latest/index.html).

In [None]:
%%bash
########## SOLUTION BEGIN ##########
# first i had to check my setup capabilities
import torch

print("PyTorch version:", torch.__version__)
print("CUDA version (used by PyTorch):", torch.version.cuda)
print("Is CUDA available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("CUDA capability of the device:", torch.cuda.get_device_capability())
# then i cloned the xlstn repo
!git clone https://github.com/NX-AI/xlstm.git

# then i had to edit the environment file (environment_pt220cu121.yaml) to fit my setup
# then i have created the invironment
!conda env create -f environment_pt220cu121.yaml

# installing xlstm mlstm pchgs
!pip install xlstm
!pip install mlstm_kernels

#activatinf the xlstm env
!conda activate xlstm


# problems:

# i had many problems creating the environment, till i figured out
# that i have ti edit the yaml file to adjust the cuda, pytorch, python, jupyter versions
# and removing the un nessesary requirments for my setup

#also i havd to change the xlstm file to Xlstm also in the import line because it was calling the wrong _int_.py file 

########## YOUR SOLUTION HERE ##########

In [5]:
# Verify your installation of xLSTM:
from omegaconf import OmegaConf
from dacite import from_dict
from dacite import Config as DaciteConfig
from Xlstm import xLSTMBlockStack, xLSTMBlockStackConfig
import os
import torch

DEVICE = "cuda" if torch.cuda.is_available() else 'cpu'

print(DEVICE)

use_slstm_kernels = True # set to True if you want to check if sLSTM cuda kernels are working

xlstm_cfg = f"""
mlstm_block:
  mlstm:
    conv1d_kernel_size: 4
    qkv_proj_blocksize: 4
    num_heads: 4
slstm_block:
  slstm:
    backend: {'cuda' if use_slstm_kernels else 'vanilla'}
    num_heads: 4
    conv1d_kernel_size: 4
    bias_init: powerlaw_blockdependent
  feedforward:
    proj_factor: 1.3
    act_fn: gelu
context_length: 32
num_blocks: 7
embedding_dim: 64
slstm_at: [] # empty = mLSTM only
"""
cfg = OmegaConf.create(xlstm_cfg)
cfg = from_dict(data_class=xLSTMBlockStackConfig, data=OmegaConf.to_container(cfg), config=DaciteConfig(strict=True))
xlstm_stack = xLSTMBlockStack(cfg)

x = torch.randn(4, 32, 64).to(DEVICE)
xlstm_stack = xlstm_stack.to(DEVICE)
y = xlstm_stack(x)
y.shape == x.shape

cuda


True

## Exercise 2: Understanding xLSTM Hyperparameters
Explain key hyperparameters that influence the performance and behavior of the xLSTM architecture and explain how they influence total parameter count.
The explanation should include: proj_factor, num_heads, act_fn, context_length, num_blocks, embedding_dim, hidden_size, dropout, slstm_at, qkv_proj_blocksize, conv1d_kernel_size. Also include how the matrix memory size of mLSTM is determined.

########## SOLUTION BEGIN ##########

the hyperparameters of xlstm are directly affecting the performance, efficiency, and computational capabiliyt of the model as well as the total parameter count.

regarding the efficiency, performance, and parameter count of the model:

 - for (proj_factor), reducing it reduces the computational cost and increasing it increases the model size so more computations, and it reduces the parameter count.

 - for (num_heads) increasing it improves the performance but increases memory usage and computational requirements, it increases the parameter count.
 
 - for (act_fn) the type of the activation function used affects the convergence and the computational costs, and it has no effect on parameter count
 
 - for (contex_length) increasing it is good for capturing temporal dependancies but that will requir more memory and computational requirements, it affects the parameter count but indirectly.

 - for (num_blocks) increasing the blocks reflects on the performance in agood way but also can add more computational requirements, and the parameters increase linearly with increasing blocks.

 - for (embedding_dim) increasing it does not realy affect the performance but it can improve the performance, and it increase the parameter count.

 - for (hidden_size) increasing it improves the model ability to learn compelx patterns but needs large memory size, and it increases the parameters.

 - for (dropout)it improves generalization but can slow down convergence, and it has no effect on parameter count. 

 - for (slstm_at) usinf sLSTM reduces the number of activation neurons also reduces the memory usage, and it has no effect on patameter counts.

 - for (qkv_blocksize) reducing it may limit the ability of the model to learn. and it increases the parameter count. 

 - for (conv1d_kernel_size) increasing it can improve performance but also increases computational cost, and it increases the parameter count.

 - for (matrix memory size in mLSTM) large memory size allows th model to store more contextual info but increases the memory usage, and it has no effect on parameter count.


## Exercise 3: Train an xLSTM model on the Trump Dataset from the previous exercise
Your task is to train an xLSTM model on the Trump Dataset from the previous exercise. 
- The goal is to achieve an average validation loss $\mathcal{L}_{\text{val}} < 1.35$. 
- You do not need to perform an extensive hyperparameter search, but you should document your runs. Log your runs with used hyperparameters using tools like wandb, neptune, mlflow, ... or a similar setup. Log training/validation loss and learning rate over steps as well as total trainable parameters of the model for each run.
- You can use the training setup from the previous exercises or any setup of your choice using high level training libaries.

## Exercise 4: Utilizing a Pretrained Model (Bonus)

Foundation Models, those pretrained on large amounts of data are more and more important. We can use those models and fine-tune them on our dataset, rather then training them from scratch.
Here are the things to consider:

- Model Selection: Choose a pretrained language model from an online repository. Hint: You can explore platforms like Hugging Face (huggingface.co), which host numerous pretrained models.

- Dataset: Use the Trump dataset with the same training and validation split as in previous exercises. You do not need to use character tokenization.

- Performance Evaluation: Evaluate the performance of the pretrained model on the validation set before and during fine-tuning. Report average-CE-loss as well as an example generated sequence with the same prompt for each epoch.
 
- Fine-tuning: Adjust the learning rate, potentially freeze some layers, train for a few epochs with a framework of your choice (e.g. [lightning](https://lightning.ai/docs/pytorch/stable/), [huggingface](https://huggingface.co/models), ...)

- Computational Resources: Be mindful of the computational demands of pretrained models. You might need access to GPUs. Try to keep the model size at a minimum and go for e.g. distilled versions or other small LMs

- Hyperparameter Tuning: You can experiment with different learning rates and potentially other hyperparameters during fine-tuning but no need to do this in depth

By completing this exercise, you will gain experience with utilizing pretrained models, understanding their capabilities, and the process of fine-tuning. Decreasing the validation loss can be seen a success for this exercise.

> **Note**: This is a standalone exercise and doesn't build upon the previous tasks.

In [None]:
########## SOLUTION BEGIN ##########

########## YOUR SOLUTION HERE ##########