### T5 (Text-to-Text Transfer Transformer): A Mathematical Overview

**Introduction**:
The T5 model, which stands for "Text-to-Text Transfer Transformer," represents a unified framework for tackling a wide variety of Natural Language Processing (NLP) tasks. Developed by Google AI, T5 reframes every NLP problem into a text-to-text format, where the model takes text as input and produces text as output. This approach leverages the power of the Transformer architecture and large-scale pre-training on a diverse dataset.

**1. Architectural Framework**:
T5 is built upon the standard Transformer Decoder-Encoder architecture. Key components include:

- **Encoder-Decoder Stack**: Unlike decoder-only models like GPT-2, T5 employs both an encoder (to process the input sequence) and a decoder (to generate the output sequence). Each consists of multiple layers.
- **Multi-Head Self-Attention with relational embeddings**: Both the encoder and decoder use multi-head self-attention mechanisms to weigh the importance of different parts of the sequence. The mathematical formulation is similar to the standard Transformer:

$$
\text{MultiHead}(X) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)W^O
$$

where each head is computed as:

$$
\text{head}_i = \text{Attention}(XW_i^Q, XW_i^K, XW_i^V)
$$

and the attention function is:
# 
$$
\text{Attention}(Q, K, V) = \text{softmax}\left( QK^T + R \right)V
$$
# 
Here, \( X \) represents the input embeddings, \( $W_i^Q, W_i^K, W_i^V$ \) are learnable projection matrices.

The decoder has two attention mechanisms:
1. **Masked Self-Attention**: Similar to the encoder's self-attention but with a causal mask to prevent attending to future tokens.
2. **Cross-Attention**: This mechanism allows the decoder to attend to the encoder's output:

$$
\text{CrossAttention}(X, EO) = \text{softmax}\left( XW^Q(E W^K)^T + R \right) E W^V
$$

where \( **X** \) is the decoder's representation and \( **E** \) is the encoder's output. This cross-attention enables the decoder to focus on relevant parts of the input sequence when generating each output token.

The relational embeddings $R$ in the attention mechanism are computed as:

$$
R_{i,j} = emb_r(f(i-j))
$$

where $emb_r$ is a learned embedding layer that maps the relative distance between positions $i$ and $j$ to a vector of head dimension.   
$f$ is a bucketing function that maps relative distances to a fixed number of buckets, limiting embedding layer size (e.g., clamping large distances to a maximum value). $f$ is shared across all layers in decoder and encoder stacks.



**2. The Text-to-Text Framework**:
Alfought T5 represents classic encoder-decoder model with relation embeddings addition, the core innovation of T5 is its unified approach. Every task is converted into a text-to-text problem by adding a task-specific prefix to the input sequence. The model is trained to generate the target text based on this combined input.

Examples:
- **Translation (English to German)**: `translate English to German: That is good.` -> `Das ist gut.`
- **Summarization**: `summarize: [article text...]` -> `[summary text...]`
- **Question Answering**: `question: Who invented the lightbulb? context: Thomas Edison invented the lightbulb in 1879.` -> `Thomas Edison`
- **Sentiment Analysis**: `sst2 sentence: This movie was fantastic!` -> `positive`

**3. Pre-training Objective**:
T5 is pre-trained on a massive and diverse text corpus called C4 (Colossal Clean Crawled Corpus) using a self-supervised denoising objective inspired by Masked Language Modeling (MLM). Specifically, T5 uses **span corruption**:

- Randomly sample spans (contiguous sequences of tokens) from the input text.
- Replace each chosen span with a single unique sentinel token (e.g., `<X>`, `<Y>`, etc.).
- The model is trained to predict the original text of the corrupted spans, using the corresponding sentinel tokens as delimiters in the target sequence.

Example:
- Original: `Thank you for inviting me to your party last week.`
- Input: `Thank you <X> me to your party <Y> week.`
- Target: `<X> for inviting <Y> last <EOS>`

This pre-training task encourages the model to learn general language understanding and generation capabilities.

**4. Fine-tuning**:
After pre-training, the *same* T5 model is fine-tuned on various downstream tasks. The fine-tuning process also uses the text-to-text format, simply by providing task-specific examples with the appropriate prefixes (like `translate English to German:`, `summarize:`, etc.). The model learns to associate the prefix with the desired task and output format. The loss function during both pre-training and fine-tuning is typically the standard cross-entropy loss computed over the target sequence tokens:

$$
\mathcal{L} = -\sum_{t=1}^{n} \log P(y_t | y_1, \ldots, y_{t-1}, \text{input})
$$

where \( P(y_t | \ldots) \) is the probability of the target token \( y_t \) given the input and previously generated target tokens.

**In Summary**:
T5 provides a powerful and flexible text-to-text framework that simplifies the approach to diverse NLP tasks. By leveraging the Transformer architecture, a large-scale denoising pre-training objective (span corruption), and a unified input/output format, T5 achieves state-of-the-art performance on many benchmarks with a single model architecture. Its versatility makes it a foundational model in modern NLP research and applications.

### 1. Environment variable setting block:

The following block is required to set environment variables that are read during the execution of the program code. 

User can change these environment variables between runs.

In [1]:
# Preliminary setup of experimental environment
import os
from pathlib import Path

# Set environment variables
os.environ["CUDA_VISIBLE_DEVICES"] = "0" # Limit CUDA visibility
os.environ["OMP_NUM_THREADS"] = "1" # Disable BLAS parallelism

# All StarPU environment variables are available at https://files.inria.fr/starpu/doc/html/ExecutionConfigurationThroughEnvironmentVariables.html
os.environ["STARPU_NCPU"] = "1" # Use only 1 CPU core
os.environ["STARPU_NCUDA"] = "1" # Use only 1 CUDA device
os.environ["STARPU_SILENT"] = "1" # Do not show lots of StarPU outputs
os.environ["STARPU_SCHED"] = "dmdasd" # Name StarPU scheduler to be used
os.environ["STARPU_FXT_TRACE"] = "0" # Do not generate FXT traces
os.environ["STARPU_WORKERS_NOBIND"] = "1" # Do not bind workers (it helps if several instances of StarPU run in parallel)
os.environ["STARPU_PROFILING"] = "1" # This enables logging performance of workers and bandwidth of memory nodes
os.environ["STARPU_HOME"] = str(Path.cwd() / "starpu") # Main directory in which StarPU stores its configuration files
os.environ["STARPU_PERF_MODEL_DIR"] = str(Path(os.environ["STARPU_HOME"]) / "sampling") # Main directory in which StarPU stores its performance model files
os.environ["STARPU_PERF_MODEL_HOMOGENEOUS_CPU"] = "1" # Assume all CPU cores are equal
os.environ["STARPU_PERF_MODEL_HOMOGENEOUS_CUDA"] = "1" # Assume all CUDA devices are equal
os.environ["STARPU_HOSTNAME"] = "T5_example" # Force the hostname to be used when managing performance model files
os.environ["STARPU_FXT_PREFIX"] = str(Path(os.environ["STARPU_HOME"]) / "fxt") # Directory to store FXT traces if enabled


### 2. Data Preparation Block: 

This block uses the interpreted file "causal_lm_data_preparation.py". This Python script supports the following arguments when run:
- hf-dataset, (default=`"roneneldan/TinyStories"`): The name of the dataset to be processed and prepared for use in the training process. By default, the "TinyStories" dataset from the Huggingface infrastructure is specified,
- dataset-path, (default=`".data"`): path to the directory where previously downloaded datasets from remote sources are saved, making it easy to access for the future use,
- dataset-select, (`int`, default=`100`): specifies the number of records from the original dataset that fall into the training set,
- hf-tokenizer, (`str`, default=`"kimihailv/llama-1.3b"`): specifies the repository from the Huggingface infrastructure used as a tokenizer,
- tokenizer-path, (`str`, default=`".model"`): path to the directory where previously downloaded tokenizers are saved,
- seq-len, (`int`, default=`1024`): length of the input token sequence for the training process,
- batch-size, (`int`, default=`1`): batch size for the training process, then is the number of input data sentences between which the loss function optimizer step is called.

In [2]:
!python3 ../wrappers/python/examples/causal_lm_data_preparation.py \
--hf-tokenizer="google/flan-t5-small" --seq-len=512 \
--batch-size=1 --dataset-select=16



Downloading data: 100%|██████████████████████| 249M/249M [00:04<00:00, 51.6MB/s]
Downloading data: 100%|██████████████████████| 248M/248M [00:03<00:00, 69.6MB/s]
Downloading data: 100%|██████████████████████| 246M/246M [00:03<00:00, 63.2MB/s]
Downloading data: 100%|██████████████████████| 248M/248M [00:03<00:00, 66.8MB/s]
Downloading data: 100%|████████████████████| 9.99M/9.99M [00:00<00:00, 27.8MB/s]
Generating train split: 2119719 examples [00:07, 279041.29 examples/s]
Generating validation split: 21990 examples [00:00, 259186.31 examples/s]
tokenizer_config.json: 100%|███████████████| 2.54k/2.54k [00:00<00:00, 31.2MB/s]
spiece.model: 100%|██████████████████████████| 792k/792k [00:00<00:00, 3.77MB/s]
tokenizer.json: 100%|██████████████████████| 2.42M/2.42M [00:00<00:00, 4.31MB/s]
special_tokens_map.json: 100%|█████████████| 2.20k/2.20k [00:00<00:00, 20.7MB/s]


Below we show an example of utilizing the GPT-2 model, implemented using the NNTile framework. We explore the following scenarios:

- **Training the model from a random initial state and saving it to a checkpoint.**
- **Loading the model weights from a checkpoint and continuing training with a different data type.**
- **Training the remote model downloaded from the Hugging Face infrastructure.**

For training and continuing retraining scenarios, the interpreted file "t5_lmhead_training.py" is used. This Python script supports the following arguments when running:

- remote_model_name, (str, default="google/flan-t5-small"): Specifies the name of the T5-based model on HuggingFace hub used to initialize the configuration and initial state of the NNTile model.  

- pretrained, (choices=["local", "remote"], default="remote"): Indicates the source of the pretrained model. "local" requires config-path (and optionally checkpoint-path), while "remote" downloads from the hub specified by remote_model_name.

- checkpoint-path, (str, default=""): Path to a saved checkpoint file to resume training or initialize a local model.

- config-path, (str, default=""): Path to the configuration JSON file, required if pretrained is "local" and no checkpoint-path is provided.

- save-checkpoint-path, (str, default=".model"): Path where the trained model checkpoint will be saved.

- optimizer, (choices=["sgd", "adam", "adamw"], default="adam"): Specifies the optimization algorithm to use during training.

- model-path, (str, default=".model"): Directory path used to cache models downloaded from HuggingFace hub.

- seq-len, (int, default=512): Length of the input token sequences for training.
seq-len-tile, (int, default=-1): Tile size for the sequence length dimension. If -1, defaults to seq-len.

- batch-size, (int, default=1): Number of sequences processed between optimizer steps.

- minibatch-size, (int, default=-1): The size of smaller batches the full batch-size is divided into for gradient accumulation. If -1, defaults to batch-size.

- minibatch-size-tile, (int, default=-1): The tile size for the minibatch dimension that is processed by individual hardware units (CPU/GPU). If -1, defaults to minibatch-size.

- d-model-tile, (int, default=-1): Tile size for the model's hidden dimension (d_model). If -1, it's inferred from the loaded model configuration.

- d-ff-tile, (int, default=-1): Tile size for the feed-forward intermediate dimension (d_ff). If -1, it's inferred from the loaded model configuration.

- num-heads-tile, (int, default=-1): Tile size for the number of attention heads dimension. If -1, it's inferred from the loaded model configuration.

- num-labels, (int, default=2): Number of output classes for the sequence classification task.

- dtype, (choices=["fp32", "fp64", "tf32", "bf16", "fp32_fast_fp16", "fp32_fast_bf16"], default="fp32"): Data type used for model computations and storage.

- restrict, (choices=["cpu", "cuda", None], default=None): Restricts computations to specific hardware: "cpu" for CPU only, "cuda" for GPU only, or None to use all available resources.

- use-redux, (action="store_true"): Enables the use of reduction operations for potentially faster computation on certain hardware configurations.

- dataset-path, (str, default=".data"): Directory path where the dataset file is located.

- dataset-file, (str, default=""): Path to the dataset file (relative to dataset-path), expected in .npz format with 'input_ids' and 'labels' keys. If empty, dummy data is generated.

- lr, (float, default=1e-4): Learning rate for the optimizer.
nepochs, (int, default=1): Number of times to iterate over the entire training dataset.

- logger, (action="store_true"): Enables NNTile's internal logger for debugging and performance monitoring.

- logger-server-addr, (str, default="localhost"): Network address of the NNTile logger server.

- logger-server-port, (int, default=5001): Network port of the NNTile logger server.

### 1. Training from the random initial state and saving into checkpoint.

This requires option `pretrained` set to `local` and `config-path` to point on previously created `.json` configuration file.

In [3]:
!python ../wrappers/python/examples/t5_lmhead_training.py \
--restrict="cuda" --pretrained=local \
--config-path="../wrappers/python/examples/t5_config.json" \
--save-checkpoint-path=".model/nntile_checkpoint.pt" \
--optimizer="adam" --lr=1e-4 --dtype=bf16 --nepochs=1 \
--dataset-file="tinystories/train.bin"



Namespace(remote_model_name='google/flan-t5-small', pretrained='local', checkpoint_path='', config_path='/home/jovyan/sivtsov/nntile/wrappers/python/examples/t5_config.json', save_checkpoint_path='.model/nntile_checkpoint.pt', optimizer='adam', model_path='.model', seq_len=512, seq_len_tile=-1, batch_size=1, minibatch_size=-1, minibatch_size_tile=-1, d_model_tile=-1, d_ff_tile=-1, num_heads_tile=-1, num_labels=2, dtype='bf16', restrict='cuda', use_redux=False, dataset_path='.data', dataset_file='tinystories/train.bin', lr=0.0001, nepochs=1, logger=False, logger_server_addr='localhost', logger_server_port=5001)
T5Config {
  "classifier_dropout": 0.0,
  "d_ff": 2048,
  "d_kv": 64,
  "d_model": 512,
  "decoder_start_token_id": 0,
  "dense_act_fn": "gelu_new",
  "dropout_rate": 0.0,
  "eos_token_id": 1,
  "feed_forward_proj": "gated-gelu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "is_gated_act": true,
  "layer_norm_epsilon": 1e-05,
  "model_type": "t5",
  "num_decoder_l

### 2. Loading the weights of the model from the control point and continuing training with a different data type

To do this, you need to set the `pretrained` parameter to `local` again. The `config-path` parameter must point to a previously created configuration file in the format. The config, as well as the `checkpoint-path`, must point to an existing checkpoint file in the PyTorch format. After that, the training can be continued

In [4]:
!python ../wrappers/python/examples/t5_lmhead_training.py \
--restrict="cuda" --pretrained=local \
--checkpoint-path=".model/nntile_checkpoint.pt" \
--config-path="../wrappers/python/examples/t5_config.json" \
--save-checkpoint-path=".model/nntile_checkpoint_v1.pt" \
--optimizer="adam" --lr=1e-4 --dtype=bf16 --nepochs=1 \
--dataset-file="tinystories/train.bin"


Namespace(remote_model_name='google/flan-t5-small', pretrained='local', checkpoint_path='/home/jovyan/sivtsov/nntile/wrappers/python/examples/.model/nntile_checkpoint.pt', config_path='/home/jovyan/sivtsov/nntile/wrappers/python/examples/t5_config.json', save_checkpoint_path='/home/jovyan/sivtsov/nntile/notebooks/.model/nntile_checkpoint_v1.pt', optimizer='adam', model_path='.model', seq_len=512, seq_len_tile=-1, batch_size=1, minibatch_size=-1, minibatch_size_tile=-1, d_model_tile=-1, d_ff_tile=-1, num_heads_tile=-1, num_labels=2, dtype='bf16', restrict='cuda', use_redux=False, dataset_path='.data', dataset_file='tinystories/train.bin', lr=0.0001, nepochs=1, logger=False, logger_server_addr='localhost', logger_server_port=5001)
  checkpoint = torch.load(args.checkpoint_path)
T5Config {
  "classifier_dropout": 0.0,
  "d_ff": 2048,
  "d_kv": 64,
  "d_model": 512,
  "decoder_start_token_id": 0,
  "dense_act_fn": "gelu_new",
  "dropout_rate": 0.0,
  "eos_token_id": 1,
  "feed_forward_proj

### 3. Continuing to train the model loaded from the Hugging Face library

The NNTile framework currently supports continued training of a model loaded from a remote source, 
in our example, from the Hugging Face infrastructure library. 


The weights of the loaded model are passed to the model implemented in NNTile.To run such a scenario, set the `pretrained` parameter to the `remote` value.
The `config-path` and `checkpoint-path` parameters are no longer required, 
as the model configuration and layer weights will be obtained from the loaded model. 
After that, the training can be continued.
    

In [5]:
!python ../wrappers/python/examples/t5_lmhead_training.py \
--restrict="cuda" --pretrained=remote \
--save-checkpoint-path=".model/nntile_checkpoint_v2.pt" \
--optimizer="adam" --lr=1e-4 --dtype=bf16 --nepochs=1 \
--dataset-file="tinystories/train.bin"


Namespace(remote_model_name='google/flan-t5-small', pretrained='remote', checkpoint_path='', config_path='', save_checkpoint_path='.model/nntile_checkpoint_v2.pt', optimizer='adam', model_path='.model', seq_len=512, seq_len_tile=-1, batch_size=1, minibatch_size=-1, minibatch_size_tile=-1, d_model_tile=-1, d_ff_tile=-1, num_heads_tile=-1, num_labels=2, dtype='bf16', restrict='cuda', use_redux=False, dataset_path='.data', dataset_file='tinystories/train.bin', lr=0.0001, nepochs=1, logger=False, logger_server_addr='localhost', logger_server_port=5001)
config.json: 100%|█████████████████████████| 1.40k/1.40k [00:00<00:00, 10.8MB/s]
model.safetensors: 100%|██████████████████████| 308M/308M [00:02<00:00, 108MB/s]
Some weights of T5ForSequenceClassification were not initialized from the model checkpoint at google/flan-t5-small and are newly initialized: ['classification_head.dense.bias', 'classification_head.dense.weight', 'classification_head.out_proj.bias', 'classification_head.out_proj.wei