### T5 (Text-to-Text Transfer Transformer): A Mathematical Overview

**Introduction**:
The T5 model, which stands for "Text-to-Text Transfer Transformer," represents a unified framework for tackling a wide variety of Natural Language Processing (NLP) tasks. Developed by Google AI, T5 reframes every NLP problem into a text-to-text format, where the model takes text as input and produces text as output. This approach leverages the power of the Transformer architecture and large-scale pre-training on a diverse dataset.

**1. Architectural Framework**:
T5 is built upon the standard Transformer Decoder-Encoder architecture. Key components include:

- **Encoder-Decoder Stack**: Unlike decoder-only models like GPT-2, T5 employs both an encoder (to process the input sequence) and a decoder (to generate the output sequence). Each consists of multiple layers.
- **Multi-Head Self-Attention with relational embeddings**: Both the encoder and decoder use multi-head self-attention mechanisms to weigh the importance of different parts of the sequence. The mathematical formulation is similar to the standard Transformer:

$$
\text{MultiHead}(X) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)W^O
$$

where each head is computed as:

$$
\text{head}_i = \text{Attention}(XW_i^Q, XW_i^K, XW_i^V)
$$

and the attention function is:
# 
$$
\text{Attention}(Q, K, V) = \text{softmax}\left( QK^T + R \right)V
$$
# 
Here, \( X \) represents the input embeddings, \( $W_i^Q, W_i^K, W_i^V$ \) are learnable projection matrices.

The decoder has two attention mechanisms:
1. **Masked Self-Attention**: Similar to the encoder's self-attention but with a causal mask to prevent attending to future tokens.
2. **Cross-Attention**: This mechanism allows the decoder to attend to the encoder's output:

$$
\text{CrossAttention}(X, EO) = \text{softmax}\left( XW^Q(E W^K)^T + R \right) E W^V
$$

where \( **X** \) is the decoder's representation and \( **E** \) is the encoder's output. This cross-attention enables the decoder to focus on relevant parts of the input sequence when generating each output token.

The relational embeddings $R$ in the attention mechanism are computed as:

$$
R_{i,j} = emb_r(f(i-j))
$$

where $emb_r$ is a learned embedding layer that maps the relative distance between positions $i$ and $j$ to a vector of head dimension.   
$f$ is a bucketing function that maps relative distances to a fixed number of buckets, limiting embedding layer size (e.g., clamping large distances to a maximum value). $f$ is shared across all layers in decoder and encoder stacks.



**2. The Text-to-Text Framework**:
Alfought T5 represents classic encoder-decoder model with relation embeddings addition, the core innovation of T5 is its unified approach. Every task is converted into a text-to-text problem by adding a task-specific prefix to the input sequence. The model is trained to generate the target text based on this combined input.

Examples:
- **Translation (English to German)**: `translate English to German: That is good.` -> `Das ist gut.`
- **Summarization**: `summarize: [article text...]` -> `[summary text...]`
- **Question Answering**: `question: Who invented the lightbulb? context: Thomas Edison invented the lightbulb in 1879.` -> `Thomas Edison`
- **Sentiment Analysis**: `sst2 sentence: This movie was fantastic!` -> `positive`

**3. Pre-training Objective**:
T5 is pre-trained on a massive and diverse text corpus called C4 (Colossal Clean Crawled Corpus) using a self-supervised denoising objective inspired by Masked Language Modeling (MLM). Specifically, T5 uses **span corruption**:

- Randomly sample spans (contiguous sequences of tokens) from the input text.
- Replace each chosen span with a single unique sentinel token (e.g., `<X>`, `<Y>`, etc.).
- The model is trained to predict the original text of the corrupted spans, using the corresponding sentinel tokens as delimiters in the target sequence.

Example:
- Original: `Thank you for inviting me to your party last week.`
- Input: `Thank you <X> me to your party <Y> week.`
- Target: `<X> for inviting <Y> last <EOS>`

This pre-training task encourages the model to learn general language understanding and generation capabilities.

**4. Fine-tuning**:
After pre-training, the *same* T5 model is fine-tuned on various downstream tasks. The fine-tuning process also uses the text-to-text format, simply by providing task-specific examples with the appropriate prefixes (like `translate English to German:`, `summarize:`, etc.). The model learns to associate the prefix with the desired task and output format. The loss function during both pre-training and fine-tuning is typically the standard cross-entropy loss computed over the target sequence tokens:

$$
\mathcal{L} = -\sum_{t=1}^{n} \log P(y_t | y_1, \ldots, y_{t-1}, \text{input})
$$

where \( P(y_t | \ldots) \) is the probability of the target token \( y_t \) given the input and previously generated target tokens.

**In Summary**:
T5 provides a powerful and flexible text-to-text framework that simplifies the approach to diverse NLP tasks. By leveraging the Transformer architecture, a large-scale denoising pre-training objective (span corruption), and a unified input/output format, T5 achieves state-of-the-art performance on many benchmarks with a single model architecture. Its versatility makes it a foundational model in modern NLP research and applications.