# Transformers

attention mechanism: a model for processing text will (i) use parameter
sharing to cope with long input passages of differing lengths and (ii) contain connections
between word representations that depend on the words themselves. The transformer
acquires both properties by using **dot-product self-attention**.

Transformers, capture connections between word representations through a mechanism called *self-attention*. This allows the model to weigh the importance of each word in a sequence relative to every other word, *depending on the words themselves*.

the dot product attention based mechanism:
$$\mathrm{sa}_n[X_1, \dots, X_N] = \sum_{m=1}^{N} a[X_m, X_n] V_m$$

The scalar weight $a[x_m,x_n]$ is the attention that the nth output pays to input $x_m$. The $N$
weights $a[\bullet,x_n]$ are non-negative and sum to one. Hence, self-attention can be thought
of as routing the values in different proportions to create each output

----
### self attention mechanism:

Transformers, a type of neural network architecture commonly used in natural language processing (NLP), capture connections between word representations through a mechanism called **self-attention**. This allows the model to weigh the importance of each word in a sequence relative to every other word, depending on the words themselves. Let’s break this down step-by-step to explain how it works:

### 1. **Word Representations (Embeddings)**
Each word in a sequence is first converted into a vector representation, typically through an embedding layer. These embeddings encode the meaning of the word in a high-dimensional space, but at this stage, they don’t yet account for context or relationships with other words.

### 2. **Self-Attention Mechanism**
The key innovation in transformers is the self-attention mechanism, which establishes connections between word representations dynamically. Here’s how it happens:
- **Query, Key, and Value Vectors**: For each word embedding, the transformer computes three vectors: a **query (Q)**, a **key (K)**, and a **value (V)** vector. These are derived by multiplying the word embedding by learned weight matrices.
- **Attention Scores**: The model calculates a score for how much each word should "pay attention" to every other word by taking the dot product of the query vector of one word with the key vectors of all words (including itself). This score reflects how relevant one word is to another based on their content.
- **Softmax Normalization**: These scores are normalized using the softmax function, turning them into attention weights that sum to 1. This step ensures the model focuses more on highly relevant words.
- **Weighted Sum**: The value vectors of all words are then combined into a single representation for each word, weighted by the attention scores. This means the final representation of a word incorporates information from other words, with the influence depending on their relevance (as determined by the attention weights).

For example, in the sentence "The cat sat on the mat," the word "cat" might have a strong connection to "sat" because they are grammatically related, and the attention mechanism assigns a higher weight to "sat" when computing the updated representation of "cat."

### 3. **Dependency on the Word Itself**
The connections between word representations depend on the words themselves because the query, key, and value vectors are derived directly from the word embeddings. Different words produce different Q, K, and V vectors, leading to unique attention patterns. For instance:
- If "cat" is replaced with "dog," the attention scores and resulting connections shift based on the new embedding for "dog," which might emphasize different relationships in the sentence.
- This adaptability allows transformers to model context-sensitive relationships, unlike older models (e.g., RNNs) that process words sequentially without such flexible, content-based weighting.

### 4. **Multi-Head Attention**
Transformers use **multi-head attention**, where the self-attention process is repeated multiple times in parallel with different weight matrices. This allows the model to capture various types of relationships (e.g., syntactic, semantic) between words simultaneously, making the connections even richer and more nuanced.

### 5. **Positional Encoding**
Since transformers don’t process words sequentially (unlike RNNs), they add **positional encodings** to the embeddings to provide information about word order. This ensures that the connections also account for the relative positions of words, but the core dependency still stems from the word content via self-attention.

### In Summary
The connections between word representations in a transformer depend on the words themselves because the self-attention mechanism dynamically computes how much each word influences others based on their embeddings. This process creates contextualized representations where, for example, "bank" in "I sat by the bank" could connect strongly to "sat" (if it’s a bench) or "river" (if it’s a water body), depending on the surrounding words. This flexibility and context-awareness are what make transformers so powerful for tasks like translation, summarization, and more.

------

<img src=../images/self-attention.png width=650>

so, that the outputs results from two chained linear transformations; the value vectors $\beta_{v} + \Omega_{v} x_m$ are computed independently for each input $x_m$,
and these vectors are combined linearly by the attention weights $a[x_m,x_n]$. However,
the overall self-attention computation is *nonlinear*.

again, the self attention mechanism or ($QKV$) is: 

Each token (word or subword) in the input sequence is transformed into three vectors:
1. Query ($Q$) – Represents what the token is looking for in other tokens.
2. Key ($K$) – Represents how much information a token has that other tokens might find relevant.
3. Value ($V$) – Contains the actual information that will be passed forward after the attention weights are applied.


The attention mechanism computes the attention scores A using the query and key vectors:
$$A = \text{softmax} \left( \frac{QK^T}{\sqrt{d_k}} \right)$$

where:
- $QK^T$ computes similarity between queries and keys.
- $\sqrt{d_k}$ is a scaling factor to stabilize gradients.


Once we have the attention scores $A$, the final representation for each token is computed as:

$$\text{Output} = AV$$

This means:
- The attention scores tell us how much attention each token should pay to every other token.
- The weighted sum of value vectors produces the output representation.

In [1]:
import numpy as np

# Define input (3 tokens, embedding size = 4)
X = np.array([[1, 0, 1, 0],  # Token 1
              [0, 1, 0, 1],  # Token 2
              [1, 1, 1, 1]]) # Token 3

# Define weight matrices (randomly initialized for simplicity)
np.random.seed(42)
W_Q = np.random.rand(4, 4)
W_K = np.random.rand(4, 4)
W_V = np.random.rand(4, 4)

# Compute Query, Key, and Value matrices
Q = X @ W_Q  # (3x4) @ (4x4) -> (3x4)
K = X @ W_K  # (3x4) @ (4x4) -> (3x4)
V = X @ W_V  # (3x4) @ (4x4) -> (3x4)

# Compute attention scores (scaled dot-product)
attention_scores = Q @ K.T  # (3x4) @ (4x3) -> (3x3)
scaled_scores = attention_scores / np.sqrt(K.shape[1])  # Scale by sqrt(d_k)

# Apply softmax to get attention weights
attention_weights = np.exp(scaled_scores) / np.sum(np.exp(scaled_scores), axis=1, keepdims=True)

# Compute final output by multiplying attention weights with Value matrix
output = attention_weights @ V  # (3x3) @ (3x4) -> (3x4)

# Print results
print("Attention Weights:\n", attention_weights)
print("\nFinal Output (Weighted Sum of Values):\n", output)

Attention Weights:
 [[0.16875885 0.07392469 0.75731646]
 [0.21372837 0.19409159 0.59218004]
 [0.07229842 0.02876049 0.89894109]]

Final Output (Weighted Sum of Values):
 [[0.64157534 1.96921544 1.75396406 2.3889108 ]
 [0.59375779 1.76150136 1.58900757 2.13931733]
 [0.70437031 2.10776405 1.89519839 2.55911343]]


### Hyper-network:
A Hyper-Network in the context of Transformers refers to a secondary neural network that generates the weights for the primary Transformer model, instead of learning them directly. This concept is particularly useful for improving parameter efficiency, adaptability, and generalization in large-scale models.

***dot product self attention***:

$$
q_n = \beta_q + \Omega_q X_n, \\
k_m = \beta_k + \Omega_k X_m,
$$

$$
a[X_m, X_n] = \text{softmax}_m \left[ k_m^{\top} \cdot q_n \right] = \frac{\exp \left[ k_m^{\top} q_n \right]}{\sum_{m'=1}^{N} \exp \left[ k_{m'}^{\top} q_n \right]},
$$

so in this way we compute the attention weights, and for each $x_n$, they are positive and sum to one. the dot product operation returns a measure of similarity between its inputs, so the weights $a[x_{\bullet},x_n]$ depend on the relative similarities
between the $n^{th}$ query and all of the keys.

so the self-attention summary:

The $n_{th}$ output is a weighted sum of the same linear transformation $v_{\bullet} = \beta_v + \Omega_v x_{\bullet}$
applied to all of the inputs, where these attention weights are positive and sum to one.
The weights depend on a measure of similarity between input $x_n$ and the other inputs.
There is no activation function, but the mechanism is nonlinear due to the dot-product
and a softmax operation used to compute the attention weights.

### Extensions to dot-product self-attention

- positional encoding
  - Absolute positinal encodings
  - Relative positional encodings
- scaled dot-product self-attention: the $\sqrt{D_q}$ part in $\mathrm{Sa}[X] = V \cdot \mathrm{Softmax}\left[\frac{K^{\top}Q}{\sqrt{D_q}}\right].$
- multiple heads: $\mathrm{MhSa}[X] = \Omega_c \left[ \mathrm{Sa}_1[X]^{\top}, \mathrm{Sa}_2[X]^{\top}, \dots, \mathrm{Sa}_H[X]^{\top} \right]^{\top}.$

the complete set of operations in a transformer architecture:
$$X \gets X + \mathrm{MhSa}[X]$$
$$X \gets \mathrm{LayerNorm}[X]$$
$$X_n \gets X_n + \mathrm{mlp}[X_n]$$
$$X \gets \mathrm{LayerNorm}[X],$$

##### There are three types of transformer models:

An *encoder* transforms the text embeddings into a representation that can
support a variety of tasks. A *decoder* predicts the next token to continue the input text. *Encoder-decoders* are used in sequence-to-sequence tasks, where one text string is
converted into another (e.g., machine translation).

### BERT: (Bidirectional Encoder Representations from Transformers)
- BERT is a pretrained deep learning model designed for contextual word embeddings. It is trained using unsupervised learning on large text corpora and fine-tuned for various downstream tasks like question answering, sentiment analysis, and named entity recognition.

- BERT consists of:
    - Input Embeddings (Token, Segment, and Positional Embeddings)
    - Multiple Transformer Encoder Layers
    - Self-Attention Mechanism (Multi-Head Attention)
    - Feedforward Networks & Layer Normalization
    - Final Output Representations for NLP Tasks


$?$  the difference between encoder and decoder models:

The encoder aimed to build a representation of the text that could be fine-tuned to solve a variety of more specific NLP tasks. Conversely, the decoder has one
purpose: to generate the next token in a sequence. It can generate a coherent text
passage by feeding the extended sequence back into the model.

The ***autoregressive*** formulation demonstrates the connection between maximizing the
joint probability of the tokens and the next token prediction task.

$$
\Pr(t_1, t_2, \dots, t_N) = \Pr(t_1) \prod_{n=2}^{N} \Pr(t_n | t_1, \dots, t_{n-1}).
$$

$?$ The entire decoder network using masked self-attention mechanism operates as follows:

The input text is tokenized, and the
tokens are converted to embeddings. The embeddings are passed into the transformer
network, but now the transformer layers use masked self-attention so that they can
only attend to the current and previous tokens. Each of the output embeddings can be
thought of as representing a partial sentence, and for each, the goal is to predict the next
token in the sequence. Consequently, after the transformer layers, a single linear layer
maps each output embedding to the size of the vocabulary, followed by a softmax[•]
function that converts these values to probabilities. During training, we aim to maximize
the sum of the log probabilities of the next token in the ground truth sequence at every
position using a standard multiclass cross-entropy loss

in text generation task, many strategies can make the output text more coherent. For example,
beam search keeps track of multiple possible sentence completions to find the overall most
likely sequence of words (which is not necessarily found by greedily choosing the most
likely word at each step). Top-k sampling randomly draws the next word from only the
top-K most likely possibilities to prevent the system from accidentally choosing from the
long tail of low-probability tokens and leading to an unnecessary linguistic dead end.

Translation between languages is an example of a sequence-to-sequence task. One common approach uses both an encoder (to compute a good representation of the source
sentence) and a decoder (to generate the sentence in the target language). This is aptly
called an *encoder-decoder* model.

This is achieved by modifying the transformer layers in the decoder. Originally,
these consisted of a masked self-attention layer followed by a neural network applied
individually to each embedding. A new self-attention layer is added between these two components, in which the decoder embeddings attend to the encoder
embeddings. This uses a version of self-attention known as encoder-decoder attention or
**cross-attention**, where the queries are computed from the decoder embeddings and the
keys and values from the encoder embeddings

$??$ open research case: ***Interaction matrices for self-attention*** $??$