# Transformers
Transformers are a type of deep learning architecture introduced by Vaswani et al. in the 2017 paper "Attention Is All You Need". Transformers have become very popular in natural language processing (NLP) and other sequence-to-sequence tasks due to their high performance and scalability.

Transformers are primarily built around the self-attention mechanism, which allows the model to weigh the importance of different elements in a sequence when processing a particular element. This allows Transformers to effectively capture long-range dependencies and contextual information, which is crucial for many NLP tasks.

The Transformer architecture consists of two main components: the encoder and the decoder. Both the encoder and decoder are composed of multiple identical layers, with each layer consisting of several sub-layers such as multi-head self-attention, positional feed-forward neural networks, and layer normalization.

- **Encoder**: The encoder processes the input sequence and generates a continuous representation for each element in the sequence. It consists of a stack of layers, each containing a multi-head self-attention mechanism and a position-wise feed-forward network. The encoder is responsible for learning the contextualized representation of the input sequence.

- **Decoder**: The decoder generates the output sequence using the continuous representations generated by the encoder. It also consists of a stack of layers, each containing a multi-head self-attention mechanism, a position-wise feed-forward network, and an additional cross-attention layer that attends to the encoder's output. The decoder is responsible for generating the output sequence based on the input sequence and the learned contextual information.

Transformers use positional encoding to inject information about the position of elements in the sequence since the self-attention mechanism is inherently permutation-invariant.

Transformers have become the foundation for many state-of-the-art NLP models, such as BERT, GPT, and T5, and have achieved top performance on a wide range of tasks, including machine translation, sentiment analysis, named entity recognition, and question-answering.

## How Do Self-Attention Models Differ From Attention Models?
An attention model and a self-attention model both refer to mechanisms used in deep learning to help neural networks focus on specific parts of the input when processing information. However, they differ in terms of their applications and how they're implemented.

- **Attention Model**: An attention model is typically used in sequence-to-sequence tasks, like machine translation, where the model has to map an input sequence (e.g., a sentence in one language) to an output sequence (e.g., the same sentence in another language). The attention mechanism computes a weighted sum of input elements (such as hidden states in a recurrent neural network), with the weights indicating the importance or relevance of each input element for the current output element. This allows the model to selectively focus on different parts of the input sequence when generating the output sequence.

- **Self-Attention Model**: A self-attention model, on the other hand, focuses on the relationships between elements within a single input sequence. Instead of mapping an input sequence to an output sequence, self-attention computes a weighted sum of the input elements, with the weights indicating the importance or relevance of each input element with respect to other elements in the same sequence. This allows the model to capture long-range dependencies and complex relationships within the input sequence.

One prominent example of a self-attention model is the Transformer architecture, which has gained widespread popularity in various natural language processing tasks, such as machine translation, text summarization, and sentiment analysis. Transformers rely solely on self-attention mechanisms to process input sequences, without the need for recurrent or convolutional layers.

In summary, the main difference between an attention model and a self-attention model is their application focus. Attention models are used to capture relationships between input and output sequences, while self-attention models capture relationships within a single input sequence.

## Multi-Head Attention
Multi-head attention is a mechanism in the Transformer architecture that enables the model to attend to different parts of the input sequence simultaneously. It is a key component of the Transformer model, which aims to capture various relationships and dependencies within the input sequence more effectively than traditional single-head attention.

In a multi-head attention mechanism, the attention operation is performed multiple times in parallel, with each "head" having its own set of learnable parameters. Each head computes a separate attention operation, allowing the model to capture different aspects or relationships in the input data. The outputs of all the attention heads are then combined through concatenation and passed through a linear transformation.

Here's a step-by-step breakdown of the multi-head attention process:

- Split the input into multiple heads: The input embeddings are split into multiple parts, corresponding to the number of attention heads.

- Compute the self-attention for each head: Each head computes the self-attention independently, using its own set of learnable parameters (query, key, and value matrices). The self-attention mechanism involves calculating the dot product of the query and key, followed by a softmax operation, and finally, a weighted sum of the value vectors.

- Concatenate the outputs: Once the self-attention has been computed for each head, the resulting output vectors are concatenated.

- Apply a linear transformation: The concatenated output is passed through a linear transformation (a fully connected layer) to produce the final output of the multi-head attention mechanism.

The main advantage of multi-head attention is that it allows the model to focus on different parts of the input sequence simultaneously, capturing various relationships and dependencies more effectively than single-head attention. This results in a more expressive and powerful representation of the input data, which is crucial for complex sequence processing tasks such as machine translation, text summarization, and language understanding.

## Add and Norm vs Batch Norm
Add and Norm layer and Batch Norm layer are both normalization techniques used in deep learning, but they serve different purposes and operate in different ways.

**Add and Norm (Layer Normalization):**
- Add and Norm is primarily used in Transformer-based models.

- It consists of two steps: residual connection (add) and layer normalization (norm).

- In the residual connection step, the input is added to the output of the previous layer (e.g., the output of the multi-head attention layer or the position-wise feed-forward layer in a Transformer).

- In the layer normalization step, the mean and variance are computed along the feature dimension (across the hidden units) for each individual input in the batch, and the input is then normalized. The normalized input is then scaled and shifted using learnable parameters.

- Add and Norm helps stabilize the learning process, reduce internal covariate shift, and allows for deeper models by mitigating the vanishing gradient problem.

**Batch Normalization:**
- Batch Normalization is commonly used in convolutional neural networks (CNNs) and some recurrent neural networks (RNNs).

- It normalizes the activations of a layer across the batch dimension.

- The mean and variance are computed across the entire batch for each feature independently.

- Normalized activations are then scaled and shifted using learnable parameters.

- Batch Normalization helps reduce internal covariate shift, accelerates training, and allows for higher learning rates.

In summary, the key differences between Add and Norm (Layer Normalization) and Batch Normalization are:

- Add and Norm includes a residual connection, while Batch Normalization does not.
- Add and Norm normalizes along the feature dimension (across hidden units) for each input independently, while Batch Normalization normalizes across the batch dimension.
- Add and Norm is mainly used in Transformer models, while Batch Normalization is commonly used in CNNs and some RNNs.
# Why Use Add and Norm Instead of Batch Norm for Tranformers?

Add and Norm (Layer Normalization) is preferred over Batch Normalization in Transformer models for the following reasons:

**Sequence length variability:** In sequence-to-sequence tasks, like those handled by Transformer models, input sequences can have varying lengths. Layer Normalization is more suitable for handling such variable-length sequences since it normalizes across the hidden units (features) for each individual input, independent of the other inputs in the batch. In contrast, Batch Normalization normalizes across the batch dimension and may not be as effective in handling input sequences with varying lengths.

**Autoregressive tasks:** Transformer models are often used for autoregressive tasks like language modeling or machine translation, where predictions are made one token at a time. In such cases, using Batch Normalization can be problematic because the batch size effectively becomes 1 during inference, making it difficult to compute meaningful normalization statistics. Layer Normalization, on the other hand, works well in this context because it normalizes across the feature dimension and doesn't rely on batch size.

**Residual connection:** The Add and Norm layer includes a residual connection, which helps in mitigating the vanishing gradient problem and allows for deeper models. This residual connection is essential for the success of Transformer models. Although Batch Normalization can be combined with a residual connection, Layer Normalization already incorporates both normalization and residual connection in a single operation, making it more convenient for the Transformer architecture.

Overall, Add and Norm (Layer Normalization) is better suited for the specific requirements of Transformer models, such as handling variable-length sequences, autoregressive tasks, and incorporating residual connections. This is why Layer Normalization is used instead of Batch Normalization in Transformer models.

## Maked Multi-Head Attention
Masked Multi-Head Attention is a variant of the multi-head attention mechanism used in Transformer models, specifically designed for autoregressive tasks, such as language modeling and text generation. In these tasks, the model generates the output sequence one token at a time, and at each step, the model should only attend to the tokens that have been generated so far. This is to prevent the model from "peeking" into the future tokens, which it should not have access to during the generation process.

The masked multi-head attention mechanism achieves this by applying a mask to the attention scores in the self-attention computation. This mask is an upper triangular matrix with zeros on and below the diagonal and negative infinity (or a very large negative number) above the diagonal. When this mask is added to the attention scores, it effectively forces the softmax function to output zero probabilities for the future tokens. As a result, the model can only attend to the current and previous tokens during the generation process, ensuring that it does not use any information from the future tokens.

In summary, masked multi-head attention is a technique used in autoregressive tasks to ensure that the model only attends to the current and previous tokens in the input sequence during the generation process, preventing it from using information from future tokens.

# Scaled Dot Product Attention
Scaled dot-product attention is a mechanism used in self-attention models, such as the Transformer architecture, to compute the attention weights and context vectors for a given input sequence. It helps the model to learn which input elements are more relevant to others when processing a sequence.

Here's a step-by-step description of the scaled dot-product attention mechanism:

- **Linear Projections:** The input sequence is first projected into three different representations: Query (Q), Key (K), and Value (V). These projections are obtained by applying linear transformations (i.e., multiplying by weight matrices) to the input embeddings.

- **Dot Product:** The dot product is computed between the Query (Q) and the Key (K) matrices. This operation measures the similarity between each pair of elements in the input sequence. The higher the dot product, the more similar the elements are.

- **Scaling:** To prevent the dot product from growing too large, which could result in vanishing gradients during training, the dot product is scaled down by dividing by the square root of the dimension of the Key (K) vectors (usually represented as dk). This scaling helps maintain the stability of gradients during training.

- **Masking (optional):** If needed, a mask is applied to the scaled dot product matrix. In the case of the Transformer model, a mask is used to prevent the attention mechanism from attending to "future" positions in the input sequence.

- **Softmax:** The softmax function is applied to the masked (if used) and scaled dot product matrix, normalizing the values along each row. The resulting matrix represents the attention weights, where each element indicates the importance of the corresponding input element for the current element being processed.

- **Context Vector:** Finally, the attention weights matrix is multiplied by the Value (V) matrix. This operation computes a weighted sum of the input elements, effectively aggregating the information from the entire sequence according to the attention weights. The resulting matrix represents the context vectors, which contain the relevant information from the input sequence for each position.

In summary, scaled dot-product attention allows a model to learn the relationships between elements in an input sequence by computing attention weights based on the similarity of Query and Key representations and using these weights to aggregate the Value representations into context vectors.