In [1]:
print('Ritu')

Ritu


# Limitations of Embeddings
When we use embeddings, they convert tokens (e.g., words or subwords) into dense, fixed-dimensional vectors that capture some semantic and syntactic properties of the tokens. However, embeddings alone are not sufficient to model context-dependent relationships or long-range dependencies in sequences. Self-attention complements embeddings to address these shortcomings

<b>Static Representation</b>:

1. Embeddings like word2vec or GloVe provide a single vector for each token, regardless of the context in which the token appears.
2. For example, the word "bank" will have the same embedding whether it appears in "river bank" or "financial bank," even though the meanings are different.

# Self-Attention
Self-attention builds upon embeddings by dynamically modeling the relationships between tokens in a sequence based on their context. Here's why self-attention is essential:

<b>Contextual Representations</b>:
1. Self-attention allows each token to consider every other token in the sequence when forming its representation.
2. For example, in "The bank by the river," self-attention can understand that "bank" refers to a riverbank based on the context provided by "river."

# How Embeddings and Self-Attention Work Together
1. <b>Initial Input</b>:Each token is first mapped to its embedding vector.
2. <b>Enhanced Representation</b>:Self-attention takes these embedding vectors as input and computes context-aware representations by aggregating information from all tokens in the sequence.
3. <b>Output</b>:The output is a richer, context-sensitive representation of each token, which is passed to subsequent layers in the Transformer.

# Steps in Self-Attention

# 1. Input Representation: 
Each word in the input sequence is embedded into a vector (𝑥1,𝑥2,...,𝑥𝑛).

# 2. Linear Transformation to Queries, Keys, and Values:

1. <b>Query (Q): "What are we looking for?</b>"

       A). The Query represents the "search term" or "question" for a specific element in the sequence.
        B). Each token in the sequence generates its own Query vector, which essentially asks: "What information am I seeking in the other tokens?"
        C).  For example: In the sentence "The cat chased the mouse," the Query vector for "mouse" might seek information related to what actions or subjects involve "mouse."


2. <b>Key (K): "What do I offer?</b>"

        A). The Key represents the "attributes" or "descriptors" of a token. It answers the question: "What information do I have to offer?"
        B). Each token generates its own Key vector, which acts as a representation of what that token can contribute to others in the sequence.
        C). For example:In the same sentence, the Key vector for "cat" might represent that it is a subject, and the Key vector for "chased" might highlight that it denotes an action.


3. <b>Value (V): "What do I contain?</b>"

        A) The Value represents the actual "payload" or "content" of the token. It’s the information that will be aggregated and passed to the next layer based on how relevant the token is to others (determined by the Query and Key).
        B) For example:The Value vector for "mouse" might contain semantic information about its role as an object in the context of the sentence.

![image.png](attachment:01f9ca1e-01cf-499a-92f3-e5ab9eef65a3.png)

# 3. Compute Attention Scores: (Interaction Between Queries, Keys, and Values)

## Step 1: Match Queries and Keys
* The attention mechanism computes a similarity score between each Query vector and every Key vector in the sequence.
* This score determines how much a particular token (represented by its Key) is relevant to another token (represented by its Query).
![image.png](attachment:8268c352-2343-4788-9563-3b7ae336d59a.png)

## Step 2: Compute Attention Weights
The similarity scores are passed through a softmax function to normalize them into attention weights. These weights indicate how much each Key contributes to the Query.



![image.png](attachment:7db3bf4c-7dec-4713-9394-75d7eb0de41d.png)

# Step 3: Aggregate Values:

The Value vectors are weighted by their corresponding attention scores and summed to produce the output.


![image.png](attachment:b9fc16d2-2a08-4d21-ab6b-1d4bae311b03.png)

# Advantages
* <b>Scalability:</b> Processes all tokens in parallel, unlike RNNs.
* <b>Global Context:</b> Captures relationships between all tokens in the sequence.
* <b>Flexibility:</b> Can handle variable-length sequences efficiently.

![image.png](attachment:6f40557d-9d77-4476-9115-c4fce6e960b1.png)

# Why Scale by 𝑑𝑘 ? 
The division by 𝑑𝑘 in the scaled dot-product attention formula is essential for ensuring numerical stability and effective learning in the self-attention mechanism


1. <b>Dot Product Magnitude Increases with Dimension:</b>

* The dot product Q⋅KT grows with the dimensionality 𝑑𝑘. Specifically, if Q and K are random vectors with elements drawn from a standard normal distribution, their dot product's variance is proportional to 𝑑𝑘.
* This means that as 𝑑𝑘 increases, the raw scores can become very large, leading to unstable gradients during backpropagation.

2. <b>Softmax Sensitivity:</b>

* The softmax function is applied to the scores to compute attention weights. Large raw scores can push the softmax into a regime where it produces extremely sharp distributions (i.e., almost all weight goes to a single token). This can hinder learning because the model becomes overconfident and less able to adjust its weights.

# Semantic Meaning
Semantic meaning refers to the inherent or literal meaning of a word, phrase, or sentence, independent of the context in which it appears. It is the meaning that is directly tied to the dictionary definition and the general understanding of the term.

# Contextual Meaning:
Contextual meaning refers to the interpretation of a word, phrase, or sentence based on the context in which it appears. This meaning considers factors like:

* The surrounding words.
* The situation or environment.
* The speaker's intention or tone.

# Self-attention is called "self-attention"

Self-attention is called "self-attention" because it computes the attention of a word (or token) in a sequence with respect to all the other words (or tokens) in the same sequence, including itself. This mechanism allows the model to dynamically focus on different parts of the input sequence when processing it, thereby learning the relationships between tokens within the sequence.

<b>Breaking Down the Name:</b>

1. "Self":
* The term "self" refers to the fact that the attention mechanism works within the same sequence of inputs.
* Each token in the sequence attends to itself and all other tokens in the sequence.
* For example, in a sentence like "The cat sat on the mat," the token "cat" calculates its attention score not only with the other tokens ("The," "sat," "on," etc.) but also with itself.

2. "Attention":
* Attention is the process of computing weights that signify the importance of other tokens relative to the current token.
In self-attention, these weights determine how much each token contributes to the representation of the current token.

#  Multi-Head Attention
Multi-head attention improves the model's capacity by enabling it to focus on different parts of the input sequence simultaneously. Instead of computing attention just once, it does so multiple times (each attention is a "head") with different learned projections.

![image.png](attachment:41163283-5074-4445-9478-441c4e835e16.png)

![image.png](attachment:57add7e1-91fb-48d1-9e44-49e0a2a41eaa.png)

![image.png](attachment:e6b1e4d9-e252-414a-95b8-ed2b18434457.png)

![image.png](attachment:d7bf35a4-7ded-4d59-887d-9ffa8d24374f.png)

# Positional Encoding
A positional encoder is a mechanism used in transformer models to provide information about the relative or absolute position of tokens in a sequence. Since transformers do not have inherent sequential processing (like RNNs), positional encodings are critical for capturing order and sequence structure.

1. <b>Why Positional Encoding</b>?
* Transformers process input sequences as a whole without assuming any order. To capture positional information, a positional encoding is added to the input embeddings. This helps the model differentiate between tokens based on their positions in the sequence.

![image.png](attachment:08095183-8ce2-443c-bf3b-093afc696032.png)

# Layer Normalization
Layer Normalization is a normalization technique used in deep learning to stabilize and accelerate training. Unlike batch normalization, which normalizes across the batch dimension, layer normalization normalizes across the feature dimensions for each individual data point. This makes it particularly useful in sequence models like transformers where batch sizes might be small or where sequences vary in length.

![image.png](attachment:2abf05c3-f37b-4cd9-a462-94858cf02144.png)

![image.png](attachment:955e5585-8c82-4d38-9b8a-cc55c549be0e.png)