# Transformers 

Imagine you're reading a sentence. When you read each word, you naturally consider how it relates to other words around it. Transformer models work in a similar way, but they do this with mathematical relationships. Here's how:

1. First Pass - Breaking Things Down:

- The model takes your input (like a sentence) and breaks each word into numerical values
- It also adds information about where each word appears in the sentence (like labels saying "first word," "second word," etc.)

2. The Attention Mechanism (The Special Sauce):
This is the key innovation of transformers. Think of it like having multiple spotlights that can shine on different words at once:

- For each word, the model looks at ALL other words in the sentence simultaneously
- It figures out how important each connection is (like how "bark" might strongly connect to "dog" but weakly to "the")
- It can handle long-range connections that earlier models struggled with


3. Processing Information:

- The model processes all these connections in parallel, rather than one after another
- This makes it much faster than older models that had to process words in sequence


4. Making Sense of It All:

- After looking at all these connections, the model combines this information to understand the context
- This helps it predict what comes next or understand the meaning of ambiguous words

A real-world analogy would be like being in a room full of people having a conversation. You don't just listen to one person at a time - you're aware of everyone's contributions and how they relate to each other, all at once. That's similar to how transformers process information.

This architecture has revolutionized AI because it:

- Processes information more efficiently than older models
- Understands context better
- Can handle longer pieces of text
- Works well for many different language tasks



### Below is a Diagram of the Transformer Architecture
<img src='./dia.png' width='500' height='700' style='margin-left:auto; margin-right:auto' /> 

Now, Let`s break down each component in detail: 

1. Input Processing Layer:
   - Input Embeddings: Converts word/tokens into dense vectors (typically 512-1024 dimensions)
   - Positional Encodings: Adds Information about token position using sine and cosine function
   - These combine to give each token both meaning and positional information
2. Encoder Block:
   - Multi-Head Self-Attention:
     - Calculates three main matrices for each token: Query(Q), Key(V), and Value(V)
     - Computes attention scores between all tokens using this formula: $(Q * K^T)/\sqrt{dk}\ $
     - Uses Multiple heads to capture different types of relationships
     - Each head focuses on different aspects of the relationships b/w words
   - Add & Normalize Layer:
     - Residual connection that adds the input to attention output
     - Layer normalization to stabalize the network
   - Feed-Forward Network:
     - Two Linear Transformations with a ReLU activation in between
     - Processes each position independtly
     - usually increases dimensionality then reduces it back
3. Decoder Block:
   - Masked Multi-Head Self-Attention:
     - Similar to encoder-self Attention but with masking
     - Prevents the model from looking at future tokens during training
     - Essential for tasks like language generation
   - Cross-Attention
     - Connects decoder to encoder outputs
     - queries come from decoder, keys and values come from encoder
     - Allows decoder to focus on relevant parts of input
   - Feed-Forward & Normaliztion
     - Similar structure to encoder blocks
     - Processes the combined attention information
4. Key Supporting Features: 
   - Attention Mechanism Math:
      $ Attention(Q,K,V) = softmax(QK^T/\sqrt{dk}\)V $
     - where: Q = Query matrix, K = Key matrix, V = Value matrix, dk = dimension of keys
   - Scaling Factors:
     - $\sqrt{dk}\ $ prevents vanishing gradients in large sequences
     - Multiple attention heads typically use dk = 64
5. Output Layer:
   - Linear projection to vocab size
   - softmax function to convert to probabilites
   - Final Predictions for the target sequence


Special Characteristics:

1. Parallelization:
- All positions are processed simultaneously
- Massive speedup compared to sequential models

2. Information Flow:
- Direct connections between any two positions
- Maximum path length is O(1) between any two tokens
- Helps with long-range dependencies

3. Training Optimizations:
- Label smoothing
- Learning rate warmup
- Dropout in various components
- Layer normalization
     