# Coding attention mechanisms

<img src="../images/figure-3.1-three-main-stages-of-llm-chapter-3-focus-on-stage-1-step-2.webp" width="800px">

- We will look at <span style="color:#4ea9fb"><b>attention mechanism</b> in isolation and focus on them at mechanistic level</span>.
- We will implement <b>4 different variants of attention mechanisms</b>:
  - <span style="color:#4ea9fb"><b>Simplified self-attention</b></span>
  - <span style="color:#4ea9fb"><b>Self-attention with trainable weights</b></span>
  - <span style="color:#4ea9fb"><b>Casual attention</b></span> 
    - Adds mask to self-attention that allows the models to only considered previous and current inputs in a sequence.
  - <span style="color:#4ea9fb"><b>Multi-head attention</b></span>
    - Organizes attention mechanism into multiple heads.

<img src="../images/figure-3.2-four-different-variants-of-attention-mechanism.webp" width="700px">

## 3.1 The problem with modeling long sequences

Say, we are in <b>pre-LLM era</b>, and we want to <b>develop a language translation model</b>.
- <span style="color:red">We cannot simply translate a text word by word due to the grammatial structures of the source and target language.</span>
- To address this problem, DNNs generally use two submodules, 
  - <span style="color:#4ea9fb"><b>encoder</b></span> (first, read and process the entire text) and 
  - <span style="color:#4ea9fb"><b>decoder</b></span> (then produces the translated text).

<img src="../images/figure-3.3-german-to-english-problem-with-word-for-word-translation.webp" width="700px">

<b>What's RNN, and why they were popular before transformers?</b>
- Before the advent of transformers, <span style="color:green"><i><b>recurrent neural networks </i>(RNNs) were the most popular encoder-decoder architecture for language translation</b></span>.
- <span style="color:#4ea9fb">RNN is a type of NN where outputs from the previous step are fed as inputs to the current step, making them suitable for sequential data like text.</span>

<b>What does the RNN (encoder-decoder) do?</b>
- <span style="color:#4ea9fb">The encoder processes a sequence of words/tokens from source language as input, using a hidden state an intermediate neural network layer of the encoder&mdash;to generate a condensed (encoded) representation of the entire input sequence.</span>
- <span style="color:#4ea9fb">The decoder then uses this encoded representation (hidden state) to generate the translated text, one word at a time (i.e., token by token).</span>

<p style="color:black; background-color:#F5C780; padding:15px">💡 <b>Key idea of encoder-decoder RNNs</b><br><b>- Encoder</b>: Processes the entire input text into hidden state (memory cell).<br><b>- Decoder</b>: Takes in this hidden state to produce the output, one word at a time.<br><b>- Hidden state:</b> ~ Embedding vector</span></p>

<b>What's the problem with encoder-decoder RNNs?</b>
- <span style="color:red">RNNs have a hard time capturing long-range dependencies in the complex sentences.</span>
  - RNN cannot directly access earlier hidden state from the encoder during the decoding phase.
  - Consequently, the decoder relies solely on the current hidden state, which despite encapsulating all relevant information, may not be sufficient to generate the correct translation.
  - This leads to loss of context.
    - Although RNNs work fine for short sentences, <span style="color:red">they struggle with longer sentences as they don't have direct access to previous words in the input sequence.</span>
- <span style="color:#4ea9fb">This motivated the design of attention mechanisms.</span>

<img src="../images/figure-3.4-german-to-english-translation-using-RNNs-encoder-decoder.webp" width="700px">

## 3.2 Capturing data dependencies with attention mechanisms

<b>Why attention mechanisms?</b>
- <span style="color:red">One major shortcomings of above RNNs is that it must remember the entire encoded input in a single hidden state before passing it to the decoder (figure 3.4).</span>
- <span style="color:#4ea9fb">Attention mechanisms address this issue by allowing the decoder to focus  on (i.e., selectively access) different parts of the input sequence at each decoding step (figure 3.5).</span>

<img src="../images/figure-3.5-german-to-english-translation-using-RNNs-encoder-decoder-with-attention-mechanism.webp" width="700px">