Question: **What is the fundamental problem we are trying to solve with attention?**


Given a sequence of input tokens/vectors $\{x_1, \cdots, x_n \}$, we want to produce output vectors $y_1, \cdots, y_n$,
$$
y_i = f(x_1, x_2, \cdots, x_n)
$$

where:
1. Each $y_i$ is influenced by the entire input sequence $\{x_1, \cdots, x_n \}$ (Global context)
2. This influence should be selective/weighted -some $x_j$ should have more impact on $y_i$ than others
3. The weighting should be learned from data, not fixed. 

In other words each ouput $y_i$ is a custom blend of all input vectors, where the blending weights are determined by input data itself.

The next logical question would be whats the simplest possible mathemematical operation that would let each output $y_i$ be influenced by all inputs $x_j$ while allowing for different weights of influence? 

Now the most straightforward thing would could do is just to combine them all together for each output $y_i$. Like this:
$$
y_i = f(x_1 + x_2 + \cdots, x_n)
$$
But thats too simple. It treats all inputs equally and we know from experience (reading a book or listening to music), not everything is equally important at each moment.

So the next question we should ask is: **How do we let the model decide what's important?**

>Think about how you read a sentence. When you see the word "bank", your brain automatically looks at the surrounding context to figure out if we're talking about a river bank or a financial bank. Your attention shifts to the relevant words.


What if we gave each input a weight? we could write:
$$
y_i = f(a_{11}x_1 + a_{22}x_2 + \cdots, a_{nn}x_n)
$$
Where the weights $a_{nm}$ tell us how much **attention** the nth word should pay to mth word.
To compute $a_{ij}$ we need a way to measure the "relevance" of $x_j$ to $x_i$ 

1. The coefficients should be close to zero for input tokens that have little influence on the output $y_n$ and largest for inputs that have most influence
    - constrain the coefficients to be non-negative to avoid situations in which one coefficient can become large and positive while another coefficient compensates by becoming large and negative.
2.  We also want to ensure that if an output pays more attention to a particular input, this will be at the expense of paying less attention to the other inputs. Thereforee, we constrain the coefficients to sum to 1


The next question is do we determine these weights $a_{nm}$? when you are reading and see the word "bank", how do you decide which context words matter? You'd probably look for some kind of relationship or compatibility between words. We might check if nearby words are about finance or about rivers. We ask some version of a question like: 'What kind of bank are we talking about?' which is 'query' - a representation of what information it needs to resolve the ambiguity. Then we look at the surrounding context for clues.

**How to compute importance/relevance between position $n$ and input $m$?**

To see how much the token represented by $x_n$ should attend to the token represented by $x_m$, we need a measure of simalrity (why similarity? logically what other concept could we use?). One simple measure is their dot product. 

And to impose the above constraints, we can define the weighting coefficients $a_{nm}$ by using the softmax function (could we use something else?). 


As it stands, the transformation from input vectors {x_n} to output vectors {y_n} is fixed and has no capacity to learn 

We need a way to transform our vectors to capture the specific type of relationship we're looking for at each position. This leads us naturally to two ideas:

- For position $i$, we to transform $x_i$ to represent "what type of relationship am I looking for?" - this becomes our query vector $q_i$. 

- We need to transform each input $x_j$ to represent "what type of information do I offer?" - this becomes our key vector $k_i$

- Then we could measure compatibility between query and key with some function $w_{ij} = f(q_i,ki)$ to know how well it matches with each potential word in our context.


**How do we formulate a query?**
We could start by transforming our input vector into a 'question vector' using a learned linear transformation:

$$
q_i = W_q * x_i
$$
Where $W_Q$ is a matrix we'll learn.


In this approach each word forms its query only from itself. This would be too limiting - we'd only be asking the questions (what relationships we're looking for) based on isolated words. Therefore, 
when we're at position $i$, our query should be informed by the surrounding context:

$$
q_i = W_Q * f(x_1, x₂, ..., x_n)
$$
Where $f()$ captures how the context influences our query formation.

The keys represent "what information do I offer?", and this offering should be somewhat independent of the query. In vanilla attention, keys represent "what information do I offer?" and this offering is be independent of the query. 

The standard explanation goes something like this: 
> *Think of it like a library. Each book (key) contains certain information, and this content exists independently of what someone might be looking for. But how relevant that book is depends on the specific question (query) being asked.*

The keys represent potential information that each position could offer, but ideally this representation should be able to adapt based on the query being made.  

---
**Aside:**
> In the standard attention mechanism, keys are computed independently of queries mainly for practical implementation reasons (computational efficiency and parallelization). However, conceptually, each position contains multifaceted information that could be relevant in different ways depending on what's being asked. A more sophisticated model might allow each position to dynamically emphasize different aspects of its information based on the nature of the query, similar to how a book might highlight different parts of its content depending on what the reader is looking for.

--- 

### Separating Relevance from content

Attention computes an output vector $y_i$ for position $i$ as a weighted sum of all input vectors $\{x_1, \ldots, x_n\}$:

$$ y_i = \sum_j w_{ij} \cdot (\text{something from } x_j) $$

The weights $w_{ij}$ are determined by how relevant $x_j$ is to $x_i$, using queries ($q_i$) and keys ($k_j$):

$$ w_{ij} = \text{softmax}(q_i \cdot k_j) $$

But what is the "something" we sum over? This is where **values** come in.



## Self attenion vs. cross attention
**Encoder:**
1. Takes input s