---
title: Understanding Transformers
subject: Applied Linear Algebra
subtitle: Building intuition for Attention Mechanisms
short_title: Understanding Transformers
authors:
  - name: Renukanandan Tumu
    affiliations:
      - Dept. of Electrical and Systems Engineering
      - University of Pennsylvania
    email: nandant@seas.upenn.edu
license: CC-BY-4.0
keywords: sample notes, ese 2030, linear algebra
---

## What are transformers?

Transformers are a popular architecture for machine-learning models today, powering most notably, [Large Language Models (LLMs)](https://openai.com/index/better-language-models/). We will examine how the inner product makes up its core.

First, let's look at the archictecture itself. The Transformer was first proposed by Vaswani et. al. in [*Attention is all you need*](https://arxiv.org/abs/1706.03762v7). At the core of the transformer is the Attention Mechanism. 

The Transformer was originally proposed to help with machine translation, to be able to translate large documents like books from one language to another. This is one of the core topics of study of Natural Language Processing or NLP. At the time the Transformer was developed, the state of the art approach was the [LSTM](https://dl.acm.org/doi/10.1162/neco.1997.9.8.1735). Unfortunately when translating an entire book, these approaches were unable to keep context from earlier in the book, or even in a sentence. This presented challenges when translating from languages that flip the order of the verb and the subject.

## Machine Translation Basics

In order for us to try operating on words, we need to find mathematical representations for them. The way that words are represented in Machine Translation is through vectors. An example word 'with' is below:

```{math}
\text{with} \mapsto
\begin{bmatrix}
0 \\
1 \\
0
\end{bmatrix}
```
We can build dictionaries that consist of these vectors, and create a library of these keys $K$. We can also have a set of vectors that represent our target language, in a set of values $V$. For our example, $K$ can have a set of vectors that represent the meaning of some words, like 'with', 'above', and 'below'. If we are translating to french, we would use the corresponding words in French 'avec', 'au-dessus', 'ci-dessous'.


## What does Attention do?

Attention helps us relate words from our query $Q$, to a set of keys $K$, into a domain of values $V$. In the case we just described, if we query for the word 'with', this would match with the key for 'with' in the matrix $K$. Finally, this would match with the value $V$ for 'avec', which is the french translation of 'with'.



Let's look at the formula for Attention.

:::{prf:definition} Attention Definition
$$ \text{Attention}(Q, K, V) =  \text{softmax}\left(\frac{QK^T}{d_k}\right)V$$
$Q$ is the query, $K$ are the keys, and $V$ are values, and $d_k$ is the dimension of the keys
:::


## Worked Example
### Definitions
```{math}
\begin{gather}
Q = \text{with} = \begin{bmatrix}
0 &
1 &
0
\end{bmatrix}\\

K = \begin{bmatrix}
\text{above} &
\text{below} &
\text{with}
\end{bmatrix} = 
\begin{bmatrix}
1 & 0 & 0 \\
0 & 0 & 1 \\
0 & 1 & 0 \\
\end{bmatrix}\\

V = \begin{bmatrix}
\text{au-dessus} &
\text{ci-dessous} &
\text{avec}
\end{bmatrix}
\end{gather}
```


### Numerator
Let's examine this definition in parts. First let's consider the numerator $QK^T$. This is an inner product, which we can interpret as the similarity between query and vectors in the keys.


In [1]:
import numpy as np
Q = np.array([0,1,0])
K = np.array([
    [1,0,0],
    [0,0,1],
    [0,1,0]
])
Q@K.T

array([0, 0, 1])

We can see that the selected key corresponds to the most similar word in the dictionary.

### Normalization and Softmax

The next part of the formula is: $$ \text{softmax}\left(\frac{QK^T}{d_k}\right) $$ has two parts. First, we can look at the normalization constant, which is the dimension of the keys. $d_t=3$. The authors of the paper found that this constant aided in training. The next part of this is the softmax. In a vector with $K$ dimensions, the function is defined as:

$$
\text{softmax}\left(z\right)_i = \frac{e^{z_i}}{\sum_{j=1}^K e^{z_j}}
$$

This generates a vector which sums to one, with elements between zero and one. One way to think about a softmax is that it is an approximation of the maximum function that is differentiable.

The value we get at the end is:

$$\begin{bmatrix}0,0,1\end{bmatrix}$$


### Multiplying by Value

Finally, we multiply the value from above with $V$, and get the translated word $\text{avec}$!

## Looking at real words
To do this, we will use a library `gensim` which provides a pretrained set of word embeddings. We will download word embeddings from twitter for this example.

In [2]:
import gensim.downloader
from gensim.models import Word2Vec
import gensim

In [3]:
glove_vectors = gensim.downloader.load('glove-twitter-25')

We can look up words, like screen, television, and remote using the function calls below:

In [4]:
screen = glove_vectors['screen']
rock = glove_vectors['rock']
remote = glove_vectors['remote']
print('Screen: ',screen)
print('rock: ',rock)
print('Remote: ',remote)

Screen:  [ 0.31098   -0.53336    0.9123    -0.15256    0.89117   -0.10692
  0.78658   -0.19013    0.93285    0.52754    0.31475    0.63839
 -3.4433    -0.65918    0.055317   1.3083     0.4009     0.0071616
 -0.27728    0.017544  -0.86985   -0.60072    1.3789     0.25096
 -1.4383   ]
rock:  [ 0.021641 -0.54484   0.78102  -0.10052   0.10482   0.53319   1.0586
  0.90491   0.2594   -0.73547  -0.12972  -0.59679  -3.4979   -0.61679
 -0.15259   0.072474 -0.45453  -0.14681   0.09392   0.32735  -0.68834
 -0.020972  0.21344  -0.63178   1.3292  ]
Remote:  [ 0.1125    0.38049   0.40254   0.089511  0.58018   0.067418  0.41842
 -0.46138   1.8756    0.99621   0.19743   0.27248  -2.7099   -0.65497
  0.54752   1.0199    0.58964   0.1559    0.93753  -0.28045  -0.8659
 -0.88299   0.85855  -0.39055  -1.0604  ]


Now we can calculate the cosine similarity based on the formula:
$$
w\cdot v = \|w\| \|v\| \cos(\theta)
$$

In [5]:
screen@rock/(np.linalg.norm(screen)*np.linalg.norm(rock))

0.5682209

In [6]:
remote@rock/(np.linalg.norm(remote)*np.linalg.norm(rock))

0.4561119

In [7]:
screen@remote/(np.linalg.norm(screen)*np.linalg.norm(remote))

0.86155677

Based on this dataset, we can see that screen and remote are the most similar, remote and rock are the least similar, and screen and rock are not very similar either. The library also provides the ability to search the database for the most similar words to a given word.

In [8]:
glove_vectors.most_similar('remote')

[('keyboard', 0.9022769331932068),
 ('monitor', 0.9000157713890076),
 ('plug', 0.8990083932876587),
 ('engine', 0.8985315561294556),
 ('switch', 0.896084725856781),
 ('device', 0.888617753982544),
 ('charger', 0.8855589032173157),
 ('counter', 0.8843674063682556),
 ('automatic', 0.878197431564331),
 ('sync', 0.877406656742096)]

The most similar words to 'remote' in this dataset are shown above. Dot products and cosine similarity are key components of technologies which underpin the latest innovations in Large Language Models.