---
title: "3.3 Case Study: ChatGPT, Transformers, and Inner Products"
subject: Inner Products and Norms
subtitle: Attention mechanisms are fancy inner products
short_title: "3.3 Case Study: ChatGPT and Inner Products"
authors:
  - name: Renukanandan Tumu
    affiliations:
      - Dept. of Electrical and Systems Engineering
      - University of Pennsylvania
    email: nandant@seas.upenn.edu
license: CC-BY-4.0
keywords: sample notes, ese 2030, linear algebra
---

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/nikolaimatni/ese-2030/HEAD?labpath=/02_Inner_Products_and_Norms/Case_Study_On_Transformers/048-transformer-case-study.ipynb)

## What are transformers?

Transformers are a popular architecture for machine-learning models today, powering most notably, [Large Language Models (LLMs)](https://openai.com/index/better-language-models/). We will examine how the inner product makes up its core.

First, let's look at the archictecture itself. The Transformer was first proposed by Vaswani et. al. in [*Attention is all you need*](https://arxiv.org/abs/1706.03762v7). At the core of the transformer is the Attention Mechanism. 

The Transformer was originally proposed to help with machine translation, to be able to translate large documents like books from one language to another. This is one of the core topics of study of Natural Language Processing or NLP. At the time the Transformer was developed, the state of the art approach was the [LSTM](https://dl.acm.org/doi/10.1162/neco.1997.9.8.1735). Unfortunately when translating an entire book, these approaches were unable to keep context from earlier in the book, or even in a sentence. This presented challenges when translating from languages that flip the order of the verb and the subject.

## Machine Translation Basics

In order for us to try operating on words, we need to find mathematical representations for them. The way that words are represented in Machine Translation is through vectors. We can build dictionaries that consist of these vectors, and create a library of these keys $K$. We can also have a set of vectors that represent our target language, in a set of values $V$. For our example, $K$ can have a set of vectors that represent the meaning of some words, like 'with', 'above', and 'below'. If we are translating to french, we would use the corresponding words in French 'avec', 'au-dessus', 'ci-dessous'. Because we have only 3 keys and 3 values, we can represent them as vectors in $\mathbb{R}^3$ using One Hot Encoding. 



## What does Attention do?

Attention helps us relate words from our query $Q$, to a set of keys $K$, into a domain of values $V$. 
A key matrix $K$ in the translation example would be a dictionary of words in the source language, English. The value matrix $V$ would be a dictionary of words in the target language, French, where each word in the $K$ aligns with its corresponding word in $V$.
The query is the word or set of words that we are looking to translate.

The dot product here can be thought of as a lookup table, identifying elements with closely matching vectors. We can think of this in the same way we think of cosine similarity, where the dot product tells us how "aligned" the vectors are. We can think of this as a lookup table, identifying which parts of the key matrix "match" our query.

## The Attention Mechanism

The core of the attention mechanism is the dot product. Let's look at the formula for Attention piecewise, using a simple example.
### Definitions
```{math}
\begin{gather}
Q = \begin{bmatrix}\text{with} \\ \text{below}\end{bmatrix} = \begin{bmatrix}
0 & 5 & 0 \\
0 & 0 & 5
\end{bmatrix}\\

K = \begin{bmatrix}
\text{above} &
\text{below} &
\text{with}
\end{bmatrix} = 
\begin{bmatrix}
5 & 0 & 0 \\
0 & 0 & 5 \\
0 & 5 & 0 \\
\end{bmatrix}\\

V = \begin{bmatrix}
\text{au-dessus} &
\text{ci-dessous} &
\text{avec}
\end{bmatrix}= 
\begin{bmatrix}
5 & 0 & 0 \\
0 & 5 & 0 \\
0 & 0 & 5 \\
\end{bmatrix}\\
\end{gather}
```
In code, this is:

In [1]:
import numpy as np
Q = np.array([[0,5,0],[0,0,5]])
K = np.array([
    [5,0,0],
    [0,0,5],
    [0,5,0]
])
V = np.array([
    [5, 0, 0],
    [0, 5, 0],
    [0, 0, 5]
])

### Part 1: Dot-Product

```{math}
QK^T = A
```
We'll organize this dot product so we can see what components lead to which items in $A$
```math
\begin{aligned}
\begin{array}{lll}
 \\
 \\
 \\
\end{array}&
\left[\begin{array}{lll}
k_{1,1} & k_{2,1} & k_{3,1}  \\
k_{1,2} & k_{2,2} & k_{3,2}  \\
k_{1,3} & k_{2,3} & k_{3,3}  \\
\end{array}\right] \\
\left[\begin{array}{lll}
q_{1,1} & q_{1,2} & q_{1,3}  \\
q_{2,1} & q_{2,2} & q_{2,3}  \\
\end{array}\right]&\left[\begin{array}{lll}
a_{1,1} & a_{1,2} & a_{1,3} \\
a_{2,1} & a_{2,2} & a_{2,3} \\
\end{array}\right] \\
&
\end{aligned}
```
Now we can replace these placeholders with data from our example.
```math
\begin{aligned}
\begin{array}{lll}
 \\
 \\
 \\
\end{array}&
\left[\begin{array}{lll}
5 & 0 & 0 \\
0 & 0 & 5 \\
0 & 5 & 0 \\
\end{array}\right] \\
\left[\begin{array}{lll}
0 & 5 & 0 \\
0 & 0 & 5
\end{array}\right]&\left[\begin{array}{lll}
0 & 0 & 5 \\
0 & 5 & 0 \\
\end{array}\right] \\
&
\end{aligned} \\
A = \left[\begin{array}{lll}
0 & 0 & 25 \\
0 & 25 & 0 \\
\end{array}\right]
```
This forms the core of the attention mechanism: matching items in the query to items in our database of keys. This relation is done entirely through the dot-product. We can see that the matrix $A$ takes higher values when the query is similar to the corresponding key in the database $K$, and lower values when the query and database are dissimilar. 
In code, this is:

In [2]:
A = Q@K.T
A

array([[ 0,  0, 25],
       [ 0, 25,  0]])

## Part 2: Scaling
The next step is to scale the output by a factor of $d_k$, where $d_k$ is the dimension of the keys. For us, this value is $3$, because each key is in $\mathbb{R}^3$. The formula here is:
```{math}
\begin{gather}
\frac{A}{d_k} = B \\
\left[\begin{array}{lll}
0 & 0 & 25 \\
0 & 25 & 0 \\
\end{array}\right] / d_k = \left[\begin{array}{lll}
0 & 0 & 8.3 \\
0 & 8.3 & 0 \\
\end{array}\right]

\end{gather}
```
In code, this is:

In [3]:
B = A/3
B

array([[0.        , 0.        , 8.33333333],
       [0.        , 8.33333333, 0.        ]])

## Part 3: Softmax

The softmax operator is a nonlinear operator which maps a vector of values so that they all lie between 0 and 1, and all sum to 1. The intuition is that it is a smooth version of the maximum operator. The formula for the softmax for a vector $z \in \mathbb{R}^K$ is below:
$$
\text{softmax}\left(z\right)_i = \frac{e^{z_i}}{\sum_{j=1}^K e^{z_j}}
$$

Our formula is:
```{math}
C = \text{softmax}(B)
```
Using the softmax, we are creating a sort of lookup table to identify which part of the key database best represents the content in the query. 

In code this is:

In [4]:
from scipy.special import softmax

C = softmax(B, axis=1)
print(np.around(C,3))
C

[[0. 0. 1.]
 [0. 1. 0.]]


array([[2.40253977e-04, 2.40253977e-04, 9.99519492e-01],
       [2.40253977e-04, 9.99519492e-01, 2.40253977e-04]])

Interpreting this result, we can see that the top query corresponds most to the third key, and the second query corresponds most to the second key.

## Part 4: Values

Finally, we get our result by multiplying the matrix $C$ with our values $V$, where each item in $V$ corresponds to an item in $K$, our key database. Our formula for this is:

```{math}
D = CV \\
D = \left[\begin{array}{lll}
0.0002 & 0.0002 & 0.999 \\
0.0002 & 0.999 & 0.0002 \\
\end{array}\right] \left[\begin{array}{lll}
5 & 0 & 0 \\
0 & 5 & 0 \\
0 & 0 & 5 \\
\end{array}\right] \\
D \approx \left[\begin{array}{lll}
0 & 0 & 5 \\
0 & 5 & 0
\end{array}\right]
```
This completes the translation by relating the keys in our database to their translated counterparts. In code this is:

In [5]:
D = C@V
D

array([[1.20126988e-03, 1.20126988e-03, 4.99759746e+00],
       [1.20126988e-03, 4.99759746e+00, 1.20126988e-03]])

Now that we have calculated the Attention mechanism for our queries, we match the result to the values which are most similar. For this example, these keys are $[0,0,5]$, corresponding to "avec", and $[0,5,0]$, corresponding to "ci-dessous". Checking with our definitions from above, these are correctly translated words.

## Putting it all together
If we combine the 4 steps we just completed, we get the full definition for Attention, which is below:
:::{prf:definition} Attention Definition
$$ \text{Attention}(Q, K, V) =  \text{softmax}\left(\frac{QK^T}{d_k}\right)V$$
$Q$ is the query, $K$ are the keys, and $V$ are values, and $d_k$ is the dimension of the keys.
:::

# Looking at real words
To do this, we will use a library `gensim` which provides a pretrained set of word embeddings. We will download word embeddings from twitter for this example.

In [6]:
import gensim.downloader
from gensim.models import Word2Vec
import gensim

In [7]:
glove_vectors = gensim.downloader.load('glove-twitter-25')

We can look up words, like screen, television, and remote using the function calls below:

In [8]:
screen = glove_vectors['screen']
rock = glove_vectors['rock']
remote = glove_vectors['remote']
print('Screen: ',screen)
print('rock: ',rock)
print('Remote: ',remote)

Screen:  [ 0.31098   -0.53336    0.9123    -0.15256    0.89117   -0.10692
  0.78658   -0.19013    0.93285    0.52754    0.31475    0.63839
 -3.4433    -0.65918    0.055317   1.3083     0.4009     0.0071616
 -0.27728    0.017544  -0.86985   -0.60072    1.3789     0.25096
 -1.4383   ]
rock:  [ 0.021641 -0.54484   0.78102  -0.10052   0.10482   0.53319   1.0586
  0.90491   0.2594   -0.73547  -0.12972  -0.59679  -3.4979   -0.61679
 -0.15259   0.072474 -0.45453  -0.14681   0.09392   0.32735  -0.68834
 -0.020972  0.21344  -0.63178   1.3292  ]
Remote:  [ 0.1125    0.38049   0.40254   0.089511  0.58018   0.067418  0.41842
 -0.46138   1.8756    0.99621   0.19743   0.27248  -2.7099   -0.65497
  0.54752   1.0199    0.58964   0.1559    0.93753  -0.28045  -0.8659
 -0.88299   0.85855  -0.39055  -1.0604  ]


Now we can calculate the cosine similarity based on the formula:
$$
w\cdot v = \|w\| \|v\| \cos(\theta)
$$

In [9]:
screen@rock/(np.linalg.norm(screen)*np.linalg.norm(rock))

0.5682209

In [10]:
remote@rock/(np.linalg.norm(remote)*np.linalg.norm(rock))

0.4561119

In [11]:
screen@remote/(np.linalg.norm(screen)*np.linalg.norm(remote))

0.86155677

Based on this dataset, we can see that screen and remote are the most similar, remote and rock are the least similar, and screen and rock are not very similar either. The library also provides the ability to search the database for the most similar words to a given word.

In [12]:
glove_vectors.most_similar('remote')

[('keyboard', 0.9022769331932068),
 ('monitor', 0.9000157713890076),
 ('plug', 0.8990083932876587),
 ('engine', 0.8985315561294556),
 ('switch', 0.896084725856781),
 ('device', 0.888617753982544),
 ('charger', 0.8855589032173157),
 ('counter', 0.8843674063682556),
 ('automatic', 0.878197431564331),
 ('sync', 0.877406656742096)]

The most similar words to 'remote' in this dataset are shown above. Dot products and cosine similarity are key components of technologies which underpin the latest innovations in Large Language Models.

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/nikolaimatni/ese-2030/HEAD?labpath=/02_Inner_Products_and_Norms/Case_Study_On_Transformers/048-transformer-case-study.ipynb)