# DAML Attention, Transformers and LLMs (and all the fancy buzzwords)

Generative AI, from massive models to agents.
These are models that have been trained on such plethora of data
that they can solve distinct problems without re-training.
One can argue that solveing language understanding
the models have indirectly learned solutions to many
problems by just leveraging the language structure in the training data.

That is all fine and good, traning on sequences of words
worked much better than anyone expected.  And we must have heard
at some point that it is the transformer architecture that did allow
training on sequences of words.
We will explore a simplified transformer,
and see that both: this architecture build
a mapping between elements in a sequence,
and also provides an easy way to scale the size of these models.
We will need a few packages

- `pip install torch`
- `pip install transformers`

In [1]:
import math

import torch
from torch import nn
from torch.nn import functional as F
from transformers import AutoConfig, AutoTokenizer

BERT is the first widely adopted transformer model
(after a few small experimental ones).
We will replicate some of its architecture.

The "uncased" part of the model name is the version of BERT
where tokenisation of words ignores case, i.e. "octopus"
and "Octopus" are the same token.

In [2]:
model_name = "bert-base-uncased"

config = AutoConfig.from_pretrained(model_name)
(
    config.vocab_size,
    config.hidden_size,
    config.intermediate_size,
    config.max_position_embeddings,
    config.num_attention_heads,
)

(30522, 768, 3072, 512, 12)

The config of the model tells us how it is parametrized.
The most relevant configuration values are:

- `vocab_size`: the number of words in the entire corpus the model has been trained on,
  plus a few extra special tokens.
- `hidden_size`: size of the embedding vector for each token.
- `intermediate_size`: internal size of the feed forward (`Linear`) layers after
  the attention blocks, we will ignore this for simplicity
- `max_position_embeddings`: the context window the model was trained with,
  also the maximu value of a positional token (we ignore position for now).
- `num_attention_heads:` again for simplicity we will use a single attention head,
  a full model will have several heads that are concatenated.

In [3]:
embedding_layer = nn.Embedding(config.vocab_size, config.hidden_size)
embedding_layer

Embedding(30522, 768)

This is an embedding layer just like BERT's one.
With the difference that our layer has not been trained.

Another thing we need for our example is some text to feed
our simplified LM with.
From the project Guttenberg we take the book Alice in Wonderland, 
it has more than enough text for a few examples.

In [4]:
with open("./lewis-carol-alice.txt", "r") as f:
    text = f.read()

print(text[512:1024])

ng this eBook.

Title: Alice’s Adventures in Wonderland

Author: Lewis Carroll

Release Date: January, 1991 [eBook #11]
[Most recently updated: October 12, 2020]

Language: English


Produced by: Arthur DiBianca and David Widger

*** START OF THE PROJECT GUTENBERG EBOOK ALICE’S ADVENTURES IN WONDERLAND ***

[Illustration]




Alice’s Adventures in Wonderland

by Lewis Carroll

THE MILLENNIUM FULCRUM EDITION 3.0

Contents

 CHAPTER I.     Down the Rabbit-Hole
 CHAPTER II.    The Pool of Tears
 CHAPTER III.  


Since the `vocab_size` enumerates words we need to know the numbers used
by BERT itself.  We use the tokenizer from BERT itself hence.
We need to know the value of `max_position_embeddings` as the tokenizer
has a safety feature that would prevent us from feeding the model itself
with more tokens than the contex window.

For BERT `max_position_embeddings` is 512,
We have not tokenized Alice in Wonderland so we do not know
the number of words in the book.
Instead we will use the comon statistic that the average word
in english is 4 characters long.
2048 characters will result in less than 512 words,
since we also need to account for whitespace.

In [5]:
tokenizer = AutoTokenizer.from_pretrained(model_name)
inputs = tokenizer(text[:2048], return_tensors="pt")
inputs.input_ids.size(), inputs.input_ids[0][:100]

(torch.Size([1, 450]),
 tensor([  101,  1996,  2622,  9535, 11029, 26885,  1997,  5650,  1521,  1055,
          7357,  1999, 20365,  1010,  2011,  4572, 10767,  2023, 26885,  2003,
          2005,  1996,  2224,  1997,  3087,  5973,  1999,  1996,  2142,  2163,
          1998,  2087,  2060,  3033,  1997,  1996,  2088,  2012,  2053,  3465,
          1998,  2007,  2471,  2053,  9259, 18971,  1012,  2017,  2089,  6100,
          2009,  1010,  2507,  2009,  2185,  2030,  2128,  1011,  2224,  2009,
          2104,  1996,  3408,  1997,  1996,  2622,  9535, 11029,  6105,  2443,
          2007,  2023, 26885,  2030,  3784,  2012,  7479,  1012,  9535, 11029,
          1012,  8917,  1012,  2065,  2017,  2024,  2025,  2284,  1999,  1996,
          2142,  2163,  1010,  2017,  2097,  2031,  2000,  4638,  1996,  4277]))

Tokenized words can now be fed through the embedding layer.

The resulting matrix is of size words in context (max 512)
times size of embeddings.
This matrix is then the latent size inside LM.

In [6]:
embeddings = embedding_layer(inputs.input_ids)
embeddings.size()

torch.Size([1, 450, 768])

To return the latent size back to a tokenized representation (enum)
we just need a matrix that looks like the transpose of the
embedding layer.

If this reminds anyone of autoencoders,
it is not a coincidence.

In [7]:
classification_layer = nn.Linear(config.hidden_size, config.vocab_size)
next_tokens = classification_layer(embeddings)

classification_layer, next_tokens.size()

(Linear(in_features=768, out_features=30522, bias=True),
 torch.Size([1, 450, 30522]))

We have something very similar to the enum/one-hot of tokens
as the output of our autoencoder language model.

The difference is that we do not have a one-hot-encoded matrix.
We got a matrix with real values everywhere.
A `softmax` transformation will take care of it.

In [8]:
next_token_ids = F.softmax(next_tokens, dim=-1).max(dim=-1).indices
next_token_ids.size()

torch.Size([1, 450])

`softmax` is just `max` but it is also a differentiable function.
The fact that it is differentiable allows us to use some form
of Gradient Descent on the output directly.

And we now have a new set of tokens that can be transformed
back into words by walking through the enum of the tokenizer.

In [9]:
print(tokenizer.decode(next_token_ids[0]))

mesh panda rider soothing keyboardist dumping unrelated would intentions strategies protect iii¹ resumed circleworld [unused129]cated dumping rhythms recognized panda [unused737] unrelated cyprus [unused195] iii panda object ₊ sweep gunslingeryson prior unrelated panda equestrian cote overnightware sweep honored pharmaceuticals overnight impressive trim madam trough [unused666] westwood£ resumed ن£ necessary nationaltrom enigma [unused737]£⁷ pandazed unrelated panda rider soothing keyboardist 30 dale honoredcated dumping national oblast cote download madam soothing keyboardist madam genus madam ate trougheto tremendous tod iii panda object ₊ resumed trough bellevue quiz expected enhancement panda cushion unrelated pandagarhodon trougheto tod gunmen findcated dumping madam chopped bombed would intentions strategies protect iii¹ hire bombedworld [unused129]®ood bombed reduce resumedyse horribly dumping knockoutvanahini horribly gunslinger shout loyal bombed [unused764]ese resumed timeles

The output is complete junk!  That's expected, we never trained this thing.

But it also gives us one insight into how we can train it.
We have words (tokens) as input and we have words (tokens) as output.
So long as we use the same tokenizer on the input and output
we do not need to worry about the internals!
What the language model will do once trained is entirely dependent
on how we setup the trainig regime.

If we make the output to match against text further down in the
Alice in Wonderland book, and backpropagate the differences,
then we will have an LM that tells stories (most LLMs are trained that way).
If we make the output match against a summary of the story
then we will have a model great at summarising text.
If we match against a translation we got a translator.

### Now we can add the middle - first: Attention

Yet LLMs oft do the tasks described above all at once.
A simple autoencoder-style LM does not have enough parameters
to process many tasks.  We need to add more parameters, more layers.

Enters the transformer layer.  The clever way on how to scale LMs
almost infinitelly.  The one thing that transforms LMs into LLMs.

In [10]:
query, key, value = embeddings, embeddings, embeddings
scores = torch.bmm(query, key.transpose(1, 2)) / math.sqrt(config.hidden_size)
scores.size()

torch.Size([1, 450, 450])

In general the transformer layer builds a self-attention mapping
within the latent values inside the LM.

We process the values against each other.
If we multiply the values against each other we get large
values for where two values are large together in an input.

In [11]:
attention_weights = F.softmax(scores, dim=-1)
attention_weights.size()

torch.Size([1, 450, 450])

`softmax` once again smoothes these attention scores.
This produces a mask of where only the high scores are present.

Multiplying by that mask mltiplies together the original
embedding values where the attention scores are high.
Each token embedding is changed in value by multiplying
with the values of token embeddings with which its attention is high.

In [12]:
attention_outputs = torch.bmm(attention_weights, value)
attention_outputs.size()

torch.Size([1, 450, 768])

### And transformer

Finally we add the actual parameters, a simple NN layer.
The entire trick is that this layer does not operate
on the embedding but on the attention transformed embeddings.

In [13]:
linear_layer1 = nn.Linear(config.hidden_size, config.hidden_size)


def transformer1(embeddings: torch.Tensor) -> torch.Tensor:
    query, key, value = embeddings, embeddings, embeddings
    scores = torch.bmm(query, key.transpose(1, 2)) / math.sqrt(
        config.hidden_size
    )
    attention_weights = F.softmax(scores, dim=-1)
    attention_outputs = torch.bmm(attention_weights, value)
    return linear_layer1(attention_outputs)


linear_layer2 = nn.Linear(config.hidden_size, config.hidden_size)


def transformer2(embeddings: torch.Tensor) -> torch.Tensor:
    query, key, value = embeddings, embeddings, embeddings
    scores = torch.bmm(query, key.transpose(1, 2)) / math.sqrt(
        config.hidden_size
    )
    attention_weights = F.softmax(scores, dim=-1)
    attention_outputs = torch.bmm(attention_weights, value)
    return linear_layer2(attention_outputs)

### Put the middle in

Since each transformer layer has the input and output
as the same laten size, we can stack as many of those as we want.
Stacking many transformer layers is exactly how LLMs scale.

In [14]:
def simple_causal_lm(text: str) -> str:
    inputs = tokenizer(text, return_tensors="pt")
    embeddings = embedding_layer(inputs.input_ids)
    x1 = transformer1(embeddings)
    x2 = transformer2(x1)
    token_ids = F.softmax(classification_layer(x2), dim=-1).max(dim=-1).indices
    return tokenizer.decode(token_ids[0])

### A note on multi-attention

The use of the `softmax` in the attention layer above is quite oversimplified.
A single set of attention scores and a singe `softmax` calculation will
force one single main attention relationship for each input token.
This is not how language works.  There may be more than a single main
relationship between the words.

In a modern implementation the attention scores are computed on subsets
of the embeddings.  And each subset gets its own `softmax` calculation.
The outputs are then summed togehter.  This is called multi-head attention,
where each head is a calculation on one subset of the embeddings.

In [15]:
simple_causal_lm(text[:2048])

'witches debuting amateur into links revelation [unused786] shaggy 1928ignon nova alba afforded outdoorlde twists sicily 216 revelation commencing interference debutingnsis [unused786] ब ɹ alba debuting plastics ⟨ 720 goth preserve 1860 [unused786] debuting schedule zack bullyital 720ela parenting bullyisance年 docking むcarbon mesh tax outdooraki tax darrell classroom wipe teamsnsis tax breed debuting naturally [unused786] debuting amateur into links mammoth alumnusela 216 revelation classroom bulb zack warner docking into links dockingcuit docking mayer む lucy pacific daphne alba debuting plastics ⟨ outdoor むら folding [unused161] grazed debuting [unused521] [unused786] debuting wetland equations む lucy daphne champagne pacing 216 revelation docking barnsley lobby shaggy 1928ignon nova alba afforded lithium lobby twists sicily badminton suez lobby nikolai outdoor diego thousand revelation清غ worthless thousand gothʑ [unused117] lobby disclose awards outdoor over worthless pouring lobby །

### We need to train this thing of course

After that it will present reasonable text.  But how do we train it.
There are several competing ways but the simplest option is to give half a context window
as input and then compare a full context window to the output.

## What about agentic AI?

Do you remember the game we trained with RL?  It had the actions:

0. Move up
1. Move right
2. Move down
3. Move left

We can make similar actions to something real, say medicine dosage.

0. Increase dose
1. Decrease dose
2. Treatment successful, stop treatment
3. Ask a human for help

Where the input is a report from the patient and the output is connected to an API
for a dosage system.  And then to an email address for action `3`.

In [17]:
N_ACTIONS = 4
agent_layer = nn.Linear(config.hidden_size, N_ACTIONS)


def simple_agentic_ai(text: str) -> torch.Tensor:
    inputs = tokenizer(text, return_tensors="pt")
    embeddings = embedding_layer(inputs.input_ids)
    x1 = transformer1(embeddings)
    x2 = transformer2(x1)
    # the last layer is the only thing that changes
    action_ids = F.softmax(agent_layer(x2), dim=-1).max(dim=-1).indices
    return action_ids.argmax(dim=-1)

And we train all layers, then we add the changed last layer, and only then train only the new layer.