# DAML Attention, Transformers and LLMs (and all the fancy buzzwords)

Generative AI, from massive models to agents.
These are models that have been trained on such plethora of data
that they can solve distinct problems without re-training.
One can argue that solveing language understanding
the models have indirectly learned solutions to many
problems by just leveraging the language structure in the training data.

That is all fine and good, traning on sequences of words
worked much better than anyone expected.  And we must have heard
at some point that it is the transformer architecture that did allow
training on sequences of words.
We will explore a simplified transformer,
and see that both: this architecture build
a mapping between elements in a sequence,
and also provides an easy way to scale the size of these models.
We will need a few packages

- `pip install torch`
- `pip install transformers`

In [None]:
import math
import torch
from torch import nn
from torch.nn import functional as F
from transformers import AutoTokenizer, AutoConfig

BERT is the first widely adopted transformer model
(after a few small experimental ones).
We will replicate some of its architecture.

The "uncased" part of the model name is the version of BERT
where tokenisation of words ignores case, i.e. "octopus"
and "Octopus" are the same token.

In [None]:
model_name = "bert-base-uncased"

config = AutoConfig.from_pretrained(model_name)
config.vocab_size, config.hidden_size, config.intermediate_size, config.max_position_embeddings, config.num_attention_heads

The config of the model tells us how it is parametrized.
The most relevant configuration values are:

- `vocab_size`: the number of words in the entire corpus the model has been trained on,
  plus a few extra special tokens.
- `hidden_size`: size of the embedding vector for each token.
- `intermediate_size`: internal size of the feed forward (`Linear`) layers after
  the attention blocks, we will ignore this for simplicity
- `max_position_embeddings`: the context window the model was trained with,
  also the maximu value of a positional token (we ignore position for now).
- `num_attention_heads:` again for simplicity we will use a single attention head,
  a full model will have several heads that are concatenated.

In [None]:
embedding_layer = nn.Embedding(config.vocab_size, config.hidden_size)
embedding_layer

This is an embedding layer just like BERT's one.
With the difference that our layer has not been trained.

Another thing we need for our example is some text to feed
our simplified LM with.
From the project Guttenberg we take the book Alice in Wonderland, 
it has more than enough text for a few examples.

In [None]:
with open("./lewis-carol-alice.txt", "r") as f:
    text = f.read()

print(text[512:1024])

Since the `vocab_size` enumerates words we need to know the numbers used
by BERT itself.  We use the tokenizer from BERT itself hence.
We need to know the value of `max_position_embeddings` as the tokenizer
has a safety feature that would prevent us from feeding the model itself
with more tokens than the contex window.

For BERT `max_position_embeddings` is 512,
We have not tokenized Alice in Wonderland so we do not know
the number of words in the book.
Instead we will use the comon statistic that the average word
in english is 4 characters long.
2048 characters will result in less than 512 words,
since we also need to account for whitespace.

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name)
inputs = tokenizer(text[:2048], return_tensors="pt")
inputs.input_ids.size(), inputs.input_ids[0][:100]

Tokenized words can now be fed through the embedding layer.

The resulting matrix is of size words in context (max 512)
times size of embeddings.
This matrix is then the latent size inside LM.

In [None]:
embeddings = embedding_layer(inputs.input_ids)
embeddings.size()

To return the latent size back to tokenized (enum)

In [None]:
classification_layer = nn.Linear(config.hidden_size, config.vocab_size)
classification_layer

In [None]:
next_tokens = classification_layer(embeddings)
next_tokens.size()

In [None]:
next_token_ids = F.softmax(next_tokens, dim=-1).max(dim=-1).indices
next_token_ids.size()

In [None]:
print(tokenizer.decode(next_token_ids[0]))

### Now we can add the middle - first: Attention

In [None]:
query, key, value = embeddings, embeddings, embeddings
scores = torch.bmm(query, key.transpose(1, 2)) / math.sqrt(config.hidden_size)
scores.size()

In [None]:
attention_weights = F.softmax(scores, dim=-1)
attention_weights.size()

In [None]:
attention_outputs = torch.bmm(attention_weights, value)
attention_outputs.size()

### And transformer

In [None]:
linear_layer1 = nn.Linear(config.hidden_size, config.hidden_size)

def transformer1(embeddings):
    query, key, value = embeddings, embeddings, embeddings
    scores = torch.bmm(query, key.transpose(1, 2)) / math.sqrt(config.hidden_size)
    attention_weights = F.softmax(scores, dim=-1)
    attention_outputs = torch.bmm(attention_weights, value)
    return linear_layer1(attention_outputs)

linear_layer2 = nn.Linear(config.hidden_size, config.hidden_size)

def transformer2(embeddings):
    query, key, value = embeddings, embeddings, embeddings
    scores = torch.bmm(query, key.transpose(1, 2)) / math.sqrt(config.hidden_size)
    attention_weights = F.softmax(scores, dim=-1)
    attention_outputs = torch.bmm(attention_weights, value)
    return linear_layer2(attention_outputs)

### Put the middle in

In [None]:
def simple_causal_lm(text: str):
    inputs = tokenizer(text, return_tensors="pt")
    embeddings = embedding_layer(inputs.input_ids)
    x1 = transformer1(embeddings)
    x2 = transformer2(x1)
    token_ids = F.softmax(classification_layer(x2), dim=-1).max(dim=-1).indices
    return tokenizer.decode(token_ids[0])

In [None]:
simple_causal_lm(text[:2048])

### We need to train this thing of course

After that it will present reasonable text.  But how do we train it.
There are several competing ways but the simplest option is to give half a context window
as input and then compare a full context window to the output.

## What about agentic AI?

Do you remember the game we trained with RL?  It had the actions:

0. Move up
1. Move right
2. Move down
3. Move left

We can make similar actions to something real, say medicine dosage.

0. Increase dose
1. Decrease dose
2. Treatment successful, stop treatment
3. Ask a human for help

Where the input is a report from the patient and the output is connected to an API
for a dosage system.  And then to an email address for action `3`.

In [None]:
N_ACTIONS = 4
agent_layer = nn.Linear(config.hidden_size, N_ACTIONS)

def simple_agentic_ai(text: str):
    inputs = tokenizer(text, return_tensors="pt")
    embeddings = embedding_layer(inputs.input_ids)
    x1 = transformer1(embeddings)
    x2 = transformer2(x1)
    # the last layer is the only thing that changes
    action_ids = F.softmax(agent_layer(x2), dim=-1).max(dim=-1).indices
    return action_ids.argmax(dim=-1)

And we train all layers, then we add the changed last layer, and only then train only the new layer.