# Module 1 - Transformer Architecture: Attention & Transformer Fundamentals

## Introduction to Transformers

> **Introduction to Transformers**

Transformer are the main neural network architecture used in virtually all large language models today, first published in 2017 with the BERT paper and since then uh most large language models have been some kind of variation on them, including GPT (OpenAI). Before the existance of Transformers, there were many variations on putting together different models designs (internal layers) used in deep learning. 

After Transformers, actually at least for natural language (NLP), many model designs have followed the same basic building block. So we
focus more on the different training techniques, the ways to generate data for these, and so on.
No one's really has made massive changes to the underlying architecture so it's a pretty powerful architecture.

It allows the model to learn many different kinds of interactions between different aspects of its
input and it's something that you can also stack with itself to different depths so you
can get you know different qualities of models and even though there are variations today that
make certain things faster or cheaper and so on, the basic building blocks are the same so it's
important to really understand this in detail and be able to uh you know to to both interpret kind
of what the models are doing and evaluate kind of new changes to this architecture.

The big innovation that unlocked the
power of large language models was
something called the **"attention mechanism"**.
Attention, as the word implies, allows a
computer, or a transformer in this case,
to **see exactly how one word relates to
the others in a certain sequence.
It gives a score of how important each
other word is in a sequence to each
other**.

![Screen Shot 2024-03-19 at 10.25.35.png](attachment:60d43075-4c55-4ac4-89a8-ffd1ad1941af.png)

> **The Transformer Block - Try to Predict the Next Word**

Transformers like most language models
are **used to try and predict the next
word**. The way Transformers do this is
different to the way most other language
models do it
but they still work on the same
underlying principle.
They will **take in a sequence of tokens
and then do something with that sequence
change the information inside that
sequence so that that sequence can be
used to predict whatever the next word
or token is in the vocabulary**.

In a Transformer we **take the input
tokens and convert them to word
embeddings, so we have a vector of
different word embedding vectors. We then
go through a series of enrichment phases
so that those vectors get transformed
and build in more and more context and
more information for each vector so that
when we finally give it to our softmax
classification layer or a prediction
layer it has a lot of information to
work with**.


> **How does the Process of a Transformer Block Works?**

If we look at the Transformer Blocks
in which the tokens are enriched and
then moved into the next sequence, we can
see what actually happens in the
schematic.

If we're **considering a Transformer with
just one Transformer Block,
the process would look something like
this:** 
- We would take an input sequence of tokens, transform them into word vectors so that we would have a series of word vectors.
- We would then add extra information, like positional encoding which we'll talk about in a moment, which gives information about the relative positions of each token to each other token in that sequence.
- And then we pass that into the Transformer Block.
- The **goal of the Transformer block is to enrich every token in that sequence with as much contextual information as possible. It does this through both the attention mechanism and with the transformations using a neural network**.
- We then process it further by adding a residual connection and normalizing the vectors in the sequence and then those vectors are used in a linear and softmaxx combination at the output of our Transformer to then try and predict the next token or classify the sequence.

Most Transformers will have many
hundreds of Transformer blocks but the
process is exactly the same, at the end
of the transform block the sequence now
bearing little resemblance to the
natural language input that we started
with but still with the same size and
format is passed on to the next
Transformer block.

![Screen Shot 2024-03-19 at 10.46.40.png](attachment:59776df1-aa45-46a8-b3c8-376423be0f15.png)

> **Most Important Stages of the Transformer Block**

Let's look step-by-step at how these
preparation, enrichment, and prediction
stages work.

- **Stage 1: Attention Mechanism (Linear Transformation)**

So firstly we have the all-important
attention mechanism.

**The role of attention is to measure
the importance and relevance of each
word compared to each other word
in a sequence**.
This concept gets stretched slightly
as we go from one block to another block
and higher in the traditional
Transformer architectures we might have
dozens if not hundreds of Transformer
blocks and so after the first block
you're not really comparing one word to
another but you are looking at the exact
same sequence and positions that each
token started with.
They have extra contextual information
given to them by the previous blocks and
we're still looking at how different
blocks
pass these sequences differently.
By doing this, **by adding more and more
layers of Transformer blocks we're able
to enrich these vectors and look deeper
into how the sequence interacts with
each other**.
Typical sequence lengths are on the
order of many thousands if not tens of
thousands these days and so there is a
lot of information to be processed and
so a single Transformer block is unable
to capture all of the information in say
a multi-paragraph context.
In addition, **attention being a linear
operation, and we'll see exactly what
that looks like in just a moment, doesn't
add any of the deep learning so far to
this model, in fact Transformers are
almost not really spoken about in terms
of deep learning, however the majority of
the parameters inside a large language
model are taken up by the feed forward
network that's used at each attention
block**.

- **Stage 2: Position Wise Neural Network (Non Linear Transformation)**

Now the feed forward neural network in a Transformer
operates slightly differently to what
you might expect.
As we said before, **the
tokens that are given to the Transformer
are turned into word embeddings and
they'll have a particular dimensionality
let's say 100.
And so that means we'll have a sequence
of vectors each of them being 100
dimensions long**.
The **"position-wise feed forward neural
network"** as they're referred to as in the
Transformer will be 100 neurons wide at
the input,
and so what **this means is that we pass
each token to the neural network one by
one.
The weights and the structure of that
neural network is identical every time
it's applied to each vector in the
sequence
and so this position wise, which means it
moves each position, feed forward neural
network is applied to each token so that
it can transform them into the right
format to be then given to the next
block in the Transformer or to the
output block** at the end of the
Transformer.
**This allows for nonlinear
Transformations and also enables the
Transformer itself to build up different
levels of complexity and understanding
as we go from the initial understanding
of say, how a noun and a verb might be
related at the lower levels of the
Transformer blocks all the way up to the
sentiment of the context** right up the
end of the Transformer blocks.


- **Stage 3: Residual Connections and Normalization Layers (Backwards Propagation, Uninterrupted Signal & Stability)**

Another really important part of the
Transformer block architecture are the
residual connections and layer
normalizations.
Now **the residual Connections in
particular are very important because
they allow for both gradients to flow
freely backward during back propagation
and they also make sure that the signal
of the input sequence isn't lost during
the processing of these vectors as they
become more and more enriched, so we
have a uninterrupted pathway for the
original structure of the input sequence
to go all the way through the
Transformer**.

The **layer normalization is also vitally
important as Transformers typically take
a long time to train and so ensuring
that we have stability in our training
is something that layer normalization
allows us to do**.

> **Input & Output of Transformer Block**

- *Input*: The **input is a series of natural language tokens that are converted to word embeddings and then in order to make sure that we preserve the order of those tokens in the sequence we also attach to our word embeddings a type of positional encoding**, now there are a number of different types of positional encodings, and we will see some examples later on Module 3. **Once we've got our tokens enriched to word embeddings with positional encodings we then pass them into the Transformer blocks they then work on adding different types of enrichment and complexity and hopefully understanding to the vectors in the sequence and then they pass it to the output of the Transformer**.

- *Output*: The **output of the Transformer we have our vocabulary and a linear neural network that selects, using the softmax function, which token is either the next token to be generated based on the sequence of vectors that we've been building up in our Transformer blocks, or it'll classify it using some classification scheme that we've developed for the particular application**. Now there's a number of different ways that you can use the Transformer blocks that we've been describing in this section, and in the next section we'll talk about some of those different approaches. Those will include **encoder models where we don't actually do any generation of new tokens they're decoder models where we only focus on generating the next token and then there are encoder-decoder models where we take one sequence in and output a completely different sequence based on the task**.

> **Different Types of Transformers Architecture**

There's a lot of different ways we can construct a
different type of Transformer by the way
that we organize these Transformer
blocks. We're going to look at the different types of common
architectures that we see with
Transformers and the different use cases
and innovations that they require.

The main types are the following:
- Encoders Transformers
- Decoders Transformers
- Encoders-Decoders Transformers

> **Encoder-Decoder Model**

![Screen Shot 2024-03-19 at 12.00.53.png](attachment:6a6696bc-5c5e-4324-bbee-59cc8562ca13.png)


In the original Transformer paper titled
"Attention is all you need", the
researchers from Google presented an
architecture based on an encoder decoder
approach.
The reason for this is that they wanted
to do machine translation between
English and German.
The **goal there was to input a sequence
of English tokens and output a
translated German sequence at the end.
The way that they achieve this goal is
by taking an encoder series of blocks, so
these would be regular Transformer
blocks as we've seen so far, they would
put in the English tokens, transform them
and prepare them in the way that we saw previously,
and then at the end of the Transformer
blocks
the vectors that we have at the output
of the different sequence vectors that
we produce after they've gone through
the Transformer blocks are actually used
for the attention mechanism in
something called cross-attention for the
decoder side of things that they
presented in their model.**

Now the way that this would work is that
the **model would first look at the words
that it had produced as the decoder side
of things, and then when we move up to
the point where cross attention is
needed
it would compare the word that it had at
the middle of its Transformer block and
look at the
cross attention vectors from the encoder
side of things.**
We'll look at how attention takes these
different types of vectors and combines
them together in just a moment, but you
**can think of it as first the encoder
takes the English language and
transforms it into some sort of enriched
vector
and then it uses those enriched vectors
and learns how the German words relate
to the English words to be translated.
So encoder decoder models typically take
one type of language task and convert it
to a different type of language such as other languages or programming languages.** This
could be translation or conversion or it
might be some kind of halfway in between
such as taking input from English or
natural language of some kind and
outputting it as say code language, or it
might be one programming language to
another programming language or it might
be summarization. 

**There are a number of
different use cases for encoder decoder
models and they're based on the concept
of cross attention** we'll dive deeper
into what cross attention is and how it
can be used later when we see
the attention mechanism in detail.
But **essentially, what the encoder does is
it provides an extra source of signal
for the decoder so that it can achieve
the task that it is given and that
during back propagation it learns to
rely on the signals from the encoder to
achieve its task**.

> **Encoder-Only Model**

![Screen Shot 2024-03-19 at 12.16.47.png](attachment:7ce62af9-e98a-4d96-8a82-becc61e7fec5.png)

The second part of the Transformer architecture family is the encoder model.
Now, Google **produced a second
Transformer architecture a couple of
years after the original Transformer was
released and this was the bi-directional
encoding representations from
Transformers or BERT**.
There were a couple of new innovations
that BERT released with, **one was segment
embedding so you could take one sentence
separate them with the [SEP] variable and
then put in a second sequence or a
second sentence and BERT would be able
to compare the two sentences together**.

The way they trained BERT was also
different, as **they would intentionally
mask different words into the sentence
it would also allow you to
incorporate next sentence prediction**, and
by that, **it would be able to tell whether
or not the next sentence preceded was
preceded by the first sentence that it
saw**, meaning that *it could give a true or false
whether or not the sequence the first
sequence it saw led then to the second
sequence or not*.

BERT was excellent for fine-tuning and
has been used and still dominates many
of the state-of-the-art techniques for
different types of natural language
processing. **BERT use cases: is excellent for things
like question and answering, named entity
recognition, and other more traditional
types of natural language processing
tasks**.
BERT is still in use today and is much
more lightweight than some of the larger
models that we typically see.

> **Decoder-Only Models**

![Screen Shot 2024-03-19 at 12.22.59.png](attachment:01695576-e164-4211-a8c8-187a5d649855.png)

The third type of architecture that was
produced based on the Transformer
architecture are known as the **decoder
only models, the most popular and
well-known version of this is GPT (generative pre-trained Transformer).
GPT is a type of Transformer that, as the name
suggests, generates new words. You've
probably heard of the buzz term
'Generative AI' and GPT is the reason for
that buzzword**.

The **whole aim of a decoder only model is
to try and predict the next word based
on the sequence that it's currently
processing.
In GPT it'll take in all of the vectors
that it's been working on and enriching
and use the classification softmax layer
at the end of the Transformer blocks to
try and predict the next token or the
next word**.

We've seen a huge amount of applications
based on these GPT or decoder-based
models and you'll be familiar with
probably ChatGPT, Bard, Claude, LLama, MPT
and the list goes on.
We'll be focusing a lot on GPT in this
course, but also the other encoder
decoder models and encoder only models
also very valuable and worth spending
some time becoming familiar with.

> **Important Variables in Transformers**

![Screen Shot 2024-03-19 at 12.28.25.png](attachment:c29d9f7c-0838-4e0b-8440-a8ac9f8afbc7.png)

Now that we're at a point where we can
look back over what the Transformers are
and how we might build them, it's also
important that we take it to account
some important variables that we'll hear
again and again.

- Input 
    - **Vocabulary Size (V):** is the number of tokens that the Transformer was trained on and that encompasses its speech it enables it to combine tokens together to create new words.
    - **Embedding or Model Size (D):** is a **very important variable in Transformers and relates quite often to the size that the Transformer would take up**. We'll see parameter counts a little later, but **one of the most important variables in determining what a parameter count will be for model **is the embedding or model size, these are the dimensions of the word embeddings that we use**. Many of the matrices and the neural network sizes that are inside the Transformer blocks are directly related to the size of the model or embedding dimension.
    - **Sequence or context length (L):** is **also vitally important to the number of parameters that a Transformer has but to the amount of compute that you will need to actually run the Transformer**. We've seen context lengths change from 512 from the original GPT model all the way up to hundreds of thousands with new models like Claude.


- Internal Variables
    - **Number of Attention Heads (H):**, so we'll talk more about attention in the next section, and the number of heads that you have in multi-attention is also an important part to keep track of.
    - **Intermediate or Internal Feed Forward Network Size (I):** is **related to the intermediate or internally hidden layer of the position-wise feed forward neural network**. These position-wise position wise feed forward neural networks **take up about 66% of all of the parameters that are learned** in a Transformer.
    - **Number of Layers (N):** is also **vitally important as that's the number of Transformer blocks that a Transformer will have**.


- Training
    - **Batch Size (B):** for all of these models is also very important to keep track of, and in fact while Transformers are very much a deep learning entity, you'll notice that a number of things are quite different. It's not uncommon to see things like an epoch of just one, or a batch size of just one or two typical for these sorts of models.
    - **Tokens Trained On (T):** the number of tokens that a Transformer is trained on reaches the millions, billions, and trillions in Transformers. This is something that we haven't seen in deep learning up to this point but is very much part of what makes large language models large.

## Attention: The Power of LLMs

> **Attention: The Secret that Unlocked the Power of LLMs**

One of the goals of this module is to
understand how we can build and train
our own base or foundation Transformers.
However, before we get into that let's
take a moment to talk about attention.
It's one of the most important
components of Transformers and something
that can be quite complicated if you
haven't seen something like it before.

![Screen Shot 2024-03-19 at 12.52.06.png](attachment:e728cead-d49b-46c7-8535-7044c28b9f0a.png)

To start with let's think about how we
can take the vector that we're working
with, that's going to be the current
token that we're looking, at so let's
assume that we're in the first layer
where we can directly correlate the input
word embedding vector with the vector
that we're going to talk about in
attention here.

Now **attention is built out of three vector
families: the query vector, the key vector
and the value vector**.
Now we **actually have one query vector
and that's going to relate to the
current token that we're looking at in
the sequence, we're going to have a
number of key vectors, they're going to
come from all of the vectors in the
sequence, and we're also going to have a
series of value vectors**.
**We're going to use a matrix
multiplication with the word vector or
the enriched embedding vector if you
like, multiplied by this
query matrix to give us our query vector.
All of the matrices, the query matrix, the
key matrix, and the value matrix
are comprised of weights that are
learned during back propagation**.

The **idea behind attention** is that we use
a single query vector
and talk to all of the other key vectors
that we generate in this from the
sequence and we effectively ask it how
much
are you, the key vector, related to me, the
query vector. **We do this in parallel for
all of the tokens in the sequence
so every time we do the attention
calculation we're focusing on our query
vector and we're broadcasting the query
vector to all of the keys,
by that I mean the key vectors**.
What we're doing is we're asking how
similar, how important is this key vector
to the query vector.

In the equation that you see here you
can see we take the
softmax of *Q* times *K* transpose. Now *Q* in
this situation is the query vector and *K*
is a matrix transposed here to make it so that we
end up with another vector and we
multiply that by the value vector.
Let's take a look at what happens in
this situation.

> **The Inner Working of Calculating Attention**

![Screen Shot 2024-03-19 at 13.03.11.png](attachment:615aa6e8-f865-4b03-86bc-fb4959337787.png)

To calculate attention step one, we
take our **input vector, which if we're in
the first layer is the word embedding
vector with positional information,
and we create three new types of vectors**.
We create the query vector, the key
vectors. and the value vector.
The query vector as I said before is
just built from the current token, we
then multiply that using a scaled dot
product on the query vector to all of
the key vectors and what this gives us
are attention scores.

**We're going to have an attention score
for each pair of the current query
vector to each of the key vectors, so
we'll end up with an attention score
vector which is the same length as the
query vector, which is the same length as
the word embedding that we get from the
token. This is another reason why the
dimensionality of the model we built is important here**.
The size of these vectors is the
dimensionality of the model,
so **the query times key vectors gives us
these attention weights and they're
scaled from zero to one**.
Then, we do a special type of
multiplication so that for each position in our vector
the attention weight is
multiplied by the value of the value
vector at each of those indices.
So from zero to the size of the
embedding, we multiply a simple scalar
product between the attention weight
score at index zero with the value vector
score at index zero and we do that for
each of them. This then **gives us a full output vector
of the attention score
for that particular token across the
entire sequence**.

We'll take a moment to think about this
one more time as attention can be
somewhat complicated.
Realistically you can think of this as
some kind of filing cabinet and lookup
system, where **we have our query, which
comes from the current token and we're
looking through the files to see how
well each of the different other files
these are the key vectors have the
information that we need that would be
the value vector.
Once we've figured out exactly how much
each key vector should give to the query
vector, and that's the attention weights,
we then combine it all together so we
that we get a full picture this will be
our output vector we get a full picture
of how much attention to pay to each
other token in the sequence**. *This is where the notion of attention comes
from, is that the value in each of the
parts of the output vector the value in
each of the parts of the output vector
tells us how much attention we should be
paying to each token relative to the
current token of focus*.

In the next section we'll take attention,
we'll take the feed forward neural
network and we'll look at how we can
actually build our own foundation models.
What's required more than just attention
the larger architecture that we
need to construct in order to train
these models.

## Building a Transformer from Scratch

> **Foundation Model Training - Choosing the Right Options to Build your Model**

![Screen Shot 2024-03-19 at 13.25.08.png](attachment:a399301a-cd93-4767-a168-ee903b801bcb.png)

- **Note:** keep in mind that for task specific performance you'll almost always need to fine tune your model. This requires a much smaller amount of training data and is usually recommended for most people, as they'll take a Large Language model that's been pre-trained or produced a foundation version and then they want to fine-tune on top of that.


start training your own foundation
model, let's look at some of the
different options that you'll need to go
through.
You'll want to think about the model
architecture,
whether it's a decoder an encoder or a
combination of the two, and you will also
want to think about the type of tasks
that you want the fine-tuned version of
this foundation model to
perform, as well. This will inform some of
the decisions that you make with the
structure of your model and also the
different types of data you'll also want
to make sure that you think about how
big you want the model to be how rich
you want its representation of language
to be and needs to be embedding
Dimensions the number of blocks Etc.
And then the type of data and the
availability of data that you have and
the wrangling of that data most
importantly is going to be one of the
most difficult things for you to
overcome.
Finally actually getting the compute
resources, both the amount of time that
you have allocated to train the model
and the hardware that's available, which
is not something you can take for
granted these days as
GPUs are quite hard to come by,
particularly the ones that are needed
for foundation model training.



> **Different Types of Architecture**

![Screen Shot 2024-03-19 at 13.28.01.png](attachment:7f1f58e3-6b7b-409e-9785-6e169d77bdcb.png)

We've seen already the encoder decoder
model that Google produced in the
attention is all you need paper and the
different generations that came after
that such as BERT, GPT, and T5 which we
haven't spoken about but we'll look at
more later in the course.

Depending on the tasks that you want
whether it's **classification maybe you'll
go with something like BERT**, if it's
**generation
you'll probably want something like GPT**, for
**translation you'll probably want an
encoded decoder like T5**.
You'll also want to **think about the
numbers of layers that you have and the
context size that you can deal with**.

![Screen Shot 2024-03-19 at 13.50.05.png](attachment:c2b74fe6-730f-4290-b7f2-02cc267c991c.png)

> **The Data**

![Screen Shot 2024-03-19 at 13.31.43.png](attachment:4fdd296b-136b-423b-94bc-51bca35f2166.png)

**Most importantly the data** is something
that you'll have to fight for.
There are a number of publicly available
data sets such as the well-known Pile
data set which is a combination of
different openly available text
resources. **However if you train the same
model that someone else has trained on
the same data although you're not going
to get much of an advantage and you're
better off just downloading the weights
of that model**.

You'll want to start at least with
something like the Pile to get a good
understanding of, at least in this case
the English language, but language in
general.
You **then may have proprietary or just
curated data sets of your own that are
more specific** and might include things
like transcriptions, digitized text, code
examples, and other sources that you
think is valuable for this foundation
model to be trained on.

> **Training the Model**

![Screen Shot 2024-03-19 at 13.34.33.png](attachment:c12d5fa2-b763-42c0-9ac7-3859e12386cc.png)

Once you have all of your data and your
architecture and the compute ready to go
then you can breathe a sigh of relief as
now you're just back to training a
regular deep learning model.

**Large language models train more or less
like every other deep learning model
except they're massive**, they take many
weeks and months to to train to a
reasonable state and they often require
hundreds of GPUs to do so.
However **they often rely on fairly
typical loss functions like cross
entropy and optimizers like AdamW**. Though
new optimizers are being researched and
developed by the community day by day.

> **LLM Alignment**

![Screen Shot 2024-03-19 at 13.37.29.png](attachment:aef9dea1-c32b-47de-840f-327bc09db046.png)

Now that your model's been fully trained
you might then wonder what do you do now.
Well odds are if you start to interact
with your large language model you'll
find that it **suffers from alignment
problems. If you're unfamiliar with the
alignment problem in large language
models, in essence it boils down to just
a few components.
Is the model accurate.
Does the model behave well for what we
want it to do, is it toxic, does it show
negative biases or any sort of
biases that would
detract from the performance that we
want.
And does it hallucinate**, does it make up
situations and examples when we need it
to be as factual as possible or maybe
you want it to be as creative as
possible and it's just not very good at
doing that either way the problem of
alignment is still an ongoing area where
different types of tools and procedures
are being investigated.

Really what you want to do after you've
built your foundation model is to look
at fine tuning methods.
After we finish this module here in
Module 2 we will talk about
parameter efficient fine tuning methods
which is a new area of development where
you can fine-tune models on different
tasks in a very compute efficient way.

## Summary

> **Summary of Module 1**

![Screen Shot 2024-03-19 at 13.52.09.png](attachment:4ec1c2fb-0d5c-4c7b-9eb5-4bc951a6e045.png)

This module focused on Transformers and
the building blocks that make up the
different types of large language models
that we see.

We saw that they were built of these
Transformer blocks and how the
Transformer blocks combined the
attention mechanism and position-wise
feed forward neural networks to enrich
the vectors that we give to them.
Transformers can be both encoder models
or decoder models and they can even be a
combination of the two.
We saw that the evolution of GPT from
GPT-1 to GPT-4 required some slight changes
in architecture and a lot of change in
data.

Keep in mind that while we want to train
these models as base or foundation
models, if we really want to make the
most of them for the tasks that we have
at hand we're going to need to do some
sort of fine tuning so that we can
achieve state-of-the-art performance.

Now let's jump into the notebooks and
look at how we can build our own mini
Transformers from scratch.