# Large Language Models

<hr style="border:2px solid gray">

# Index <a id='index'></a>
1. [Introduction](#intro)
1. [What are LLMs?](#llms)
1. [Generating text with GPT-2 in Python](#gpt-2)


<hr style="border:2px solid gray">

# Introduction <a id='intro'></a> [^](#index)

Now we have covered neural networks with all kinds of different ways to define locality, we come to discuss the application of machine learning that is now the most well-known to the general public: large languages models. 

These models have transformed natural language processing with top performance for tasks such as translation, text summarisation, and answering questions. 

In this notebook, we will give a brief overview of LLMs, including the basic principles of how they work, some common architectures, and some extra training techniques used for them. 

You'll have a brief intorduction the HuggingFace `transformers` library, which implements many LLM architectures and gives access to pre-trained networks that you can use, and we will run a local instance of GPT-2 (one the pre-cursors to ChatGPT) for you to experiment with.

<hr style="border:2px solid gray">

# What are LLMs? <a id='llms'></a> [^](#intro)

You will doubtless have heard of large language models (LLMs) such as ChatGPT, Claude, Gemini, and more. These models are among the cutting-edge of AI and have attracted billions in investment, due to the undoubtedly impressive capabilities they demonstrate for a huge variety of tasks but particularly for language generation. 

While similar ideas existed using statistical models and recurrent neural networks, it was the introduction of the transformer architecture that lead to the development of foundational models such as the **generative pre-trained transformer** (**GPT**) or **bidirectional encoder representations from transformers** (**BERT**) models that revolutionised a wide variety of NLP tasks. 

The question we now ask is, how do LLMs work?

There are now many different types of LLMs but broadly speaking, these originally divided into two categories:

* decoder-only transformers, which are causal (each token can only attend to earlier tokens) and generate sequences one token at a time

* encoder-only transformers, which are non-causal and learn token context by randomly masking some sequence entries during training and learning to predict the masked tokens, but are not generally appropriate for iteratively generating text

Instead of the encoder-decoder architecture we expect for language processing, great success has been found using with models based on just the encoder or just the decoder. For now, we will focus on decoder-only models, as this is the model type that is generally most effective for sequence generation tasks like chatbots, question answering, and agentic AI.

## Generative pre-trained transformers

First introduced by OpenAI in 2018, GPT models are based on the the transformer architecture, using multi-head self-attention. In particular, GPT models are based on the transformer decoder, albeit **without** the cross-attention block that attends to encoder outputs and decoder hidden states in the original transformer architecture. This is because there is not an encoder in this model. The schematic below illustrates a GPT model architecture. Let's break it down:

<center>
<img src='GPT-model-schematic.png' width=1000/>
</center>



* Input a prompt into the model, which is tokenized and embedded to the right dimensions for the model, and padded to the desired length

* For each layer in the decoder:
    * Apply multi-head self-attention with a causal mask (like in the original transformer decoder), so each element the padded input sequence only attends to earlier elements

    * Apply the usual dropout, residual connections, and layer normalisation

    * Pass the self-attention output through a fully-connected neural network with the typical dropout, residual connections, and layer normalisation

* After all the decoder layers, apply a linear projection from the embedding dimension to the vocabulary dimension to produce an output value for each possible token, for the next token in the sequence

* Select a token based on the softmax output, either by selecting the token corresponding to the maximum value or treating the softmax outputs as probabilities and sampling randomly based on these probabilities

* Append the selected token to the original sequence, and pass this as input to the model again

* Repeat this process until you have generated the desired number of new tokens

The output sequences are constructed token-by-token, which is referred to as **auto-regressive**. 

Of course, different LLM models will have differences that aren't just the number of layers, for example they might use different attention mechanisms, activation functions, and so-on. [This paper](https://link.springer.com/article/10.1007/s42452-025-07668-w) gives a nice overview of modern LLMs and some of the differences used for different models.

## LLM training

Ok, so the architecture of a GPT model is straightforward enough, but what do we mean by pre-training? Basically, before we learn any specific application, the model is trained on a large dataset of unlabelled data in an unsupervised fashion, so it learns how to generate data, and later we can **fine-tune** the output by training with labelled data to produce a model suited to a specific task. The total training procedure for these models breaks down as follows:

* During pre-training:
    * The model is given unlabelled data, i.e. this is an **unsupervised learning** task
    
    * During training, the model aims to maximise the likelihood of the predicted token, given all previous tokens in the sequence; this means it learns to choose the most likely next token based on what is in the dataset. This is a standard language modelling loss function.
    
    * The model produced after pre-training is sometimes referred to as a **foundational model**, and can have additional tasks built on top of its outputs

    * Compared to previous RNN language models, only requiring unlabelled data is a big improvement as it is a lot less time consuming to gather and prepare compared to labelled data

* During fine-tuning:
    * After we have a foundational model, we can fine-tune to a specific task using labelled data, i.e. this is a **supervised learning** task
    
    * Typically, this involves adding an additional output head to the end of the model stack i.e. adding more layers with weights that can be learned
    
    * Fine-tuning is then optimizing the weights of the combined model to produce the desired output

For example, GPT-3.5 is the foundational model of the original ChatGPT, which was subsequently fine-tuned to produce friendly, interactive outputs to act as a chatbot. 

The same training ideas are applied to pretty much all of the LLM models around today - a foundational model is trained on a large volume of text, and then specific models are trained for certain tasks by fine-tuning. For example, models that claim to be better at producing accurate results for coding problems are likely fine-tuned on a large volume of code available online.

Some examples of fine-tuning tasks, separated by decoder-only or encoder-only models:

* Decoder-only:
    * Chatbots - the labelled data presented may be transcripts of conversations or example interactions, to make the model more conversational and interactive

    * Code generation - fine-tuning likely uses a large volume of samples of code or documentation of various code packages

* Encoder-only:
    * Text summarisation - the labelled data may be inputs of articles, and the labels would be human-written summaries of these articles

## Fine-tuning mechanisms

After foundational model training, fine-tuning can proceed in a variety of ways. Examples include:

* Reinforcement learning from human feedback: prompts are given to the LLM, which generates outputs that are ranked by humans to order them according to the desired output, e.g. to order them by how human they sound

* Constitutional AI: a major part of Anthropic's Claude model, another AI trained to determine how well LLM output obeys a "constitution" classifies the output from the LLM, and the model is fine-tuned to ensure the output obeys the "constitution". This generally consists of preventing models from generating any offensive, violent, or discriminatory content. 

* Few-shot, zero-shot, and in-context learning: few-shot and zero-shot learning refer to LLMs generalising to new tasks with only a few examples or no explicit examples respectively. In-context learning us where the model uses patterns present in the input prompt context (based on what it has previously learned about language structure) to infer appropriate output

Of course, there is a lot more that can go into training these models, with many possible different techniques. Many of these methods depend on what specific task the model is being fine-tuned for.

Now, you will have an opportunity to experiment with GPT-2, a foundational model preceeding ChatGPT, using the HuggingFace `transformers` library.



<hr style="border:2px solid gray">

# Generating text with GPT-2 with HuggingFace `transformers` <a id='gpt-2'></a> [^](#index)

While we have implemented transformers ourselves in PyTorch, if we want to work with state-of-the-art models and get the best, up-to-date weights and datasets, then the HuggingFace ecosystem is an incredibly useful tool. This platform gives access to architectures and pre-trained weights for many of the most recent models, as well as a large library of datasets for different tasks and training. 

We won't spend time digging into this library in detail, so please read the [HuggingFace webpages](https://huggingface.co/) for more information and [these pages](https://huggingface.co/docs/transformers/index) for the documentation for `transformers`, but we will use it here to instantiate and experiment with a version of GPT-2, one of the precursor models to ChatGPT.

The code here is based on this [Github repo](github.com/karpathy/minGPT) from Andrej Karpathy, one of the founders of OpenAI.

The first time you run this cell, it will download the weights for the GPT-2 model from HuggingFace. This consists of about 1.5 billion weights, so this may take some time to download. We will define the model and the tokenizer we will need to generate data, imported from `transformers`. We can also inspect the model by printing it:

In [13]:
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel

device = ('cuda' if torch.cuda.is_available() else 'mps' if torch.mps.is_available() else 'cpu')
# device = 'cpu'

model_type = 'gpt2-xl'
model = GPT2LMHeadModel.from_pretrained(model_type)
model.config.pad_token_id = model.config.eos_token_id

model.to(device)
model.eval()
print(model)

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 1600)
    (wpe): Embedding(1024, 1600)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-47): 48 x GPT2Block(
        (ln_1): LayerNorm((1600,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D(nf=4800, nx=1600)
          (c_proj): Conv1D(nf=1600, nx=1600)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((1600,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=6400, nx=1600)
          (c_proj): Conv1D(nf=1600, nx=6400)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((1600,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=1600, out_features=50257, bias=False)
)


**Note: the `Conv1D` shown is this architecture is *not* a convolution, but instead is a custom definition used in the original model and is equivalent to `nn.Linear` with the weights transposed.**

**This means that if it says `Conv1D(nf, nx)`, this is equivalent to `nn.Linear(nx, nf)`.**

From this, we can see the architecture of this model. Let's step through it piece-by-piece:

* `wte` : input embedding layer to an embedding dimension of 1600

* `wpe` : positional embeddings; these are generated from a list of integers up to the length the input sequence, with a maximum block size specified, so each unique position is indexed and converted to an embedding with the same dimensions as the embedded sequence

* `drop` : dropout layer to be applied to the sum of the token embeddings and the position embeddings

* `h` : the full transformer used, consisting of 48 `GPT2Block` items. Each `GPT2Block` is a transformer layer, with the following composition: 

    * Multi-head self-attention block `GPT2Attention`, where individual parts are as follows:
    
        * `c_attn` is a single linear layer combining projects for queries, keys, and values

        * `c_proj` is the final output linear layer to project from the combined attention output dimensions to the model dimension

        * `attn_dropout` is the dropout layer applied to the output of the attention mechanism

        * `resid_dropout` is the dropout layer applied to the output of the final linear projection
    <br></br>
    * Feed-forward network `GPT2MLP`, with the following individual components:

        * `c_fc` is a linear layer from the embedding dimension to a hidden dimension (here, it is 4 * embed_dim = 6400)

        * `c_proj` is the final linear layer from the hidden dimension to the embedding dimension

        * `act` is the activation function for the network, applied between the two linear layers, which here is the so-called Gaussian Error Linear Units (described in [this paper](https://arxiv.org/abs/1606.08415))

        * `dropout` is the dropout layer applied to the output of the network

    * Each of these sub-layers have residual connections around them

* `ln_f` is a LayerNorm applied after the transformer

* `lm_head` is the output linear layer to go from the embedding dimension to the vocab dimension to produce a value for each possible word at each step in the sequence

    
    


**As mentioned before, we can see that the GPT2Block is very similar to the transformer layers we defined previously, with input masked multi-head self-attention and a feed-forward network; the main difference to the original transformer paper is a) this isn't an encoder-decoder and instead is most similar to just a decoder (without cross-attention), and b) there are 48 layers, so we have many weights to learn the data.**

Here is a function to generate text using this model:

In [14]:
def generate(prompt='', num_samples=10, steps=20, do_sample=True, top_k = 40, temperature = 1.0):        
    # tokenize the input prompt into integer input sequence
    tokenizer = GPT2Tokenizer.from_pretrained(model_type)
    if prompt == '': 
        # to create unconditional samples...
        # huggingface/transformers tokenizer special cases these strings
        prompt = '<|endoftext|>'
    encoded_input = tokenizer(prompt, return_tensors='pt').to(device)
    x = encoded_input['input_ids']
    
    # we'll process all desired num_samples in a batch, so expand out the batch dim
    x = x.expand(num_samples, -1)

    # forward the model `steps` times to get samples, in a batch
    y = model.generate(x, max_new_tokens=steps, do_sample=do_sample, top_k=top_k, temperature = temperature)
    
    for i in range(num_samples):
        out = tokenizer.decode(y[i].cpu().squeeze())
        print('-'*80)
        print(out)

In [15]:
generate('', temperature = 1e-6)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


--------------------------------------------------------------------------------
<|endoftext|>The first time I saw the movie, I was in the middle of a long day of work.
--------------------------------------------------------------------------------
<|endoftext|>The first time I saw the movie, I was in the middle of a long day of work.
--------------------------------------------------------------------------------
<|endoftext|>The first time I saw the movie, I was in the middle of a long day of work.
--------------------------------------------------------------------------------
<|endoftext|>The first time I saw the movie, I was in the middle of a long day of work.
--------------------------------------------------------------------------------
<|endoftext|>The first time I saw the movie, I was in the middle of a long day of work.
--------------------------------------------------------------------------------
<|endoftext|>The first time I saw the movie, I was in the middle of a long

Let's quickly explain the steps of this function, and try each part in turn. Beginning with handling the input prompt:


In [5]:
prompt = 'When using LLMs,'

tokenizer = GPT2Tokenizer.from_pretrained(model_type)
if prompt == '': 
    # to create unconditional samples...
    # huggingface/transformers tokenizer special cases these strings
    prompt = '<|endoftext|>'
encoded_input = tokenizer(prompt, return_tensors='pt').to(device)
x = encoded_input['input_ids']

First, we take in our prompt, which is a string. In order to pass this through the model, we need to convert it to tokens using our tokenizer, which in this case is a pre-trained `GPT2Tokenizer` from HuggingFace. 

If we didn't pass any string and instead we want to generate text unprompted, we need to set our prompt to the end-of-sequence token, which for the HuggingFace `transformers` tokenizers is '<|endoftext|>' and tells the tokenizer it is the end of the sequence.

Our prompt goes into the tokenizer and we then send the outputs to the device we are running the model on. Let's look at the encoded input:

In [6]:
print(encoded_input)

{'input_ids': tensor([[ 2215,  1262, 27140, 10128,    11]]), 'attention_mask': tensor([[1, 1, 1, 1, 1]])}


Our tokenizer has divided our input into 5 individual tokens and returned the indices of these tokens in the tokenizer vocabulary, as well as an attention mask (which is all `True`), all on the device we have selected to run on. We can also look at how our prompt has been split into individual tokens using `tokenizer.decode`:

In [7]:
for token in x.flatten(): # need to flatten as x has shape (1, seq_len)
    print(tokenizer.decode(token))

When
 using
 LL
Ms
,


While the first two words are not split up, "LLMs" is split into "LL" and "Ms", and the comma has been identified as a separate token. The split of "LLMs" could be due to the plural, as this modifies the meaning of the root word, but without examining the tokenizer in more details it is hard to be sure.

Now, let's look at the next parts of the function:

In [8]:
num_samples = 10
steps = 10

x = x.expand(num_samples, -1)

First we need to define some parameters for the output we want, which are as follows:

* `num_samples` : the number of times we generate an output for the prompt, so we will get `num_samples` output sequences

* `steps` : the maximum number of tokens we will return for the new sequence

The tokenized prompt is then duplicated `num_samples` times using `.expand`, so we can pass it in as a batch to the model to generate samples.

Now, let's look at the step where we generate from the model:

In [9]:
do_sample = True
temperature = 1.0
top_k = 40

y = model.generate(x, max_new_tokens=steps, do_sample=do_sample, top_k=top_k, temperature = temperature)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


The `generate` method produces output sequences based on the prompt. The arguments to this function are as follows:

* `x` : the input prompt (called `inputs` in the function definition)

* `max_new_tokens` : the maximum number of new tokens to generate

* `do_sample` : whether the output sequences should be sampled from the softmax of the model outputs, or if the selected token should be the one with the highest output value. If `True`, the softmax outputs is treated as probabilities for each possible token, and the output is randomly sampled based on these probabilities, according to the `torch.multinomial` function (see [the documentation](https://docs.pytorch.org/docs/stable/generated/torch.multinomial.html) for more details).

* `top_k` : only the `top_k` highest model outputs are kept before applying the softmax, the remaining outputs are set to negative infinity i.e. so the "probability" for these tokens is set to basically 0

* `temperature` : level of randomness in the token generation (you may have encountered this in using LLMs yourself). The model outputs are divided by `temperature` before softmax, which results in the softmax outputs generally being closer together. This means that if we increase the value of `temperature`, the softmax outputs for different tokens are more similar so when we sample the next token in our sequence, we have more randomness in the output. Similarly, if we decrease the value of `temperature`, the softmax outputs will be more different and the highest values will become more likely to be selected, reducing the randomness of the output.


The `generate` method works recursively; after a new token is generated, it is appended to the input sequence and the updated sequence is passed back through again (although the sequence length is capped by the max sequence size of the model, which is the limit of the model context) to generate another token. This is repeated until the maximum number of new tokens has been generated.


Finally, we need to decode each of our output lines; we can decode whole sequences at a time, but need to do each individual sequence separately:

In [10]:
for i in range(num_samples):
        out = tokenizer.decode(y[i].cpu().squeeze())
        print('-'*80)
        print(out)

--------------------------------------------------------------------------------
When using LLMs, you'll have to be on the lookout for an
--------------------------------------------------------------------------------
When using LLMs, take note that most of the time you will not
--------------------------------------------------------------------------------
When using LLMs, consider that the underlying memory has to be addressed (
--------------------------------------------------------------------------------
When using LLMs, the data for the initialization should be obtained from
--------------------------------------------------------------------------------
When using LLMs, it is recommended to run a test suite to make
--------------------------------------------------------------------------------
When using LLMs, if you need the output data as a string (
--------------------------------------------------------------------------------
When using LLMs, always specify which of the

We transfer the model outputs `y` back to the cpu (in case we have been running on GPU), and remove any size 1 axes using `squeeze`. Finally, we pass the outputs through the tokenizer and print the output (with some dashes to separate individual lines).

<div style="background-color:#C2F5DD">

Now, feel free to play about with this model and generate some sequences based on your choice of prompt. Some interesting things to try include:

* Varying `temperature`, both increasing and decreasing; what happens in the limit of very large or very small values? 

* Verify that setting `do_sample` to `False` makes the output consistent

* Try varying `top_k` alongside `temperature`; for low temperature, you shouldn't see much change in the set of output tokens you observe even if you vary `top_k`, but if you increase the `temperature` you should see more and more possible tokens appearing in your output if you increase `top_k`

In [11]:
# Your prompting here

generate(prompt="What's the price of a loaf of bread?", num_samples=1, steps=50, do_sample=True, top_k = 40, temperature = 1.0)   


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


--------------------------------------------------------------------------------
What's the price of a loaf of bread? One person says it is worth $5. An economist says it's worth $5. The former person thinks it doesn't matter what the price is. The latter person thinks that is a terrible thing, since we should strive to earn a living and
