## Large Language Models (LLMs)
### Introduction
You may have heard or used ChatGPT recently. ChatGPT, Google Bard, and other conversational AIs, are examples of Large Language Models, or LLMs for short.  LLMs are recent advances in deep learning models to work on human language tasks, such as text generation and question answering. A large language model is a trained deep-learning model that understands and generates text in a human-like fashion. Behind the scenes, it is a large transformer model.

A transformer model can be implemented together with what we call an attention mechanism. This attention mechanism allows the model to weigh the importance of different words (or tokens) in a sequence, allowing the model to capture long range dependencies and relationships.

This means that transformer models perform better due to the fact that they learn the context surrounding the use of a word.

To process text input with a transformer model, you first need to tokenise it into a sequence of characters, sub-words, or whole words. These tokens are then encoded as numbers which we can then convert into embeddings. Embeddings are vector-space representations of the tokens that preserve their positioning. In short, they are represented by an array of numbers, where each number points to a word in the vocabulary of the language.

The encoder in the transformer transforms the embeddings of all the tokens into a context vector.  The context vector allows the transformer decoder to generate output based on clues. For instance, our initial input would be in the form of a text prompt that can be passed on to the transformer decoder to produce the next most probable word given our input. We can then repeat this by reusing the same decoder, and using the previously predicted next-word as the input, and so on. By repeating this process, a transformer model can generate an entire passage of text word by word. The transformer model learns the rules of human language implicitly through the training examples presented to it. If the training data is large enough, it is possible for the model to learn not only the rules of grammar, but also semantics, and even whole concepts.

We will use `PyTorch` for constructing our neural network. This will allow us to look at the details underlying the model rather than use tensorflow, which obscures some of the detail from you.

We will build neural networks with the `torch.nn` namespace, which provides the classes you need to build a neural network. Every module in PyTorch subclasses the `nn.Module`, and so it works a little differently from `tensorflow`, but does have the advantage of constructing sophisticated neural networks relatively quickly using a modular approach.

Deep learning models can take a lot of resources and time to train, even on a small data set, so this example will take some time to train.

### Installing Python libaries

In [None]:
!pip install torch numpy matplotlib nltk

### Model architecture
The transformer architecture follows an encoder-decoder structure. The encoder is tasked with mapping an input sequence to a sequence of continuous representations. The decoder takes the output of the encoder together with the decoder output at the previous time step to generate an output sequence. When generating an output sequence, the transformer does not rely on mechanisms such as recurrence and convolutions. Instead, LLMs architectures tend to consist of multiple layers, including feedforward layers, embedding layers, and attention layers.  

A typical task for transformers is language translation, where we use a typical sequence-to-sequence model (i.e. from English to German), which have both an encoder and decoder.

For models like those of Chat GPT, where we wish to generate text based on a prompt, text generation models only require the decoder element. This is due to the fact that the input and output sequences are essentially the same as the model generates text token-by-token according to the previous context.  Consequently, for text generation, we only need to focus on implementing the decoder portion of the transformer model. 

- *Tokenisation*: 
We break our training data text into tokens, in our case, we will be working at the character-level (this gives us more data than using word-bsaed corpora, but also keeps the training manageable). Even so we would still need a huge dataset to obtain good generative results.

- *Token Embedding*: 
The model first creates token embeddings. Each token (word, subword, or character) in the input sequence is represented by a high-dimensional vector, which is initialized randomly and learned during training. The embedding dimension is a hyperparameter.

- *Positional Encoding*: 
This helps the model to understand the order of the tokens in the sequence provided. Positional encoding is achieved by adding a vector to each token embedding, which encodes the token's position in the sequence.  This is important since words or tokens in natural language have an order to them that is driven by the syntax of the language in question, or the semantics.

- *Masking*: This layer stops the model from looking ahead, or learning from padding tokens (more on this later).

- *Decoder Stack*: The normalised output is then passed into a stack of decoders. Each decoder block in the stack consists of a self-attention mechanism (multi-headed self-attention) and a feed-forward neural network, with layer normalisation and residual connections throughout. Each decoder block refines learning over multiple layers.  After the data passes through the decoder stack, we reach the final part of the model, the output layer represented by the language model head.

- *Output layer (Language model head)*: This layer is responsible for predicting token probabilities.

### The dataset
Our dataset will consist of a custom collection of short texts we have gathered containing short sentences based on Haikus from various webpages. The idea is that the final model will be able to generate text similar to the chosen dataset given a text prompt we provide after training has completed.  The first step is to preprocess the data ready for training:

In [None]:
# OUr haiku data
training_data = """
memorial day a shadow for each white cross.
spring rain as the doctor speaks i think of lilacs.
spring moonset a rice ball for breakfast.
sunny afternoon an old man lingers near the mailbox.
cinco de mayo horses roll in the shallows.
quitting time the smell of rain in the lobby.
waves slowly cresting towards shore a faint moon.
overnight rain the scent of orange blossoms in a desert town.
misty summer rain calling pheasant in Zen temple.
day is done poppies amidst the dying grass.
watching clouds the white petals of a crushed crocus.
mountain stream two well placed rocks the path home.
night shift in the parking lot car lights dim near morning.
a wild violet on the sunny hill noon time nap.
a sunny day pink haze of the cherry blossoms over the hill.
polished oak the freesia's shadow ends in coffee foam.
nobody here a table in the mountain speckled with petals.
first date not even noticing the new moon.
vanishing difference gliding geese settle onto their reflections.
distant hawk a gust of cherry petals crosses the lawn.
Orange sunrise peaks through The tubes and wires of father's Life support machine.
moonlessness so many ways I want to touch you.
earth day even the shadows at dusk smell green.
so cold a goose honks its way across the night sky.
damp straw the day old colt mesmerized by the radio.
prying at the window wind.
in winter's wind the call of a friend who now has cancer.
rainy bridge the river flowing faster.
spring breeze the balcony's shadows on my book.
winter drizzle all the passing faces look into the cafe.
smells of spring adrift in the morning air bubbles under ice.
spring morning your hand on my chest a bird.
crossroads the brown core of an apple.
hot afternoon a flirting couple in the memorial's shade.
last red in the sky a small girl's moon face rises over the counter.
day moon the woman with silver hair steps back into the shade.
morning delivery snow comes in with the FedEx man.
easter an anxious mother calls in the wind.
windless day the prolific weeds at the grave site.
spring day cool and grey cup of tea.
grey today the waves rush memories.
spring equinox Pray the ancestor grave In cold rain.
folding chair the newborn colt tries to stand.
scattered sun one chickadee louder than the rest.
orthopaedic clinic a three legged chair outside the entrance.
the plane's landing lights a wave of barking sweeps the neighbourhood.
the way grass parts as the pheasant passes spring's end.
at half mast a butterfly passes by the still flag.
"""
    
training_data = training_data.lower().replace('\n', ' ')
    

In [None]:
print(training_data[: 50])

### Preprocessing
#### Tokenisation
We first need to tokenise the text.  Tokenisation involves breaking down the text into smaller parts known as tokens. A token might be a single character, sub-word, or a whole word. The choice of token size depends on the specific task and method being used.

For instance, GPT-3 (the model underlying ChatGPT) uses a form of tokenisation called *Byte Pair Encoding (BPE)*, which tokenises text into subwords. These subwords are sequences of characters that frequently appear together, for instance *str* in words like *str*ing, *str*ong, *str*eet, *str*ap. This approach represents a mid-point between character-level and word-level tokenisation.

We create a very simple tokeniser that will encode the characters `a-z`, and numbers `0-9`. We will also encode the period `.` and the whitespace character `" "`. Our texts are quite short typically composed a few words per line. Therefore, we will use character-level tokenisation. This means our model will be tasked with generating human-like language on the basis of characters to form complete words and phrases.

#### Vocabulary
Once we have tokenised the data, the unique set of tokens is known as the vocabulary of the language. The size of a vocabulary depends on the complexity of the language and the granularity of the tokenisation. Our vocabulary is very small by contrast. GPT-3's tokeniser uses a vocabulary of 50,257 tokens! This large vocabulary includes a broad range of English words, common word parts, and even whole phrases, enabling GPT-3 to understand and generate highly nuanced text. This is beyond the scope of our little model.

#### Padding
For better understanding of the patterns governing human language we need some notion of memory to record the beginning of the sequence of words when we reach the end.

A transformer expects a fixed size input. However, real-world text comes in sequences of variable length. To accommodate this, we use a technique called padding. Padding involves extending short sequences with further tokens to match the length of the longest sequence in the batch of training examples. 

For our model, we will use a type of padding known as left-padding. This involves prepending a special token (`<pad>`) to the beginning of shorter sequences until they match the length of the longest sequence in our dataset. Padding will ensure that every token in every sequence, regardless of its original length, will be used in the training process.

We create a class to group related functionality together for our simple tokeniser:

In [None]:
import string

class Tokeniser:
    def __init__(self):
        self.dictionary = {}
        self.reverse_dictionary = {}
        self.stopwords = string.punctuation.replace(".", "").replace("?", "")
        self.stopwords += 'â€™“”¦–\xa0¹'
        # Add the padding token
        self.__add_to_dict('<pad>')
        
        # Add numbers to our stop list
        for i in range(10):
            self.__add_to_dict(str(i))

        # Add space and punctuation to the dictionary
        self.__add_to_dict('.')
        self.__add_to_dict('?')
        self.__add_to_dict(' ')

        # Add characters to the dictionary as our vocabulary
        for i in range(26):
            self.__add_to_dict(chr(ord('a') + i))
        
    def __add_to_dict(self, character):
        if character not in self.dictionary:
            self.dictionary[character] = len(self.dictionary)
            self.reverse_dictionary[self.dictionary[character]] = character

    def tokenise(self, text):
        text = text.lower().replace('\n',' ').replace('–', '')
        for i in self.stopwords:
            text = text.replace(i, "")
        return [self.dictionary[c] for c in text if c not in self.stopwords]

    def character_to_token(self, character):
        return self.dictionary[character]

    def token_to_character(self, token):
        return self.reverse_dictionary[token]

    def size(self):
        return len(self.dictionary)

Let's see what this tokeniser produces after processing the data:

In [None]:
tokeniser = Tokeniser()

tokenised_training_data = tokeniser.tokenise(training_data)

print(tokenised_training_data[:50])

Tokenisation converts each token in the text into a unique integer identifier, or index, using a dictionary called a vocabulary. The vocabulary consists of a list of all unique tokens that the model will be trained to recognise.

In [None]:
# Show the mapping of characters to indices - the vocabulary
tokeniser.dictionary

We have two dictionaries, one whose key is represented by each character of our vocabulary, and another with keys associated with the integer values assigned to each vocabulary item. This allows us to map between the two representations.  For instance, we can convert the training data from integer values, back to the original text. We can use this approach later when generating the text character by character in the prediction step:

In [None]:
# Display the training data (reconstruct it from the indices character-by-character)
text = ""

for c in tokenised_training_data:
    text += tokeniser.reverse_dictionary.get(c, "")

print(text[:100]) 

### Input embedding
The next step is to turn tokens into a numerical representation that can be used by our machine learning model. This is performed by the embedding layer.  When tokenising the text, we replaced the characters with integers as we saw. However, to capture the relationship between different tokens, we need the embedding layer to act as a bridge between discrete tokens and what is known as continuous vector space. A vector space is composed of a set whose elements (called vectors) can be added together or multiplied by numbers called scalars (represented by real numbers).

If our tokens are words, the embedding layer helps capture the similarity between words like "King" and "Queen" by placing their respective vectors close together in vector space. This closeness stems from the fact that "King" and "Queen" may often appear in similar contexts, i.e. surrounded by similar tokens related to being the head of state or a member of the royal court.  In short, the embedding layer transforms each token, represented as an integer, into a continuous vector in a high-dimensional space.  

The `torch.nn.Embedding` class allows us to create the embedding layer. During training, this layer's weights (representing our token embeddings) are updated to better capture the semantic relationships between different tokens.

In the code below, the variable `number_of_tokens` stores the total number of unique tokens our model can encounter in the input. This number typically equals the size of our token dictionary. Whereas the parameter `d_model`, specifies dimensionality (size) of the embedding vectors. Higher dimensions enable our model to encode more information on each token, but can also increase the complexity and train timing of the model.:

In [None]:
import torch 

class TokenEmbedding(torch.nn.Module):
    """
    PyTorch module that converts tokens into embeddings.

    Input dimension is: (batch_size, sequence_length)
    Output dimension is: (batch_size, sequence_length, d_model)
    """

    def __init__(self, d_model, number_of_tokens):
        super().__init__()
        self.embedding_layer = torch.nn.Embedding(
            num_embeddings=number_of_tokens,
            embedding_dim=d_model
        )

    def forward(self, x):
        return self.embedding_layer(x)

The input to `TokenEmbedding` will be a batch of sequences, with each token represented by an integer. The output, is a batch of the same sequences, but with each integer now being replaced by a high-dimensional vector encapsulating semantic information.

### Positional encoding

Transformers work on the basis of self-attention, which is responsible for computing relevance scores for elements in the input sequence relative to each other. Transformers do not actually consider the sequence order since the self-attention mechanism treats input elements as independent of each other.

In natural language the order of elements is very important, otherwise we would find it difficult to communicate. Without taking into account the positional information of tokens in a language, a transformer might interpret different sentences as being equivalent if they contain similar words but in a different order. Therefore we need positional encoding to inject the concept of *sequence order* into the transformer model.

The positional encoding for a position *p* with dimension *i* in the input sequence is computed using the sine and cosine functions, as follows:

- PE(p, 2<sup>i</sup> ) = *sin*(p / 10000 ^ (2<sup>i</sup> / d_model))
- PE(p, 2<sup>i+1</sup>) = *cos*(p / 10000 ^ (2<sup>i</sup> / d_model))

As we said, the term `d_model`, above, represents the dimensionality of the input and output vectors in our model.

Using large values for `d_model` means that each word will be represented by a larger vector, which will allow the model to capture more complex representations, but would also require more computational resources as a result. 

The *sin* and *cos* functions are applied alternately to each dimension of the positional encoding vector.  The final positional encodings carry values between -1 and 1, and they what is known as a *wavelength*, which increases with each dimension. 

In short, we are using the *sin* and *cos* functions of different frequencies to generate the actual positional encoding. The patterns generated by the *sin* and *cos* functions allow the model to identify different positions and generalise to sequence lengths that it did not observe during training. 

These positional encodings are added to the input embeddings before being processed by the transformer. This allows the transformer to learn to give more attention to adjacent words. This captures the idea that words occurring in proximity to others often have some syntactic or semantic relevance. In short, we can learn the semantics of a word, by the words that surround it. Though in our case, words are represented by the tokenised characters.

The class below enables us to write the positional embedding, which we then add to the input layer:

In [None]:
class PositionalEncoding(torch.nn.Module):
    """
    Pytorch module that creates a positional encoding matrix. This matrix will later be added to the 
    transformer's input embeddings to provide a sense of position of the sequence elements.
    """

    def __init__(self, d_model, max_sequence_length):
        super().__init__()
        self.d_model = d_model
        self.max_sequence_length = max_sequence_length
        self.positional_encoding = self.create_positional_encoding()

    def create_positional_encoding(self):
        """
        Creates a positional encoding matrix of size (max_sequence_length, d_model).
        """

        # Initialise positional encoding matrix
        positional_encoding = np.zeros((self.max_sequence_length, self.d_model))

        # Calculate positional encoding for each position and each dimension: Here is where we use the sin and cos functions:
        for pos in range(self.max_sequence_length):
            for i in range(0, self.d_model, 2):
                # Apply sin to even indices in the array; indices in Python start at 0 so i is even.
                positional_encoding[pos, i] = np.sin(pos / (10000 ** ((2 * i) / self.d_model)))
                
                if i + 1 < self.d_model:
                    # Apply cos to odd indices in the array; we add 1 to i because indices in Python start at 0.
                    positional_encoding[pos, i + 1] = np.cos(pos / (10000 ** ((2 * i) / self.d_model)))

        # Convert numpy array to PyTorch tensor and return it
        return torch.from_numpy(positional_encoding).float()

    def forward(self, x):
        """
        Adds the positional encoding to the input embeddings at the corresponding positions.
        """
        # Add positional encodings to input embeddings. 
        # The ":" indexing ensures we only add positional encodings up
        # to the length of the sequence in the batch. x.size(0) is the batch size, so this 
        # is a way to make sure we are not adding extra positional encodings.
        return x + self.positional_encoding[ : x.size(1), : ]

### Masking layer
Now we come to the masking layer.  The padding tokens we used to increase the length of shorter sequences do not provide any semantic information, and so we want the model to ignore them when it is processing the input data.  This is the role of the masking layer. 

The mask consists of an array with the same length as the sequence, and with ones at the positions corresponding to actual tokens and zeros for all other positions containing the padding tokens.  When we update the attention scores we apply the mask to the attention scores matrix, which will set the scores for any padding positions to a very large negative number (e.g., -1e9). Using a large negative number is necessary as the scores will be passed onto a softmax function, which will convert any large negative values to zero. This means that all padding positions will have no influence on the final output of the attention layer.

### Attention layer
The attention mechanism is an important component and allows us to highlight the important elements of an input sequence to the model.  The attention mechanism calculates an attention score for each token considering all other tokens in the sequence.

Attention scores are derived using query, key, and value vectors. These vectors are generated by multiplying the input embeddings with learned matrices. These vectors contribute to calculating the attention scores, to determine the impact of one token on another.  The attention scores for the tokens are calculated as the dot product between the query and key vectors. These scores are then scaled down by dividing by the square root of the query/key/value dimension for more stable gradients.

For masked positions, the attention score will be set to a large negative number, making these positions unavailable in subsequent computations.  In the `MaskedSelfAttention` class below, the operation `masked_fill(mask == 0, -1e9)` is applied to every position in `attention_weights`, where the position in `mask` is equal zero (i.e., a padding token). At this point, we replace the `attention_weights` value with `-1e9` to make sure that all padding tokens are ignored, or given no attention. The attention scores are normalised using the softmax function. This forces to them fall in the range 0 and 1 and will also ensure they sum up to 1. These normalised attention scores are then multiplied with the value vectors and summed to obtain the output of the attention layer.

In summary, when the attention mechanism is presented with a sequence of tokens, it takes the query vector attributed to some specific token in the sequence and computes its score against each key in the database. This allows it to capture how the token under consideration relates to the others in the sequence. Afterwhich, it scales the values according to the attention weights (calculated from the scores) to maintain focus on tokens relevant to the query. Finally, it produces an attention output for the token under consideration. 


In [None]:
class MaskedSelfAttention(torch.nn.Module):
    """
    Pytorch module for a self attention layer.
    This layer is used in the MultiHeadedSelfAttention module.

    Input dimension is: (batch_size, sequence_length, embedding_dimension)
    Output dimension is: (batch_size, sequence_length, head_dimension)
    """

    def __init__(self, embedding_dimension, head_dimension):
        super().__init__()
        self.embedding_dimension = embedding_dimension
        self.head_dimension = head_dimension
        self.query_layer = torch.nn.Linear(embedding_dimension, self.head_dimension)
        self.key_layer = torch.nn.Linear(embedding_dimension, self.head_dimension)
        self.value_layer = torch.nn.Linear(embedding_dimension, self.head_dimension)
        self.softmax = torch.nn.Softmax(dim=-1)

    def forward(self, x, mask):
        """
        Compute the self attention.

        x dimension is: (batch_size, sequence_length, embedding_dimension)
        output dimension is: (batch_size, sequence_length, head_dimension)
        mask dimension is: (batch_size, sequence_length)

        mask values are: 0 or 1. 0 means the token is masked, 1 means the token is not masked.
        """

        # x dimensions are: (batch_size, sequence_length, embedding_dimension)
        # query, key, value dimensions are: (batch_size, sequence_length, head_dimension)
        query = self.query_layer(x)
        key = self.key_layer(x)
        value = self.value_layer(x)

        # Calculate the attention weights.
        # attention_weights dimensions are: (batch_size, sequence_length, sequence_length)
        attention_weights = torch.matmul(query, key.transpose(-2, -1))

        # Scale the attention weights.
        attention_weights = attention_weights / np.sqrt(self.head_dimension)

        # Apply the mask to the attention weights, by setting the masked tokens to a very low value.
        # This will make the softmax output 0 for these values.
        mask = mask.reshape(attention_weights.shape[0], 1, attention_weights.shape[2])
        attention_weights = attention_weights.masked_fill(mask == 0, -1e9)

        # Softmax makes sure all scores are between 0 and 1 and the sum of scores is 1.
        # attention_scores dimensions are: (batch_size, sequence_length, sequence_length)
        attention_scores = self.softmax(attention_weights)

        # The attention scores are multiplied by the value
        # Values of tokens with high attention score get highlighted because they are multiplied by a larger number,
        # and tokens with low attention score get drowned out because they are multiplied by a smaller number.
        # Output dimensions are: (batch_size, sequence_length, head_dimension)
        return torch.bmm(attention_scores, value)

### Multi-head attention
The attention mechanism, allows the model to focus on different parts of the input sequence when generating each token in the output sequence. Instead of having one single attention layer, we can have a group to gain a different perspective on the input from each one.

If you show a group of people a picture, and talk to them about what they notice, you will obtain different details — one might notice the colors, another the shapes in the picture, and another might comment on the composition. Each of these perspectives gives you information about the whole picture that might be missed by only having the perspective of one individual.

This idea has similarities with multi-head attention. Here we have multiple sets (or heads) of the attention mechanism, each independently focuses on different aspects of the input. 

Each head may learn to pay attention to different positions of the input sequence and extract different types of information. For example, one attention head might learn to focus on the smaller aspects of the language like its syntax, while another might learn semantic information.

The `MaskedMultiHeadedSelfAttention` class allows us to specify the embedding layer input and the number of heads (`number_of_heads`):

In [None]:
class MaskedMultiHeadedSelfAttention(torch.nn.Module):
    """
    Pytorch module for a multi head attention layer.

    Input dimension is: (batch_size, sequence_length, embedding_dimension)
    Output dimension is: (batch_size, sequence_length, embedding_dimension)
    """

    def __init__(self, embedding_dimension, number_of_heads):
        super().__init__()
        self.embedding_dimension = embedding_dimension
        self.head_dimension = embedding_dimension // number_of_heads
        self.number_of_heads = number_of_heads

        # Create the self attention modules
        self.self_attentions = torch.nn.ModuleList(
            [MaskedSelfAttention(embedding_dimension, self.head_dimension) for _ in range(number_of_heads)])

        # Create a linear layer to combine the outputs of the self attention modules
        self.output_layer = torch.nn.Linear(number_of_heads * self.head_dimension, embedding_dimension)

    def forward(self, x, mask):
        """
        Compute the multi head attention.

        x dimensions are: (batch_size, sequence_length, embedding_dimension)
        mask dimensions are: (batch_size, sequence_length)
        mask values are: 0 or 1. 0 means the token is masked, 1 means the token is not masked.
        """
        # Compute the self attention for each head
        # self_attention_outputs dimensions are:
        # (number_of_heads, batch_size, sequence_length, head_dimension)
        self_attention_outputs = [self_attention(x, mask) for self_attention in self.self_attentions]

        # Concatenate the self attention outputs
        # self_attention_outputs_concatenated dimensions are:
        # (batch_size, sequence_length, number_of_heads * head_dimension)
        concatenated_self_attention_outputs = torch.cat(self_attention_outputs, dim=2)

        # Apply the output layer to the concatenated self attention outputs
        # output dimensions are: (batch_size, sequence_length, embedding_dimension)
        return self.output_layer(concatenated_self_attention_outputs)

### The decoder layer
Our model also consists of several decoder layers. The first decoder takes the positional encoding layer as input. Each further decoder then takes as input the output of the previous decoder.  

The model therefore uses the output of previous layers to inform its understanding of the current layer. Each layer learns to represent different features of the input. The higher-level layers can capture more complex or abstract features, which are composed of simpler features captured by the lower-level layers.  So, in short, it is the stacking of multiple decoder layers, which enables the model to capture more complex patterns and understand deeper contextual relationships among the data.

The final layer is the language model head, which is going to output the probabilies of next tokens, which we will discuss later.

In the case of an autoregressive model like GPT-2, each decoder layer is composed of a self-attention mechanism and a feed-forward neural network. The self-attention mechanism allows the model to weigh the importance of different tokens in the input when predicting the next token, while the feed-forward network enables the model to learn more traditional, positional relationships between the tokens.

The decoder layer is shown below, where we make use of the `DecoderLayer` and `DecoderStack` classes. The class `DecoderLayer` encapsulates the processes executed within a single decoder layer, while the `DecoderStack` class handles the stacking of multiple `DecoderLayers`:

In [None]:
class DecoderLayer(torch.nn.Module):
    """
    Pytorch module for an encoder layer.

    An encoder layer consists of a multi-headed self attention layer, a feed forward layer and dropout.

    Input dimension is: (batch_size, sequence_length, embedding_dimension)
    Output dimension is: (batch_size, sequence_length, embedding_dimension)
    """

    def __init__(
            self,
            embedding_dimension,
            number_of_heads,
            feed_forward_dimension,
            dropout_rate
    ):
        super().__init__()
        self.embedding_dimension = embedding_dimension
        self.number_of_heads = number_of_heads
        self.feed_forward_dimension = feed_forward_dimension
        self.dropout_rate = dropout_rate

        self.multi_headed_self_attention = MaskedMultiHeadedSelfAttention(embedding_dimension, number_of_heads)
        self.feed_forward = FeedForward(embedding_dimension, feed_forward_dimension)
        self.dropout = torch.nn.Dropout(dropout_rate)
        self.layer_normalisation_1 = torch.nn.LayerNorm(embedding_dimension)
        self.layer_normalisation_2 = torch.nn.LayerNorm(embedding_dimension)

    def forward(self, x, mask):
        """
        Compute the encoder layer.

        x dimensions are: (batch_size, sequence_length, embedding_dimension)
        mask dimensions are: (batch_size, sequence_length)
        mask values are: 0 or 1. 0 means the token is masked, 1 means the token is not masked.
        """

        # Layer normalization 1
        normalised_x = self.layer_normalisation_1(x)

        # Multi headed self attention
        attention_output = self.multi_headed_self_attention(normalised_x, mask)

        # Residual output
        residual_output = x + attention_output

        # Layer normalization 2
        normalised_residual_output = self.layer_normalisation_2(residual_output)

        # Feed forward
        feed_forward_output = self.feed_forward(normalised_residual_output)

        # Dropout, only when training.
        if self.training:
            feed_forward_output = self.dropout(feed_forward_output)

        # Residual output
        return residual_output + feed_forward_output

The `DecoderStack` is comoposed of several decoder layers in sequence, defined by the variable `number_of_layers` that we pass when instantiating the class:

In [None]:
class DecoderStack(torch.nn.Module):
    """
    Pytorch module for a stack of decoders.
    """

    def __init__(
            self,
            embedding_dimension,
            number_of_layers,
            number_of_heads,
            feed_forward_dimension,
            dropout_rate,
            max_sequence_length
    ):
        super().__init__()
        self.embedding_dimension = embedding_dimension
        self.number_of_layers = number_of_layers
        self.number_of_heads = number_of_heads
        self.feed_forward_dimension = feed_forward_dimension
        self.dropout_rate = dropout_rate
        self.max_sequence_length = max_sequence_length

        # Create the encoder layers
        self.encoder_layers = torch.nn.ModuleList(
            [DecoderLayer(embedding_dimension, number_of_heads, feed_forward_dimension, dropout_rate) for _ in
             range(number_of_layers)])

    def forward(self, x, mask):
        decoder_outputs = x
        for decoder_layer in self.encoder_layers:
            decoder_outputs = decoder_layer(decoder_outputs, mask)

        return decoder_outputs

The feed forward layer is composed of a fully connected layer, which applies the *ReLU* activation function to the data. This enables the model to learn the complex patterns and relationships present in the data:

In [None]:
class FeedForward(torch.nn.Module):
    """
    Pytorch module for a feed forward layer.

    A feed forward layer is a fully connected layer with a ReLU activation function in between.
    """

    def __init__(self, embedding_dimension, feed_forward_dimension):
        super().__init__()
        self.embedding_dimension = embedding_dimension
        self.feed_forward_dimension = feed_forward_dimension
        self.linear_1 = torch.nn.Linear(embedding_dimension, feed_forward_dimension)
        self.linear_2 = torch.nn.Linear(feed_forward_dimension, embedding_dimension)

    def forward(self, x):
        """
        Compute the feed forward layer.
        """
        return self.linear_2(torch.relu(self.linear_1(x)))

### Language Model
The `LanguageModel` class, below, brings together all the layers we have created so far, including the token embedding, positional encoding, normalisation, and our stack of decoder layers. After training our first model, we may revisit this to tune the parameters, for instance by adjusting the number of decoder layers (`number_of_layers`) and the number of attention heads (`number_of_heads`).  We also add functions for saving and loading our trained model later:

In [None]:
class LanguageModel(torch.nn.Module):
    """
    Pytorch module for a language model.
    """

    def __init__(
            self,
            number_of_tokens,  # The number of tokens in the vocabulary
            max_sequence_length=512,  # The maximum sequence length to use for attention
            embedding_dimension=512,  # The dimension of the token embeddings
            number_of_layers=6,  # The number of decoder layers to use
            number_of_heads=4,  # The number of attention heads to use
            feed_forward_dimension=None,  # The dimension of the feed forward layer
            dropout_rate=0.1  # The dropout rate to use
    ):
        super().__init__()
        self.number_of_tokens = number_of_tokens
        self.max_sequence_length = max_sequence_length
        self.embedding_dimension = embedding_dimension
        self.number_of_layers = number_of_layers
        self.number_of_heads = number_of_heads

        if feed_forward_dimension is None:
            # GPT-2 paper uses 4 * embedding_dimension for the feed forward dimension
            self.feed_forward_dimension = embedding_dimension * 4
        else:
            self.feed_forward_dimension = feed_forward_dimension

        self.dropout_rate = dropout_rate

        # Create the token embedding layer
        self.token_embedding = TokenEmbedding(embedding_dimension, number_of_tokens)

        # Create the positional encoding layer
        self.positional_encoding = PositionalEncoding(embedding_dimension, max_sequence_length)

        # Create the normalization layer
        self.layer_normalisation = torch.nn.LayerNorm(embedding_dimension)

        # Create the decoder stack
        self.decoder = DecoderStack(
            embedding_dimension=embedding_dimension,
            number_of_layers=number_of_layers,
            number_of_heads=number_of_heads,
            feed_forward_dimension=self.feed_forward_dimension,
            dropout_rate=dropout_rate,
            max_sequence_length=max_sequence_length
        )

        # Create the language model head
        self.lm_head = LMHead(embedding_dimension, number_of_tokens)

    def forward(self, x, mask):
        # Compute the token embeddings
        # token_embeddings dimensions are: (batch_size, sequence_length, embedding_dimension)
        token_embeddings = self.token_embedding(x)

        # Compute the positional encoding
        # positional_encoding dimensions are: (batch_size, sequence_length, embedding_dimension)
        positional_encoding = self.positional_encoding(token_embeddings)

        # Post embedding layer normalization
        positional_encoding_normalised = self.layer_normalisation(positional_encoding)

        decoder_outputs = self.decoder(positional_encoding_normalised, mask)
        lm_head_outputs = self.lm_head(decoder_outputs)

        return lm_head_outputs
    
    def save_checkpoint(self, path):
        print(f'Saving checkpoint {path}')
        torch.save({
            'number_of_tokens': self.number_of_tokens,
            'max_sequence_length': self.max_sequence_length,
            'embedding_dimension': self.embedding_dimension,
            'number_of_layers': self.number_of_layers,
            'number_of_heads': self.number_of_heads,
            'feed_forward_dimension': self.feed_forward_dimension,
            'dropout_rate': self.dropout_rate,
            'model_state_dict': self.state_dict()
        }, path)
    
    @staticmethod
    def load_checkpoint(path) -> 'LanguageModel':
        checkpoint = torch.load(path)
        model = LanguageModel(
            number_of_tokens=checkpoint['number_of_tokens'],
            max_sequence_length=checkpoint['max_sequence_length'],
            embedding_dimension=checkpoint['embedding_dimension'],
            number_of_layers=checkpoint['number_of_layers'],
            number_of_heads=checkpoint['number_of_heads'],
            feed_forward_dimension=checkpoint['feed_forward_dimension'],
            dropout_rate=checkpoint['dropout_rate']
        )
        model.load_state_dict(checkpoint['model_state_dict'])
        return model


### Language model head

The `LanguageModel` class concludes with the `LMHead` layer. This layer is essentially a linear transformation that maps the high-dimensional output of the decoder stack back down to the dimension of the token vocabulary.

Recall that we want to assign a probability to each word in the vocabulary given the preceding context. This is the main role of `LMHead`, where it maps the embedding dimension back to the high-dimensional space containing the individual tokens representing our vocabulary. This allows us to assign a score to each of the tokens. These scores are then passed through a softmax function to convert them into probabilities. 

So `LMHead` is the main component responsible for transforming the high-dimensional output of the decoder stack to the likelihood of each token being the next token in the sequence.  The `LMHead` is implemented as a subclass of PyTorch's `torch.nn.Module`. It uses basic type of neural network layer that applies a linear transformation to the input to map the decoder's output dimension to the number of tokens in the vocabulary:

In [None]:
class LMHead(torch.nn.Module):
    """
    Pytorch module for the language model head.
    The language model head is a linear layer that maps the embedding dimension to the vocabulary size.
    """

    def __init__(self, embedding_dimension, number_of_tokens):
        super().__init__()
        self.embedding_dimension = embedding_dimension
        self.number_of_tokens = number_of_tokens
        self.linear = torch.nn.Linear(embedding_dimension, number_of_tokens)

    def forward(self, x):
        """
        Compute the language model head.

        x dimensions are: (batch_size, sequence_length, embedding_dimension)
        output dimensions are: (batch_size, sequence_length, number_of_tokens)
        """
        # Compute the linear layer
        # linear_output dimensions are: (batch_size, sequence_length, number_of_tokens)
        linear_output = self.linear(x)

        return linear_output

### Autoregressive wrapper

Recall that our model will take an initial prompt given by us to generate the next word.  It will then take the sequence composed of our initial prompt, plus the last predicted word to then generate the next and so on.

To allow our language model to generate text one token at a time, we need an autoregressive model. An autoregressive model, does exactly what we have described, it takes the output from previous steps and feeds them as the input to subsequent steps. 

To achieve this we will use the `AutoregressiveWrapper`.  The autoregressive wrapper takes a sequence of tokens as input, where the sequence length is one token more than the maximum sequence length allowed. 

The `AutoregressiveWrapper` class also includes a method for calculating the probabilities of the next token in the sequence. It generates these probabilities by applying a softmax function to the unnormalised scores output by the model for each token in the vocabulary, associated with the last token in the sequence. 

The `temperature` parameter, in the code below, is used to adjust the sharpness of the probability distribution. Lower temperature values force the output to be increase the likelihood that it will choose the most probable token. With higher values for `temperature`, we see the output becoming more random in terms of the probable token:

In [None]:
class AutoregressiveWrapper(torch.nn.Module):
    """
    Pytorch module that wraps a GPT model and makes it autoregressive.
    """

    def __init__(self, gpt_model):
        super().__init__()
        self.model = gpt_model
        self.max_sequence_length = self.model.max_sequence_length

    def forward(self, x, mask):
        """
        Autoregressive forward pass
        """
        inp, target = x[:, :-1], x[:, 1:]
        mask = mask[:, :-1]

        output = self.model(inp, mask)
        return output, target

    def next_token_probabilities(self, x, mask, temperature=1.0):
        """
        Calculate the token probabilities for the next token in the sequence.
        """
        logits = self.model(x, mask)[:, -1]

        # Apply the temperature
        if temperature != 1.0:
            logits = logits / temperature

        # Apply the softmax
        probabilities = torch.softmax(logits, dim=-1)

        return probabilities

    def save_checkpoint(self, path):
        self.model.save_checkpoint(path)

    @staticmethod
    def load_checkpoint(path) -> 'AutoregressiveWrapper':
        model = LanguageModel.load_checkpoint(path)
        return AutoregressiveWrapper(model)

### Trainer
Now we have a model set up with all the necessary layers of the architecture. We can now begin the training phase. The `Trainer` is a helper class that loops over the epochs and shuffles the data at the start of each epoch. We do this to prevent the batches from being the same each time, which would cause the model to overfit to these specific batches.

To get started, we pass in the model, tokeniser, and chosen optimiser to our `Trainer` class. We also configure the loss function.

Next we create batches of sequences and their respective masks. Batch size means that in every forward pass through the model we consider a group of sequences upto the specified batch size from the training data simultaneously. The higher the batch size the better the model can learn patterns in the data, but a higher batch size also means we will need more memory to process them.

We then train for several epochs, where an epoch represents a complete pass over the whole training data. For each epoch we shuffle the sequences at random, and begin to create several batches of data. This involves creating the input and mask tensors for the batch. Remember that padding tokens are masked, meaning they will not be considered in the attention step.

Next we do a forward pass through the model with a batch. This means we let the model make predictions using the given data. The model predictions are then compared to the target value, which is the sequence shifted by one step so that the next token can be seen. 

For each batch we train the model using the the input and mask tensors, and compute the output. We also compute the losses based on the model output and the target for each epoch.  The model outputs probabilities for what token should be the next, and the loss function knows what the correct answer should be. If the model prediction was way off the target value, then we have a higher loss value. 

Once we know the loss, we can backpropagate it and clip the gradients to prevent the issue of exploding gradients.  We can then update the model parameters by taking a step in the direction of the gradient.  In essence, we are calculating the direction in which the weights should be adjusted, in order to improve the prediction of the model. If the training is going well, we should see the loss reduce over time.

Afterwhich, we reset the gradients so that the gradients from the previous batch are not used in the next step.  Lastly, we append the loss to the list of losses, so that the average loss can be computed for each epoch, and we store these average loss values for plotting later, before moving onto the next batch:

In [None]:
class Trainer:
    def __init__(self, model, tokeniser: Tokeniser, optimiser=None):
        super().__init__()
        self.model = model
        if optimiser is None:
            self.optimiser = torch.optim.Adam(model.parameters(), lr=0.0001)
        else:
            self.optimiser = optimiser
        self.tokeniser = tokeniser
        self.loss_function = torch.nn.CrossEntropyLoss()

    def train(self, data: list[str], epochs, batch_size):
        loss_per_epoch = []
        for epoch in range(epochs):
            losses = []

            # Shuffle the sequences
            random.shuffle(data)

            # Create batches of sequences and their respective mask.
            batches = []
            for i in range(0, len(data), batch_size):
                sequence_tensor = torch.tensor(data[i: i + batch_size], dtype=torch.long)

                # Create the mask tensor for the batch, where 1 means the token is not a padding token
                mask_tensor = torch.ones_like(sequence_tensor)
                mask_tensor[sequence_tensor == self.tokeniser.character_to_token('<pad>')] = 0

                batches.append((sequence_tensor, mask_tensor))

            # Train the model on each batch
            for batch in batches:
                self.model.train()

                # Create the input and mask tensors
                input_tensor = torch.zeros((batch_size, self.model.max_sequence_length + 1), dtype=torch.long)
                mask_tensor = torch.zeros((batch_size, self.model.max_sequence_length + 1), dtype=torch.long)

                for i, input_entry in enumerate(batch[0]):
                    input_tensor[i] = input_entry

                for i, mask_entry in enumerate(batch[1]):
                    mask_tensor[i] = mask_entry

                # Compute the model output
                model_output, target = self.model.forward(
                    x=input_tensor,
                    mask=mask_tensor
                )
                # Compute the losses
                # The loss is computed on the model output and the target
                loss = self.loss_function(model_output.transpose(1, 2), target)

                # Backpropagate the loss.
                loss.backward()

                # Clip the gradients. This is used to prevent exploding gradients.
                torch.nn.utils.clip_grad_norm_(self.model.parameters(), 0.5)

                # Update the model parameters. This is done by taking a step in the direction of the gradient.
                self.optimiser.step()

                # Reset the gradients. This is done so that the gradients from the previous batch
                # are not used in the next step.
                self.optimiser.zero_grad()

                # Append the loss to the list of losses, so that the average loss can be computed for this epoch.
                losses.append(loss.item())

            # Print the loss
            epoch_loss = np.average(losses)
            loss_per_epoch.append(epoch_loss)
            print('Epoch:', epoch, 'Loss:', epoch_loss)

        return loss_per_epoch

We will create two more helper functions; one to create the input sequences upto a certain max length, and one to tokenise and pad the training data. This will help to keep the training code simpler later on, and we can also use it in the generation step after training:

In [None]:
def create_training_sequences(max_sequence_length, tokenised_training_data):
    # Create sequences of length max_sequence_length + 1
    # The last token of each sequence is the target token
    sequences = []
    for i in range(0, len(tokenised_training_data) - max_sequence_length - 1):
        sequences.append(tokenised_training_data[i: i + max_sequence_length + 1])
    return sequences

def tokenise_and_pad_training_data(max_sequence_length, tokeniser, training_data):
    # Tokenise the training data
    tokenised_training_data = tokeniser.tokenise(training_data)
    for _ in range(max_sequence_length):
        # Prepend padding tokens
        tokenised_training_data.insert(0, tokeniser.character_to_token('<pad>'))
    return tokenised_training_data

### Model training
We will now load our training data again and then move onto the code for kickstarting the training of the actual model.  Training can take a long time with these models, so you can restrict the training set for now so that it runs within a reasonable amount of time. 

*Note*: this could still be a few hours!! You can interrupt the code execution from the *Kernel* menu, if you wish to go back and tweak the parameters if training takes too long or is too computational expensive for your hardware. 

You can also tweak the amount of training data to use below (`character_limit`), which can also help reduce training time, but will impact the performance of the model predictions (not that the results will be great):

In [None]:
import numpy as np
import random

character_limit = len(training_data)
training_data = training_data[:character_limit]

# Set up the tokeniser and other parameters
tokeniser = Tokeniser()

# We can tweak the dimensions of the embedding to speed up (small values) or 
# improve training (large values -> 256, 512).
embedding_dimension = 256

max_sequence_length = 512 # You can also try smaller sequence lengths, e.g. 20

number_of_tokens = tokeniser.size()

# Create the model: you can play with these network configuration parameters to see how the model performs
model = AutoregressiveWrapper(
    LanguageModel(
        embedding_dimension=embedding_dimension,
        number_of_tokens=number_of_tokens,
        number_of_heads=4,
        number_of_layers=3,
        dropout_rate=0.1,
        max_sequence_length=max_sequence_length
    )
)

tokenised_and_padded_training_data = tokenise_and_pad_training_data(max_sequence_length, tokeniser, training_data)

sequences = create_training_sequences(max_sequence_length, tokenised_and_padded_training_data)

# Train the model
optimiser = torch.optim.Adam(model.parameters(), lr=0.0001)

# Set up our Trainer class and pass in the model, tokeniser, and optimiser to be used.
trainer = Trainer(model, tokeniser, optimiser)

# Start training, and return the loss for each epoch to plot later
loss_per_epoch = trainer.train(sequences, epochs=2, batch_size=8)

### Evaluate
Once training is complete we can plot the loss to see how it decreases over time. The plot is in log scale, so you will observe smaller variations in the loss towards the end of the training phase.

In [None]:
# Plot the loss per epoch in log scale
from matplotlib import pyplot as plt

plt.plot(loss_per_epoch)

plt.yscale('log')

plt.ylabel('Loss')
plt.xlabel('Epoch')

plt.show()

### Predict
Now the model has been trained, we wish to see if it actually learnt to generate coherent words from the character sequences it has learnt (our tokens). We write another helper class for generating text, which takes the trained model and the tokeniser as parameters. The `generate` function is then used to pass the length of the sequence we wish to generate and a starting prompt in the form of a word or phrase.

We switch the model from *training* mode to *eval* mode (`self.model.eval()`). In the eval mode the model will not apply dropout. 

As you can see from the code comments we have a similar approach as training. We need to convert our prompt to tokens, and produce a mask.  We keep doing this until we have generated a sequence of tokens upto our maximum settings.  

Each prediction of the next token from the model is stored and returned at the end.  In other words, this is done by auto-regressively generating new tokens and adding them to the input sequence. After a token is added we run the new input sequence with the extra token through the model again, and we append a further predicted token. 

We continue until the maximum number of characters we wanted to generate is reached, or until we have generated the `eos_token` (end of sequence token). This is a token that can be custom defined as an indication to stop generating new tokens. For example, we could set this to the full stop character (`.`), but we will leave this blank for now:

In [None]:
# Pads a sequence on the left with a specified padding token until it reaches a desired final length
def pad_left(sequence, final_length, padding_token):
    return [padding_token] * (final_length - len(sequence)) + sequence

# Generator class for autoregressive token-by-token sequence generation
class Generator:
    def __init__(self, model, tokeniser):
        self.model = model              # Trained language model for generation
        self.tokeniser = tokeniser      # Tokeniser to convert between text and tokens

    def generate(
            self,
            max_tokens_to_generate: int,  # Maximum number of tokens to generate
            prompt: str = None,           # Optional prompt string to condition the generation
            temperature: float = 1.0,     # Sampling temperature (higher = more random)
            eos_token: int = None,        # Optional end-of-sequence token to stop early
            padding_token: int = 0):      # Token used for left-padding inputs

        self.model.eval()  # Set the model to evaluation mode (disables dropout etc.)

        # Convert the prompt into a list of tokens, or start with padding if no prompt is provided
        if prompt is None:
            start_tokens = [self.tokeniser.character_to_token(padding_token)]
        else:
            start_tokens = self.tokeniser.tokenise(prompt)

        # Pad the token sequence to fit model input size
        input_tensor = torch.tensor(
            pad_left(
                sequence=start_tokens,
                final_length=self.model.max_sequence_length + 1,
                padding_token=padding_token
            ),
            dtype=torch.long
        )

        # Ensure input tensor is 2D (batch of 1)
        if len(input_tensor.shape) == 1:
            input_tensor = input_tensor[None, :]

        out = input_tensor  # Initialise output with the input prompt

        for _ in range(max_tokens_to_generate):
            # Only use the most recent max_sequence_length tokens as input
            x = out[:, -self.model.max_sequence_length:]

            # Create attention mask: 1 for real tokens, 0 for padding
            mask = torch.ones_like(x)
            mask[x == padding_token] = 0

            # Use model to get probabilities for the next token
            next_token_probabilities = self.model.next_token_probabilities(
                x=x,
                temperature=temperature,
                mask=mask
            )

            # Sample one token according to the predicted distribution
            next_token = torch.multinomial(next_token_probabilities, num_samples=1)

            # Append the sampled token to the current sequence
            out = torch.cat([out, next_token], dim=1)

            # Stop generation if end-of-sequence token is generated
            if eos_token is not None and next_token.item() == eos_token:
                break

        # Convert generated token IDs back into characters and join into a string
        generated_tokens = out[0].tolist()
        return ''.join([self.tokeniser.token_to_character(token) for token in generated_tokens])


Below, we create a `prompt` function to make re-running the generator simpler. Inside this function, we instantiate the `Generator`. The prompt we give is converted to tokens, and then padded so it has the correct sequence length. We want to generate a sequence of characters upto the value specified in variable `max_tokens_to_generate`, so you can tweak this as desired:  

In [None]:
def prompt(model, tokeniser, prompt_text, max_tokens_to_generate = 30):
    
    generator = Generator(model, tokeniser)
    
    generated_text = generator.generate(
        max_tokens_to_generate=max_tokens_to_generate,
        prompt=prompt_text,
        padding_token=tokeniser.character_to_token('<pad>')
    )

    print(generated_text.replace('<pad>', ''))


In [None]:
prompt(model, tokeniser, "a spring")

prompt(model, tokeniser, "a spider")

prompt(model, tokeniser, "a small bear")

prompt(model, tokeniser, "blue sky")

prompt(model, tokeniser, "August blue sky")

### Saving and loading the model
Once you have trained the model, it is useful if you can save it, so you do not have to spend the time training a new model.  We have added the save and load functionality already, so we can simply call the `save_checkpoint` method and pass in a filename for the trained model:

In [None]:
model.save_checkpoint('trained_model')

And to load our pre-trained model we can do this:

In [None]:
model = model.load_checkpoint('trained_model')

And to check it ran successfully, we can provide another prompt and generate text as before:

In [None]:
prompt(model, tokeniser, "August")

We have implemented a text generator to generate text from a prompt using a transformer model commonly used in Large Language Models like ChatGPT. As you can see even with very little training, the model can generate whole words from characters.  Another thing you will note is the training time involved in training these models.  For best performance you need a very large dataset to produce human-like responses, which will take a very large amount of computational resources and time. The performance of our model is not great. There are other things we can do to improve the model. 

As we stack multiple layers in our model, each subsequent layer may be able capture more and more complex concepts. The first layer might understand similar words for colours, like red and black. Higher layers might recognise that these colour words play a specific role in a sentence, such as being communicating what things look like. And with even higher layers, the model may start to comprehend more complex semantic relationships, such as colours being associated with feelings, and so forth.

Although we are not likely to have the computational power to generate a much larger transformer model to obtain better results, let's experiment with the pre-trained LLM GPT2 for comparison.

### GPT2
Now that we have learnt about transformers and how we construct them, we will look at a pre-trained model, the GPT-2 model, and generate text based on an input sequence used as the text prompt as before. This exercise will cement some of the concepts we have covered on transformers, so you can compare our small model, with one trained on a much larger data set.  

GPT-2 is the successor to GPT, the original NLP framework by OpenAI. The complete GPT-2 model has approximately 1.5 billion parameters. The model was trained on data from 8 million web pages collected from outbound links from Reddit.

### Model architecture
The language model utilises a transformer based architecture and is comprised of several key components that we have seen before, including input embeddings, encoder layers, decoder layers and output layers:

- *Input Embedding*: In this the input text is converted to numerical representations that can be understood by the model. The embedding layer is being deployed for this task which maps each word or token in the input seq to a high dim vector.

- *Encoder layer*: GPT2 consists of multiple identical encoder layers stacked over each other. Each encoder layer has two sub layers which are a self attention mechanism and feed forward network. The self attention mechanism allows the model to weigh the importance of different words or tokens with inp. seq thereby capturing the dependencies and relationships betw. them. The feed forward network processes the self attn outputs to gen more complex representations.

- *Decoder layer*: It follows the encoder layers and has a similar structure as it also consists of self attention and feed forward layers. Just that in this the decoder layer is conditioned on the context from the prev. tokens enabling autoregressive generation. This means the model predicts the next word in the seq based on the context it has learned so far.

- *Output layer*: The final layer of GPT2 is a linear transformation followed by a softmax activation function. This layer produces the probability distribution over the vocab for the next word in the sequence. It alows the model to generate text by sampling from the distribution, or choosing the word with the highest probability.

We will first install some additional tools in the form of the `pytorch-transformers` library to bring in the GPT2 tokeniser and Language Model Head modules:

In [None]:
!pip install pytorch-transformers

We can now import the GPT2 tokeniser and the Language model head. Recall that GPT2 tokenises texts by sub-word, so we will import it and use it to tokenise the text prompt, before presenting it to the model for prediction:

In [None]:
import torch

from pytorch_transformers import GPT2Tokenizer, GPT2LMHeadModel

We now set up the tokeniser and model to be used for prediction:

In [None]:
tokeniser = GPT2Tokenizer.from_pretrained('gpt2')

In [None]:
model = GPT2LMHeadModel.from_pretrained('gpt2')

We can print the model instance to obtain more information about its architecture.  It differs slightly from what we have covered so far, but you will recognise some of the main layers, eg. the embedding layers, normalisation layer (note there are more than one), and the attention layer, not the dimensionality of the layers:

In [None]:
print(model)

To see how the model performs on such a large dataset of Reddit inspired websites, we can give it a text prompt as we have done before to get it started:

In [None]:
# Number of new tokens to generate beyond the initial prompt
num_tokens_to_generate = 60

# Define the initial text prompt
prompt_text = "Once upon a time"

# Convert the prompt into a list of token IDs using the tokeniser
indexed_tokens = tokeniser.encode(prompt_text)

# Iteratively generate new tokens
for i in range(num_tokens_to_generate):
    # Convert the current list of token IDs into a tensor of shape [1, sequence_length]
    tokens_tensor = torch.tensor([indexed_tokens])
    
    # Disable gradient computation for inference (saves memory and speeds up computation)
    with torch.no_grad():
        outputs = model(tokens_tensor)  # Get the model’s output predictions for the next token

        predictions = outputs[0]  # Extract the raw logits from the output tuple
        # Get the most probable next token (argmax over the vocabulary at the last position)
        predicted_index = torch.argmax(predictions[0, -1, :]).item()

        # Append the predicted token to the sequence
        indexed_tokens = indexed_tokens + [predicted_index]

# Decode the final sequence of token IDs back into text
predicted_text = tokeniser.decode(indexed_tokens + [predicted_index])

# Print the generated continuation of the prompt
print(predicted_text)


Obviously, not quite the childrens story we were expecting, and not the best performance. With more data and longer training this can be improved.

### What have we learnt?
We built a tiny character-level transformer and saw, in miniature, how the core mechanics behind GPT-style language models work. Even a handful of training steps taught the network to turn random characters into whole words, demonstrating that the self-attention mechanism can pick up local patterns almost immediately. We also observed, however, that quality scales steeply with both data and parameters - without billions of tokens and millions of weights the model soon plateaus, generating short, repetitive fragments rather than coherent prose.

Stacking attention layers is central to the transformer’s power. Lower layers latch onto surface similarities, i.e. colours such as *red* and *black* cluster together because they often occupy the same syntactic slots. Mid-level layers start to track roles, recognising that colour words typically modify nouns and convey appearance. Only in the uppermost layers do broader associations emerge, for instance linking colours to emotions or style. Each layer therefore builds on the abstractions of the one beneath it, turning raw character sequences into ever richer semantic representations.

We also saw the practical trade-offs of training large models. Bigger architectures demand exponentially more computation and memory, and the marginal gains from additional layers or tokens eventually outstrip what a single workstation can supply. To appreciate the performance ceiling we switched to GPT-2, a pre-trained transformer with 1.5 billion parameters and a vast web-scale corpus behind it. Comparing its output with that of our toy model hopefully reinforces the link between model capacity, data volume, and the fluency of the generated text. Achieving state of the art results requires huge computational power, large volumes of data, and a lot of time (potentially months) of training.