# Building a GPT model

In [2]:

# Get data using pywget
!python -m wget -o input/tinyshakespeare.txt https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt


Saved under input/tinyshakespeare.txt


In [7]:
# Read and inspect text file
with open('./input/tinyshakespeare.txt', 'r', encoding='utf-8') as f:
    texts = f.read()

# Check length of characters
print('Total character length: {0}\n\n'.format(len(texts)))

# Example texts
print('First 500 characters:\n\n{0}\n\n'.format(texts[:500]))

# Unique characters
chars = sorted(list(set(texts)))
print('Unique characters: {0}'.format(''.join(chars)))
print('Total unique characters: {0} characters'.format(len(chars)))


Total character length: 1115394


First 500 characters:

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor


Unique characters: 
 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
Total unique characters: 65 characters


## 1: Tokenization
I will be using a simple character to index tokenizer.<br/>
More complex GPTs uses tokenizer on a word/sub-word level.<br/>
So tokenizer packages we can use are `tiktoken` from OpenAI or `SentencePiece` from Google.

In [8]:
# To make it simple, encoding and decode will be done at a character level
str_to_idx = {ch:i for i,ch in enumerate(chars)}
idx_to_str = {i:ch for i,ch in enumerate(chars)}

# encode & decode function
encode = lambda s: [str_to_idx[ch] for ch in s]
decode = lambda idx: "".join([idx_to_str[i] for i in idx])

# Example
print('Encoded Words: {0}'.format(encode("Hello World!")))
print('Decoded Words: {0}'.format(decode(encode("Hello World!"))))

Encoded Words: [20, 43, 50, 50, 53, 1, 35, 53, 56, 50, 42, 2]
Decoded Words: Hello World!


In [10]:
import torch
print('CUDA Availability: {0}'.format(torch.cuda.is_available()))

data = torch.tensor(encode(texts), dtype=torch.long)
print('Shape: {0}\nData Type:{1}'.format(data.shape, data.dtype))
print('First 500 encoded characters:\n\n{0}\n\n'.format(data[:500]))

CUDA Availability: True
Shape: torch.Size([1115394])
Data Type:torch.int64
First 500 encoded characters:

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59,  1, 39, 56, 43,  1, 39, 50, 50,
         1, 56, 43, 57, 53, 50, 60, 43, 42,  1, 56, 39, 58, 46, 43, 56,  1, 58,
        53,  1, 42, 47, 43,  1, 58, 46, 39, 52,  1, 58, 53,  1, 44, 39, 51, 47,
        57, 46, 12,  0,  0, 13, 50, 50, 10,  0, 30, 43, 57, 53, 50, 60, 43, 42,
         8,  1, 56, 43, 57, 53, 50, 60, 43, 42,  8,  0,  0, 18, 47, 56, 57, 58,
         1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 18, 47, 56, 57, 58,  6,  1, 63,
        53, 59

## 2: Explaining using a Bigram Language Model
Check out `no-self-attenton-simple-bigram.py` to see the process of training a GPT using a simple language model.<br /><br />

A bigram language model is a type of statistical language model that predicts the probability of a word in a sequence based on the previous word. It considers pairs of consecutive words (bigrams) and estimates the likelihood of encountering a specific word given the preceding word in a text or sentence.<br /><br />

This model does not have any self-attention, meaning that it is unable to capture dependencies between tokens and their relationships within a set of input.<br /><br />

Example: Input -> "The fox jumps over the lazy ", Prediction -> "The fox jumps over the lazy `dog`"<br />
In this case, the input is usually encoded and is converted into an embedding vector using an Embedding layer.<br />
Depending on the level of granularity/detail, the model could predict in the level of characters, sub-words or word/s.<br /><br />

Shout out to [Andrej Karpathy](https://www.youtube.com/@AndrejKarpathy/featured) for the indescribable introduction to GPT.

## 3: Attention is all you need
Check out `decoder-transformer.py` to see the process of training a GPT using larger language model.<br /><br />

### Positional Encoding
Before we get into attention, we need to address a specific issue that is present in our model.
Our model does not know the information about the position of tokens in a sequence.
So, Transformers do not inherently have a sense of order for the elements in a sequence since they process the entire sequence in parallel.
To solve this, we use an embedding table to encode the positional values of a set of token input sequence.
As a result, the Transformer can capture sequential information and relationships between different positions in the input sequence, making it suitable for tasks like language translation and language modeling.
This is called `positional encoding`.

Then, we add the token embeddings and the positional encoding to produce an input that has information about each token relationship to other tokens and their positions.

### Attention
We need some sort of way for the model to get a grasp of the past history.
And as well, have a way to gauge how important certain tokens (within a subset of samples) are relevant to predicting the next token.

This is where the `Self-Attention` head come into play.
Using the research from [Attention Is All You Need](https://arxiv.org/abs/1706.03762), Scaled Dot-Product Attention allows for the conditions above to be met.

But, a single self-attention head is not enough to extract the intricate relationships between tokens.
Thus, using a multiple of the self-attention head could be use to jointly attend to information from different representation subspaces at different positions (which couldn't be achieve by a single head).

### Model
Now, we would have the transformer block looking like this:

- LayerNorm
- MultiHeadAttention (Masked inputs, which makes it a decoder)
- Add(x)
- LayerNorm
- FeedForward
- Add(x)

Note: LayerNorm is used instead of BatchNorm in NLP tasks. Layer normalization normalizes input across the features instead of normalizing input features across the batch dimension in batch normalization. And in this case the features are the size of each token input sequence. It helps with bias, normalizes values and reduce internal covariate shift (variables manipulation other variables).

### Scaling Up
To further scale-up the model and to reduce the validation loss, we can scale up the model by repeating the transformer blocks multiple times. But, this will introduce `overfitting` to occur. Scaling up the model can also lead to unwanted effects such as overfitting, we can use `regularization` techniques such as Dropout to try to prevent those effects.<br /><br /> 
Dropout decativates neurons (turn gradients to zero) can produce effects similar to the decoder transformer blocks that mask certain inputs so that the predictions do not rely on it. But in this case, turns of a percentage of the neurons in random positions and it reduce the chances for neurons to be co-adapting to other neurons.

### Full Model
- Add(TokenEmbedding(x), PositionalEmbedding(x))
- multiple TransformerBlocks
- LayerNorm
- Linear

### Final
In the paper [Attention Is All You Need](https://arxiv.org/abs/1706.03762), there is an encoder and a decoder transformer block. The difference between encoder and decoder is if the multi self-attention blocks have masked inputs or not.

Our model does not use the encoder transformer block because the paper is specifically written to do language translation. The encoder block originally is used to train the language it wants to translate and find all the intricate details such as the relationship and postion of tokens. So, it does not need provide masked inputs.

However, decoder requires a way for the model to actually learn and understand how to translate the encoded language into the decoded language. So, it mask the inputs and the model is require to predict tokens by not being able to see future tokens.

Finally, in the proposed architecture by the research paper, there is a component called the cross-attention head. Language translation requires the model to understand how to translate from x to y. You need a way for the model to combine both the encoded and initial decoded parts into one. It allows NLP models to capture intricate relationships and dependencies between different input sequences. Self-attention focuses on understanding it's own token input sequence. In a way, cross-attention enables a way for the model to assimulate information from multiple sources of data effectively.