## building GPT from scratch, with notes


Andrej Karpathy, a rock star in the world of LLMs, made a [video](https://www.youtube.com/watch?v=kCc8FmEb1nY) about a year ago walking through how to build an attention transformer based on the seminal paper "Attention is all you need" which was authored in 2017, which really kicked off this whole innovation that has led us to this crazy era of LLMs.

The goal of this tutorial is for self learning and understanding how a transformer is built and trained, which will help me fine-tune models and understand the deeper nuances to make the right choices. 

Also I may just end up using a different data set than Shakespeare text to train this model.

Ultimately, the goal is to fine tune a model on a custom code repository, so we will get into fine-tuning algorithms too, like QLORA etc. Anyway, getting ahead of myself here we go.

I'll put the time stamp of the video in comments of the code where he says something noteworthy, or sometimes, just for checkposts in this notebook.

Omer
1.15.24

In [5]:
# manual step alert! I downloaded, created a text file and cleaned it up a little for the full corpus of Khalil Gibran under ../data/khalil.txt
# I downloaded it from https://archive.org/stream/the-complete-works-of-khalil-gibran/The%20complete%20works%20of%20Khalil%20Gibran_djvu.txt
# the following downloads the file in html format, yuck
#!wget https://archive.org/stream/the-complete-works-of-khalil-gibran/The%20complete%20works%20of%20Khalil%20Gibran_djvu.txt

In [14]:
# read it in and inspect
with open('../data/khalil.txt', 'r', encoding='utf-8') as file:
    text = file.read()

In [15]:
print("length of the data in characters: ", len(text))

length of the data in characters:  914761


In [16]:
print(text[:1000])

A TEAR AND A SMILE 


The Creation 
( = = C) 


The God separated a spirit from Himself and fashioned it into Beauty. He 
showered upon her all the blessings of gracefulness and kindness. He gave her 
the cup of happiness and said, “Drink not from this cup unless you forget the 
past and the future, for happiness is naught but the moment.” And He also gave 
her a cup of sorrow and said, “Drink from this cup and you will understand the 
meaning of the fleeting instants of the joy of life, for sorrow ever abounds.” 

And the God bestowed upon her a love that would desert he forever upon her 
first sigh of earthly satisfaction, and a sweetness that would vanish with her first 
awareness of flattery. 

And He gave her wisdom from heaven to lead to the all-righteous path, and 
placed in the depth of her heart and eye that sees the unseen, and created in he an 
affection and goodness toward all things. He dressed her with raiment of hopes 
spun by the angels of heaven from the sinews of the 

In [22]:
# identify all unique characters used in this corpus, since we are going to make a character based transformer
chars = sorted(list(set(text)))
vocabulary_size = len(chars)
print(''.join(chars))
print(vocabulary_size)


 !'()*+,-.0123459:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWYZ_abcdefghijklmnopqrstuvwxyz|~©»é—‘’“”
87


We will now tokenize the text, and essentially here each character is treated as a token. This is to convert alphabets which computers don't understand into numbers which they do. Each token will have an essence and a meaning associated with it, but more on this later.

that is different from GPT which used sub-words as tokens. A great explanation of the sub-word token and why it is better than the full word as a token or a character as a token is in [this](https://www.superdatascience.com/podcast/subword-tokenization-with-byte-pair-encoding) short clip by Jon Krohn

In [34]:
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder takes a string, and outputs a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder takes a list of int and returns a string 
                                                 #(the join combines the array of strings to make it a string)

print(encode("good boy!"))
print(decode(encode("good boy!")))


[57, 65, 65, 54, 1, 52, 65, 75, 2]
good boy!


we just made a character level tokenizer! there are many others. 
- Google uses [sentencepiece](https://github.com/google/sentencepiece)
- OpenAI uses [tiktoken](https://github.com/openai/tiktoken)

Both are sub-word tokenizers. e.g. "related" is not a token, but "re", "la" , "ted" could be tokens. Yes they make no sense to humans, but when LLMs see these subwords joined with each other, they can derive semantic meaning, e.g. if we prefixed the sub-work "un" in front of "related" the LLMs would be able to see that unrelated and related are connected, and given the "essence" of what the LLM knows about "un", it would assume that the full word "unrelated" is the opposite of whatever follows "un", which is "related". This will make more sense when we talk about stage 1 of the tokenizer.

As an example let's try openAI's subword tokenizer...

In [44]:
import tiktoken
enc = tiktoken.get_encoding('gpt2')
print("the number of all the subtokens in gpt2 are {}".format(enc.n_vocab))
print("good boy encodes to {} in gpt2".format(enc.encode("good boy!")))
print("11274 decodes to {} in gpt2".format(enc.decode([11274])))
print("2933 decodes to {} in gpt2".format(enc.decode([2933])))
print("0 decodes to {} in gpt2".format(enc.decode([0])))

the number of all the subtokens in gpt2 are 50257
good boy encodes to [11274, 2933, 0] in gpt2
11274 decodes to good in gpt2
2933 decodes to  boy in gpt2
0 decodes to ! in gpt2


We will now encode the entire text data set.. we will use the pytorch library, specifically tensors.
tensors are multi-dimensional, highly efficient arrays of the same data type. We can create multi arrays by making arrays within arrays for example, but that is highly inefficient compared to tensor. we will see why **multi-dimensional** is so important soon in the transformer stages..

In [55]:
import torch
data = torch.tensor(encode(text), dtype=torch.long)
print("the shape of this encoded tensor with all of the text data is {}".format(data.shape))
print("by the way, the shape of the original data was {} and since this is character encoding, i.e.\
 1:1 mapping of the character to the number, this actually makes sense".format(len(text)))
print("the data type of this encoded tensor is {}".format(data.dtype))
print("here is a sample of the first 50 characters:\n{}".format(data[:50]))

the shape of this encoded tensor with all of the text data is torch.Size([914761])
by the way, the shape of the original data was 914761 and since this is character encoding, i.e. 1:1 mapping of the character to the number, this actually makes sense
the data type of this encoded tensor is torch.int64
here is a sample of the first 50 characters:
tensor([25,  1, 44, 29, 25, 42,  1, 25, 38, 28,  1, 25,  1, 43, 37, 33, 36, 29,
         1,  0,  0,  0, 44, 58, 55,  1, 27, 68, 55, 51, 70, 59, 65, 64,  1,  0,
         4,  1, 21,  1, 21,  1, 27,  5,  1,  0,  0,  0, 44, 58])
