## building GPT from scratch, with notes


Andrej Karpathy, a rock star in the world of LLMs, made a [video](https://www.youtube.com/watch?v=kCc8FmEb1nY) about a year ago walking through how to build an attention transformer based on the seminal paper "Attention is all you need" which was authored in 2017, which really kicked off this whole innovation that has led us to this crazy era of LLMs.

The goal of this tutorial is for self learning and understanding how a transformer is built and trained, which will help me fine-tune models and understand the deeper nuances to make the right choices. 

Also I may just end up using a different data set than Shakespeare text to train this model.

Ultimately, the goal is to fine tune a model on a custom code repository, so we will get into fine-tuning algorithms too, like QLORA etc. Anyway, getting ahead of myself here we go.

I'll put the time stamp of the video in comments of the code where he says something noteworthy, or sometimes, just for checkposts in this notebook.

Omer
1.15.24

In [1]:
# manual step alert! I downloaded, created a text file and cleaned it up a little for the full corpus of Khalil Gibran under ../data/khalil.txt
# I downloaded it from https://archive.org/stream/the-complete-works-of-khalil-gibran/The%20complete%20works%20of%20Khalil%20Gibran_djvu.txt
# the following downloads the file in html format, yuck
#!wget https://archive.org/stream/the-complete-works-of-khalil-gibran/The%20complete%20works%20of%20Khalil%20Gibran_djvu.txt

In [2]:
# read it in and inspect
with open('../data/khalil.txt', 'r', encoding='utf-8') as file:
    text = file.read()

In [3]:
print("length of the data in characters: ", len(text))

length of the data in characters:  914761


In [4]:
print(text[:1000])

A TEAR AND A SMILE 


The Creation 
( = = C) 


The God separated a spirit from Himself and fashioned it into Beauty. He 
showered upon her all the blessings of gracefulness and kindness. He gave her 
the cup of happiness and said, “Drink not from this cup unless you forget the 
past and the future, for happiness is naught but the moment.” And He also gave 
her a cup of sorrow and said, “Drink from this cup and you will understand the 
meaning of the fleeting instants of the joy of life, for sorrow ever abounds.” 

And the God bestowed upon her a love that would desert he forever upon her 
first sigh of earthly satisfaction, and a sweetness that would vanish with her first 
awareness of flattery. 

And He gave her wisdom from heaven to lead to the all-righteous path, and 
placed in the depth of her heart and eye that sees the unseen, and created in he an 
affection and goodness toward all things. He dressed her with raiment of hopes 
spun by the angels of heaven from the sinews of the 

In [5]:
# identify all unique characters used in this corpus, since we are going to make a character based transformer
chars = sorted(list(set(text)))
vocab_size = len(chars)
print(''.join(chars))
print(vocab_size)


 !'()*+,-.0123459:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWYZ_abcdefghijklmnopqrstuvwxyz|~©»é—‘’“”
87


We will now tokenize the text, and essentially here each character is treated as a token. This is to convert alphabets which computers don't understand into numbers which they do. Each token will have an essence and a meaning associated with it, but more on this later.

that is different from GPT which used sub-words as tokens. A great explanation of the sub-word token and why it is better than the full word as a token or a character as a token is in [this](https://www.superdatascience.com/podcast/subword-tokenization-with-byte-pair-encoding) short clip by Jon Krohn

In [6]:
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder takes a string, and outputs a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder takes a list of int and returns a string 
                                                 #(the join combines the array of strings to make it a string)

print(encode("good boy!"))
print(decode(encode("good boy!")))


[57, 65, 65, 54, 1, 52, 65, 75, 2]
good boy!


we just made a character level tokenizer! there are many others. 
- Google uses [sentencepiece](https://github.com/google/sentencepiece)
- OpenAI uses [tiktoken](https://github.com/openai/tiktoken)

Both are sub-word tokenizers. e.g. "related" is not a token, but "re", "la" , "ted" could be tokens. Yes they make no sense to humans, but when LLMs see these subwords joined with each other, they can derive semantic meaning, e.g. if we prefixed the sub-work "un" in front of "related" the LLMs would be able to see that unrelated and related are connected, and given the "essence" of what the LLM knows about "un", it would assume that the full word "unrelated" is the opposite of whatever follows "un", which is "related". This will make more sense when we talk about stage 1 of the tokenizer.

As an example let's try openAI's subword tokenizer...

In [7]:
import tiktoken
enc = tiktoken.get_encoding('gpt2')
print("the number of all the subtokens in gpt2 are {}".format(enc.n_vocab))
print("good boy encodes to {} in gpt2".format(enc.encode("good boy!")))
print("11274 decodes to {} in gpt2".format(enc.decode([11274])))
print("2933 decodes to {} in gpt2".format(enc.decode([2933])))
print("0 decodes to {} in gpt2".format(enc.decode([0])))

the number of all the subtokens in gpt2 are 50257
good boy encodes to [11274, 2933, 0] in gpt2
11274 decodes to good in gpt2
2933 decodes to  boy in gpt2
0 decodes to ! in gpt2


We will now encode the entire text data set.. we will use the pytorch library, specifically tensors.
tensors are multi-dimensional, highly efficient arrays of the same data type. We can create multi arrays by making arrays within arrays for example, but that is highly inefficient compared to tensor. we will see why **multi-dimensional** is so important soon in the transformer stages..

In [8]:
import torch
data = torch.tensor(encode(text), dtype=torch.long)
print("the shape of this encoded tensor with all of the text data is {}".format(data.shape))
print("by the way, the shape of the original data was {} and since this is character encoding, i.e.\
 1:1 mapping of the character to the number, this actually makes sense".format(len(text)))
print("the data type of this encoded tensor is {}".format(data.dtype))
print("here is a sample of the first 50 characters:\n{}".format(data[:50]))

the shape of this encoded tensor with all of the text data is torch.Size([914761])
by the way, the shape of the original data was 914761 and since this is character encoding, i.e. 1:1 mapping of the character to the number, this actually makes sense
the data type of this encoded tensor is torch.int64
here is a sample of the first 50 characters:
tensor([25,  1, 44, 29, 25, 42,  1, 25, 38, 28,  1, 25,  1, 43, 37, 33, 36, 29,
         1,  0,  0,  0, 44, 58, 55,  1, 27, 68, 55, 51, 70, 59, 65, 64,  1,  0,
         4,  1, 21,  1, 21,  1, 27,  5,  1,  0,  0,  0, 44, 58])


We will now split this model into training and test data, so we can check out (aka validate) how close it the our vocabulary to Khalil Gibran's style using the test data.

This is a standard concept in data science. if you want to learn more, try [here](https://www.obviously.ai/post/the-difference-between-training-data-vs-test-data-in-machine-learning)

In [9]:
n = int(0.9*len(data)) # n will be 90% of the (character) length of all the data, so 90% of 914761
train_data = data[:n] # train our model Bon first 90% of the length of data
val_data = data[n:] # we will validate on the last 90% on how accurate our model is.

We will now send "chunks" of the data in the dataset to the model to train it. can't send it all of the data, as it would be very computationally hard to handle. so we train on *randomly sampled* chunks.

These chunks have a maximum length (which will make sense why b/c stage 1 of the transformer is limited by the how many tokens can be sent to it in parallel. We will call this block_size. (It can also be called context length in terms of the input tokens the gpt can accept)

In [10]:
block_size = 8

# we will send the following chunk for training..
print(train_data[:block_size+1]) # notice how we are sending 9 characters, not 8 

tensor([25,  1, 44, 29, 25, 42,  1, 25, 38])


In [11]:
#in the actual data this looks like this:
print(text[:block_size+1])

A TEAR AN


**Important note!** when we send 9 characters, each of them have some information about the relationship to each other packed into them, e.g.
- in the context of A, a space likely comes next.
- In the context of "A ", T likely comes next, 
- if it is "A T" then "E" will likely follow, and so on

This is why the 9 pieces of data sent will show "8 relationships". The 8th example is:
- If the 8th phrase "A TEAR A" comes, then what follows is likely N.


In [12]:
x = train_data[:block_size]
y = train_data[1:block_size+1]
for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f"when input is {context}, the target: {target}")
    
print("see! 8 relationships!")

when input is tensor([25]), the target: 1
when input is tensor([25,  1]), the target: 44
when input is tensor([25,  1, 44]), the target: 29
when input is tensor([25,  1, 44, 29]), the target: 25
when input is tensor([25,  1, 44, 29, 25]), the target: 42
when input is tensor([25,  1, 44, 29, 25, 42]), the target: 1
when input is tensor([25,  1, 44, 29, 25, 42,  1]), the target: 25
when input is tensor([25,  1, 44, 29, 25, 42,  1, 25]), the target: 38
see! 8 relationships!


So, everytime we send this data into the transformer to train it, we will sample and send many such batches randomly from different location of the corpus

We will be sending **many batches all stacked up in a single tensor**, sent to the Xformer. And we do this just for efficiency to keep the GPUs busy as they are very good at parallel processing of data.

So while we may be processing these multiple RANDOMLY sampled chunks in parallel in real time, these chunks are processed completely indepdently, they don't talk to each other. 

tidbit. this is what makes the Transfomer model different from the tradition neural network approaches like LSTM (long short term memory) which send data in sequentially. The folks at openAI when they wrote the paper wanted to optimize for speed and scale, and that is why they didn't use traditional recurrent neural network approaches.

Ofcourse, there was a problem with this..when you send data in randomly, the transformer would have to figure out the inter-relationship across these chunks to have complete context. it does that in one of the stages. more to come on that later.


now, we will generalize the prior (serial) data chunks and introduce the **batch dimension** below:

In [13]:
torch.manual_seed(1337) # to make this deterministic for each "random" run
batch_size = 4 # how many independent sequences will we process in parallel
block_size = 8 # what is the maximum context length for predictions.

def get_batch(split):
    #generates a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size, ))
        # doc: https://pytorch.org/docs/stable/generated/torch.randint.html
        # torch.randint(low=0, high, size, ...)
        # explanation of the prior line of code:
        # (batch_size, )  --- (A)
        # this determines the number of random numbers generated
        # in this case it will be a list of 4 that gets spit out and stored in ix.
        # these 4 are selected as the next 4 items after the random starting point
        # that random starting point is calculated by:
        # len(data) - block_size --- (B)
        # So, this is first argument passed to the random integer (randint) function
        # this argument basically tells the function, hey find the POSITION of 
        # the character data between zero and this number. 
        # Note how it is total length - block_size, imagine if the full length of data was 38
        # the sample of the POSITION of the data would not be more than 38-8 = 30
        # this is because in the subsequent steps we will be extracting the value of the 30th item 
        # and the 31st item and the 32nd item all the way to the 8th item 
        # because we have a batch_size of 8. suppose we didn't have the -block_size part in there
        # we could then randomly pick 35 or 36 or even 38, and that would screw up the
        # subsequent step, where it would seek the next 8 values, but there wouldn't be the full 8
        # and it would error out.
        # 
        #
        # so to bring it together (B) identifies the random position of anywhere in the data set.
        # and (A) selects four such POSITIONS and return those back out and stores it in ix.
        
        
    x = torch.stack([data[i:i+block_size] for i in ix])
        # so what that does is it takes the 4 positions identifies in ix earlier, 
        # pulls the next block_size(8) and stack them up as ROWs in the tensor, 
        # so it would be an 4x8 (rows x columns) tensor
    
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
        # it shows the stack x by an offset of 1..
    return x,y

xb,yb = get_batch('train')
print('inputs:')
print(xb.shape)
print(xb)
print('targets:')
print(yb.shape)
print(yb)

print("----")

for b in range(batch_size): # batch dimension
    for t in range(block_size): # time dimension
        context = xb[b, :t+1]
        target = yb[b,t]
        print(f"when input is {context.tolist()} the target: {target}")

inputs:
torch.Size([4, 8])
tensor([[55, 68, 23, 84,  1, 33, 64,  1],
        [69, 69,  1, 53, 68, 71, 55, 62],
        [55,  1, 65, 64, 55,  1, 73, 59],
        [75, 65, 71, 68,  1,  0, 66, 55]])
targets:
torch.Size([4, 8])
tensor([[68, 23, 84,  1, 33, 64,  1, 68],
        [69,  1, 53, 68, 71, 55, 62,  1],
        [ 1, 65, 64, 55,  1, 73, 59, 70],
        [65, 71, 68,  1,  0, 66, 55, 68]])
----
when input is [55] the target: 68
when input is [55, 68] the target: 23
when input is [55, 68, 23] the target: 84
when input is [55, 68, 23, 84] the target: 1
when input is [55, 68, 23, 84, 1] the target: 33
when input is [55, 68, 23, 84, 1, 33] the target: 64
when input is [55, 68, 23, 84, 1, 33, 64] the target: 1
when input is [55, 68, 23, 84, 1, 33, 64, 1] the target: 68
when input is [69] the target: 69
when input is [69, 69] the target: 1
when input is [69, 69, 1] the target: 53
when input is [69, 69, 1, 53] the target: 68
when input is [69, 69, 1, 53, 68] the target: 71
when input is [69, 

So, I want to talk about why we printed the "y". Well this is going to be the loss function for our neural network.
take an example of the first row printed:

- x -> [55, 68, 23, 84,  1, 33, 64,  1]
- y -> [68, 23, 84,  1, 33, 64,  1, 68]

- if the input is 55, then the desired output is 68 (value in y in the SAME index as the x[])
- if the input is 55, 68 then the desired output is 23 (value in y in the SAME index as the max value of x[])


this loss function is applied on a neural network all the way in the end to measure what the NN spits out against what should have been the correct answer, and the delta between the NN output and the actual is the "error". This "error" is backpropagated through the neural network layers to adjust their weights so that the error is minimized.

here is more details on the [loss function](https://www.analyticsvidhya.com/blog/2022/06/understanding-loss-function-in-deep-learning/) in ML


Coming back to the above example, we have 32 values in x, and 32 values in y (desired targets). essentially we have 32 relationships stored in x and y

repeating what we said earlier:

- if the input is 55, then the desired output is 68 (value in y in the SAME index as the x[])
- if the input is 55, 68 then the desired output is 23 (value in y in the SAME index as the max value of x[])
- and so on.

So this tensor below:
xb tensor([[55, 68, 23, 84,  1, 33, 64,  1],
        [69, 69,  1, 53, 68, 71, 55, 62],
        [55,  1, 65, 64, 55,  1, 73, 59],
        [75, 65, 71, 68,  1,  0, 66, 55]])
        
will feed into the transformer, and the transformer will simultaneously process these batches. and then look up the correct integers to predict in the same positions in the other tensor:


yb tensor([[68, 23, 84,  1, 33, 64,  1, 68],
        [69,  1, 53, 68, 71, 55, 62,  1],
        [ 1, 65, 64, 55,  1, 73, 59, 70],
        [65, 71, 68,  1,  0, 66, 55, 68]])

In [14]:
# checkpoint in the YT video. We are at 22:30
# to recap:
print("the length of the vocabulary: {}".format(vocab_size))
print("the actual vocabulary: {}".format(''.join(chars)))


the length of the vocabulary: 87
the actual vocabulary: 
 !'()*+,-.0123459:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWYZ_abcdefghijklmnopqrstuvwxyz|~©»é—‘’“”


#### Building a simple bigram language model

Andrej covers the bigram model in depth in the first twenty minutes of [this](https://www.youtube.com/watch?v=PaCmpygFfXo) video. He took a list of human names and made the model create artificial names based on the how the characters in each name of the former followed each other.  

The essential summary is:

*A bigram language model is a type of statistical language model that predicts the probability of a word in a sequence based on the previous word. It considers pairs of consecutive words (bigrams) and estimates the likelihood of encountering a specific word given the preceding word in a text or sentence.*

![bigram](https://d33wubrfki0l68.cloudfront.net/d851cde48c4f2b7be4c1433aa1a5538cb77e2aee/232ee/images/introduction-n-gram-language-models_files/ngram-language-model-explained-with-examples.png)B

In [None]:
# checkpoint in the YT video. We are at 22:30
# to recap:
print("the length of the vocabulary: {}".format(vocab_size))
print("the actual vocabulary: {}".format(''.join(chars)))


#### nn.embeddings

now before we dive into the next code, let's take a quick detour and understand pytorch embeddings. this one stumped me for a while.

PS a good resource where I learnt this was Jeff Heaton's excellent [video](https://www.youtube.com/watch?v=e6kcs9Uj_ps)

In [76]:
import torch
import torch.nn as nn
torch.manual_seed(42)
x = nn.Embedding(3,2)

# so embeddings are just lookup tables, and what we did with the line above is
# we create a tensor (aka a matrix of numbers) of 3x2 matrix and filled it with
# a bunch of random numbers. to print this tensor out we can use:
x.weight

Parameter containing:
tensor([[ 0.3367,  0.1288],
        [ 0.2345,  0.2303],
        [-1.1229, -0.1863]], requires_grad=True)

In [83]:
# now if we want to look up certain values from this "embedding" lookup table
# , we can call it like so:
y = torch.tensor([0]) # we have to look up the embedding using a tensor, 
                        # not just any old integer will do
x(y) 

tensor([[0.3367, 0.1288]], grad_fn=<EmbeddingBackward0>)

In [95]:
# you can see that we looked up the first row specified in y, aka row "0" 
# (since python arrays are zero-based) and printed that out. 
# this is useful to take any entity, e.g. words and REPRESENT them in 
# n-dimensions numerical vectors.
# cow = [0.2323, 0.343434, 0.434343] 
# this allows us to see how close the word is to another word semantically like
# animal = [0.3, 0.35, 0.48] vs far apart from another word like:
# car = [0.8, 0.1, 0.03]
# you can see that the 3 numbers (aka the vector) representing animal are closer to
# those numbers representing cow, vs car. you can use things like mean squared
# error to find out the single number which represents the closeness or 
# far-apart-ness for these numbers!



In [97]:
# important note: the size of the matrix that comes out of embedding is the
# SAME as what went in. in the above example we sent it a size of 1 (when we
# sent it [0]) and we got out a size of 1 [0.3367,0.1288]. its just that now
# each array has a depth of 2 because we had made the embedding using a size of 2 each
# like so: nn.Embedding(3,2)
# heres a slightly more involved example:

z = torch.tensor([0,1])
print(x.weight)




Parameter containing:
tensor([[ 0.3367,  0.1288],
        [ 0.2345,  0.2303],
        [-1.1229, -0.1863]], requires_grad=True)


torch.Size([2])

In [100]:
# if you scroll up the 0th index in the embedding is represented by [0.3367, 0.1288]
# and the 1th index is [0.2345, 0.2303]

print(z)
print(x(z))


# so with that, let's resume the Andrej video, we are at 23:07 

tensor([0, 1])
tensor([[0.3367, 0.1288],
        [0.2345, 0.2303]], grad_fn=<EmbeddingBackward0>)


In [107]:
# above, 0 is represented by a vector of 2 numbers [0.3367, 0.1288], 
# and similarly 1 is represented also a vector of 2 numbers [0.2345, 0.2303]

# now, if we passed it a tensor of more complexity see how that is represented 
# by the embeddings...
t1 = torch.tensor([[0,1,1], [1,0,0]])
print(t1)
print(x(t1))


tensor([[0, 1, 1],
        [1, 0, 0]])
tensor([[[0.3367, 0.1288],
         [0.2345, 0.2303],
         [0.2345, 0.2303]],

        [[0.2345, 0.2303],
         [0.3367, 0.1288],
         [0.3367, 0.1288]]], grad_fn=<EmbeddingBackward0>)
torch.Size([2, 3, 2])


In [108]:
# now 

# 0th row of the embedding, then the 1th row of the embedding, and then again
# the 1th row of the embedding.
# the size is:
print(x(t1).size())

torch.Size([2, 3, 2])


In [None]:
# which means a tensor of TWO, 3x2 matrices, and each of these two matrices is
# a representing of each value (eg [0, 1, 1]) within the input tensor, 
# within each 3x2 matrix, each row represents the (2 dimensional vector representation)
# of each of the numbers in the input, e.g. 0, or 1, or 1.. so
# 0 is represented by [0.3367, 0.1288], the next
# 1 is represtned by [0.2345, 0.2303], and the next
# 1 is represtned by [0.2345, 0.2303]
# and therefore [0, 1, 1] is represented by:
# [0.3367, 0.1288],
# [0.2345, 0.2303],
# [0.2345, 0.2303]
# rinse and repeat for the next input [1, 0, 0]

# why am I going into this depth, because things are about to get crazy in 
# the n dimensions when we call the nn.Embedding method in the next stage..

In [126]:
import torch
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(1337)

class BigramLanguageModel(nn.Module):
    # the BigramLanguageModel will inherit from nn.Module class
    # so it would be a subclass of nn.Module
    
    #initializing method:
    def __init__(self,vocab_size):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)
        # to recap: 
        # the length of the vocabulary: 87
        # the actual vocabulary: 
        # !'()*+,-.0123459:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWYZ_abcdefghijklmnopqrstuvwxyz|~©»é—‘’“”
        # nn.Embedding has been discussed above (aka its a lookup table), so
        # what we're doing is that we are representing each character in the 
        # vocabulary above into an 87 dimension vector
        
    
    def forward(self, idx, targets=None):
        # takes the inputs and target tensors, input renamed as idx
        # recall that the input was a stacked tensor of batch (rows) of 4
        # with 8 tokens (here, that is same as characters) across, eg
        # tensor([
        # [55, 68, 23, 84,  1, 33, 64,  1],
        # [69, 69,  1, 53, 68, 71, 55, 62],
        # [55,  1, 65, 64, 55,  1, 73, 59],
        # [75, 65, 71, 68,  1,  0, 66, 55]])
        
       # idx and targets are both (B,T)B tensors of integers
        logits = self.token_embedding_table(idx) # (B,T,C)
        # so recall the long explanation of the pytorch embeddings above.
        # each of the rows in the input, aka the encoded chunk we are going to use to train
        # eg [55, 68, 23, 84,  1, 33, 64,  1]
        # is taken and then each of the numbers (eg 55) is going to be blown into its 
        # representation of an 87 number VECTOR, which is the 55th row
        # in the embedding table. 
        # so we shoved in 4x8 and should get a 4x8x87 matrix as an output.
        # we call this Batch x Time x Channel. The batch is 4, the Time is 8
        # and the channel is 87.
        
        # a peek forward logits is going to essentially represent the scores
        # of the next character in the sequence, and this is where the target
        # tensor will come in.. (24:12)
        
        # basically if i'm character 58, (say "y") I know just by being "y"
        # what are the probabilities of what character will follow vs not.
        # say "a" has a higher probability of following "y" vs "z". (in fact, 
        # let's assume the probability of z following y is 0! that would mean
        # for the 58th index in the Embedding (aka for y) , the 59th value (or 
        # the number representing the character z) would be 0 and say the
        # 7th value (or number representing a) could be 0.8
        
       
        # next step is is evaluate the loss function
        
    
        #loss = F.cross_entropy(logits, targets) (a) commented bc it fails,
                            # we need to do some transforms
        
        # this means , we have the identity of the next character in 'targets'
        # so how well are we predicting the next character in the 'logits'
        
        # intuitively, if the loss is low (meaning accuracy is high), the
        # correct next number should be very high (such as the 7th value example above)
        # and every other value would be a low number.
        
        # the magic of using tensors with these crazy n dimensional arrays is that you
        # are doing a bunch of mathametical operations in parallel together 
        # which is supporting very nicely in GPUs <-- this has enabled the massive
        # scaling and speedup of pre-training an LLM which takes 12 days with GPUs
        # vs it would have taken 12 centuries with CPUs!
        
        # the reason (a) was commented was because if you look carefully at what
        # cross entropy is expecting 
        # https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html#torch.nn.CrossEntropyLoss
        # it is expecting not a torch.Size([4, 8, 87]) which is what we will are passing
        # it rn. Instead it is expecting 87 (channels) as the 2nd dimension, not the 3rd!
               
            
        # but to predict by the "generate() function"
        if targets is None:
            loss = None
        else:

            # 

            B, T, C = logits.shape
            logits = logits.view(B*T, C) # C or channels needs to be 2nd dimension
            # so we are stretching out the array so its two dimensional and conforms to
            # cross_entropy(). 

            #doing the same to targets:
            targets = targets.view(B*T)

            # and now cross_entropy should work:
            loss = F.cross_entropy(logits, targets)

            # at this point 
            # print(logits.shape) is torch.Size([32, 87])
            # print(loss) is  tensor(4.7719, grad_fn=<NllLossBackward0>)
            # an ideal negative loss likelihood unitary number should be
            # -ln(1/87) ==> 4.4659

            # this means initial predictions are not super diffuse and have a little
            # entropy
        
        return logits, loss

        
    def generate(self, idx, max_new_tokens):
        # idx is (B,T) array of indices in the current context
        # the job if generate() is to take in the batches, B,T and generate
        #(B,T+1), (B,T+2) and so on in the time dimension.
        
        for _ in range(max_new_tokens):
            # get the predictions
            logits, loss = self(idx)
            # the self(idx) will call the forward function (hmm! must be a pytorch thing)
            # 
            
            #focus only on the last step, i.e what comes after the last character
            logits = logits[:, -1, :] # becomes shape (B,C) or (4,87), the last column
            
            #now, apply softmax to get probabilities
            # https://en.wikipedia.org/wiki/Softmax_function
            #converts a vector of K real numbers into a probability distribution of 
            # K possible outcomes. It is a generalization of the logistic function
            # to multiple dimensions 
            # The softmax function is often used as the last activation function 
            # of a neural network to normalize the output of a network 
            # to a probability distribution over predicted output classes
            
            probs = F.softmax(logits, dim=-1) # (B,C)
            
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # becomes (B,1)
            
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
            # so we are concatenating what was predicted prior and written on top
            # of idx, and generate as many as max_new_tokens has specified.
            
            
        return idx
            

    
    
m = BigramLanguageModel(vocab_size) 
logits, loss = m(xb,yb)
# this line calls the __init__ and creates m, an object of the BigramLanguageModel 
# class
print(logits.shape)
print(loss)

storage_idx = torch.zeros((1,1), dtype=torch.long)
# where we will store the output generated tokens

# generating a 100 tokens
print(decode(m.generate(storage_idx, 
                        max_new_tokens=100)[0].tolist()))

torch.Size([32, 87])
tensor(4.7719, grad_fn=<NllLossBackward0>)

Cug__>Om-B»n“MFxP1iw3a*U(qe'fwTqfEpWUc©Rd~e—DVc5:urRotUHDuO-!Ag_ rOYCuR.Wx5
G0tf3”@b>*3c=cWbql4xhoCd


In [None]:
# the above prints out garbage because this is a totally random model.
# we are not back propagating and reducing loss so it runs through
# using random numbers.

# also we are sending the model the whole length of data just to predict the next 
# char, e.g. to predict the last U we are sending in the full history
# Cug__>Om-B»n“MFxP1iw3a*U(qe'fwTqfEpW 
# as well. in bigram the history doesn't matter but in the actual
# transformer model it will, so we want to keep this code as is instead 
# of sending a pruned single char input.

# stopping at 34:16