## The notebook is to implement a basic LLM fully from scratch

### Tokenization in Early LLM Stages

In the initial stages of large language models (LLMs), the process begins with tokenizing both words and characters. Tokenization refers to the technique of converting text into numerical representations. There are various methods to achieve this:

- **Byte Pair Encoding (BPE):** This method, used by OpenAI, breaks words down into subword units. For example, words like "depend" might be split into "de" and "pend," while suffixes such as "ing" are treated similarly.
  
- **Word-to-Vector Encoding:** This simpler approach converts entire words into numerical values by sorting a dictionary of words and mapping each to a number. While straightforward, it struggles to handle unseen words.
  
In contrast, BPE is preferred because it breaks words into smaller subword units, ensuring even previously unseen words can be tokenized effectively.


In [228]:
# Tiktoken is used for BPE
import tiktoken
print("tiktoken version:", tiktoken.__version__)
import torch
from torch.utils.data import Dataset, DataLoader
print("torch version: ", torch.__version__)

torch_avail = torch.cuda.is_available()

print(f"Is torch available {torch_avail}")




tiktoken version: 0.8.0
torch version:  2.5.1+cu124
Is torch available True


### Implement the BPE
- One of the vocabs is  cl100k_base it has total 100256 tokens placed in the url of open ai "https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken"
- Smaller vocab is gpt2 has 50001 tokens placed in "https://openaipublic.blob.core.windows.net/gpt-2/encodings/main/vocab.bpe"

In [229]:
tokenizer = tiktoken.get_encoding("gpt2")
text = (
    "Hello, do you like tea? <|endoftext|> In the sunlit terraces"
     "of someunknownPlace."
)
integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})
print(integers)
decode_text = tokenizer.decode(integers)
print(decode_text)



[15496, 11, 466, 345, 588, 8887, 30, 220, 50256, 554, 262, 4252, 18250, 8812, 2114, 1659, 617, 34680, 27271, 13]
Hello, do you like tea? <|endoftext|> In the sunlit terracesof someunknownPlace.


### Representing the Data
We will now define the input data and target data for our model:

- **Input Data:** These are the tokens provided as input to the language model (LLM).
- **Target Data:** This is the same as the input data, but used for prediction — it represents the next word that the model is trying to predict.
- **Dataset:** The dataset in pytorch is an interface for accessing and managing data. len() and getitem() functions are needed for the dataset
- **DataLoader:** : This used to load the data from the dataset in batches. It handles shuffling, batches and parallelism

In [230]:
class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []

        token_ids = tokenizer.encode(txt)    #1

        for i in range(0, len(token_ids) - max_length, stride):     #2
            input_chunk = token_ids[i:i + max_length]
            target_chunk = token_ids[i + 1: i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self):    #3
        return len(self.input_ids)

    def __getitem__(self, idx):         #4
        return self.input_ids[idx], self.target_ids[idx]

In [231]:
def create_dataloader_v1(txt, batch_size=4, max_length=256,
                         stride=128, shuffle=True, drop_last=True,
                         num_workers=0):
    tokenizer = tiktoken.get_encoding("gpt2")                         #1
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)   #2
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last,     #3
        num_workers=num_workers     #4
    )

    return dataloader

In [232]:
with open("The_Verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

dataloader = create_dataloader_v1(
    raw_text, batch_size=8, max_length=4, stride=4, shuffle=False)
data_iter = iter(dataloader)      #1
first_batch = next(data_iter)
print(first_batch)
first_batch = next(data_iter)
print(first_batch)

[tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]]), tensor([[  367,  2885,  1464,  1807],
        [ 3619,   402,   271, 10899],
        [ 2138,   257,  7026, 15632],
        [  438,  2016,   257,   922],
        [ 5891,  1576,   438,   568],
        [  340,   373,   645,  1049],
        [ 5975,   284,   502,   284],
        [ 3285,   326,    11,   287]])]
[tensor([[  287,   262,  6001,   286],
        [  465, 13476,    11,   339],
        [  550,  5710,   465, 12036],
        [   11,  6405,   257,  5527],
        [27075,    11,   290,  4920],
        [ 2241,   287,   257,  4489],
        [   64,   319,   262, 34686],
        [41976,    13,   357, 10915]]), tensor([[  262,  6001,   286,   465],
        [13476,    11,   339,   550],
    

### Now we are going to create token embeddings
Every token is basically changed to a vector of different size. They can be anything and this vector is used for LLM

In [233]:
# Below is sample embedding technique
# Input id are created using BPE, these are called token ids, these token ids need to be connected and converted to embeddings 
input_ids = torch.tensor([2,3,5,1]) 
vocab_size = 6
output_dim = 3
torch.manual_seed(123)  # to avoid randomness
embedding_layer = torch.nn.Embedding(vocab_size, output_dim)
print(embedding_layer)
print(embedding_layer.weight)
print(embedding_layer(torch.tensor([3]))) # will print the 3rd embedding
print(embedding_layer(input_ids))

Embedding(6, 3)
Parameter containing:
tensor([[ 0.3374, -0.1778, -0.1690],
        [ 0.9178,  1.5810,  1.3010],
        [ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-1.1589,  0.3255, -0.6315],
        [-2.8400, -0.7849, -1.4096]], requires_grad=True)
tensor([[-0.4015,  0.9666, -1.1481]], grad_fn=<EmbeddingBackward0>)
tensor([[ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-2.8400, -0.7849, -1.4096],
        [ 0.9178,  1.5810,  1.3010]], grad_fn=<EmbeddingBackward0>)


- Above we created a simple token embeddings without position information inside the layer
- Now we try to add that information as well

### Position embeddings
- They have the same dimension has the token embedding created above
- Relative position embedding tells how far apart the token are
- Open AI uses absolute positional embeddings

In [234]:
vocab_size = 50257 #if gpt2 is used
output_dim = 256 # original gpt3 uses 12288 
token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)

## Now lets call our data loader, each data token will have embedding of size of 256 now 
## 8 text samples with 4 tokens each
max_length = 4
dataloader = create_dataloader_v1(
    raw_text, batch_size=8, max_length=max_length,
   stride=max_length, shuffle=False
)
data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print("Token IDs:\n", inputs)
print("\nInputs shape:\\n", inputs.shape)

token_embeddings = token_embedding_layer(inputs)
print(token_embeddings.shape)


Token IDs:
 tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]])

Inputs shape:\n torch.Size([8, 4])
torch.Size([8, 4, 256])


In [235]:
## For GPT model we need another embedding layer that has same embedding dimension has token embedding layer
context_length = max_length
pos_embedding_layer = torch.nn.Embedding(context_length, output_dim)
pos_embeddings = pos_embedding_layer(torch.arange(context_length))
print(pos_embeddings.shape)
print(torch.arange(context_length))

torch.Size([4, 256])
tensor([0, 1, 2, 3])


In [236]:
## We can now add these directly to the token embeddings, 
# where PyTorch will add the 4 × 256–dimensional pos_embeddings tensor to each 4 × 256–dimensional 
# token embedding tensor in each of the eight batches
# Positional embedding data is added to each token embedding
input_embeddings = token_embeddings + pos_embeddings
print(input_embeddings.shape)

torch.Size([8, 4, 256])


## Self Attention
### Now we will work on having self Attention

- Self attention requires weights for each of the tokens
- to address the connection between the words

In [237]:
## We create stuff for inputs
inputs = torch.tensor(
  [[0.43, 0.15, 0.89], # Your     (x^1)
   [0.55, 0.87, 0.66], # journey  (x^2)
   [0.57, 0.85, 0.64], # starts   (x^3)
   [0.22, 0.58, 0.33], # with     (x^4)
   [0.77, 0.25, 0.10], # one      (x^5)
   [0.05, 0.80, 0.55]] # step     (x^6)
)

query = inputs[1]                            #1
attn_scores_2 = torch.empty(inputs.shape[0])

print("query: ", query, " ***** empty attention scores: ", attn_scores_2)

for i, x_i in enumerate(inputs):
    print("index: ", i, x_i)
    attn_scores_2[i] = torch.dot(x_i, query)


print("Final attention scores: ",attn_scores_2)


## Now we scale it to unity

attn_weights_2_tmp = attn_scores_2 / attn_scores_2.sum()
print("Attention weights scaled:", attn_weights_2_tmp)
print("Sum:", attn_weights_2_tmp.sum())

query:  tensor([0.5500, 0.8700, 0.6600])  ***** empty attention scores:  tensor([6.1377e-43, 0.0000e+00, 2.8250e-42, 0.0000e+00, 3.6013e-43, 0.0000e+00])
index:  0 tensor([0.4300, 0.1500, 0.8900])
index:  1 tensor([0.5500, 0.8700, 0.6600])
index:  2 tensor([0.5700, 0.8500, 0.6400])
index:  3 tensor([0.2200, 0.5800, 0.3300])
index:  4 tensor([0.7700, 0.2500, 0.1000])
index:  5 tensor([0.0500, 0.8000, 0.5500])
Final attention scores:  tensor([0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865])
Attention weights scaled: tensor([0.1455, 0.2278, 0.2249, 0.1285, 0.1077, 0.1656])
Sum: tensor(1.0000)


- We now have to do softmax to remove gradients in the weights it is similar to scaling to unity

In [238]:
def softmax_naive(x):
    return torch.exp(x) / torch.exp(x).sum(dim=0)

attn_weights_2_naive = softmax_naive(attn_scores_2)
print("Attention weights scaled :", attn_weights_2_naive)
print("Sum:", attn_weights_2_naive.sum())

Attention weights scaled : tensor([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])
Sum: tensor(1.)


- We now have a softmax function which is e^xi / sum(e^x) but lets use torch softmax for optimisation

In [239]:
attn_weights_2 = torch.softmax(attn_scores_2, dim=0)
print("Attention weights:", attn_weights_2)
print("Sum:", attn_weights_2.sum())

Attention weights: tensor([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])
Sum: tensor(1.)


In [240]:
query = inputs[1]         #1
print("query:", query)
context_vec_2 = torch.zeros(query.shape)
for i,x_i in enumerate(inputs):
    print(x_i)
    context_vec_2 += attn_weights_2[i]*x_i
print("context vector :",context_vec_2)

query: tensor([0.5500, 0.8700, 0.6600])
tensor([0.4300, 0.1500, 0.8900])
tensor([0.5500, 0.8700, 0.6600])
tensor([0.5700, 0.8500, 0.6400])
tensor([0.2200, 0.5800, 0.3300])
tensor([0.7700, 0.2500, 0.1000])
tensor([0.0500, 0.8000, 0.5500])
context vector : tensor([0.4419, 0.6515, 0.5683])


In [241]:
attn_scores = torch.empty(6, 6)
for i, x_i in enumerate(inputs):
    for j, x_j in enumerate(inputs):
        attn_scores[i, j] = torch.dot(x_i, x_j)
print(attn_scores)

tensor([[0.9995, 0.9544, 0.9422, 0.4753, 0.4576, 0.6310],
        [0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865],
        [0.9422, 1.4754, 1.4570, 0.8296, 0.7154, 1.0605],
        [0.4753, 0.8434, 0.8296, 0.4937, 0.3474, 0.6565],
        [0.4576, 0.7070, 0.7154, 0.3474, 0.6654, 0.2935],
        [0.6310, 1.0865, 1.0605, 0.6565, 0.2935, 0.9450]])


In [242]:
attn_scores = inputs @ inputs.T
print(attn_scores)

tensor([[0.9995, 0.9544, 0.9422, 0.4753, 0.4576, 0.6310],
        [0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865],
        [0.9422, 1.4754, 1.4570, 0.8296, 0.7154, 1.0605],
        [0.4753, 0.8434, 0.8296, 0.4937, 0.3474, 0.6565],
        [0.4576, 0.7070, 0.7154, 0.3474, 0.6654, 0.2935],
        [0.6310, 1.0865, 1.0605, 0.6565, 0.2935, 0.9450]])


In [243]:
print("attention scores shape ", attn_scores.shape)

# attn_weights = torch.softmax(attn_scores, dim=0)
# print(attn_weights)
# print("\n ********** \n")

attn_weights = torch.softmax(attn_scores, dim = -1)
print(attn_weights)

attention scores shape  torch.Size([6, 6])
tensor([[0.2098, 0.2006, 0.1981, 0.1242, 0.1220, 0.1452],
        [0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581],
        [0.1390, 0.2369, 0.2326, 0.1242, 0.1108, 0.1565],
        [0.1435, 0.2074, 0.2046, 0.1462, 0.1263, 0.1720],
        [0.1526, 0.1958, 0.1975, 0.1367, 0.1879, 0.1295],
        [0.1385, 0.2184, 0.2128, 0.1420, 0.0988, 0.1896]])


In [244]:
row_2_sum = sum([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])
print("Row 2 sum:", row_2_sum)
print("All row sums:", attn_weights.sum(dim=-1))

Row 2 sum: 1.0
All row sums: tensor([1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000])


In [245]:
all_context_vecs = attn_weights @ inputs
print("context vectors for each words : ", all_context_vecs)
print("\n ********** \n")
print(inputs)

context vectors for each words :  tensor([[0.4421, 0.5931, 0.5790],
        [0.4419, 0.6515, 0.5683],
        [0.4431, 0.6496, 0.5671],
        [0.4304, 0.6298, 0.5510],
        [0.4671, 0.5910, 0.5266],
        [0.4177, 0.6503, 0.5645]])

 ********** 

tensor([[0.4300, 0.1500, 0.8900],
        [0.5500, 0.8700, 0.6600],
        [0.5700, 0.8500, 0.6400],
        [0.2200, 0.5800, 0.3300],
        [0.7700, 0.2500, 0.1000],
        [0.0500, 0.8000, 0.5500]])


### Self Attention is complex lets write
- First we have embeddings ... which has positional embeddings and token vector embeddings for word
- This embeddings are used for self attention
- Lets talk about attention a bit what it means 
- Here attention means whats the weight of every word with respect to the other words in the sentence
- Each word is embedding and now to determine the links between them what do we do
- every word embedding is a vector how are each vector linked with other ... we can do dot product where if the link is the highest then value is high
- now using the dot product concept there is a interesting approach every word vector is connected to other vector using its dot product ... ofcourse this scaled using softmax
- now we have single attention weight of a single word vector with respect to all the words... like how every word is linked with others 
- these weights if i multiply with each embedding and then add it i will get a context vector for the specific word

### This is called a scaled dot product attention

### Computing the attention weights
- Query
- Key
- Value

The above weights are to be projected from the embedded input tokens


In [246]:
x_2 = inputs[1]     #1
d_in = inputs.shape[1]      #2
d_out = 2         #3


print(f"inputs the general embeddings {inputs}")
print(f"inputs shape {inputs.shape}")
print(f"d_in {d_in} and d_out {d_out}")

inputs the general embeddings tensor([[0.4300, 0.1500, 0.8900],
        [0.5500, 0.8700, 0.6600],
        [0.5700, 0.8500, 0.6400],
        [0.2200, 0.5800, 0.3300],
        [0.7700, 0.2500, 0.1000],
        [0.0500, 0.8000, 0.5500]])
inputs shape torch.Size([6, 3])
d_in 3 and d_out 2


In [247]:
torch.manual_seed(123)
W_query = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)
W_key   = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)
W_value = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)

print("weight matrix of query: ", W_query)
print(f"shape of W_query: {W_query.shape}")

query_2 = x_2 @ W_query 
key_2 = x_2 @ W_key 
value_2 = x_2 @ W_value
print(f"Product of x2 and Wquery x2 = {x_2} and query_2 = {query_2} ")

weight matrix of query:  Parameter containing:
tensor([[0.2961, 0.5166],
        [0.2517, 0.6886],
        [0.0740, 0.8665]])
shape of W_query: torch.Size([3, 2])
Product of x2 and Wquery x2 = tensor([0.5500, 0.8700, 0.6600]) and query_2 = tensor([0.4306, 1.4551]) 


In [248]:
query = inputs @ W_query 
keys = inputs @ W_key 
values = inputs @ W_value
print("keys.shape:", keys.shape)
print("values.shape:", values.shape)
print(f"keys: {keys} \n values: {values}")

keys.shape: torch.Size([6, 2])
values.shape: torch.Size([6, 2])
keys: tensor([[0.3669, 0.7646],
        [0.4433, 1.1419],
        [0.4361, 1.1156],
        [0.2408, 0.6706],
        [0.1827, 0.3292],
        [0.3275, 0.9642]]) 
 values: tensor([[0.1855, 0.8812],
        [0.3951, 1.0037],
        [0.3879, 0.9831],
        [0.2393, 0.5493],
        [0.1492, 0.3346],
        [0.3221, 0.7863]])


In [249]:
keys_2 = keys[1]             #1
attn_score_22 = query_2.dot(keys_2)
print(attn_score_22)

attn_scores_2 = query_2 @ keys.T       #1
print(attn_scores_2)

tensor(1.8524)
tensor([1.2705, 1.8524, 1.8111, 1.0795, 0.5577, 1.5440])


In [250]:
# Now we scale it in a unusual way
d_k = keys.shape[-1]
attn_weights_2 = torch.softmax(attn_scores_2 / d_k**0.5, dim=-1)
print(d_k)
print(attn_weights_2)

2
tensor([0.1500, 0.2264, 0.2199, 0.1311, 0.0906, 0.1820])


In [251]:
context_vec_2 = attn_weights_2 @ values
print(context_vec_2)

tensor([0.3061, 0.8210])


### Query, Key and Value
- There are 3 matrices of weights that are created query key and values
- each embedding tokens are propogated to the weights query key and value
- we get query vector by multiplying wq with embedding
- the once we get query and key vectors we multiply to get attention weights of query and key
- then this attention weight i multiply with values to get context vector


In [252]:
import torch.nn as nn
class SelfAttention_v1(nn.Module):
    def __init__(self, d_in, d_out):
        super().__init__()
        self.W_query = nn.Parameter(torch.rand(d_in, d_out))
        self.W_key   = nn.Parameter(torch.rand(d_in, d_out))
        self.W_value = nn.Parameter(torch.rand(d_in, d_out))

    def forward(self, x):
        keys = x @ self.W_key
        queries = x @ self.W_query
        values = x @ self.W_value
        print(f"values : {values}")
        print(f"queries : {queries}")
        print(f"keys : {keys}")
        attn_scores = queries @ keys.T # omega
        print(f"attention scores q*k : {attn_scores}")
        attn_weights = torch.softmax(
            attn_scores / keys.shape[-1]**0.5, dim=-1
        )
        print(f"attention weights w = softmax(q*k): {attn_weights}")
        context_vec = attn_weights @ values
        return context_vec

In [253]:
torch.manual_seed(123)
sa_v1 = SelfAttention_v1(d_in, d_out)
print(f"context vector {sa_v1.forward(inputs)}")

values : tensor([[0.1855, 0.8812],
        [0.3951, 1.0037],
        [0.3879, 0.9831],
        [0.2393, 0.5493],
        [0.1492, 0.3346],
        [0.3221, 0.7863]], grad_fn=<MmBackward0>)
queries : tensor([[0.2309, 1.0966],
        [0.4306, 1.4551],
        [0.4300, 1.4343],
        [0.2355, 0.7990],
        [0.2983, 0.6565],
        [0.2568, 1.0533]], grad_fn=<MmBackward0>)
keys : tensor([[0.3669, 0.7646],
        [0.4433, 1.1419],
        [0.4361, 1.1156],
        [0.2408, 0.6706],
        [0.1827, 0.3292],
        [0.3275, 0.9642]], grad_fn=<MmBackward0>)
attention scores q*k : tensor([[0.9231, 1.3545, 1.3241, 0.7910, 0.4032, 1.1330],
        [1.2705, 1.8524, 1.8111, 1.0795, 0.5577, 1.5440],
        [1.2544, 1.8284, 1.7877, 1.0654, 0.5508, 1.5238],
        [0.6973, 1.0167, 0.9941, 0.5925, 0.3061, 0.8475],
        [0.6114, 0.8819, 0.8626, 0.5121, 0.2707, 0.7307],
        [0.8995, 1.3165, 1.2871, 0.7682, 0.3937, 1.0996]],
       grad_fn=<MmBackward0>)
attention weights w = softmax(q*

In [254]:
class SelfAttention_v2(nn.Module):
    def __init__(self, d_in, d_out, qkv_bias=False):
        super().__init__()
        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key   = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)

    def forward(self, x):
        keys = self.W_key(x)
        queries = self.W_query(x)
        values = self.W_value(x)
        attn_scores = queries @ keys.T
        attn_weights = torch.softmax(
            attn_scores / keys.shape[-1]**0.5, dim=-1
        )
        context_vec = attn_weights @ values
        return context_vec

In [255]:
torch.manual_seed(789)
sa_v2 = SelfAttention_v2(d_in, d_out)
print(sa_v2.forward(inputs))

tensor([[-0.0739,  0.0713],
        [-0.0748,  0.0703],
        [-0.0749,  0.0702],
        [-0.0760,  0.0685],
        [-0.0763,  0.0679],
        [-0.0754,  0.0693]], grad_fn=<MmBackward0>)


WHY QUERY, KEY, AND VALUE?
The terms “key,” “query,” and “value” in the context of attention mechanisms are borrowed from the domain of information retrieval and databases, where similar concepts are used to store, search, and retrieve information.

A query is analogous to a search query in a database. It represents the current item (e.g., a word or token in a sentence) the model focuses on or tries to understand. The query is used to probe the other parts of the input sequence to determine how much attention to pay to them.

The key is like a database key used for indexing and searching. In the attention mechanism, each item in the input sequence (e.g., each word in a sentence) has an associated key. These keys are used to match the query.

The value in this context is similar to the value in a key-value pair in a database. It represents the actual content or representation of the input items. Once the model determines which keys (and thus which parts of the input) are most relevant to the query (the current focus item), it retrieves the corresponding values.

** We can improve the SelfAttention_v1 implementation further by utilizing PyTorch’s nn.Linear layers, which effectively perform matrix multiplication when the bias units are disabled. Additionally, a significant advantage of using nn.Linear instead of manually implementing nn.Parameter(torch.rand(...)) is that nn.Linear has an optimized weight initialization scheme, contributing to more stable and effective model training.