<a href="https://colab.research.google.com/github/ishanBahuguna/LLM-from-scratch/blob/main/LLM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Reading in a short story as text smaple into python**

## Step 1 : Creating Tokens

In [None]:
with open("the-verdict.txt" , "r" , encoding="utf-8") as f:
  raw_text = f.read()

print("Total number of character: ", len(raw_text))
print(raw_text[:99])

FileNotFoundError: [Errno 2] No such file or directory: 'the-verdict.txt'

In [None]:
import re

text = "Hello, world. This, is a test."
result = re.split(r'(\s)' , text)
print(result)

In [None]:
result = re.split(r'([,.]|\s)' , text)

print(result)

Removing whitespaces depends on the problem statement , removeing them decreases computational cost but it can be useful if the dataset is like python code with indentation

In [None]:
result = [item for item in result if item.strip()]
print(result)

In [None]:
text = "Hello, world. Is this-- a test?"
result = re.split(r'([,:;.?_!()\']|--|\s)' , text);
result  = [item for item in result if item.strip()]
print(result)

Applying tokenization on raw_text

In [None]:
preprocessed = re.split(r'([,:;.?_!()\']|--|\s)' , raw_text);
preprocessed = [item.strip() for item in preprocessed if item.strip()]

print(preprocessed[:90])


In [None]:
print(len(preprocessed))

We have tokenize the text data and now we will be creating vocabulary from it which is the set of tokens sorted in order and then assign unique id's to the tokens which is called token ID

## Step 2 : Creating token ID's

In [None]:
all_words = sorted(set(preprocessed))
vocab_size = len(all_words)

print(vocab_size)


NameError: name 'preprocessed' is not defined

In [None]:
# creating vocab by assigning tokenID's to the tokens

vocab = {token : integer for integer , token in enumerate(all_words)}
print(vocab)

In [None]:
#e.g : creating token ids

eg_word = sorted(("heelo" , "how" , "." , "you"))
vocab = {token : integer for integer , token in enumerate(eg_word)}

print(vocab)

In [None]:
for i , item in enumerate(vocab.items()):
  print(item)
  if i >= 50:
    break;

consider the vocab like a encoder which converts words into token ids but later we also need decoder so the the numeric output from the LLM can be converted into text again

In [None]:
import re

class SimpleTokenizerV1:
  def __init__(self , vocab):
    self.str_to_int = vocab
    self.int_to_str = {i:s for s , i in vocab.items()}

  def encode(self , text):
    preprocessed = re.split(r'([,:;.?_!()\']|--|\s)' , text);

    preprocessed = [
        item.strip() for item in preprocessed if item.strip()
    ]

    ids = [self.str_to_int[s] for s in preprocessed]
    return ids

  def decode(self , ids):
    text = " ".join([self.int_to_str[id] for id in ids])
    # Replace spaces before specified punctuations
    text = re.sub(r'\s+([,.?''()\'])' , r'\1' , text)
    return text

In [None]:
tokenizer = SimpleTokenizerV1(vocab)

text = """"It's the last he painted, you know,"
Mrs. Gisburn said with pardonable pride."""

ids = tokenizer.encode(text)
print(ids)

In [None]:
tokenizer.decode(ids)

The simple tokenizer was able to encode and decode the training text successfuly whihc was present in the vocab but what if the word is not present in the vocab?

In [None]:
text = "Hello, do you like tea?" # Hello is not present in the vocab
print(tokenizer.encode(text))

The above error is because the vocab used here is very small whereas LLMs use very large datasets and a concept of Special Text tokens

## Adding special text tokens to the vocab

In [None]:
all_tokens = sorted(list(set(preprocessed)))
all_tokens.extend(["<|endoftext|>"  , "<|unk|>"])

vocab = {token:integer for integer, token in enumerate(all_tokens)}

In [None]:
len(vocab.items())

In [None]:
for i,item in enumerate(list(vocab.items())[-5:]):
  print(item)

In [None]:
# updated tokenizer class:

class SimpleTokenizerV2:
  def __init__(self , vocab):
    self.str_to_int = vocab
    self.int_to_str = {i:s for s,i in vocab.items()}

  def encode(self , text):
    preprocessed = re.split(r'([,:;.?_!()\']|--|\s)' , text);

    #two strip() --> remove leading and trailing white spaces
    # as well as removes the empty strings
    preprocessed = [
        item.strip() for item in preprocessed if item.strip()
    ]

    preprocessed = [
        item if item in self.str_to_int
        else "<|unk|>" for item in preprocessed
    ]

    ids = [self.str_to_int[s] for s in preprocessed]
    return ids

  def decode(self , ids):
    text = " ".join(self.int_to_str[id] for id in ids)
    # Replace spaces before the specified punctuations
    text = re.sub(r'\s+([,.?''()\'])' , r'\1' , text)
    return text

In [None]:
tokenizer = SimpleTokenizerV2(vocab)

text1 = "Hello , do you like tea?"
text2 = "In the sunlit terraces of the palace."

text = " <|endoftext|> ".join((text1 , text2))

print(text)

In [None]:
tokenizer.encode(text)

In [None]:
tokenizer.decode(tokenizer.encode(text))

## Byte Pair Encoding

Directly using a library called tiktoken for BPE which is written in rust

In [None]:
!pip install tiktoken #used by openAI

In [None]:
import importlib
import tiktoken

print("tiktoken version : " , importlib.metadata.version("tiktoken"))

In [None]:
tokenizer = tiktoken.get_encoding('gpt2')

The usage of this tokenizer is similar to SimpleTokenizerV2 which we have implemented earlier

In [None]:
text = ("Hello , do you like tea? <|endoftext|> In the sunlit terraces " "of someunknownPlace.")

integers = tokenizer.encode(text , allowed_special={"<|endoftext|>"})
print(integers)

In [None]:
strings = tokenizer.decode(integers)
print(strings)

Another example to illustrate how BPE algo works

In [None]:
integers = tokenizer.encode("Akwirw ier")
print(integers)

strings = tokenizer.decode(integers)
print(strings)

## CREATING INPUT-TARGET PAIRS

Implement data loader that fetches the input-target pairs using a sliding window approach

In [None]:
with open("the-verdict.txt" , "r" , encoding="utf-8") as f:
  raw_text = f.read()

enc_text = tokenizer.encode(raw_text)
print("Vocab size : " , len(enc_text))

In [None]:
enc_sample = enc_text[50:]

In [None]:
context_size = 4 # this size is actually very big , in initial gpt it was 1024

x = enc_sample[:context_size]
y = enc_sample[1:context_size+1]

print(f"x : {x}")
print(f"y :      {y}")

In [None]:
for i in range(1 , context_size+1):
  context = enc_sample[:i]
  desired = enc_sample[i]

  print(context , " ----> " , desired)

In [None]:
for i in range(1 , context_size+1):
  context = enc_sample[:i]
  desired = enc_sample[i]

  print(tokenizer.decode(context) , " ----> " , tokenizer.decode([desired]))

In [None]:
import torch

In [None]:
torch.__version__

In [None]:
import tiktoken

In [None]:
from torch.utils.data import Dataset , DataLoader
# todo : learn pytorch and Dataset class

class GPTDatasetV1(Dataset):
  def __init__(self , txt , tokenizer, max_length , stride):
    self.input_ids = []
    self.target_ids = []

    # tokenize the entire text
    token_ids = tokenizer.encode(txt , allowed_special={"<|endoftext|>"})

    # use a sliding window to chunk the book info overlapping sequences of max_length : auto-regression model
    for i in range(0 , len(token_ids) - max_length , stride):
      input_chunk = token_ids[i:i+max_length]
      target_chunk = token_ids[i+1:i+max_length+1]
      self.input_ids.append(torch.tensor(input_chunk))
      self.target_ids.append(torch.tensor(target_chunk))

  def __len__(self):
    return len(self.input_ids)

  def __getitem__(self , idx):
    return self.input_ids[idx] , self.target_ids[idx]

Batches example:

ex1
ex2

ex3
ex4

ex5
ex6

ex7 --> not satifying size of batch=2 so drop for small dataset may create instability in training the dataset

In [None]:
# txt --> training dataset
# stride --> by how many places to move ahead(here 1)

def create_dataloader_v1(txt ,batch_size=2 , max_length=256
                         , stride=128 , shuffle=True , drop_last=True , num_workers=0 ):

  # initialize the tokenizer
  tokenizer = tiktoken.get_encoding("gpt2")

  # create dataset
  dataset = GPTDatasetV1(txt , tokenizer , max_length , stride)

  # create dataloader : todo
  dataloader = DataLoader(
      dataset,
      batch_size = batch_size,
      shuffle = shuffle,
      drop_last = drop_last,
      num_workers = num_workers
  )

  return dataloader

In [None]:
with open("the-verdict.txt" , "r" , encoding='utf-8') as f:
  raw_text = f.read()

FileNotFoundError: [Errno 2] No such file or directory: 'the-verdict.txt'

In [None]:
data_loader = create_dataloader_v1(raw_text , batch_size=1 , max_length=4 , stride=1 , shuffle=False)

data_iter = iter(data_loader)
first_batch = next(data_iter)
print(first_batch)

In [None]:
second_batch = next(data_iter)
print(second_batch)

In the above two dataloader batches it can be seen that there are smiliar tokens in the first and second batches which may lead to overfitting of LLM so we use a greater stride size

In [None]:
data_loader = create_dataloader_v1(raw_text , batch_size=1 , max_length=4 , stride=4 , shuffle=False)

data_iter = iter(data_loader)
first_batch = next(data_iter)
print(first_batch)

In [None]:
second_batch = next(data_iter)
print(second_batch)

In [None]:
dataloader = create_dataloader_v1(
    raw_text, batch_size=8 , max_length=4, stride=4
    , shuffle=False
)

data_iter = iter(dataloader)
input , targets = next(data_iter)
print("Inputs : \n" , input)
print("\nTargets : \n" , targets)

## Creating Token embeddings

TODO : LLMs are deep NN so learn how NN works

We convert tokenIds into vector embeddings which are initialzed with random values and optimzied during the training of LLM by adjusting wts

In [None]:
input_ids = torch.tensor([2,3,5,1])
# fox-2 , jumps-3 , over-5 , dog-1

vocab_size = 6 # for simplicity taken 6 otherwise use tiktoken vocab size
# tiktoken vocab size --> tokenizer.n_vocab
output_dim = 3 # for simplicity

torch.manual_seed(123) # read about this

# embedding_layer = torch.nn.Embedding(vocab_size , output_dim)
embedding_layer = torch.nn.Embedding(tokenizer.n_vocab , output_dim)

In [None]:
print(embedding_layer.weight) ## random values
#embedding_layer act as a lookup table for input_id

In [None]:
embedding_layer(torch.tensor([3])) # since python is 0-indexed

In [None]:
embedding_layer(input_ids)

## Encoding word positions


The need of encoding positions is that the LLMs self attention mechanism is position agnostic and treats same token at different position to be same which should not happen e.g fox jumps over a fox

In [None]:
vocab_size = 50257 # size of tiktoken vocab
output_dim = 256

token_embedding_layer = torch.nn.Embedding(vocab_size , output_dim)
# check out the implementation of embedding from the book

In [None]:
print("Weights of embedding layer : \n" , token_embedding_layer.weight)

In [None]:
max_length = 4

dataloader = create_dataloader_v1(
    raw_text , batch_size=8,
    max_length=max_length ,
    stride = max_length , shuffle=False
)

data_iter = iter(dataloader)
inputs , targets = next(data_iter) # dataloader returns tensor which can be feeded to embedding layer

In [None]:
print("Token ids : \n " , inputs)

print("\n Input shape: \n" , inputs.shape)

print("\n Targer: \n" , targets)

print("\n Targets Shaper: \n" , targets.shape)

In [None]:
# input is feeded to the embedding layer
token_embeddings = token_embedding_layer(inputs)
token_embeddings.shape

In [None]:
token_embeddings[0,0]

For positional encoding GPT simply used another embedding layer
In case of Lamma used rotational embedding layer

In [None]:
context_length = max_length
pos_embedding_layer = torch.nn.Embedding(context_length , output_dim)

# **Coding Attention Mechanism**

## Simple Attention Mechanism without tangible weights

In [None]:
import torch

Tensors are representation of text in terms of multidimensional arrays

0-D tensor : scalar

1-D tensor : vector

2-D tensor : matrix

3-D tesnor : 3rd order tensor

.

.


In [None]:
inputs = torch.tensor(
    [[0.43, 0.15, 0.89],  # Your        (x^1)
     [0.55, 0.87, 0.66],  # journey     (x^2)
     [0.57, 0.85, 0.64],  # starts      (x^3)
     [0.22, 0.58, 0.33],  # with        (x^4)
     [0.77, 0.25, 0.10],  # one         (x^5)
     [0.05, 0.80, 0.55]]  # step        (x^6)
)


In [None]:
# dot product of two vectors

input_1 = inputs[0]
input_2 = inputs[1]

print("Input 1 : " , input_1)
print("Input 2 : " , input_2)

print("Multiplication : " , input_1 * input_2)

Input 1 :  tensor([0.4300, 0.1500, 0.8900])
Input 2 :  tensor([0.5500, 0.8700, 0.6600])
Multiplication :  tensor([0.2365, 0.1305, 0.5874])


In [None]:
dot_product = 0.43*0.55 + 0.15*0.87 + 0.89*0.66
print(dot_product)

#dot product in terms on tensor:
print(torch.dot(input_1 , input_2))

0.9544
tensor(0.9544)


In [None]:
# dot product of each token w.r.t x2:
input_2 = inputs[1]

for input in inputs:
  print(torch.dot(input , input_2))


res = 0

for input in inputs:
  res += torch.dot(input , input_2)


tensor(0.9544)
tensor(1.4950)
tensor(1.4754)
tensor(0.8434)
tensor(0.7070)
tensor(1.0865)


Tensors have fixed sized and it is generally considered inefficient to inc or dev there size , whereas list in python have dynamic size


The dot product between two vectors is used to understand the similarity between them

In [None]:
# dot product with respect to a query(query was x2 in above e.g) = attention scores

input_query = inputs[1]
attn_scores_2 = torch.empty(inputs.shape[0])

for idx , x_i in enumerate(inputs):
  attn_scores_2[idx] = torch.dot(x_i , input_query)


print(attn_scores_2)


tensor([0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865])


In a real attention mechanism the attentions wts are trained in NN


In [None]:
# normalizing the attn_score to be 1:

attn_wt_2_tmp = attn_scores_2 / attn_scores_2.sum() # learn about softmax function

attn_wt_2_tmp


tensor([0.1455, 0.2278, 0.2249, 0.1285, 0.1077, 0.1656])

In [None]:
attn_wt_2_tmp.sum()

tensor(1.0000)

In [None]:
# todo: need for normalizing the scores or wts , pytorch , softmax function , significane of dot product

def softmax_naive(x):
  return torch.exp(x) / torch.exp(x).sum(dim = 0)

softmax_naive(attn_scores_2)

tensor([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])

In [None]:
attn_wts_2 = torch.softmax(attn_scores_2 , dim=0)
attn_wts_2

tensor([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])

Creating context vector as same length of the query vector: Multiplying attention wts with the respective vector and yeilding the vector of same length as of query vector by summing the result of multiplication

In [None]:
torch.zeros(input_query.shape)

tensor([0., 0., 0.])

In [None]:
context_vec_2 = torch.zeros(input_query.shape)

# it is called self attention mechanism becoz the current word in translation is laser focused to itself w.r.t to others using attention scores
for idx , x_i in enumerate(inputs):
  print(f"{attn_wts_2[idx]}  ---->  {x_i}")
  context_vec_2 += attn_wts_2[idx] * x_i
  print(context_vec_2)

context_vec_2

# what is happening in above loop:
"""
inputs =   [[1,2,3],
           [4,5,6],
           [7,8,9],
           [10,11,12]]

attn_score = [1,2,3,4]

context_vec = (1 * [1,2,3]) + (2 * [4,5,6]) + (3 * [7,8,9]) + (4 * [10,11,12])
            = [60 , 80 , 90]
"""

0.13854756951332092  ---->  tensor([0.4300, 0.1500, 0.8900])
tensor([0.0596, 0.0208, 0.1233])
0.2378913015127182  ---->  tensor([0.5500, 0.8700, 0.6600])
tensor([0.1904, 0.2277, 0.2803])
0.23327402770519257  ---->  tensor([0.5700, 0.8500, 0.6400])
tensor([0.3234, 0.4260, 0.4296])
0.12399158626794815  ---->  tensor([0.2200, 0.5800, 0.3300])
tensor([0.3507, 0.4979, 0.4705])
0.10818186402320862  ---->  tensor([0.7700, 0.2500, 0.1000])
tensor([0.4340, 0.5250, 0.4813])
0.15811361372470856  ---->  tensor([0.0500, 0.8000, 0.5500])
tensor([0.4419, 0.6515, 0.5683])


tensor([0.4419, 0.6515, 0.5683])

Creating context vector for all inputs

In [None]:
import torch
inputs = torch.tensor(
    [[0.43, 0.15, 0.89],  # Your        (x^1)
     [0.55, 0.87, 0.66],  # journey     (x^2)
     [0.57, 0.85, 0.64],  # starts      (x^3)
     [0.22, 0.58, 0.33],  # with        (x^4)
     [0.77, 0.25, 0.10],  # one         (x^5)
     [0.05, 0.80, 0.55]]  # step        (x^6)
)

print(f"Input Tensor : \n\n{inputs}")
# creating empty tensor to store attention scores
attn_scores = torch.empty([inputs.shape[0] , inputs.shape[0]])
attn_wts = torch.empty([inputs.shape[0] , inputs.shape[0]])


# calculating attention scores for each input tensor
for i , x_i in enumerate(inputs):
  for j , x_j in enumerate(inputs):
    attn_scores[i][j] = torch.dot(x_i , x_j)

print(f"\n\n\nAttention sores before normalization: \n\n {attn_scores}")

# normalizing attention scores:
# for i , x_i in enumerate(attn_scores):
#   attn_wts[i] = torch.softmax(x_i , dim=0)

# dim =1 => softmax is applied row wise which is similar to above loop
attn_wts = torch.softmax(attn_scores , dim=1)

# attn_wts = torch.softmax(attn_scores , dim=0) --> dim = 0 => apply softmax column wise

print(f"\nAttention sores after normalization: \n\n{attn_wts}")

# context vector to store each of the input vector
temp = torch.zeros(inputs.shape)
context_vec = torch.empty(inputs.shape)

print(f"\n\nContext vector initialization : \n\n{context_vec}")


for i in range(0 , inputs.shape[0]):
  temp = torch.zeros(inputs.shape[1])
  for idx , input in enumerate(inputs):
    temp += input * attn_wts[i][idx]

  if i == 1:
    print(f"\nContext vector : \n {temp}")

  context_vec[i] = temp


print(f"\n\nContext vector after processing with attention weights :\n\n{context_vec}")





Input Tensor : 

tensor([[0.4300, 0.1500, 0.8900],
        [0.5500, 0.8700, 0.6600],
        [0.5700, 0.8500, 0.6400],
        [0.2200, 0.5800, 0.3300],
        [0.7700, 0.2500, 0.1000],
        [0.0500, 0.8000, 0.5500]])



Attention sores before normalization: 

 tensor([[0.9995, 0.9544, 0.9422, 0.4753, 0.4576, 0.6310],
        [0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865],
        [0.9422, 1.4754, 1.4570, 0.8296, 0.7154, 1.0605],
        [0.4753, 0.8434, 0.8296, 0.4937, 0.3474, 0.6565],
        [0.4576, 0.7070, 0.7154, 0.3474, 0.6654, 0.2935],
        [0.6310, 1.0865, 1.0605, 0.6565, 0.2935, 0.9450]])

Attention sores after normalization: 

tensor([[0.2098, 0.2006, 0.1981, 0.1242, 0.1220, 0.1452],
        [0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581],
        [0.1390, 0.2369, 0.2326, 0.1242, 0.1108, 0.1565],
        [0.1435, 0.2074, 0.2046, 0.1462, 0.1263, 0.1720],
        [0.1526, 0.1958, 0.1975, 0.1367, 0.1879, 0.1295],
        [0.1385, 0.2184, 0.2128, 0.1420, 0.0988, 0.1896

In [None]:
# some techniques:

attn_scores = inputs @ inputs.T # matrix multiplication with transpose and it is more optimized than using two loops

tensor([[0.9995, 0.9544, 0.9422, 0.4753, 0.4576, 0.6310],
        [0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865],
        [0.9422, 1.4754, 1.4570, 0.8296, 0.7154, 1.0605],
        [0.4753, 0.8434, 0.8296, 0.4937, 0.3474, 0.6565],
        [0.4576, 0.7070, 0.7154, 0.3474, 0.6654, 0.2935],
        [0.6310, 1.0865, 1.0605, 0.6565, 0.2935, 0.9450]])

In [None]:
attn_wts = torch.softmax(attn_scores , dim = 1)
attn_wts.sum(dim=-1)

tensor([1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000])

In [None]:
context_vec = attn_wts @ inputs
context_vec

tensor([[0.4421, 0.5931, 0.5790],
        [0.4419, 0.6515, 0.5683],
        [0.4431, 0.6496, 0.5671],
        [0.4304, 0.6298, 0.5510],
        [0.4671, 0.5910, 0.5266],
        [0.4177, 0.6503, 0.5645]])

## Self attention with trainable weights

In [2]:
import torch

inputs = torch.tensor(
    [[0.43, 0.15, 0.89],  # Your        (x^1)
     [0.55, 0.87, 0.66],  # journey     (x^2)
     [0.57, 0.85, 0.64],  # starts      (x^3)
     [0.22, 0.58, 0.33],  # with        (x^4)
     [0.77, 0.25, 0.10],  # one         (x^5)
     [0.05, 0.80, 0.55]]  # step        (x^6)
)

In [None]:
x_2 = inputs[1]
d_in = inputs.shape[1]
d_out = 2

print(f"x_2 : {x_2} \n\n d_in : {d_in}\n\n d_out : {d_out}")

In [None]:
torch.manual_seed(123)

# initializing wt query using nn parameter because it makes the tensor which is trainable => requires gradient = true
W_query = torch.nn.Parameter(torch.rand(d_in , d_out))
W_query

In [None]:
W_key = torch.nn.Parameter(torch.rand(d_in , d_out))
W_key

In [None]:
W_value = torch.nn.Parameter(torch.rand(d_in , d_out))
W_value

In [None]:
query_2 = x_2 @ W_query
query_2

In [1]:
keys = inputs @ W_key
keys

NameError: name 'inputs' is not defined