## The notebook is to implement a basic LLM fully from scratch

### Tokenization in Early LLM Stages

In the initial stages of large language models (LLMs), the process begins with tokenizing both words and characters. Tokenization refers to the technique of converting text into numerical representations. There are various methods to achieve this:

- **Byte Pair Encoding (BPE):** This method, used by OpenAI, breaks words down into subword units. For example, words like "depend" might be split into "de" and "pend," while suffixes such as "ing" are treated similarly.
  
- **Word-to-Vector Encoding:** This simpler approach converts entire words into numerical values by sorting a dictionary of words and mapping each to a number. While straightforward, it struggles to handle unseen words.
  
In contrast, BPE is preferred because it breaks words into smaller subword units, ensuring even previously unseen words can be tokenized effectively.


In [1]:
# Tiktoken is used for BPE
import tiktoken
print("tiktoken version:", tiktoken.__version__)
import torch
from torch.utils.data import Dataset, DataLoader
print("torch version: ", torch.__version__)

tiktoken version: 0.8.0
torch version:  2.4.1+cu124


### Implement the BPE
- Chosen cl100k_base it has total 100256 tokens placed in the url of open ai "https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken"

In [20]:
tokenizer = tiktoken.get_encoding("cl100k_base")
text = (
    "Hello, do you like tea? <|endoftext|> In the sunlit terraces"
     "of someunknownPlace."
)
integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})
print(integers)
decode_text = tokenizer.decode(integers)
print(decode_text)



[9906, 11, 656, 499, 1093, 15600, 30, 220, 100257, 763, 279, 7160, 32735, 7317, 2492, 1073, 1063, 16476, 17826, 13]
Hello, do you like tea? <|endoftext|> In the sunlit terracesof someunknownPlace.


### Representing the Data
We will now define the input data and target data for our model:

- **Input Data:** These are the tokens provided as input to the language model (LLM).
- **Target Data:** This is the same as the input data, but used for prediction — it represents the next word that the model is trying to predict.
- **Dataset:** The dataset in pytorch is an interface for accessing and managing data. len() and getitem() functions are needed for the dataset
- **DataLoader:** : This used to load the data from the dataset in batches. It handles shuffling, batches and parallelism

In [5]:
class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []

        token_ids = tokenizer.encode(txt)    #1

        for i in range(0, len(token_ids) - max_length, stride):     #2
            input_chunk = token_ids[i:i + max_length]
            target_chunk = token_ids[i + 1: i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self):    #3
        return len(self.input_ids)

    def __getitem__(self, idx):         #4
        return self.input_ids[idx], self.target_ids[idx]

In [6]:
def create_dataloader_v1(txt, batch_size=4, max_length=256,
                         stride=128, shuffle=True, drop_last=True,
                         num_workers=0):
    tokenizer = tiktoken.get_encoding("gpt2")                         #1
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)   #2
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last,     #3
        num_workers=num_workers     #4
    )

    return dataloader

In [18]:
with open("The_Verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

dataloader = create_dataloader_v1(
    raw_text, batch_size=8, max_length=4, stride=4, shuffle=False)
data_iter = iter(dataloader)      #1
first_batch = next(data_iter)
print(first_batch)
first_batch = next(data_iter)
print(first_batch)

[tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]]), tensor([[  367,  2885,  1464,  1807],
        [ 3619,   402,   271, 10899],
        [ 2138,   257,  7026, 15632],
        [  438,  2016,   257,   922],
        [ 5891,  1576,   438,   568],
        [  340,   373,   645,  1049],
        [ 5975,   284,   502,   284],
        [ 3285,   326,    11,   287]])]
[tensor([[  287,   262,  6001,   286],
        [  465, 13476,    11,   339],
        [  550,  5710,   465, 12036],
        [   11,  6405,   257,  5527],
        [27075,    11,   290,  4920],
        [ 2241,   287,   257,  4489],
        [   64,   319,   262, 34686],
        [41976,    13,   357, 10915]]), tensor([[  262,  6001,   286,   465],
        [13476,    11,   339,   550],
    

### Now we are going to create token embeddings
These are like weights for data features