# Large Language Model

1. In this notebook I have implemented a large language model from scratch
2. The code below follows following steps from a transformder architecture:
    - Tokenize the training text input
      
  

## Preprocessing : Read and Process the Text, Convert to Tokens, Embed into the Vectors

## Step 1: Create Tokens from the text read from a book

In [83]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

print("Total characters read from book" , len(raw_text))
# print the first 100 characters 
print(raw_text[:99])

Total characters read from book 20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 


### Tokenize all the text to be used for LLM model 

In [84]:
# Use Regular Expressions 
import re

text = "Hello, This is a Test to split words from a large text.-- Just testing"
#result = re.split(r'(\s)', text)

# split by all these characters 
result = re.split(r'([,.:;?_!"()\']|--|\s)', text) 

# strip words of whitespaces and also remove whitespaces 
result = [item.strip() for item in result if item.strip()]
print(result)

['Hello', ',', 'This', 'is', 'a', 'Test', 'to', 'split', 'words', 'from', 'a', 'large', 'text', '.', '--', 'Just', 'testing']


In [85]:
# Apply the tokenizer code on entire raw text from the book
preprocessed = re.split(r'([,.:;?_!"\']|--|\s)', raw_text)

preprocessed = [item.strip() for item in preprocessed if item.strip()]
print(preprocessed[:99])
print("Total Tokens:", len(preprocessed))

['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in', 'the', 'height', 'of', 'his', 'glory', ',', 'he', 'had', 'dropped', 'his', 'painting', ',', 'married', 'a', 'rich', 'widow', ',', 'and', 'established', 'himself', 'in', 'a', 'villa', 'on', 'the', 'Riviera', '.', '(Though', 'I', 'rather', 'thought', 'it', 'would', 'have', 'been', 'Rome', 'or', 'Florence', '.', ')', '"', 'The', 'height', 'of', 'his', 'glory', '"', '--', 'that', 'was', 'what', 'the', 'women', 'called', 'it', '.', 'I', 'can', 'hear', 'Mrs', '.', 'Gideon', 'Thwing', '--', 'his', 'last', 'Chicago', 'sitter', '--']
Total Tokens: 4685


### Convert Tokens to Token IDs 

- Build Vocabulary - list of all the unique tokens sorted alphabetically, give them unique indexes
- Token id of any token is its inside the vocabulary built so far

In [86]:
# Convert preprocessed to a Set so all elements are unique and then sort it
all_words = sorted(set(preprocessed))
vocab_size = len(all_words)

print(vocab_size)

1131


In [87]:
# Allocate an integer to the words 
# This is encoding words to number -- we woild need decoder to convert number to corresponding word
vocab = {token:integer for integer,token in enumerate(all_words)}


In [88]:
for i, item in enumerate(vocab.items()):
    print(item)
    if i >= 50:
        break;
    

('!', 0)
('"', 1)
("'", 2)
('(I', 3)
('(Though', 4)
(')', 5)
(',', 6)
('--', 7)
('.', 8)
(':', 9)
(';', 10)
('?', 11)
('A', 12)
('Ah', 13)
('Among', 14)
('And', 15)
('Are', 16)
('Arrt', 17)
('As', 18)
('At', 19)
('Be', 20)
('Begin', 21)
('Burlington', 22)
('But', 23)
('By', 24)
('Carlo', 25)
('Chicago', 26)
('Claude', 27)
('Come', 28)
('Croft', 29)
('Croft)', 30)
('Destroyed', 31)
('Devonshire', 32)
('Don', 33)
('Dubarry', 34)
('Emperors', 35)
('Florence', 36)
('For', 37)
('Gallery', 38)
('Gideon', 39)
('Gisburn', 40)
('Gisburns', 41)
('Grafton', 42)
('Greek', 43)
('Grindle', 44)
('Grindles', 45)
('HAD', 46)
('Had', 47)
('Hang', 48)
('Has', 49)
('He', 50)


In [107]:
# Define an Tokenizer class to provide logic for Encoding and Decoding Text using provided Vocabulary
# Constructor accepts a vocabulary - which is the dictionary storing the words we want to train on an numerical ids
class SimpleTokenizerV1:
    def __init__(self, vocab):
        #class variable str_to_int would store vocabulary words with their token ids
        self.str_to_int = vocab
        #class variables int_to_str will store the mapping of token ids to strings - opposite of vocabulary
        self.int_to_str = {i:s for s,i in vocab.items()}

    # Use given vocabulary to store mapping of words to token ids (given in the vocabulary)
    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|__|\s)', text)
        preprocessed = [ item.strip() for item in preprocessed if item.strip()
        ]
        # use the vocabulary - str to int dictionary to get list of all token ids
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids

    
    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])

        # remove spaces before every puctuations
        text = re.sub(r'([,.:;?_!"()\']|__|\s)', r'\1', text)
        return text.strip()
        

## Instantiate Tokenizer

In [108]:
tokenizer = SimpleTokenizerV1(vocab)

#test the encoder
text = """"It is the last of all the Greek," 
            said he. """
ids = tokenizer.encode(text)
print(ids)

[1, 58, 585, 989, 603, 723, 146, 989, 43, 6, 1, 852, 534, 8]


In [110]:
#test the decoder
tokenizer.decode(ids)

'" It is the last of all the Greek , " said he .'

## Special Context Tokens to Handle Unknown Words in Dictionary and End of Text Token

### End of Text tokens are embedded between texts from two different sources or unrelated texts.

In [115]:
all_tokens = sorted(list(set(preprocessed)))
all_tokens.extend(["<|endoftext|>", "<|unk|>"])

vocab = {token:integer for integer,token in enumerate(all_tokens)}
print(len(vocab.items()))

1133


In [120]:
list(vocab.items())[-5:]


[('younger', 1128),
 ('your', 1129),
 ('yourself', 1130),
 ('<|endoftext|>', 1131),
 ('<|unk|>', 1132)]

In [121]:
for i,item in enumerate(list(vocab.items())[-5:]):
    print(item)

('younger', 1128)
('your', 1129)
('yourself', 1130)
('<|endoftext|>', 1131)
('<|unk|>', 1132)


In [134]:
class SimpleTokenizerV2:
    def __init__(self, vocab):
        #class variable str_to_int would store vocabulary words with their token ids
        self.str_to_int = vocab
        #class variables int_to_str will store the mapping of token ids to strings - opposite of vocabulary
        self.int_to_str = {i:s for s,i in vocab.items()}

    # Use given vocabulary to store mapping of words to token ids (given in the vocabulary)
    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|__|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        #print(self.str_to_int)
        preprocessed = [
            item if item in self.str_to_int
            else "<|unk|>" for item in preprocessed 
        ]
        
        # use the vocabulary - str to int dictionary to get list of all token ids
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids

    
    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])

        # remove spaces before every puctuations
        text = re.sub(r'([,.:;?_!"()\']|__|\s)', r'\1', text)
        return text.strip()

In [135]:
# testing SimpleTokenizerV2
tokenizer1 = SimpleTokenizerV2(vocab)

text1 = "Hello, do you like tea?"
text2 = "In sunlit terraces of palace"

final_text = " <|endoftext|> ".join((text1, text2))
print(final_text)

Hello, do you like tea? <|endoftext|> In sunlit terraces of palace


In [136]:
ids = tokenizer1.encode(final_text)
print(ids)

[1132, 6, 356, 1127, 629, 976, 11, 1131, 57, 957, 985, 723, 1132]


In [137]:
tokenizer1.decode(ids)



'<|unk|> , do you like tea ? <|endoftext|> In sunlit terraces of <|unk|>'

### Some additional tokens used during training of LLMs
- [BOS] : Beginning of Sequence
- [EOS] : End of Sequence
- [PAD] : Padding for smaller batches when they are trained in parallel

## Byte-Pair Encoding : Word can be broken into sub word tokens 