# Working with text data

<img src="../images/three-main-stages-of-coding-an-llm-stage1-step1.png" width="800px">

# 2.1 Understanding word embeddings

<b>Why do we need embeddings?</b>
- <span style="color:red">Deep neural network (NN) models, including LLMs cannot process text data directly. Since, the text data is categorical, it's not compatible with mathematical operations used to train NNs.</span>
- So, we need a way <span style="color:#4ea9fb">to represent non numeric data (words/text) in a continuous numbers, a format that NNs can understand and process</span>.

<b>What's an embedding?</b>
- The concept of <span style="color:#4ea9fb"><b>converting text (or other data) into numerical vector representations.</b></span>
- In other words, embedding is a mapping from discrete objects (words, image, or entire documents) into a point in continuous high dimensional space. 

<b>Different type of embeddings</b>
- While *word embeddings* are the most common form of text embedding, there are other type of embeddings such as subword/token, sentence, paragraph, document, etc.
  - Since GPT-like LLMs learn to generate one word at a time, we will focus on **word embeddings**.
- Refer [https://prasanth.io/Knowledge/Tech/Embeddings](https://prasanth.io/Knowledge/Tech/Embeddings) for different type of embeddings.
- For *retrieval-augmentated generation*, sentence or paragraph embeddings are more popular choices.

<b>How to embed different data types?</b>
- Using a specific NN layer or another pretrained NN model, we can embed different data types - such as text, image, video, etc. 
<p style="color:black; background-color:#F5C780; padding:15px">ðŸ’¡Different data types require different embedding models. <span style="color:red">Embedding model designed for text data would not be suitable for embedding audio or video data.</span></p>

<img src="../images/different-embedding-models-for-different-data-types.png" width="700px">

<b><i>Word2Vec</i> - Most popular word embedding</b>
- <span style="color:#4ea9fb">The main idea behind Word2Vec is that <b>words that appear in similar contexts tend to have similar meanings</b></span>. Consequently, when projected into two-dimensional word embeddings for visualization purposes, similar terms are clustered together.
- For more details, refer [https://prasanth.io/Knowledge/Tech/Word2Vec](https://prasanth.io/Knowledge/Tech/Word2Vec)

<img src="../images/word-embeddings-projected-in-two-dimension-example.png" width="600px">

<b>Why don't we use <i>Word2Vec</i> for LLMs?</b>
- <span style="color:#4ea9fb">LLMs commonly produce their own embeddings as part of the input layer, and are updated during training</span>.
- <span style="color:green">The advantage of optimizing the embeddings as part of the LLM training is that the embeddings are optimized to the specific data and task at hand</span>.
  - LLMs can also create contextualized output embeddings.

<b>What's an optimal Embedding Size</b>?
- It's <span style="color:#4ea9fb">a trade off between performance and effficiency</span>.
- For more details on embedding size of various GPTs, refer https://prasanth.io/Knowledge/Tech/GPT-comparison.
  - For e.g., GPT-1 and GPT-2 Small (both 117M parameters) use an embedding size of 768 dimensions, where as GPT-3 Davinci (175B parameters) use an embedding size of 12,288 dimensions (16x of the former).

## 2.2 Tokenizing text

- In this section, we tokenize text, which means breaking text into smaller units, such as individual words and punctuation characters.

<b>What's tokenizing?</b>
- Split input text into individual tokens (or words or sub-words)

<p style="color:black; background-color:#F5C780; padding:15px">ðŸ’¡The image shown here is slightly an oversimplied version. <br>&nbsp;&nbsp;&nbsp;- <span style="color:red">Between <b>Token IDs</b> and <b>Token embeddings</b>, there's an intermediate sliding window based process.<br>&nbsp;&nbsp;&nbsp;- The <b>token embedding</b> will be added with <b>positional embeddings</b> to create the final <b>input embeddings</b> for the decoder.</span></p>

<img src="../images/tokenizing-text-block-diagram.png" width="600px">

- Load raw text we want to work with
- [The Verdict by Edith Wharton](https://en.wikisource.org/wiki/The_Verdict) is a public domain short story

In [17]:
import os
import urllib.request

file_path = "the-verdict.txt"
if not os.path.exists(file_path):
    url = (
        "https://raw.githubusercontent.com/rasbt/"
        "LLMs-from-scratch/main/ch02/01_main-chapter-code/"
        "the-verdict.txt"
    )
    print(f"Downloading file from: '{url}' to '{file_path}'...")
    urllib.request.urlretrieve(url, file_path)
else:
    print(f"File '{file_path}' already exists. Skipping download.")

File 'the-verdict.txt' already exists. Skipping download.


In [18]:
with open(file_path, "r") as f:
    raw_text = f.read()

print(f"Total characters in text: {len(raw_text)}")
print(f"First 100 characters in text: \n{raw_text[:100]}")

Total characters in text: 20479
First 100 characters in text: 
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no g


- The **goal is to tokenize and embed this text for an LLM**
- Let's develop a simple tokenizer based on some simple sample text that we can then later apply to the text above
- The following regular expression will split on whitespaces

In [21]:
import re
text = "Hello, word. This, is a test."
# Split on whitespace character
result = re.split(r'(\s)', text) 
print(result)

['Hello,', ' ', 'word.', ' ', 'This,', ' ', 'is', ' ', 'a', ' ', 'test.']


- We don't only want to split on whitespaces but also commas and periods, so let's modify the regular expression to do that as well

In [22]:
# Split on whitespace, commans, and period character
result = re.split(r'([.,]|\s)', text) 
print(result)

['Hello', ',', '', ' ', 'word', '.', '', ' ', 'This', ',', '', ' ', 'is', ' ', 'a', ' ', 'test', '.', '']


<b>Should we remove whitspaces or not during tokenization?</b>
- <span style="color:green"><b>Removing whitespaces reduces the memory and computing requirements</b></span>
- <span style="color:red">Keeping whitespaces can be useful, if we train models that are sensitive to the exact structure of the text (e.g., Python code, which is sensitive to indentation and spacing).</span>

- As we can see, this creates empty strings, let's remove them

In [23]:
# Strip whitespace from each tiem and then filter out any empty strings
result = [item.strip() for item in result if item.strip()]
print(result)

['Hello', ',', 'word', '.', 'This', ',', 'is', 'a', 'test', '.']


- This looks pretty good, but let's also handle other types of punctuation, such as periods, question marks, and so on.

In [24]:
text = "Hello, world. Is this-- a test?"

result = re.split(r'([,.:;?_!"()\']|--|\s)', text)
result = [item.strip() for item in result if item.strip() != ""]
print(result)

['Hello', ',', 'world', '.', 'Is', 'this', '--', 'a', 'test', '?']


<img src="../images/simple-tokenization-example-of-sample-text.png" width="450px">

- This is pretty good, and we are now ready to apply this tokenization to the raw text loaded from `the-verdict.txt`

In [25]:
preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip() != ""]
print (f"No. of tokens: {len(preprocessed)}")
print(f"First 30 tokens: \n{preprocessed[:30]}")

No. of tokens: 4690
First 30 tokens: 
['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in']


## 2.3 Converting tokens into token IDs

- Next, we convert the text tokens into token IDs using vocabulary that we can process via embedding layers later to generate token embedding vectors.

<img src="../images/vocabulary-to-convert-text-tokens-to-token-ids-example.png" width="650px">

- From these tokens, we can now build a vocabulary that consists of all the unique tokens

In [26]:
all_words = sorted(set(preprocessed)) # Sort individual tokens in alphabetical order
vocab_size = len(all_words)
print(f"Vocabulary size: {vocab_size}")

vocab = {token: id for id, token in enumerate(all_words)}

Vocabulary size: 1130


- Below are the first 20 entries in this vocabulary:

In [27]:
for i, item in enumerate(vocab.items()):
    if i < 20:
        print (item)

('!', 0)
('"', 1)
("'", 2)
('(', 3)
(')', 4)
(',', 5)
('--', 6)
('.', 7)
(':', 8)
(';', 9)
('?', 10)
('A', 11)
('Ah', 12)
('Among', 13)
('And', 14)
('Are', 15)
('Arrt', 16)
('As', 17)
('At', 18)
('Be', 19)


- Below, we illustrate the tokenization of a short sample text using a small vocabulary:

<img src="../images/tokenization-example-using-a-sample-small-vocabulary.png" width="600px">

- Putting it now all together into a **simple text tokenizer** class

In [28]:
class SimpleTokenizerV1:
    def __init__(self, vocab):
        # Store the vocabulary for access in encode and decode methods
        self.str_to_int = vocab
        # Create an inverse vocabulary that maps token ids to the original text tokens
        self.int_to_str = {id: token for token, id in vocab.items()}

    def encode(self, text):
        """
        Preprocess the input text into token ids
        """
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        ids = [self.str_to_int[token] for token in preprocessed]
        return ids

    def decode(self, ids):
        """
        Convert token ids back to the original text
        """
        text = " ".join([self.int_to_str[id] for id in ids])

        # Remove spaces before the specified punctuation marks
        text = re.sub(r'\s+([,.?!"()\'])', r"\1", text)
        return text

- The `encode` function turns text into token IDs
- The `decode` function turns token IDs back into text

<img src="../images/tokenizer-encoder-and-decoder-implementations-example.png" width="600px">

In [29]:
text = """"It's the last he painted, you know," 
           Mrs. Gisburn said with pardonable pride."""
text

'"It\'s the last he painted, you know," \n           Mrs. Gisburn said with pardonable pride.'

In [30]:
# Let's use the vocab created from `the-verdict.txt`
tokenizer = SimpleTokenizerV1(vocab)

In [31]:
ids = tokenizer.encode(text)
print (ids)

[1, 56, 2, 850, 988, 602, 533, 746, 5, 1126, 596, 5, 1, 67, 7, 38, 851, 1108, 754, 793, 7]


- We can decode the integers back into text

In [32]:
tokenizer.decode(ids)

'" It\' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.'

In [33]:
tokenizer.decode(tokenizer.encode(text))

'" It\' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.'

<b>Problem with tokenizing words that are not in the vocabulary</b>
- <span style="color:red">The word "Hello" was not usd in the `the_verdict.txt` text, so it's not in the vocabulary.</span> So, the code cell below will fail.
- <p style="color:black; background-color:#F5C780; padding:15px">ðŸ’¡ So, a large and diverse training sets are needed to extend the vocabulary when working with LLMs.</span></p>

In [39]:
flag = "Hello" in tokenizer.str_to_int.keys()
print (f"Checking if 'Hello' is in the vocabulary: {flag}")

text = "Hello, do you like tea?"
try:
    tokenizer.encode(text)
except KeyError as e:
    print(f"KeyError: {e}")

Checking if 'Hello' is in the vocabulary: False
KeyError: 'Hello'


## 2.5 Adding special context tokens

<b>Why do we need special tokens?</b>
- <span style="color:#4ea9fb"><b>Special tokens help LLM with additional context</b></span> like unknown words and document boundaries.
-  Some of these special tokens are
   - `[UNK]` or `<|unk|>` (unknown): Represents unknown words (words that are not in the vocabulary)
   - `[EOS]` (end of sequence) or `<|endoftext|>`: Positioned at the end of the text. Acts as a marker for LLM, signalling the end of a particular segment, such as text or document (usually used to concatenate multiple unrelated text, e.g., two different Wikpedia articles or two different books, and so on). 
   - `[BOS]` (beginning of sequence): Positioned at the beginning of the text. Acts as a marker, signalling the beginning of a particular content.
   - `[PAD]` (padding): When training LLMs with batch sizes larger than one, the input text in the batch might contain varying lengths. To ensure all texts have same length, we pad or extend the shorter texts with `PAD` token, upto the length of the longest text in the batch.

<b>What happens if we don't pad input sequences?</b>
- <span style="color:red">Without padding, we cannot process multiple sequences in parallel (as batches) since neural networks expect fixed-size inputs:</span>
  - Most deep learning frameworks require tensors with consistent dimensions for efficient computation
  - The model's internal matrices and vectors are designed for fixed-size inputs
  - Batched operations rely on regular shaped arrays/tensors
- This would force us to process sequences one at a time, which would:
  - Significantly slow down training and inference
  - Prevent utilizing parallel processing capabilities of GPUs
  - Increase computational costs
- <span style="color:green">With padding + attention masks, we can:
  - Process variable length sequences efficiently in batches
  - Tell the model to ignore padded tokens during attention computation
  - Maintain the semantic meaning of the original sequences</span>

<img src="../images/add-special-tokens-to-vocabulary-to-deal-with-certain-contexts.png" width="600px">

<img src="../images/appending-endoftext-token-to-independent-text-source.png" width="600px">

As observed earlier, if the text is not in the vocabulary, the tokenization will fail. So, we need to add special tokens to the vocabulary.

In [60]:
all_tokens = list((set(preprocessed)))
all_tokens.extend([
    "<|endoftext|>",  # Special token to indicate the end of a text sequence
    "<|unk|>" # Special token to indicate an unknown token/words (out-of-vocabulary words)
    ])
vocab = {token: id for id, token in enumerate(all_tokens)}
print (f"Vocabulary size: {len(vocab)}")

print ("\nLast 5 tokens in the vocabulary:")
for item in list(vocab.items())[-5:]:
    print (item)

Vocabulary size: 1132

Last 5 tokens in the vocabulary:
('poverty', 1127)
('landing', 1128)
('mere', 1129)
('<|endoftext|>', 1130)
('<|unk|>', 1131)


- We also need to adjust the tokenizer accordingly so that it knows when and how to use the new `<unk>` token

In [49]:
class SimpleTokenizerV2:
    def __init__(self, vocab):
        # Store the vocabulary for access in encode and decode methods
        self.str_to_int = vocab
        # Create an inverse vocabulary that maps token ids to the original text tokens
        self.int_to_str = {id: token for token, id in vocab.items()}

    def encode(self, text):
        """
        Preprocess the input text into token ids
        """
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        preprocessed = [
            item if item in self.str_to_int else "<|unk|>" for item in preprocessed
        ]
        ids = [self.str_to_int[token] for token in preprocessed]
        return ids

    def decode(self, ids):
        """
        Convert token ids back to the original text
        """
        text = " ".join([self.int_to_str[id] for id in ids])

        # Remove spaces before the specified punctuation marks
        text = re.sub(r'\s+([,.?!"()\'])', r"\1", text)
        return text

Let's try to tokenize text with the modified tokenizer `SimpleTokenizerV2`:

In [61]:
tokenizer = SimpleTokenizerV2(vocab)

text1 = "Hello, do you like tea?"
text2 = "In the sunlit terraces of the palace."

text = " <|endoftext|> ".join([text1, text2])

print (text)

Hello, do you like tea? <|endoftext|> In the sunlit terraces of the palace.


In [64]:
print(tokenizer.encode(text))

[1131, 402, 975, 1052, 1059, 3, 41, 1130, 987, 709, 727, 123, 428, 709, 1131, 274]


In [63]:
tokenizer.decode(tokenizer.encode(text))

'<|unk|>, do you like tea? <|endoftext|> In the sunlit terraces of the <|unk|>.'

## 2.5 Byte pair encoding