# Working with text data

<img src="../images/three-main-stages-of-coding-an-llm-stage1-step1.png" width="800px">

# 2.1 Understanding word embeddings

<b>Why do we need embeddings?</b>
- <span style="color:red">Deep neural network (NN) models, including LLMs cannot process text data directly. Since, the text data is categorical, it's not compatible with mathematical operations used to train NNs.</span>
- So, we need a way <span style="color:#4ea9fb">to represent non numeric data (words/text) in a continuous numbers, a format that NNs can understand and process</span>.

<b>What's an embedding?</b>
- The concept of <span style="color:#4ea9fb"><b>converting text (or other data) into numerical vector representations.</b></span>
- In other words, embedding is a mapping from discrete objects (words, image, or entire documents) into a point in continuous high dimensional space. 

<b>Different type of embeddings</b>
- While *word embeddings* are the most common form of text embedding, there are other type of embeddings such as subword/token, sentence, paragraph, document, etc.
  - Since GPT-like LLMs learn to generate one word at a time, we will focus on **word embeddings**.
- Refer [https://prasanth.io/Knowledge/Tech/Embeddings](https://prasanth.io/Knowledge/Tech/Embeddings) for different type of embeddings.
- For *retrieval-augmentated generation*, sentence or paragraph embeddings are more popular choices.

<b>How to embed different data types?</b>
- Using a specific NN layer or another pretrained NN model, we can embed different data types - such as text, image, video, etc. 
<p style="color:black; background-color:#F5C780; padding:15px">💡Different data types require different embedding models. <span style="color:red">Embedding model designed for text data would not be suitable for embedding audio or video data.</span></p>

<img src="../images/different-embedding-models-for-different-data-types.png" width="700px">

<b><i>Word2Vec</i> - Most popular word embedding</b>
- <span style="color:#4ea9fb">The main idea behind Word2Vec is that <b>words that appear in similar contexts tend to have similar meanings</b></span>. Consequently, when projected into two-dimensional word embeddings for visualization purposes, similar terms are clustered together.
- For more details, refer [https://prasanth.io/Knowledge/Tech/Word2Vec](https://prasanth.io/Knowledge/Tech/Word2Vec)

<img src="../images/word-embeddings-projected-in-two-dimension-example.png" width="600px">

<b>Why don't we use <i>Word2Vec</i> for LLMs?</b>
- <span style="color:#4ea9fb">LLMs commonly produce their own embeddings as part of the input layer, and are updated during training</span>.
- <span style="color:green">The advantage of optimizing the embeddings as part of the LLM training is that the embeddings are optimized to the specific data and task at hand</span>.
  - LLMs can also create contextualized output embeddings.

<b>What's an optimal Embedding Size</b>?
- It's <span style="color:#4ea9fb">a trade off between performance and effficiency</span>.
- For more details on embedding size of various GPTs, refer https://prasanth.io/Knowledge/Tech/GPT-comparison.
  - For e.g., GPT-1 and GPT-2 Small (both 117M parameters) use an embedding size of 768 dimensions, where as GPT-3 Davinci (175B parameters) use an embedding size of 12,288 dimensions (16x of the former).

## 2.2 Tokenizing text

- In this section, we tokenize text, which means breaking text into smaller units, such as individual words and punctuation characters.

<b>What's tokenizing?</b>
- Split input text into individual tokens (or words or sub-words)

<p style="color:black; background-color:#F5C780; padding:15px">💡The image shown here is slightly an oversimplied version. <br>&nbsp;&nbsp;&nbsp;- <span style="color:red">Between <b>Token IDs</b> and <b>Token embeddings</b>, there's an intermediate sliding window based process.<br>&nbsp;&nbsp;&nbsp;- The <b>token embedding</b> will be added with <b>positional embeddings</b> to create the final <b>input embeddings</b> for the decoder.</span></p>

<img src="../images/tokenizing-text-block-diagram.png" width="600px">

- Load raw text we want to work with
- [The Verdict by Edith Wharton](https://en.wikisource.org/wiki/The_Verdict) is a public domain short story

In [183]:
import os
import urllib.request

file_path = "the-verdict.txt"
if not os.path.exists(file_path):
    url = (
        "https://raw.githubusercontent.com/rasbt/"
        "LLMs-from-scratch/main/ch02/01_main-chapter-code/"
        "the-verdict.txt"
    )
    print(f"Downloading file from: '{url}' to '{file_path}'...")
    urllib.request.urlretrieve(url, file_path)
else:
    print(f"File '{file_path}' already exists. Skipping download.")

File 'the-verdict.txt' already exists. Skipping download.


In [184]:
with open(file_path, "r") as f:
    raw_text = f.read()

print(f"Total characters in text: {len(raw_text)}")
print(f"First 100 characters in text: \n{raw_text[:100]}")

Total characters in text: 20479
First 100 characters in text: 
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no g


- The **goal is to tokenize and embed this text for an LLM**
- Let's develop a simple tokenizer based on some simple sample text that we can then later apply to the text above
- The following regular expression will split on whitespaces

In [185]:
import re

text = "Hello, word. This, is a test."
# Split on whitespace character
result = re.split(r"(\s)", text)
print(result)

['Hello,', ' ', 'word.', ' ', 'This,', ' ', 'is', ' ', 'a', ' ', 'test.']


- We don't only want to split on whitespaces but also commas and periods, so let's modify the regular expression to do that as well

In [186]:
# Split on whitespace, commans, and period character
result = re.split(r"([.,]|\s)", text)
print(result)

['Hello', ',', '', ' ', 'word', '.', '', ' ', 'This', ',', '', ' ', 'is', ' ', 'a', ' ', 'test', '.', '']


<b>Should we remove whitspaces or not during tokenization?</b>
- <span style="color:green"><b>Removing whitespaces reduces the memory and computing requirements</b></span>
- <span style="color:red">Keeping whitespaces can be useful, if we train models that are sensitive to the exact structure of the text (e.g., Python code, which is sensitive to indentation and spacing).</span>

- As we can see, this creates empty strings, let's remove them

In [187]:
# Strip whitespace from each tiem and then filter out any empty strings
result = [item.strip() for item in result if item.strip()]
print(result)

['Hello', ',', 'word', '.', 'This', ',', 'is', 'a', 'test', '.']


- This looks pretty good, but let's also handle other types of punctuation, such as periods, question marks, and so on.

In [188]:
text = "Hello, world. Is this-- a test?"

result = re.split(r'([,.:;?_!"()\']|--|\s)', text)
result = [item.strip() for item in result if item.strip() != ""]
print(result)

['Hello', ',', 'world', '.', 'Is', 'this', '--', 'a', 'test', '?']


<img src="../images/simple-tokenization-example-of-sample-text.png" width="450px">

- This is pretty good, and we are now ready to apply this tokenization to the raw text loaded from `the-verdict.txt`

In [189]:
preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip() != ""]
print(f"No. of tokens: {len(preprocessed)}")
print(f"First 30 tokens: \n{preprocessed[:30]}")

No. of tokens: 4690
First 30 tokens: 
['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in']


## 2.3 Converting tokens into token IDs

- Next, we convert the text tokens into token IDs using vocabulary that we can process via embedding layers later to generate token embedding vectors.

<img src="../images/vocabulary-to-convert-text-tokens-to-token-ids-example.png" width="650px">

- From these tokens, we can now build a vocabulary that consists of all the unique tokens

In [190]:
all_words = sorted(set(preprocessed))  # Sort individual tokens in alphabetical order
vocab_size = len(all_words)
print(f"Vocabulary size: {vocab_size}")

vocab = {token: id for id, token in enumerate(all_words)}

Vocabulary size: 1130


- Below are the first 20 entries in this vocabulary:

In [191]:
for i, item in enumerate(vocab.items()):
    if i < 20:
        print(item)

('!', 0)
('"', 1)
("'", 2)
('(', 3)
(')', 4)
(',', 5)
('--', 6)
('.', 7)
(':', 8)
(';', 9)
('?', 10)
('A', 11)
('Ah', 12)
('Among', 13)
('And', 14)
('Are', 15)
('Arrt', 16)
('As', 17)
('At', 18)
('Be', 19)


- Below, we illustrate the tokenization of a short sample text using a small vocabulary:

<img src="../images/tokenization-example-using-a-sample-small-vocabulary.png" width="600px">

- Putting it now all together into a **simple text tokenizer** class

In [192]:
class SimpleTokenizerV1:
    def __init__(self, vocab):
        # Store the vocabulary for access in encode and decode methods
        self.str_to_int = vocab
        # Create an inverse vocabulary that maps token ids to the original text tokens
        self.int_to_str = {id: token for token, id in vocab.items()}

    def encode(self, text):
        """
        Preprocess the input text into token ids
        """
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        ids = [self.str_to_int[token] for token in preprocessed]
        return ids

    def decode(self, ids):
        """
        Convert token ids back to the original text
        """
        text = " ".join([self.int_to_str[id] for id in ids])

        # Remove spaces before the specified punctuation marks
        text = re.sub(r'\s+([,.?!"()\'])', r"\1", text)
        return text

- The `encode` function turns text into token IDs
- The `decode` function turns token IDs back into text

<img src="../images/tokenizer-encoder-and-decoder-implementations-example.png" width="600px">

In [193]:
text = """"It's the last he painted, you know," 
           Mrs. Gisburn said with pardonable pride."""
text

'"It\'s the last he painted, you know," \n           Mrs. Gisburn said with pardonable pride.'

In [194]:
# Let's use the vocab created from `the-verdict.txt`
tokenizer = SimpleTokenizerV1(vocab)

In [195]:
ids = tokenizer.encode(text)
print(ids)

[1, 56, 2, 850, 988, 602, 533, 746, 5, 1126, 596, 5, 1, 67, 7, 38, 851, 1108, 754, 793, 7]


- We can decode the integers back into text

In [196]:
tokenizer.decode(ids)

'" It\' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.'

In [197]:
tokenizer.decode(tokenizer.encode(text))

'" It\' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.'

<b>Problem with tokenizing words that are not in the vocabulary</b>
- <span style="color:red">The word "Hello" was not usd in the `the_verdict.txt` text, so it's not in the vocabulary.</span> So, the code cell below will fail.
- <p style="color:black; background-color:#F5C780; padding:15px">💡 So, a large and diverse training sets are needed to extend the vocabulary when working with LLMs.</span></p>

In [198]:
flag = "Hello" in tokenizer.str_to_int.keys()
print(f"Checking if 'Hello' is in the vocabulary: {flag}")

text = "Hello, do you like tea?"
try:
    tokenizer.encode(text)
except KeyError as e:
    print(f"KeyError: {e}")

Checking if 'Hello' is in the vocabulary: False
KeyError: 'Hello'


## 2.5 Adding special context tokens

<b>Why do we need special tokens?</b>
- <span style="color:#4ea9fb"><b>Special tokens help LLM with additional context</b></span> like unknown words and document boundaries.
-  Some of these special tokens are
   - `[UNK]` or `<|unk|>` (unknown): Represents unknown words (words that are not in the vocabulary)
   - `[EOS]` (end of sequence) or `<|endoftext|>`: Positioned at the end of the text. Acts as a marker for LLM, signalling the end of a particular segment, such as text or document (usually used to concatenate multiple unrelated text, e.g., two different Wikpedia articles or two different books, and so on). 
   - `[BOS]` (beginning of sequence): Positioned at the beginning of the text. Acts as a marker, signalling the beginning of a particular content.
   - `[PAD]` (padding): When training LLMs with batch sizes larger than one, the input text in the batch might contain varying lengths. To ensure all texts have same length, we pad or extend the shorter texts with `PAD` token, upto the length of the longest text in the batch.

<b>What happens if we don't pad input sequences?</b>
- <span style="color:red">Without padding, we cannot process multiple sequences in parallel (as batches) since neural networks expect fixed-size inputs:</span>
  - Most deep learning frameworks require tensors with consistent dimensions for efficient computation
  - The model's internal matrices and vectors are designed for fixed-size inputs
  - Batched operations rely on regular shaped arrays/tensors
- This would force us to process sequences one at a time, which would:
  - Significantly slow down training and inference
  - Prevent utilizing parallel processing capabilities of GPUs
  - Increase computational costs
- <span style="color:green">With padding + attention masks, we can:
  - Process variable length sequences efficiently in batches
  - Tell the model to ignore padded tokens during attention computation
  - Maintain the semantic meaning of the original sequences</span>

<img src="../images/add-special-tokens-to-vocabulary-to-deal-with-certain-contexts.png" width="600px">

<img src="../images/appending-endoftext-token-to-independent-text-source.png" width="600px">

As observed earlier, if the text is not in the vocabulary, the tokenization will fail. So, we need to add special tokens to the vocabulary.

In [199]:
all_tokens = list((set(preprocessed)))
all_tokens.extend(
    [
        "<|endoftext|>",  # Special token to indicate the end of a text sequence
        "<|unk|>",  # Special token to indicate an unknown token/words (out-of-vocabulary words)
    ]
)
vocab = {token: id for id, token in enumerate(all_tokens)}
print(f"Vocabulary size: {len(vocab)}")

print("\nLast 5 tokens in the vocabulary:")
for item in list(vocab.items())[-5:]:
    print(item)

Vocabulary size: 1132

Last 5 tokens in the vocabulary:
('poverty', 1127)
('landing', 1128)
('mere', 1129)
('<|endoftext|>', 1130)
('<|unk|>', 1131)


- We also need to adjust the tokenizer accordingly so that it knows when and how to use the new `<unk>` token

In [200]:
class SimpleTokenizerV2:
    def __init__(self, vocab):
        # Store the vocabulary for access in encode and decode methods
        self.str_to_int = vocab
        # Create an inverse vocabulary that maps token ids to the original text tokens
        self.int_to_str = {id: token for token, id in vocab.items()}

    def encode(self, text):
        """
        Preprocess the input text into token ids
        """
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        preprocessed = [
            item if item in self.str_to_int else "<|unk|>" for item in preprocessed
        ]
        ids = [self.str_to_int[token] for token in preprocessed]
        return ids

    def decode(self, ids):
        """
        Convert token ids back to the original text
        """
        text = " ".join([self.int_to_str[id] for id in ids])

        # Remove spaces before the specified punctuation marks
        text = re.sub(r'\s+([,.?!"()\'])', r"\1", text)
        return text

Let's try to tokenize text with the modified tokenizer `SimpleTokenizerV2`:

In [201]:
tokenizer = SimpleTokenizerV2(vocab)

text1 = "Hello, do you like tea?"
text2 = "In the sunlit terraces of the palace."

text = " <|endoftext|> ".join([text1, text2])

print(text)

Hello, do you like tea? <|endoftext|> In the sunlit terraces of the palace.


In [202]:
print(tokenizer.encode(text))

[1131, 402, 975, 1052, 1059, 3, 41, 1130, 987, 709, 727, 123, 428, 709, 1131, 274]


In [203]:
tokenizer.decode(tokenizer.encode(text))

'<|unk|>, do you like tea? <|endoftext|> In the sunlit terraces of the <|unk|>.'

## 2.5 Byte pair encoding

<b>What's Byte Pair Encoding (BPE), and where is it used?</b>
- The BPE tokenizer is a subword tokenizer, which means it can split words into smaller parts.
- The BPE tokenizer was used to train LLMs such as GPT-2, GPT-3, and the original model used in ChatGPT.
  - Has a total vocabulary size of 50,257 tokens, with `<|<endoftex|>` being assigned the largest token ID.
- <span style="color:green"><b>The BPE tokenizer can handle any unknown words.</b></span>
  - <span style="color:#4ea9fb"><b>The ability to break down unknown words into subword tokens ensures that the tokenizer, and consequently the LLM, can process any text data, even if it contains words that were not present in the training data.</b></span>
  
<b>How does BPE work and handle unknown words?</b>
- If the tokenizer encounters an unknown word, it can represent it as a sequence of subword tokens or characters.
- For more details, refer https://prasanth.io/Knowledge/Tech/BPE

In [204]:
from importlib.metadata import version
import tiktoken

print(f"tiktoken version: {version('tiktoken')}")

encoding_list = tiktoken.list_encoding_names()
print(f"Available encoding names: {encoding_list}")

tiktoken version: 0.8.0
Available encoding names: ['gpt2', 'r50k_base', 'p50k_base', 'p50k_edit', 'cl100k_base', 'o200k_base']


For details on different `tiktoken` encoding models, refer https://www.datacamp.com/tutorial/tiktoken-library-python.

In [205]:
for tokenizer_temp in encoding_list:
    tokenizer_temp = tiktoken.get_encoding(tokenizer_temp)
    print(f"Vocabulary size for '{tokenizer_temp}': {tokenizer_temp.n_vocab}")
del tokenizer_temp

Vocabulary size for '<Encoding 'gpt2'>': 50257
Vocabulary size for '<Encoding 'r50k_base'>': 50257
Vocabulary size for '<Encoding 'p50k_base'>': 50281
Vocabulary size for '<Encoding 'p50k_edit'>': 50284
Vocabulary size for '<Encoding 'cl100k_base'>': 100277
Vocabulary size for '<Encoding 'o200k_base'>': 200019


We can find the encoding for the model by running the following code:

In [206]:
models = [
    "gpt-2",
    "gpt-3.5",
    "gpt-4",
    "gpt-4o",
    "gpt-4o-mini",
    "text-embedding-3-small",
]
for model in models:
    print(f"{model:<30s}: {tiktoken.encoding_for_model(model)}")

gpt-2                         : <Encoding 'gpt2'>
gpt-3.5                       : <Encoding 'cl100k_base'>
gpt-4                         : <Encoding 'cl100k_base'>
gpt-4o                        : <Encoding 'o200k_base'>
gpt-4o-mini                   : <Encoding 'o200k_base'>
text-embedding-3-small        : <Encoding 'cl100k_base'>


In [207]:
tokenizer = tiktoken.get_encoding("gpt2")

In [208]:
print(dir(tokenizer))

['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_core_bpe', '_encode_bytes', '_encode_only_native_bpe', '_encode_single_piece', '_mergeable_ranks', '_pat_str', '_special_tokens', 'decode', 'decode_batch', 'decode_bytes', 'decode_bytes_batch', 'decode_single_token_bytes', 'decode_tokens_bytes', 'decode_with_offsets', 'encode', 'encode_batch', 'encode_ordinary', 'encode_ordinary_batch', 'encode_single_token', 'encode_with_unstable', 'eot_token', 'max_token_value', 'n_vocab', 'name', 'special_tokens_set', 'token_byte_values']


In [209]:
print(f"No. of items in BPE tokenizer vocabulary: {tokenizer.n_vocab}")
print(f"Special tokens: {tokenizer.special_tokens_set} | {tokenizer._special_tokens}")
print(
    f"Largest token ID in the GPT-2 tokenizer: {tokenizer.decode([tokenizer.n_vocab-1])}"
)

No. of items in BPE tokenizer vocabulary: 50257
Special tokens: {'<|endoftext|>'} | {'<|endoftext|>': 50256}
Largest token ID in the GPT-2 tokenizer: <|endoftext|>


In [210]:
text = "Hello, do you like tea? In the sunlit terraces" "of someunknownPlace."
integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})
print(f"No. of tokens: {len(integers)} | No. of characters: {len(text)}")
print(integers)

No. of tokens: 18 | No. of characters: 66
[15496, 11, 466, 345, 588, 8887, 30, 554, 262, 4252, 18250, 8812, 2114, 1659, 617, 34680, 27271, 13]


<img src="../images/tiktoken-verified-in-openai-tokenizer-sample.png" width="800px">

Source:[https://platform.openai.com/tokenizer](https://platform.openai.com/tokenizer)

In [211]:
text = (
    "Hello, do you like tea? <|endoftext|> In the sunlit terraces"
    "of someunknownPlace."
)
integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})
print(f"No. of tokens: {len(integers)} | No. of characters: {len(text)}")
print(integers)

No. of tokens: 20 | No. of characters: 80
[15496, 11, 466, 345, 588, 8887, 30, 220, 50256, 554, 262, 4252, 18250, 8812, 2114, 1659, 617, 34680, 27271, 13]


In [212]:
strings = tokenizer.decode(integers)
print(strings)

Hello, do you like tea? <|endoftext|> In the sunlit terracesof someunknownPlace.


<p style="color:black; background-color:#F5C780; padding:15px"><b>Exercise 2.1 Byte pair encoding of unknown words<br></b>
Try the BPE tokenizer from the tiktoken library on the unknown words “Akwirw ier” and
print the individual token IDs. Then, call the decode function on each of the resulting
integers in this list to reproduce the mapping shown in figure 2.11. Lastly, call the
decode method on the token IDs to check whether it can reconstruct the original
input, “Akwirw ier.”
</p>

<img src="../images/bpe-tokenizers-example.png" width="600px">

In [213]:
print(tokenizer.encode("Akwirw ier"))
print(tokenizer.decode(tokenizer.encode("Akwirw ier")))

[33901, 86, 343, 86, 220, 959]
Akwirw ier


## 2.6 Data sampling with a sliding window

<span style="color:#4ea9fb">After converting input text → tokens → token IDs, we need to generate the input-target (or input-output) pairs for training the LLM using <b>efficient data loaders</b><i> that iterates over the input dataset, and returns input-output pairs as PyTorch tensors (multidimensional arrays)</i></span>.

<img src="../images/input-target-pairs-for-llm-training.webp" width="600px">

Let's implement a data loader that fetches the input-target pairs (similar to the above image) from the training dataset using the slidinw window approach.

In [214]:
with open("the-verdict.txt", "r", encoding="utf-8") as file:
    raw_text = file.read()
print(f"Total characters: {len(raw_text)}")

enc_text = tokenizer.encode(raw_text)
print(f"Total tokens (in the training set): {len(enc_text)}")

Total characters: 20479
Total tokens (in the training set): 5145


Let's remove the first 50 tokens from the dataset for demo as it results in slightly more interesting text passage.

In [215]:
enc_sample = enc_text[50:]
print(f"Sample of encoded text that's decoded:\n {tokenizer.decode(enc_sample[:50])}")

Sample of encoded text that's decoded:
  and established himself in a villa on the Riviera. (Though I rather thought it would have been Rome or Florence.)

"The height of his glory"--that was what the women called it. I can hear Mrs. Gideon Th


<span style="color:#4ea9fb"><b><code>context_size</code> determine how many tokens are included in the input text</b></span>.

In [216]:
# Create a input-output pair for next word prediction task.
context_size = 4  # Determines how many tokens are included in the input text
x = enc_sample[:context_size]
y = enc_sample[1 : context_size + 1]
print(f"Creating an input-output pair for next word prediction task:")
print(f"x: {x}")
print(f"y: \t{y}")
print("")

print(f"Creating multiple input ----> output pairs for next word prediction task:")
for i in range(1, context_size + 1):
    context = enc_sample[:i]
    desired = enc_sample[i]
    print(f"{context} ----> {desired}")
print(f"Corresponding text:")
for i in range(1, context_size + 1):
    context = tokenizer.decode(enc_sample[:i])
    desired = tokenizer.decode(enc_sample[i : i + 1])
    print(f"{context} ----> {desired}")

Creating an input-output pair for next word prediction task:
x: [290, 4920, 2241, 287]
y: 	[4920, 2241, 287, 257]

Creating multiple input ----> output pairs for next word prediction task:
[290] ----> 4920
[290, 4920] ----> 2241
[290, 4920, 2241] ----> 287
[290, 4920, 2241, 287] ----> 257
Corresponding text:
 and ---->  established
 and established ---->  himself
 and established himself ---->  in
 and established himself in ---->  a


<img src="../images/data-lodaer-sample-text.webp" width="800px">

- Let's create dataset and dataloader that extracts chunks of text using a sliding window approach from the input text dataset.

<p style="color:white; background-color:#4ea9fb; padding:15px"><b>Listing 2.5 A dataset for batched inputs and targets<br></b>

In [218]:
import torch
from torch.utils.data import Dataset, DataLoader


class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        """
        Use sliding window to chunk the tokenized input dataset into overlapping input-output sequences of max_length (a.k.a context_size)

        Args:
        - txt: The input text a.k.a the training dataset
        - tokenizer: The tokenizer object
        - max_length: The number of tokens in the input text (a.k.a. context_size)
        - stride: The number of tokens to move the window by (a.k.a. step_size). In other words, the stride determines how much the window moves to the right after each input-output pair is created.
        """
        self.input_ids = []
        self.target_ids = []

        token_ids = tokenizer.encode(txt)  # Tokenizes the entire text

        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i : i + max_length]
            output_chunk = token_ids[i + 1 : i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(output_chunk))

    def __len__(self):
        """Returns the total number of input-output pairs (rows) in the dataset"""
        return len(self.input_ids)

    def __getitem__(self, idx):
        """Returns a single row from dataset"""
        return self.input_ids[idx], self.target_ids[idx]

<p style="color:white; background-color:#4ea9fb; padding:15px"><b>Listing 2.6 A dataloader to generate batches with input-target pairs</b>

In [229]:
def create_dataloader_v1(
    txt,
    batch_size=4,
    max_length=128,
    stride=128,
    shuffle=True,
    drop_last=True,
    num_workers=0,
):
    """
    Create a DataLoader object for the GPTDatasetV1 class

    Args:
    - txt: The input text a.k.a the training dataset
    - batch_size: The number of input-output pairs to include in each batch
    - max_length: The number of tokens in the input text (a.k.a. context_size or input_size)
    - stride: The number of tokens to move the window by (a.k.a. step_size). In other words, the stride determines how much the window moves to the right after each input-output pair is created.
    - shuffle: Whether to shuffle the data or not
    - drop_last: Whether to drop the last incomplete batch or not
    - num_workers: The number of CPU processes to use for pre-processing the data.
    """
    tokenizer = tiktoken.get_encoding("gpt2")
    dataset = GPTDatasetV1(
        txt,
        tokenizer,
        max_length,
        stride,
    )
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last,  # drop the last batch if it's smaller than the specified batch_size to prevent loss spikes during training.
        num_workers=num_workers,  # The number of CPU processes to use for pre-processing the data.
    )
    return dataloader

Let's test the `dataloader`  with `batch_size=1`, `max_length=4` (a.k.a. context_size) and `stride=1` to develop an intuition on how `GPTDatasetV1` class and `create_data_loader_v1` function works together.

Note: 
- `max_length=4` is quite small, and only chosen for demonstration purposes. <span style="color:#4ea9fb">It's common to train LLMs with input sizes of 256.</span>
- `batch_size=1` is also chosen for demonstration purposes. <span style="color:red">Small batch sizes require less memory during training, but lead to more noisy model updates.</span><span style="color:#4ea9fb"> In practice, we use larger batch sizes to speed up training, and <b>the batch size is a tradeoff and hyperparameter to experiment with when training LLMs</b> , just like in regular deep learning.</span> 
- `stride=1` is also chosen for demonstration purposes. <span style="color:#4ea9fb">

In [None]:
with open("the-verdict.txt", "r") as file:
    text = file.read()

dataloader = create_dataloader_v1(
    text, batch_size=1, max_length=4, stride=1, shuffle=False
)

data_iter = iter(dataloader)

first_batch = next(data_iter)
print(f"First batch:\n{first_batch}")  # (input_ids, target_ids)
second_batch = next(data_iter)
print(f"Second batch:\n{second_batch}")  # (input_ids, target_ids)

First batch:
[tensor([[  40,  367, 2885, 1464]]), tensor([[ 367, 2885, 1464, 1807]])]
Second batch:
[tensor([[ 367, 2885, 1464, 1807]]), tensor([[2885, 1464, 1807, 3619]])]


<p style="color:black; background-color:#F5C780; padding:15px"><b>Exercise 2.2 Data loaders with different strides and context sizes<br></b>
To develop more intuition for how the data loader works, try to run it with different
settings such as <code>max_length=2</code> and <code>stride=2</code>, and <code>max_length=8</code> and <code>stride=2</code>.
</p>

In [None]:
dataloader = create_dataloader_v1(
    text, batch_size=2, max_length=2, stride=2, shuffle=False
)

data_iter = iter(dataloader)

first_batch = next(data_iter)
print(f"First batch:\n{first_batch}")  # (input_ids, target_ids)
second_batch = next(data_iter)
print(f"Second batch:\n{second_batch}")  # (input_ids, target_ids)

First batch:
[tensor([[  40,  367],
        [2885, 1464]]), tensor([[ 367, 2885],
        [1464, 1807]])]
Second batch:
[tensor([[1807, 3619],
        [ 402,  271]]), tensor([[ 3619,   402],
        [  271, 10899]])]


In [252]:
dataloader = create_dataloader_v1(
    text, batch_size=2, max_length=8, stride=2, shuffle=False
)

data_iter = iter(dataloader)

first_batch = next(data_iter)
print(f"First batch:\n{first_batch}")  # (input_ids, target_ids)
second_batch = next(data_iter)
print(f"Second batch:\n{second_batch}")  # (input_ids, target_ids)

First batch:
[tensor([[   40,   367,  2885,  1464,  1807,  3619,   402,   271],
        [ 2885,  1464,  1807,  3619,   402,   271, 10899,  2138]]), tensor([[  367,  2885,  1464,  1807,  3619,   402,   271, 10899],
        [ 1464,  1807,  3619,   402,   271, 10899,  2138,   257]])]
Second batch:
[tensor([[ 1807,  3619,   402,   271, 10899,  2138,   257,  7026],
        [  402,   271, 10899,  2138,   257,  7026, 15632,   438]]), tensor([[ 3619,   402,   271, 10899,  2138,   257,  7026, 15632],
        [  271, 10899,  2138,   257,  7026, 15632,   438,  2016]])]


<img src="../images/sliding-window-sample-with-stride-of-1-and-4.webp" width="800px">

In [260]:
# Another example
dataloader = create_dataloader_v1(
    raw_text, batch_size=8, max_length=4, stride=4, shuffle=False
)

data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print(f"Input tensor:\n {inputs}")
print(f"Target tensor:\n {targets}")

Input tensor:
 tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]])
Target tensor:
 tensor([[  367,  2885,  1464,  1807],
        [ 3619,   402,   271, 10899],
        [ 2138,   257,  7026, 15632],
        [  438,  2016,   257,   922],
        [ 5891,  1576,   438,   568],
        [  340,   373,   645,  1049],
        [ 5975,   284,   502,   284],
        [ 3285,   326,    11,   287]])


<i>Note:</i> 
- By setting the `stride=4` same as `max_length=4`, <span style="color:green">we utilize the dataset fully (as we don't skip a single word). At the same time, we don't have any overlap between the batches</span> as <span style="color:red">having more overlap could lead to increased overfitting</span>.
- <span style="color:red">The input and output tensors we see above are token IDs, and are not the final input embeddings that are fed into the LLM</span>. The input embeddings are created by converting the token IDs into token embeddings and (optional but recommended) added with positional embeddings.
  - i.e., $\text{Token IDs} \neq \text{Token embeddings}$
  - i.e., $\text{Token embeddings} + \text{Positional embeddings} = \text{Input embeddings}$
  

## 2.7 Creating token embeddings

<img src="../images/creating-token-embeddings-flow-chart.webp" width="700px">

Last step in preparing input text for LLM is to <span style="color:#4ea9fb"><i>convert token IDs into embeddings vectors</i></span>. 
- As a preliminary step, we need to <b>initialize these embeddings with random values</b>. The initialization serves as a starting point for the model to learn the embeddings (i.e., optimized) during training.
- <span style="color:#4ea9fb"><b><i>The embedding layer is essentially a lookup operation</i> that maps token IDs to embedding vectors. In other words, it retrieves rows from the embedding layer's weight matrix based on the token IDs.</b></span>
  - <span style="color:#4ea9fb">Each row in the output matrix is obtained via the lookup operation from the embedding layer's weight matrix.</span>

- Let's see how token ID to embedding vector conversion works with a simple example (`input_ids = [2,3,5,1]` after tokenization).

| ... | Reality | Demo |
| - | - | - |
| Vocabulary | 50,257 words in BPE tokenizer | 6 words |
| Embedding size | 12,888 in GPT-3 | 3 |

In [278]:
input_ids = torch.tensor([2, 3, 5, 1])  # => context_size or max_length = 4
print(f"1. Input tensor: {input_ids}")

vocab_size = 6
output_dim = 3

# Instantiate an embedding layer
torch.manual_seed(123)
embedding_layer = torch.nn.Embedding(vocab_size, output_dim)  # 6 x 3
print(f"\n2. Embedding layer's weigth matrix")
print(embedding_layer.weight)

# Apply it to token ID to obtain the corresponding embedding vector of the token
print(
    f"\n3a. Embedding vector for a single token ID = 3: \n{embedding_layer(torch.tensor([3]))}"
)

print(
    f"\n3b. Embedding vector for all 4 input token IDs = {input_ids}: \n{embedding_layer(input_ids)}"
)

1. Input tensor: tensor([2, 3, 5, 1])

2. Embedding layer's weigth matrix
Parameter containing:
tensor([[ 0.3374, -0.1778, -0.1690],
        [ 0.9178,  1.5810,  1.3010],
        [ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-1.1589,  0.3255, -0.6315],
        [-2.8400, -0.7849, -1.4096]], requires_grad=True)

3a. Embedding vector for a single token ID = 3: 
tensor([[-0.4015,  0.9666, -1.1481]], grad_fn=<EmbeddingBackward0>)

3b. Embedding vector for all 4 input token IDs = tensor([2, 3, 5, 1]): 
tensor([[ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-2.8400, -0.7849, -1.4096],
        [ 0.9178,  1.5810,  1.3010]], grad_fn=<EmbeddingBackward0>)


<img src="../images/create-token-embeddings-my-own-viz.webp" width="700px">
<img src="../images/emedding-layer-lookup-operation-simple-example.webp" width="600px">

<span style="color:#4ea9fb">The above embedding layer approach is essentially <b>a more efficient way</b> of implementing one hot encoding followed by matrix multiplication in a fully connected layer.</span>

Now that *we created embedding vectors from token IDs*, let's add a small modification to the embedding vectors to <span style="color:#4ea9fb">encode positional information aboout a token within a text</span>.

## 2.8 Encoding word positions

- <span style="color:red">A minor shortcoming of LLM is that their <b>self attention mechanism (covered in chapter 3) doesn't have a notion of position of order for the tokens within a sequence</b>.</span>
- <span style="color:red">The way the embedding layer introduced in the above section works is that <b> the same token ID will always produce the same embedding vector, regardless of its position in the input text</b></span>  as shown in figure below.

<img src="../images/embedding-layer-without-positional-awareness.webp" width="800px">

In principle, the deterministic, position-independent embedding of the token ID is good for reproducability. However, <span style="color:red"> since the self-attention mechanism of LLMs itself is also position-agnostic</span>, <span style="color:green">it's beneficial to inject additional positional information into the embeddings.</spam>

<b>What are the two broad categories of position-aware embeddings?</b>
1. <b>Relative positional embeddings (RPE):</b>
   - Instead of focusing on the absolute position of a token in the text, RPEs consider the relative position (or distance) of a token with respect to other tokens in the text. <span style="color:#4ea9fb">In RPE, the model learns the relationships interms of <b>"how far apart"</b>, whereas in APE, the model learns the relationships in terms of <b>"at which exact position"</b>.</span>.
   - The advantage is that the <span style="color:green">model can generalize better to input sequences of varying lengths, even if it has not seen such lengths during training</span>.
2. <b>Absolute positional embeddings (APE):</b>
   - APE are directly associated with specific (absolute) positions in the input text. 
   - For each position in the input text, a unique embedding is added to the token's embedding to convey its exact position in the text.
   - <span style="color:green">OpenAI's GPT models use APE that are optimized during training process</span>, <span style="color:red"> rather than being fixed or predefined like the positional encodings in the original transformer model.

<b>What are the benefits of considering position-aware embeddings?</b>
- Both type of positional embedddings aim to <span style="color:green">augment the capacity of LLMs to understand the order and relationships between tokens, ensuring more accurate context-aware predictions.</span> 
- <span style="color:#4ea9fb">The choice between RPE and APE depends on the specific task and the nature of the text data</span>.

<img src="../images/positional-embeddings-added-to-token-embeddings-example.webp" width="700px">

- Let's consider a more realistic embeddings size and encode input tokens into 256-dimensional embeddings (vector representations), with an additional 256-dimensional positional embeddings.

| ... | Reality | Previous example | New example |
| - | - | - | - |
| Vocabulary | 50,257 words in BPE tokenizer | 6 words | 50257 words (assume token IDs are created by BPE tokenier) |
| Embedding size | 12,888 in GPT-3 | 3 | 256 (smaller than original GPT-3, but reasonable for demonstration purposes) |

In [None]:
### === DataLoader === ###
max_length = 4  # context_size
dataloader = create_dataloader_v1(
    raw_text,
    batch_size=8,
    max_length=max_length,
    stride=max_length,  # No overlap between input-output pairs
    shuffle=False,
)
data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print(f"Input tensor (Token IDs):\n {inputs}")  # 8 x 4 (batch_size x max_length)
print(f"\n→ Inputs shape: \t\t\t{inputs.shape}")


### === Token Embedding Layer === ###
vocab_size = 50257
output_dim = 256  # embedding_size
token_embedding_layer = torch.nn.Embedding(
    vocab_size, embedding_size
)  # 50257 words, 256 dimensions
print(f"\n→ Token embedding layer shape:\t\t{token_embedding_layer}")

### === Token Embeddings === ###
# ❗ Use the token embedding layer to obtain the token embeddings (i.e., embed the input token IDs into 256-dimensional vectors)
token_embeddings = token_embedding_layer(inputs)
print(
    f"\n→ Token embeddings shape: \t\t{token_embeddings.shape}"
)  # 8 x 4 x 256 (batch_size x max_length x embedding_size)


### === Positional Embedding Layer === ###
# ❗ For GPT model's APE approach, we need to create another embedding layer that has same embedding dimension as the `token_embedding_layer`
context_length = max_length
pos_embedding_layer = torch.nn.Embedding(context_length, output_dim)  # 4 x 256
print(f"\n→ Positional encoding layer shape:\t{pos_embedding_layer}")

### === Positional Embeddings === ###
pos_embeddings = pos_embedding_layer(torch.arange(context_length))  # 4 x 256
print(f"\n→ Positional embeddings shape: \t\t{pos_embeddings.shape}")


### === Add Positional Embeddings to Token Embeddings === ###
input_embeddings = (
    token_embeddings + pos_embeddings
)  # 8 x 4 x 256 + 4 x 256 (broadcasting) => 8 x 4 x 256
print(f"\n→ Input embeddings shape: \t\t{input_embeddings.shape}")

Input tensor (Token IDs):
 tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]])

→ Inputs shape: 			torch.Size([8, 4])

→ Token embedding layer shape:		Embedding(50257, 256)

→ Token embeddings shape: 		torch.Size([8, 4, 256])

→ Positional encoding layer shape:	Embedding(4, 256)

→ Positional embeddings shape: 		torch.Size([4, 256])

→ Input embeddings shape: 		torch.Size([8, 4, 256])


- Let's do a quick sanity check of **token embeddings** on the first token IDs in the batch. 
  - From the results below, it's evident once again that the embedding layer is essentially a lookup operation that maps token IDs to embedding vectors.
  - Each token ID is now embedded into a 256-dimensional vector.

In [305]:
print(f"Token embeddings for the first item in the batch (manual):")
print(token_embedding_layer(torch.tensor([40, 367, 2885, 1464])))

print(f"\nToken embeddings for the first item in the batch (automated):")
print(token_embeddings[0])

Token embeddings for the first item in the batch (manual):
tensor([[ 0.4247,  0.6801, -0.8078,  ...,  0.1753, -0.6280, -1.5333],
        [ 0.5758,  1.0052, -0.5642,  ..., -0.1103,  2.8701,  1.1953],
        [ 0.4584,  1.3983, -1.3585,  ...,  0.2204,  0.4413, -0.7308],
        [-0.2103,  0.0598,  0.2861,  ...,  0.6133, -0.1733, -0.5151]],
       grad_fn=<EmbeddingBackward0>)

Token embeddings for the first item in the batch (automated):
tensor([[ 0.4247,  0.6801, -0.8078,  ...,  0.1753, -0.6280, -1.5333],
        [ 0.5758,  1.0052, -0.5642,  ..., -0.1103,  2.8701,  1.1953],
        [ 0.4584,  1.3983, -1.3585,  ...,  0.2204,  0.4413, -0.7308],
        [-0.2103,  0.0598,  0.2861,  ...,  0.6133, -0.1733, -0.5151]],
       grad_fn=<SelectBackward0>)


- Let's do a quick sanity check of **positional embeddings**. 
  - From the results below, it's evident that the positional embeddings are unique for each position in the input text.
  - Each position is now embedded into a 256-dimensional vector.
  - <span style="color:#4ea9fb">Position embedding tensor consists of 4 position embeddings, each of size 256 (<code>output_dim</code>), corresponding to the 4 tokens (<code>context_length</code>) in the input text. These 4 x 256 embeddings will be broadcasted to each of the 8 (<code>batch_size</code>) tokenized inputs in the batch.</span>
- <span style="color:#4ea9fb">As we know, the <code>context_length</code> represents the supported input size of the LLM. Here, we choose it similar to the <code>max_length</code> of the input text. In practice, the input text are longer than the <code>context_length</code>, in which case we have to truncate the input text.</span>

In [329]:
print(f"\nPositional embeddings (manual):")
print(
    pos_embedding_layer(torch.arange(context_length))
)  # [0, 1, 2, 3] as context_length = 4

print(f"\nPositional embeddings (automated):")
print(pos_embeddings)


Positional embeddings (manual):
tensor([[-1.1197, -0.8287,  0.6786,  ..., -0.1717,  0.9193, -1.7804],
        [ 0.0681, -0.8354, -0.6381,  ...,  0.9914,  1.2633,  0.0893],
        [ 0.2216, -0.8501, -0.4013,  ...,  0.4249,  0.3531,  0.4431],
        [-0.7114,  0.1228,  1.4046,  ...,  0.2221,  0.5282,  1.0612]],
       grad_fn=<EmbeddingBackward0>)

Positional embeddings (automated):
tensor([[-1.1197, -0.8287,  0.6786,  ..., -0.1717,  0.9193, -1.7804],
        [ 0.0681, -0.8354, -0.6381,  ...,  0.9914,  1.2633,  0.0893],
        [ 0.2216, -0.8501, -0.4013,  ...,  0.4249,  0.3531,  0.4431],
        [-0.7114,  0.1228,  1.4046,  ...,  0.2221,  0.5282,  1.0612]],
       grad_fn=<EmbeddingBackward0>)


<span style="color:#4ea9fb">The <code>input_embeddings</code> tensor we created are used as the input for the main LLM layers</span>, which we will being impelmenting in next chapter.

<img src="../images/block-diagram-input-text-to-input-embeddings.webp" width="700px">

## Summary

- <b>LLMs require textual data to be converted into numerical vectors</b>, since <span style="color:red">they cannot process raw text data directly</span>. <span style="color:#4ea9fb"><b>Embeddings transform discrete data (like words or images) into continuous vectors</b></span> that can be processed by neural networks.
- <b>Raw input text > Tokens > Token IDs</b>
  - First step in processing raw text data is <span style="color:#4ea9fb"><b>tokenization</b></span>, which involves breaking text into smaller units, such as words or subwords. These tokens are then converted into integer representations (<span style="color:#4ea9fb"><b>token IDs</b></span>) using a vocabulary.
  - <span style="color:#4ea9fb"><b>Special tokens</b></span> like `[UNK]`, `<|endoftext|>`, `[EOS]`, `[BOS]`, and `[PAD]` are added to the vocabulary to enhance model's understanding and handle various contexts, such as unknown words or mark the beginning and end of unrelated text segments.
  - BPE tokenization used for LLMs like GPT-2 and GPT-3 is a subword tokenizer that can efficiently handle unknown words by splitting them into subword tokens (units or individual characters).
- <b>Token IDs > Input-Output pairs</b>
  - <span style="color:#4ea9fb"><b>Sliding window</b></span> approach is used to generate input-target pairs for training LLMs. This approach involves extracting chunks of text from the input text dataset using a sliding window.
- <b>Embeddings</b>
  - Embedding layers in PyTorch function as a lookup operation, retrieving embedding vectors based on token IDs. <span style="color:#4ea9fb"><b>The resulting embedding vectors provide continuous representations of the input tokens, which is crucial for training deep learning modules like LLMs.</b></span>
  - While token embeddings provide consistent vector representations for each token, <span style="color:red"> they lack a sense of the token's position in the input text</span>. <span style="color:green">To remedy this, two main types of positional embeddings exists</span>:
    - Absolute positional embeddings (APE) directly encode the position of a token in the input text.
    - Relative positional embeddings (RPE) consider the relative position of a token with respect to other tokens in the text.
  - OpenAI's GPT models use APE, which are added to the token embedding vectors and are optimized during the model training.