**LLM Workshop 2024 by Sebastian Raschka**

This code is based on *Build a Large Language Model (From Scratch)*, [https://github.com/rasbt/LLMs-from-scratch](https://github.com/rasbt/LLMs-from-scratch)

<br>
<br>
<br>
<br>

# 2) Understanding LLM Input Data

Packages that are being used in this notebook:

In [1]:
from importlib.metadata import version


print("torch version:", version("torch"))
print("tiktoken version:", version("tiktoken"))

torch version: 2.4.1
tiktoken version: 0.8.0


- This notebook provides a brief overview of the data preparation and sampling procedures to get input data "ready" for an LLM
- Understanding what the input data looks like is a great first step towards understanding how LLMs work

<img src="./figures/01.png" width="1000px">

<br>
<br>
<br>
<br>

# 2.1 Tokenizing text

- In this section, we tokenize text, which means breaking text into smaller units, such as individual words and punctuation characters

<img src="figures/02.png" width="800px">

- Load raw text we want to work with
- [The Verdict by Edith Wharton](https://en.wikisource.org/wiki/The_Verdict) is a public domain short story

In [35]:
# with open("the-verdict.txt", "r", encoding="utf-8") as f:
#     raw_text = f.read()

with open("how-to-do-great-work.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()
    
print("Total number of character:", len(raw_text))
print(raw_text[:99])

Total number of character: 59342
If you collected lists of techniques for doing great work in a lot of different fields, what would 


- The goal is to tokenize and embed this text for an LLM
- Let's develop a simple tokenizer based on some simple sample text that we can then later apply to the text above

<img src="figures/03.png" width="690px">

- The following regular expression will split on whitespaces and punctuation

In [36]:
import re

preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
preprocessed = [item for item in preprocessed if item]
print(preprocessed[:38])

['If', ' ', 'you', ' ', 'collected', ' ', 'lists', ' ', 'of', ' ', 'techniques', ' ', 'for', ' ', 'doing', ' ', 'great', ' ', 'work', ' ', 'in', ' ', 'a', ' ', 'lot', ' ', 'of', ' ', 'different', ' ', 'fields', ',', ' ', 'what', ' ', 'would', ' ', 'the']


In [37]:
print("Number of tokens:", len(preprocessed))

Number of tokens: 23383


In [38]:
unique = set(preprocessed)
sorted(unique)

['\n',
 ' ',
 '!',
 '"',
 "'",
 '(',
 ')',
 ',',
 '.',
 '14',
 '15',
 '18',
 '21',
 '50',
 '7',
 '99%',
 ':',
 ';',
 '?',
 'A',
 'Affectation',
 'Am',
 'Ambition',
 'Ambitious',
 'An',
 'And',
 'Another',
 'As',
 'Aspies',
 'At',
 'Avoid',
 'Be',
 'Begin',
 'Being',
 'Believe',
 'Big',
 'Boldly',
 'Broken',
 'But',
 'By',
 'Can',
 'Changing',
 'Colleagues',
 'Competition',
 'Consciously',
 'Copernicus',
 'Corollary',
 'Curiosity',
 'Darwin',
 'Develop',
 'Do',
 'Doing',
 'Don',
 'Durer',
 'Einstein',
 'Either',
 'Elegance',
 'English',
 'Even',
 'Every',
 'Everyone',
 'Exception',
 'Fame',
 'Few',
 'Fields',
 'Finishing',
 'Five',
 'Following',
 'For',
 'Fortunately',
 'Four',
 'From',
 'God',
 'Good',
 'Great',
 'Growing',
 'Have',
 'Having',
 'He',
 'Here',
 'History',
 'How',
 'Husband',
 'I',
 'Ideally',
 'If',
 'In',
 'Indeed',
 'Inexperience',
 'Informality',
 'Instead',
 'Interest',
 'Is',
 'It',
 'Just',
 'Knowledge',
 'Laborious',
 'Learning',
 'Lego',
 'Let',
 'Lots',
 'Luck'

<br>
<br>
<br>
<br>

# 2.2 Converting tokens into token IDs

- Next, we convert the text tokens into token IDs that we can process via embedding layers later
- For this we first need to build a vocabulary

<img src="figures/04.png" width="900px">

- The vocabulary contains the unique words in the input text

In [39]:
all_words = sorted(set(preprocessed))
vocab_size = len(all_words)

print(vocab_size)

1897


In [40]:
vocab = {token:integer for integer,token in enumerate(all_words)}

In [45]:
vocab['qualities']

1399

- Below are the first 50 entries in this vocabulary:

In [42]:
for i, item in enumerate(vocab.items()):
    print(item)
    if i >= 50:
        break

('\n', 0)
(' ', 1)
('!', 2)
('"', 3)
("'", 4)
('(', 5)
(')', 6)
(',', 7)
('.', 8)
('14', 9)
('15', 10)
('18', 11)
('21', 12)
('50', 13)
('7', 14)
('99%', 15)
(':', 16)
(';', 17)
('?', 18)
('A', 19)
('Affectation', 20)
('Am', 21)
('Ambition', 22)
('Ambitious', 23)
('An', 24)
('And', 25)
('Another', 26)
('As', 27)
('Aspies', 28)
('At', 29)
('Avoid', 30)
('Be', 31)
('Begin', 32)
('Being', 33)
('Believe', 34)
('Big', 35)
('Boldly', 36)
('Broken', 37)
('But', 38)
('By', 39)
('Can', 40)
('Changing', 41)
('Colleagues', 42)
('Competition', 43)
('Consciously', 44)
('Copernicus', 45)
('Corollary', 46)
('Curiosity', 47)
('Darwin', 48)
('Develop', 49)
('Do', 50)


- Below, we illustrate the tokenization of a short sample text using a small vocabulary:

<img src="figures/05.png" width="800px">

- Let's now put it all together into a tokenizer class

In [43]:
class SimpleTokenizerV1:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s,i in vocab.items()}
    
    def encode(self, text):
        preprocessed = re.split(r'([,.?_!"()\']|--|\s)', text)
        preprocessed = [
            item.strip() for item in preprocessed if item.strip()
        ]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids
        
    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text

- The `encode` function turns text into token IDs
- The `decode` function turns token IDs back into text

<img src="figures/06.png" width="800px">

- We can use the tokenizer to encode (that is, tokenize) texts into integers
- These integers can then be embedded (later) as input of/for the LLM

In [44]:
tokenizer = SimpleTokenizerV1(vocab)

text = """"The first step is to decide what to work on. The work you choose needs to have three qualities: it has to be something you have a natural aptitude for, that you have a deep interest in, and that offers scope to do great work.

In practice you don't have to worry much about the third criterion. Ambitious people are if anything already too conservative about it. So all you need to do is find something you have an aptitude for and great interest in. [1]

That sounds straightforward, but it's often quite difficult."""
ids = tokenizer.encode(text)
print(ids)

KeyError: 'qualities:'

- We can decode the integers back into text

In [16]:
tokenizer.decode(ids)

'" It\' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.'

In [17]:
tokenizer.decode(tokenizer.encode(text))

'" It\' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.'

<br>
<br>
<br>
<br>

# 2.3 BytePair encoding

- GPT-2 used BytePair encoding (BPE) as its tokenizer
- it allows the model to break down words that aren't in its predefined vocabulary into smaller subword units or even individual characters, enabling it to handle out-of-vocabulary words
- For instance, if GPT-2's vocabulary doesn't have the word "unfamiliarword," it might tokenize it as ["unfam", "iliar", "word"] or some other subword breakdown, depending on its trained BPE merges
- The original BPE tokenizer can be found here: [https://github.com/openai/gpt-2/blob/master/src/encoder.py](https://github.com/openai/gpt-2/blob/master/src/encoder.py)
- In this lecture, we are using the BPE tokenizer from OpenAI's open-source [tiktoken](https://github.com/openai/tiktoken) library, which implements its core algorithms in Rust to improve computational performance
- (Based on an analysis [here](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/02_bonus_bytepair-encoder/compare-bpe-tiktoken.ipynb), I found that `tiktoken` is approx. 3x faster than the original tokenizer and 6x faster than an equivalent tokenizer in Hugging Face)

In [None]:
# pip install tiktoken

In [46]:
import importlib
import tiktoken

print("tiktoken version:", importlib.metadata.version("tiktoken"))

tiktoken version: 0.8.0


In [47]:
tokenizer = tiktoken.get_encoding("gpt2")

In [48]:
text = (
    "Hello, do you like tea? <|endoftext|> In the sunlit terraces"
     "of someunknownPlace."
)

integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})

print(integers)

[15496, 11, 466, 345, 588, 8887, 30, 220, 50256, 554, 262, 4252, 18250, 8812, 2114, 1659, 617, 34680, 27271, 13]


In [49]:
strings = tokenizer.decode(integers)

print(strings)

Hello, do you like tea? <|endoftext|> In the sunlit terracesof someunknownPlace.


- BPE tokenizers break down unknown words into subwords and individual characters:

<img src="figures/07.png" width="700px">

In [50]:
tokenizer.encode("Akwirw ier", allowed_special={"<|endoftext|>"})

[33901, 86, 343, 86, 220, 959]

<br>
<br>
<br>
<br>

# 2.4 Data sampling with a sliding window

- Above, we took care of the tokenization (converting text into word tokens represented as token ID numbers)
- Now, let's talk about how we create the data loading for LLMs
- We train LLMs to generate one word at a time, so we want to prepare the training data accordingly where the next word in a sequence represents the target to predict

<img src="figures/08.png" width="800px">

- For this, we use a sliding window approach, changing the position by +1:

<img src="figures/09.png" width="900px">

- Note that in practice it's best to set the stride equal to the context length so that we don't have overlaps between the inputs (the targets are still shifted by +1 always)

<img src="figures/10.png" width="800px">

In [51]:
from supplementary import create_dataloader_v1


dataloader = create_dataloader_v1(raw_text, batch_size=8, max_length=4, stride=4, shuffle=False)

data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print("Inputs:\n", inputs)
print("\nTargets:\n", targets)

Inputs:
 tensor([[ 1532,   345,  7723,  8341],
        [  286,  7605,   329,  1804],
        [ 1049,   670,   287,   257],
        [ 1256,   286,  1180,  7032],
        [   11,   644,   561,   262],
        [16246,   804,   588,    30],
        [  314,  3066,   284,  1064],
        [  503,   416,  1642,   340]])

Targets:
 tensor([[  345,  7723,  8341,   286],
        [ 7605,   329,  1804,  1049],
        [  670,   287,   257,  1256],
        [  286,  1180,  7032,    11],
        [  644,   561,   262, 16246],
        [  804,   588,    30,   314],
        [ 3066,   284,  1064,   503],
        [  416,  1642,   340,    13]])


<br>
<br>
<br>
<br>

# Exercise: Prepare your own favorite text dataset