### section 2.2: tokenizing text

In [1]:
import re
import urllib.request

In [2]:
url = ("https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/main/"
       "ch02/01_main-chapter-code/the-verdict.txt")

urllib.request.urlretrieve(url, "the-verdict.txt")

('the-verdict.txt', <http.client.HTTPMessage at 0x105ad5e50>)

In [3]:
with open('the-verdict.txt', 'r', encoding='utf-8') as f:
    verdict = f.read()


print(f"Number of characters: {len(verdict)}")
print("First 100 characters:")
verdict[:99]

Number of characters: 20479
First 100 characters:


'I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no '

In [4]:
# Split by space
result = re.split(r'([,.!]|\s)', "Some example sentence. Thanks for joining us!")

[item for item in result if item.strip()]

['Some', 'example', 'sentence', '.', 'Thanks', 'for', 'joining', 'us', '!']

In [5]:
# Get a little more complex

tmp = "Hello, matey! Here is some text? I think? Is this thing -- on?!"

result = re.split(r'([,.:;!_"()?]|--|\s)', tmp)
result = [item for item in result if item.strip()]
result

['Hello',
 ',',
 'matey',
 '!',
 'Here',
 'is',
 'some',
 'text',
 '?',
 'I',
 'think',
 '?',
 'Is',
 'this',
 'thing',
 '--',
 'on',
 '?',
 '!']

In [6]:
# Lets apply this to the Edith Wharton text

preprocessed = re.split(r'([,.:;!_"()?\']|--|\s)', verdict)
preprocessed = [item for item in preprocessed if item.strip()]
len(preprocessed)

4690

In [7]:
print(preprocessed[:30])

['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in']


### section 2.3: Converting tokens into token IDs

We have a bunch of tokens, but we need to convert them into unique IDs. Out the gate this is easy enough to do. Get the tokens and alphabetize them. Remove duplicates and gives each one a number.

In [8]:
all_words = sorted(set(preprocessed))
print(f"{len(all_words)} unique words in the text")

1130 unique words in the text


In [9]:
vocab = {word: idx for idx, word in enumerate(all_words)}

In [10]:
for word, idx in vocab.items():
    print(f"{word}: {idx}")
    if idx > 20:
        break

!: 0
": 1
': 2
(: 3
): 4
,: 5
--: 6
.: 7
:: 8
;: 9
?: 10
A: 11
Ah: 12
Among: 13
And: 14
Are: 15
Arrt: 16
As: 17
At: 18
Be: 19
Begin: 20
Burlington: 21


### Put all this logic into a class

class defined in `utils.py`

In [11]:
from utils import SimpleTokenizerV1

In [12]:
tokenize = SimpleTokenizerV1(vocab)

In [13]:
tokenize.encode("look at me")

[642, 180, 663]

Some goofy sentences using random IDs

In [14]:
print(tokenize.decode([459, 123, 888, 1050]))
print(tokenize.decode([1035, 56, 837, 554]))

forehead absurdity sign unusual
true It resented hooded


### section 2.4: Adding special context tokens

BUT we run into issues when a never before seen word shows up. We handle this by adding some additional handlers to the vocabulary. `|unk|` is used when we don't know a word (we can make sure the code is retrieved using the `.get` method for dicts). In addition, we can make sure we tell the model that an end of document has been reached using another one like `|end of text|`. 

There are many other different context tokens we can use - these are just a couple of examples.

In [15]:
all_words.extend(['<|endoftext|>', '<|unk|>'])

vocab = {word: idx for idx, word in enumerate(all_words)}

In [16]:
for item in enumerate(list(vocab.keys())[-5:]):
    print(item)

(0, 'younger')
(1, 'your')
(2, 'yourself')
(3, '<|endoftext|>')
(4, '<|unk|>')


In V2, we add the special tokens.

In [17]:
from utils import SimpleTokenizerV2

In [18]:
tokenize_v2 = SimpleTokenizerV2(vocab)

In [19]:
# 2 unknown words
tokenize_v2.encode("BOOM POW")

[1131, 1131]

In [20]:
# Doing end of text - these are 2 songs so we'd want to separate them

text1 = "So here I am, it's in my head"
text2 = "Eating seeds is a past time activity"

text = " <|endoftext|> ".join([text1, text2])

text

"So here I am, it's in my head <|endoftext|> Eating seeds is a past time activity"

In [21]:
tokenize_v2.decode(tokenize_v2.encode(text))

"<|unk|> here I am, it' s in my head <|endoftext|> <|unk|> <|unk|> is a past time activity"

### section 2.5: Byte pair encoding (BPE)

A much more sophisticated tokenization approach used in the training of the GPT series. It was first described for text compression in 1994 by Philip Gage! Quite complex to implement, so we just import it via `tiktoken`. It breaks down tokens into pieces so we can successfully embed new, unknown tokens. We don't have a bunch of `<|unk|>` tokens if we have many new words.

[From wikipedia](https://en.wikipedia.org/wiki/Byte_pair_encoding), the algorithm works by finding common adjacent pairs of characters with unused placeholder bytes. It continues to do this until there are no more adjacent pairs appearing more than once or until a desired vocabulary size is reached.

The process ends up with a vocabulary of every individual letter present and the various combinations that may be plentiful. [This video](https://huggingface.co/learn/nlp-course/en/chapter6/5) has a really nice step by step process of the algorithm. 

In [22]:
import tiktoken
tiktokenize = tiktoken.get_encoding('gpt2')

In [23]:
text = ("Hello, do  you like tea? <|endoftext|> In the sunlit terraces of someUnknownPlace.")

integers = tiktokenize.encode(text, allowed_special={"<|endoftext|>"})

integers

[15496,
 11,
 466,
 220,
 345,
 588,
 8887,
 30,
 220,
 50256,
 554,
 262,
 4252,
 18250,
 8812,
 2114,
 286,
 617,
 20035,
 27271,
 13]

In [24]:
tiktokenize.decode(integers)

'Hello, do  you like tea? <|endoftext|> In the sunlit terraces of someUnknownPlace.'

Note that the nonsense word made it back - this is due to BPE algorithm that iteritively generates words.

### section 2.6 Data Sampling with a Sliding Window

With all of this text, we'll want to build training examples and have a way to load them into our training process.

In [25]:
import torch
print(f"Can this Mac use GPUs?: {torch.backends.mps.is_available()}")

from utils import create_dataload_v1

Can this Mac use GPUs?: True


In [26]:
tokenizer = tiktoken.get_encoding('gpt2')

Note the number of tokens is slightly larger than the original approach that took the full words. This is because of the components being counted.

In [27]:
enc_text = tokenizer.encode(verdict)
len(enc_text)

5145

Building the training and target sets. Our context window is the size of the text string given for context. Relatively short right now so we can get a sense of what is going on visually.

In [28]:
# sample for more interesting text
enc_sample = enc_text[50:]

context_window = 4

x = enc_sample[:context_window]
y = enc_sample[1:context_window + 1]

print(f"x: {x}")
print(f"y:      {y}")

x: [290, 4920, 2241, 287]
y:      [4920, 2241, 287, 257]


**Note** Tokenizer decode takes a list - make sure to wrap a single value in brackets.

* `tokenizer.decode(enc_sample[1])` returns an error
* must do `tokenizer.decode([enc_sample[1]])`

Let's instead decode these to get a sense of the generated text

In [29]:
for i in range(1, context_window + 1):
    print(f"{tokenizer.decode(enc_sample[:i])} -----> {tokenizer.decode([enc_sample[i]])}")

 and ----->  established
 and established ----->  himself
 and established himself ----->  in
 and established himself in ----->  a


We build the dataloader and then make it into an iterator so we can get pieces of it to investigate. Notice that from the first batch to the next, we shift everything over 1 slot.

In [30]:
dataloader = create_dataload_v1(verdict, batch_size=1, max_length=4, stride=1, shuffle=False)
data_iter = iter(dataloader)

# first batch
next(data_iter)

[tensor([[  40,  367, 2885, 1464]]), tensor([[ 367, 2885, 1464, 1807]])]

In [31]:
# second_batch
next(data_iter)

[tensor([[ 367, 2885, 1464, 1807]]), tensor([[2885, 1464, 1807, 3619]])]

### section 2.7 - making embeddings

This is more of less a placeholder. We produce a tensor of randomly generated small numbers based on the dimensions we need. Because we're building a vector for every word in our vocabulary, we need that number. Whatever size we want the output vector to be is important here too. 

So as a toy example, we bring in a 6x3 tensor. Meaning an embedding matrix for a vocabulary of 6 words of which we want the output vectors to be of length 3.

We'll be optimizing these weights as part of the LLM training process later. So we're learning how to build it together.

In [34]:
vocab_size = 6
output_dim = 3

In [35]:
torch.manual_seed(123)

embedding_layer = torch.nn.Embedding(vocab_size, output_dim)
embedding_layer.weight

Parameter containing:
tensor([[ 0.3374, -0.1778, -0.1690],
        [ 0.9178,  1.5810,  1.3010],
        [ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-1.1589,  0.3255, -0.6315],
        [-2.8400, -0.7849, -1.4096]], requires_grad=True)

So if we want to get the embeddings of 4 words - ids `2`, `3`, `5` and `1` - we use them as indices of our embedding matrix. It's just a lookup.

In [36]:
embedding_layer(torch.tensor([2, 3, 5, 1]))

tensor([[ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-2.8400, -0.7849, -1.4096],
        [ 0.9178,  1.5810,  1.3010]], grad_fn=<EmbeddingBackward0>)

Doing this with more realistic dimensions.

In [37]:
vocab = tokenizer.max_token_value
output_dim = 256
token_embedding_layer = torch.nn.Embedding(vocab, output_dim)

We use the new dimensions to create our dataloader and extract an example batch. Note we increased the batch size and made the stride equal to the max length parameter. This means there won't be overlap between examples. 

In [38]:
max_length = 4
dataloader = create_dataload_v1(verdict, batch_size=8, max_length=max_length, stride=max_length, shuffle=False)

data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print(f"Token IDs: {inputs}")
print(f"Input shape: {inputs.shape}")

Token IDs: tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]])
Input shape: torch.Size([8, 4])


We can get a look into the actual tokens using the decoder. Note some of the odd outputs - these are the tokens that come out of BPE. `Gisburn` was broken down to `G` +  `is` + `burn`.

In [39]:
for row in inputs:
    for elem in row:
        print(tokenizer.decode([elem.item()]), end=" ")
    print()

I  H AD  always 
 thought  Jack  G is 
burn  rather  a  cheap 
 genius -- though  a 
 good  fellow  enough -- 
so  it  was  no 
 great  surprise  to  me 
 to  hear  that , 


With our training batches of IDs, we essentially append the embedding vectors of length 256. The magic of tensors starts to take shape.

In [40]:
token_embeddings = token_embedding_layer(inputs)
print(token_embeddings.shape)

torch.Size([8, 4, 256])


### section 2.8 - Encoding word positions

So off the rip, the embeddings above would be fine after training. However, there is a shortcoming of the LLM attention mechanism (which we'll get to in chapter 3) in that it doesn't recognize *where in the sentence* a token is, which can have a major effect on the context.

There are 2 main ways to handle position

* absolute - add a specific position embedding to the token embedding. For example adding 1.1 to the first token because it's the first word in the first input.
* relative - encodes distances between words instead of exact positioning. This has a generalizability benefit.

GPT uses absolute positionings that are actually optimized in the training process, so we'll be using that.

All this set up needs is a second embedding object that is the same vector length for each token we expect to see if the training batch.

In [41]:
context_length = max_length
pos_embedding_layer = torch.nn.Embedding(context_length, output_dim)
pos_embeddings = pos_embedding_layer(torch.arange(context_length))
pos_embeddings.shape

torch.Size([4, 256])

And we just add these things together. Apparently this just worked in the original transformer paper. Some more reading on it [here](https://kazemnejad.com/blog/transformer_architecture_positional_encoding/).

In [42]:
input_embeddings = token_embeddings + pos_embeddings
input_embeddings.shape

torch.Size([8, 4, 256])

Before training to optimize the weights, we just need one more layer - attention! This is the most difficult part apparently. On to chapter 3!