# 2.4 Adding special context tokens

In the previous section, we implemented a simple tokenizer and applied it to a paragraph in our training set. In this section, we will modify this tokenizer to handle unknown words.

We will also discuss the use and addition of special contextual tags that can enhance the model's understanding of context or other relevant information in the text. For example, these special tags can include markers for unknown words and document boundaries.

In particular, we will modify the vocabulary and tokenizer implemented in the previous section SimpleTokenizerV2 to support two new tokens <|UNK|> and <|CONTENT|> as shown in Figure 2.9.

**Figure 2.9 We add special tokens to our vocabulary to handle specific contexts. For example, we add the <|UNK|> token to represent new and unknown words that are not part of the training data and therefore not part of the existing vocabulary. In addition, we also add a <|CONTENT|> token that we can use to separate two unrelated text sources. **

![fig2.20](https://github.com/datawhalechina/llms-from-scratch-cn/blob/main/Translated_Book/img/fig-2-20.jpg?raw=true)

As shown in Figure 2.9, we can modify the tokenizer to use the <|UNK|> token if it encounters a word that is not part of the vocabulary. In addition, we add tokens between unrelated text. For example, when training GPT-like LLMs on multiple independent documents or books, it is common to insert a token before each document or book after the previous text source, as shown in Figure 2.10.
This helps the LLM understand that although these text sources are connected for training, they are actually unrelated.

**Figure 2.10 When processing multiple independent text sources, we add tokens called <|endoftext|> between these texts. These <|endoftext|> tokens serve as markers, marking the beginning and end of a specific paragraph, which allows LLM to process and understand the text more effectively. **

![fig2.21](https://github.com/datawhalechina/llms-from-scratch-cn/blob/main/Translated_Book/img/fig-2-21.jpg?raw=true)

现在让我们修改词汇表，以包含这两个特殊的token，<unk>以及<|endoftext|>，并将它们添加到我们在上一节中创建的唯一词表中：

In [4]:
all_tokens = sorted(list(set(preprocessed)))
all_tokens.extend(["<|endoftext|>", "<|unk|>"])
vocab = {token:integer for integer,token in enumerate(all_tokens)}
print(len(vocab.items()))


NameError: name 'preprocessed' is not defined

According to the output of the print statement, the new vocabulary size is 1161 (the vocabulary size in the previous section was 1159).

As an extra quick check, let's print the last 5 words of our updated vocabulary:

In [5]:
for i, item in enumerate(list(vocab.items())[-5:]):
    print(item)

NameError: name 'vocab' is not defined

The above code prints the following:

('younger', 1156)
('your', 1157)
('yourself', 1158)
('<|endoftext|>', 1159)
('<|unk|>', 1160)

Based on the output of the above code, we can confirm that the two new special tokens have indeed been successfully merged into the vocabulary. Next, we adjust the tokenizer in Listing 2.3 accordingly, as shown in Listing 2.4:

**Listing 2.4 A simple text tokenizer that handles unknown words**

In [2]:
class SimpleTokenizerV2:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = { i:s for s,i in vocab.items()}
    def encode(self, text):
        preprocessed = re.split(r'([,.?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        preprocessed = [item if item in self.str_to_int else "<|unk|>" for item in preprocessed] #A
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids
    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text) #B
        return text


Compared to the SimpleTokenizerV1 we implemented in Listing 2.3 in the previous section, the new SimpleTokenizerV2 replaces unknown words with <|UNK|> tokens.

Now let's try this new tagger in practice. To do this, we will use a simple text example that is made by concatenating two separate and unrelated sentences:

In [None]:
text1 = "Hello, do you like tea?"
text2 = "In the sunlit terraces of the palace."
text = " <|endoftext|> ".join((text1, text2))
print(text)

The output is as follows:

'Hello, do you like tea? <|endoftext|> In the sunlit terraces of the palace.'

Next, let's tokenize the vocab we created earlier in Listing 2.2 using SimpleTokenizerV2:

In [3]:
tokenizer = SimpleTokenizerV2(vocab)
print(tokenizer.encode(text))


NameError: name 'vocab' is not defined

This will print the following token ID:

[1160, 5, 362, 1155, 642, 1000, 10, 1159, 57, 1013, 981, 1009, 738, 1013, 1160]

We can see that the token ID list contains 1159, which is the <|endoftext|> delimiter token, and two 1160s, which are used to mark unknown words.

Let's de-tokenize the text as a quick sanity check:

In [None]:
print(tokenizer.decode(tokenizer.encode(text)))

The output looks like this:

'<|unk|>, do you like tea? <|endoftext|> In the sunlit terraces of the <|unk|>.'

By comparing the de-tokenized text above with the original input text, we know that the training dataset, Edith Wharton's short story The Verdict, does not contain the words "Hello" and "palace".

So far, we have discussed tokenization, which is an important step in processing text as input to LLMs. Depending on the LLM, some researchers also consider other special tokens, such as:

·[BOS] (beginning of sequence): This token marks the beginning of the text. LLM indicates where a paragraph of content begins. </br>
·[EOS] (end of sequence): This token is at the end of the text and is particularly useful when connecting multiple unrelated texts, similar to <|text|>. For example, when merging two different Wikipedia articles or books, the [EOS] token indicates where one article ends and the next begins. </br>
·[PAD] (padding): When training LLMs with a batch size greater than 1, the batch may contain texts of different lengths. To ensure that all texts have the same length, the [PAD] tag is used to extend or "pad" shorter texts to the length of the longest text in the batch. </br>

Note that the tokenizer used for the GPT model does not require any of the tokens mentioned above, but only uses the <|内文|> token for simplicity. The <|内文|" is a token similar to the [EOS] token mentioned above. Additionally, the <|内文|" is also used for padding. However, as we will explore in subsequent sections, when training on batches of inputs, we typically use masking, which means that we do not pay attention to the padded tokens. Therefore, choosing a specific token for padding becomes irrelevant.

Additionally, the tokenizer used for the GPT model does not use the <|UNK|> tag for out-of-vocabulary words. Instead, the GPT model uses a byte pair encoding tokenizer that breaks words into sub-word units, which we will discuss in the next section.