# 2.3 Convert token to token ID

In the previous section, we split a short story by Edith Wharton into individual tokens.
In this section, we will convert these tokens from Python strings to integer representations, generating so-called token IDs.
This conversion is an intermediate step before converting the token IDs into embedding vectors.

In order to map the previously generated tokens to token IDs, we first need to build a so-called vocabulary.
This vocabulary defines how we map each unique word and special character to a unique integer, as shown in Figure 2.6.

**Figure 2.6 We build the vocabulary by splitting the entire text in the training dataset into individual tokens.
These individual tokens are then sorted alphabetically and duplicate tokens are removed.
These unique tokens are then clustered into a vocabulary that defines a mapping from each unique token to a unique integer value.
The vocabulary shown is intentionally kept small for illustration purposes and does not include punctuation or special characters for simplicity. **

![fig2.6](https://github.com/datawhalechina/llms-from-scratch-cn/blob/main/Translated_Book/img/fig-2-6.jpg?raw=true)

In the previous section, we tokenized the short stories of Edith Wharton and assigned them to a Python variable called "preprocessed".
Now, let's create a list of all the unique tokens and sort them alphabetically to determine the size of the vocabulary:

In [1]:
import re
import requests

url = "https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/main/ch02/01_main-chapter-code/the-verdict.txt"
response = requests.get(url)
raw_text = response.text
preprocessed = re.split(r'([,.?_!"()\']|--|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if
item.strip()]

all_words = sorted(list(set(preprocessed)))
vocab_size = len(all_words)
print(vocab_size)

1159


After determining that the vocabulary has 1159 words through the above code, we create the vocabulary and print its first 50 words for illustration.

### Code Example 2.2 Creating a Vocabulary

In [2]:
vocab = {token:integer for integer,token in
enumerate(all_words)}
for i, item in enumerate(vocab.items()):
 print(item)
 if i > 50:
  break

('!', 0)
('"', 1)
("'", 2)
('(', 3)
(')', 4)
(',', 5)
('--', 6)
('.', 7)
(':', 8)
(';', 9)
('?', 10)
('A', 11)
('Ah', 12)
('Among', 13)
('And', 14)
('Are', 15)
('Arrt', 16)
('As', 17)
('At', 18)
('Be', 19)
('Begin', 20)
('Burlington', 21)
('But', 22)
('By', 23)
('Carlo', 24)
('Carlo;', 25)
('Chicago', 26)
('Claude', 27)
('Come', 28)
('Croft', 29)
('Destroyed', 30)
('Devonshire', 31)
('Don', 32)
('Dubarry', 33)
('Emperors', 34)
('Florence', 35)
('For', 36)
('Gallery', 37)
('Gideon', 38)
('Gisburn', 39)
('Gisburns', 40)
('Grafton', 41)
('Greek', 42)
('Grindle', 43)
('Grindle:', 44)
('Grindles', 45)
('HAD', 46)
('Had', 47)
('Hang', 48)
('Has', 49)
('He', 50)
('Her', 51)


The output is as follows:

('!', 0) \
('"', 1) \
("'", 2) \
... \
('Has', 49) \
('He', 50)

As we can see from the output above, this dictionary contains a single token associated with a unique integer label.
Our next goal is to apply this vocabulary to convert new text into tokenIDs, as shown in Figure 2.7.

**Figure 2.7 Starting with a new text sample, we tokenize the text and use the vocabulary to convert the text tokens into token IDs.
This vocabulary is built based on the entire training set and can be applied to the training set itself and any new text examples.
The vocabulary shown below will not include punctuation or special characters for simplicity. **

![fig2.7](https://github.com/datawhalechina/llms-from-scratch-cn/blob/main/Translated_Book/img/fig-2-7.jpg?raw=true)

Later in the book, when we want to convert the output of a large language model (LLM) from numbers back to text, we will also need a way to convert token IDs back to text.
To do this, we can create a reverse version of the vocabulary that maps token IDs back to their corresponding tokens.

Let's implement a complete tokenizer class in Python, including an encode method that splits text into tokens and performs a string-to-integer mapping through the vocabulary to generate token IDs.
In addition, we also implement a decode method that performs the reverse integer-to-string mapping to convert the token ID back to text.

The code for this tokenizer is shown in Code Example 2.3:

### Code Example 2.3 Implementing a simple text tokenizer

In [3]:
class SimpleTokenizerV1:
    def __init__(self, vocab):
        self.str_to_int = vocab #A
        self.int_to_str = {i:s for s,i in vocab.items()} #B
 
    def encode(self, text): #C
        preprocessed = re.split(r'([,.?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed
if item.strip()]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids
 
    def decode(self, ids): #D
        text = " ".join([self.int_to_str[i] for i in ids])  
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text) #E
        return text

Using the SimpleTokenizerV1 Python class described above, we can now instantiate new tokenizer objects with an existing vocabulary, which we can then use to encode and decode text, as shown in Figure 2.8.

**Figure 2.8 There are two common methods for tokenizer implementation: one is the encoding method and the other is the decoding method.
The encoding method receives sample text, splits it into individual tokens, and converts these tokens into token IDs through a vocabulary.
The decoding method receives token IDs, converts them back to text tokens, and concatenates these text tokens into natural text. **

![fig2.8](https://github.com/datawhalechina/llms-from-scratch-cn/blob/main/Translated_Book/img/fig-2-8.jpg?raw=true)

Let's try this out in practice by instantiating a new tokenizer object from the SimpleTokenizerV1 class and using it to tokenize a short story by Edith Wharton:

In [4]:
tokenizer = SimpleTokenizerV1(vocab)
text = """"It's the last he painted, you know," Mrs. Gisburn
said with pardonable pride."""
ids = tokenizer.encode(text)
print(ids)

[1, 58, 2, 872, 1013, 615, 541, 763, 5, 1155, 608, 5, 1, 69, 7, 39, 873, 1136, 773, 812, 7]


The above code prints out the token ID of the following code:

[1, 58, 2, 872, 1013, 615, 541, 763, 5, 1155, 608, 5, 1, 69, 7,
39, 873, 1136, 773, 812, 7]

Next, let's see if we can convert these token IDs back into text using the decode method:

In [5]:
tokenizer.decode(ids)

'" It\' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.'

This will output the following text:

'" It's the last he painted, you know," Mrs. Gisburn said
with pardonable pride.'

Based on the output above, we can see that the decode method successfully converts the tokenID back to the original text.

So far, so good.
With this, we have built a tokenizer that is able to tokenize and decode text based on a fragment from our training set.
Now, let's apply it to a new example of text that was not included in our training set:

In [6]:
text = "Hello, do you like tea?"
tokenizer.encode(text)

KeyError: 'Hello'

Executing the above code will result in the following error:

...
KeyError: 'Hello'

The problem is that the word "Hello" does not appear in the short story "The Verdict".
So, this word is not included in the vocabulary we built earlier.
This highlights the importance of considering using a large and diverse training set to expand the vocabulary when working with large language models (LLMs).

In the next section, we will further test the tokenizer on text containing unknown vocabulary, and we will also discuss additional special tokens that can be used to provide more context to the Large Language Model (LLM) during training.