# 2.2 Text segmentation (serialization)

This section covers how we segment the input text into individual tokens, which is a required preprocessing step for creating embeddings for large language models (LLMs).
These tokens may be single words or special characters, including punctuation, as shown in Figure 2.4.

**Figure 2.4 A view of the text processing steps involved in this section in a large language model (LLM).
Here, we split the input text into individual tokens, which may be words or special characters such as punctuation.
In the following sections, we will convert text into token IDs and create token embeddings. **

![fig2.4](https://github.com/datawhalechina/llms-from-scratch-cn/blob/main/Translated_Book/img/fig-2-4.jpg?raw=true)

The text we will be training tokens for our Large Language Model (LLM) is a short story called The Verdict by Edith Wharton, which is in the public domain so we can use it for our LLM training task.
This article can be found on Wikisource at https://en.wikisource.org/wiki/The_Verdict and can be copied and pasted into a text file. I have copied it into a text file called “the-verdict.txt” so that it can be loaded using Python’s standard file reading tools:

### Code Example 2.1: Using Python code to load a short story as a text example

In [1]:
import requests
url = "https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/main/ch02/01_main-chapter-code/the-verdict.txt"
response = requests.get(url)
raw_text = response.text
print("Total number of characters:", len(raw_text))
print(raw_text[:99])

Total number of characters: 20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 


Alternatively, you can find the file named "the-verdict.txt" in the GitHub repository of this book at: https://github.com/rasbt/LLMs-from-scratch/tree/main/ch02/01_main-chapter-code

The print command is used to print the total number of characters in a file. We then print the first 100 characters of the file as an example:

Total number of characters: 20479\
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no

Our goal is to tokenize this 20,479-character short story into individual words and special characters so that we can convert them into embedding vectors for training a Large Language Model (LLM) in the following chapters.

### Size of sample text

Note that when running large language models (LLMs), it is common to process millions of articles and hundreds of thousands of books - gigabytes of text.
However, for educational purposes, it is sufficient to use a small sample of text, such as a single book. This allows the main steps of text processing to be clearly demonstrated while ensuring that it runs in a reasonable amount of time on common consumer hardware.

How can we best split this text into a list of tokens?
We’ll do a brief exploration of this, using Python’s regular expression library, the re module, for examples.
(Note that you don’t have to learn or remember any regular expression syntax, as we’ll switch to using a prebuilt tokenizer later in this chapter.)

Using some simple example text, we can use the following re.split command to split the text on whitespace characters:

In [2]:
import re
text = "Hello, world. This, is a test."
result = re.split(r'(\s)', text)
print(result)

['Hello,', ' ', 'world.', ' ', 'This,', ' ', 'is', ' ', 'a', ' ', 'test.']


The result is a list containing single words, spaces, and punctuation:

['Hello,', ' ', 'world.', ' ', 'This,', ' ', 'is', ' ', 'a', '
', 'test.']

Note that the simple tokenization scheme above is mostly useful for breaking up the sample text into individual words, but there are still some words connected to punctuation marks that we would like to list separately.

Let's modify the regular expression to split on spaces (\s) and commas and periods ([,.]):

In [3]:
result = re.split(r'([,.]|\s)', text)
print(result)

['Hello', ',', '', ' ', 'world', '.', '', ' ', 'This', ',', '', ' ', 'is', ' ', 'a', ' ', 'test', '.', '']


We can see that the words and punctuation marks are now separate items in the list, just as we wanted:

['Hello', ',', '', ' ', 'world.', ' ', 'This', ',', '', ' ',
'is', ' ', 'a', ' ', 'test.']

There is one minor problem, though, because the list still contains whitespace characters.
We can choose to safely remove these extra characters, as follows:

In [4]:
result = [item.strip() for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '.', 'This', ',', 'is', 'a', 'test', '.']


The resulting output without whitespace is as follows:

['Hello', ',', 'world.', 'This', ',', 'is', 'a', 'test.']

### Removing whitespaces or not

### Whether to remove spaces

When developing a simple tokenizer, whether we should encode spaces as separate characters or remove them depends on our application and its requirements.
Removing spaces can reduce memory and computational requirements. However, keeping spaces can be useful when we are training models that are sensitive to the precise structure of the text
(for example, Python code is very sensitive to indentation and spacing).
Here, we remove spaces to simplify and concise the token output.
Later, we will move on to a tokenization scheme that includes spaces.

The tokenization scheme we designed in the previous section works well on simple sample text.
Now, let's modify it further so that it can also handle other types of punctuation,
such as question marks, quotation marks, and the double dash we saw in the first 100 characters of Edith Wharton's short story, as well as other additional special characters:

In [5]:
text = "Hello, world. Is this-- a test?"
result = re.split(r'([,.?_!"()\']|--|\s)', text)
result = [item.strip() for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '.', 'Is', 'this', '--', 'a', 'test', '?']


The resulting output looks like this:

['Hello', ',', 'world', '.', 'Is', 'this', '--', 'a', 'test',
'?']

As we can see from the results summarized in Figure 2.5,
our word segmentation scheme can now successfully handle various special characters in the text.

**Figure 2.5 The tokenization scheme we have implemented so far segments text into individual words and punctuation marks. In the specific example shown in this figure, the sample text is segmented into 10 individual tokens. **

![fig2.5](https://github.com/datawhalechina/llms-from-scratch-cn/blob/main/Translated_Book/img/fig-2-5.jpg?raw=true)

Now that we have a basic tokenizer up and running, let's deploy it on the entire collection of Edith Wharton's short stories:

In [6]:
preprocessed = re.split(r'([,.?_!"()\']|--|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if
item.strip()]
print(len(preprocessed))

4649


The print statement above outputs 4649, which is the number of tokens in the text (excluding spaces).

Let's print the first 30 tokens for a quick visual inspection:

In [7]:
print(preprocessed[:30])

['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in']


The resulting output shows that our tokenizer seems to have processed the text well, as all the words and special characters are neatly separated:

['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in']