In [None]:
# Step 1: Creating Tokens

In [4]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
  raw_text = f.read();

print("Total number of characters int the file:", len(raw_text));
print(raw_text[:100]);

Total number of characters int the file: 20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no g


In [2]:
""" 
Our goal is to tokenize this 20,479-character (which is present in the file named the-verdict.txt) short story into individual words and 
special characters that we can then turn into embeddings for LLM training
"""

"""
Note that it's common to process millions of articles and hundreds of thousands of books -- many gigabytes of text -- when working with LLMs.
However, for educational purposes, it's sufficient to work with smaller text samples like a single book to illustrate the main ideas behind 
the text processing steps and to make it possible to run it in reasonable time on consumer hardware.
"""

"""
How can we best split this text to obtain a list of tokens? For this, we go on a small excursion and use Python's regular expression 
library re for illustration purposes. (Note that you don't have to learn or memorize any regular expression syntax since we will transition
to a pre-built tokenizer later in this chapter.)
"""

# Using some simple example text, we can use the re.split command with the following syntax to split a text on whitespace characters:

In [11]:
# this re -- Regular Expression . It works like whenever in the statement it finds white-spaces it separates the words
import re

text = "Hello nishit, ,,world!!"
result = re.split(r'(\s)', text)

print(result)

['Hello', ' ', 'nishit,', ' ', ',,world!!']


In [None]:
# The result is a list of individual words, whitespaces, and punctuation characters:

# Let's modify the regular expression splits on whitespaces (\s) and commas, and periods ([,.]):

In [31]:
result = re.split(r'([,.]|\s)', text)

for item in result:
    print(item)

# We can see that the words and punctuation characters are now separate list entries just as we want

Hello
,

 
world
.

 
Is
 
this--
 
a
 
test?


In [None]:
# A small remaining issue is that the list still includes whitespace characters. Optionally, we can remove these redundant 
# characters safely as follows:

In [13]:
result = [item for item in result if item.strip()]
print(result)

['Hello', 'nishit', ',', ',', ',', 'world!!']


In [None]:
# REMOVING WHITESPACES OR NOT

"""
When developing a simple tokenizer, whether we should encode whitespaces as separate characters or just remove them depends
on our application and its requirements. Removing whitespaces reduces the memory and computing requirements. However, keeping 
whitespaces can be useful if we train models that are sensitive to the exact structure of the text (for example, Python code, which is
sensitive to indentation and spacing). Here, we remove whitespaces for simplicity and brevity of the tokenized outputs. Later, we will 
switch to a tokenization scheme that includes whitespaces.
"""

In [None]:
"""
The tokenization scheme we devised above works well on the simple sample text. Let's modify it a bit further so that it can also handle other 
types of punctuation, such as question marks, quotation marks, and the double-dashes we have seen earlier in the first 100 characters of 
Edith Wharton's short story, along with additional special characters:
"""

In [29]:
text = "Hello, world. Is this-- a test?"
result = re.split(r'([,.:;?_!"()\']|--|\s)', text); 
result = [item for item in result if item.strip()]

# the above two lines indicates the tokenization lines

print(result, '\n')

for item in result:
    print(item)

['Hello', ',', 'world', '.', 'Is', 'this', '--', 'a', 'test', '?'] 

Hello
,
world
.
Is
this
--
a
test
?


In [37]:
preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]
print(preprocessed[:50])

['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in', 'the', 'height', 'of', 'his', 'glory', ',', 'he', 'had', 'dropped', 'his', 'painting', ',', 'married', 'a', 'rich', 'widow', ',', 'and', 'established', 'himself']
