### Lab 8.1 Tokenization

This week we will work up to creating an RNN text generator.  In today's lab you will explore different methods of text tokenization.   Here's an overview of what you will try to do.

Imagine that our entire dataset consists of the following text:

    hello world hello a b c

We would first build a vocabulary of the words in the dataset:

    0: hello
    1: world
    2: a
    3: b
    4: c

Thus the dataset can be mapped to token indices:

    0 1 0 2 3 4

Now suppose that we have defined the maximum sequence length (`seq_len`) to be 3.  We will use each possible sequence as the input to our RNN, and the next token as the target.  Here are the possible input sequences and targets:

    0 1 0 -> 2
    1 0 2 -> 3
    0 2 3 -> 4

You will build a subclass of `Dataset` to find all possible sequences for a given dataset, either at the word or character level.

The following code will download the text of Shakespeare's sonnets and read it in as one long string.

In [6]:
from torch.utils.data import Dataset

In [None]:
!wget --no-clobber "https://www.dropbox.com/scl/fi/7r68l64ijemidyb9lf80q/sonnets.txt?rlkey=udb47coatr2zbrk31hsfbr22y&dl=1" -O sonnets.txt


--2025-02-25 20:46:43--  https://www.dropbox.com/scl/fi/7r68l64ijemidyb9lf80q/sonnets.txt?rlkey=udb47coatr2zbrk31hsfbr22y&dl=1
Resolving www.dropbox.com (www.dropbox.com)... 2620:100:601b:18::a27d:812, 162.125.8.18
Connecting to www.dropbox.com (www.dropbox.com)|2620:100:601b:18::a27d:812|:443... connected.
ERROR: cannot verify www.dropbox.com's certificate, issued by ‘CN=DigiCert TLS RSA SHA256 2020 CA1,O=DigiCert Inc,C=US’:
  Unable to locally verify the issuer's authority.
To connect to www.dropbox.com insecurely, use `--no-check-certificate'.


In [9]:
text = (open("sonnets.txt").read())

In [10]:
text = text.lower()

In [11]:
print(text[:1000])

i

 from fairest creatures we desire increase,
 that thereby beauty's rose might never die,
 but as the riper should by time decease,
 his tender heir might bear his memory:
 but thou, contracted to thine own bright eyes,
 feed'st thy light's flame with self-substantial fuel,
 making a famine where abundance lies,
 thy self thy foe, to thy sweet self too cruel:
 thou that art now the world's fresh ornament,
 and only herald to the gaudy spring,
 within thine own bud buriest thy content,
 and tender churl mak'st waste in niggarding:
   pity the world, or else this glutton be,
   to eat the world's due, by the grave and thee.

 ii

 when forty winters shall besiege thy brow,
 and dig deep trenches in thy beauty's field,
 thy youth's proud livery so gazed on now,
 will be a tatter'd weed of small worth held:
 then being asked, where all thy beauty lies,
 where all the treasure of thy lusty days;
 to say, within thine own deep sunken eyes,
 were an all-eating shame, and thriftless praise.


### Exercises

1. Prepare a vocabulary of the unique words in the dataset.  (For simplicity's sake you can leave the punctuation in.)

In [12]:
# Read the sonnets.txt file
with open("sonnets.txt", "r") as f:
    text = f.read()

# Split text into words (punctuation remains attached)
words = text.split()

# Compute the unique words and sort them (optional)
vocab = sorted(set(words))

# Print out the vocabulary size and a few sample words
print("Vocabulary size:", len(vocab))
print("Sample words:", vocab[:20])

Vocabulary size: 4901
Sample words: ["''tis", "'Amen'", "'Fair,", "'Gainst", "'Had", "'I", "'No'", "'Now", "'Since", "'This", "'Thou", "'Thus", "'Thy", "'Tis", "'Truth", "'Will'", "'Will',", "'Will,'", "'Will.'", "'fore"]


2. Now you will make a Dataset subclass that can return sequences of tokens, encoded as integers.

In [13]:
class WordDataset(Dataset):
  def __init__(self,text,seq_len=100):
    self.seq_len = seq_len
    # add code to compute the vocabulary (copied from exercise 1)
    # add code to convert the text to a sequence of word indices
    self.seq_len = seq_len
    self.vocab = sorted(set(text.split()))
    self.word_to_idx = {word: idx for idx, word in enumerate(self.vocab)}
    self.idx_to_word = {idx: word for idx, word in enumerate(self.vocab)}
    self.data = [self.word_to_idx[word] for word in text.split()]

  def __len__(self):
    # replace this with code to return the number of possible sub-sequences
    return len(self.data) - self.seq_len

  def __getitem__(self,i):
    # replace this with code to return a sequence of length seq_len of token indices starting at i, and the index of token i+seq_len as the label
    return (self.data[i:i+self.seq_len], self.data[i+self.seq_len])

  def decode(self,tokens):
    # replace this with code to convert a sequence of tokens back into a string
    return ' '.join(self.idx_to_word[token] for token in tokens)

3. Verify that your class can successfully encode and decode sequences.

In [15]:
# sentence from sonnets.txt
text = "shall i compare thee to a summer's day"
seq_len = 5
dataset = WordDataset(text, seq_len=seq_len)
seq, label = dataset[0]


print("Encoded sequence (indices):", seq)
print("Label index:", label)


decoded_seq = dataset.decode(seq)
print("Decoded sequence:", decoded_seq)


decoded_label = dataset.idx_to_word[label]
print("Decoded label:", decoded_label)

decoded_text = dataset.decode(dataset.data)
print("Decoded text:", decoded_text)

Encoded sequence (indices): [4, 3, 1, 6, 7]
Label index: 0
Decoded sequence: shall i compare thee to
Decoded label: a
Decoded text: shall i compare thee to a summer's day


4. Do the exercise again, but this time at the character level.

In [16]:
class CharacterDataset(Dataset):
  def __init__(self,text,seq_len=100):
    self.seq_len = seq_len
    # add code to compute the vocabulary of unique characters
    # add code to convert the text to a sequence of character indices
    self.seq_len = seq_len
    self.vocab = sorted(set(text))
    self.char_to_idx = {char: idx for idx, char in enumerate(self.vocab)}
    self.idx_to_char = {idx: char for idx, char in enumerate(self.vocab)}
    self.data = [self.char_to_idx[c] for c in text]

  def __len__(self):
    # replace this with code to return the number of possible sub-sequences
    return len(self.data) - self.seq_len

  def __getitem__(self,i):
    # replace this with code to return the sequence of length seq_len of token indices starting at i, and the index of token i+seq_len as the label
    return (self.data[i:i+self.seq_len], self.data[i+self.seq_len])

  def decode(self,tokens):
    # replace this with code to convert a sequence of tokens back into a string
    return ''.join(self.idx_to_char[token] for token in tokens)

5. Compare the number of sequences for each tokenization method.

In [18]:
text = "shall i compare thee to a summer's day"
seq_len_word = 3
seq_len_char = 10

word_dataset = WordDataset(text, seq_len=seq_len_word)
num_word_sequences = len(word_dataset)

char_dataset = CharacterDataset(text, seq_len=seq_len_char)
num_char_sequences = len(char_dataset)

print("Word-level number of sequences:", num_word_sequences)
print("Character-level number of sequences:", num_char_sequences)

Word-level number of sequences: 5
Character-level number of sequences: 28


6. Optional: implement the byte pair encoding algorithm to make a Dataset class that uses word parts.