In [1]:
# Use python 3.7 or later (IMPORTANT)

import numpy as np
from collections import Counter
import random
from tqdm import tqdm
import regex as re
import math
# set seed
np.random.seed(42)
random.seed(42)

We use Jane Austen's "Emma" novel as the running example for the rest of our homework

> **Add blockquote**



In [2]:
## Load data
with open("./Emma.txt", "r", encoding='UTF8') as f:
    text = f.read()

In [3]:
# see dataset properties:
print(f"Total: {len(list(text))} characters")
print(f"number of unique characthers {len(set(text))}")

Total: 899644 characters
number of unique characthers 92


However, the Emma dataset is quite big. So, we will use this small dataset from [NYTimes](https://www.nytimes.com/2024/09/04/opinion/yuval-harari-ai-democracy.html) for debugging purpose. Use this only for debugging and not for reporting final results. Do your experiments with Emma dataset and the data kept in the multilingual-data folder.

In [4]:
text_ = f"""Democracy is a conversation. Its function and survival depend on the available information technology. For most of history, no technology existed for holding large-scale conversations among millions of people. In the premodern world, democracies existed only in small city-states like Rome and Athens, or in even smaller tribes. Once a polity grew large, the democratic conversation collapsed, and authoritarianism remained the only alternative.

Large-scale democracies became feasible only after the rise of modern information technologies like the newspaper, the telegraph and the radio. The fact that modern democracy has been built on top of modern information technologies means that any major change in the underlying technology is likely to result in a political upheaval.

This partly explains the current worldwide crisis of democracy. In the United States, Democrats and Republicans can hardly agree on even the most basic facts, such as who won the 2020 presidential election. A similar breakdown is happening in numerous other democracies around the world, from Brazil to Israel and from France to the Philippines.

In the early days of the internet and social media, tech enthusiasts promised they would spread truth, topple tyrants and ensure the universal triumph of liberty. So far, they seem to have had the opposite effect. We now have the most sophisticated information technology in history, but we are losing the ability to talk with one another, and even more so the ability to listen.

As technology has made it easier than ever to spread information, attention became a scarce resource, and the ensuing battle for attention resulted in a deluge of toxic information. But the battle lines are now shifting from attention to intimacy. The new generative artificial intelligence is capable of not only producing texts, images and videos, but also conversing with us directly, pretending to be human.

Over the past two decades, algorithms fought algorithms to grab attention by manipulating conversations and content. In particular, algorithms tasked with maximizing user engagement discovered by experimenting on millions of human guinea pigs that if you press the greed, hate or fear button in the brain, you grab the attention of that human and keep that person glued to the screen. The algorithms began to deliberately promote such content. But the algorithms had only limited capacity to produce this content by themselves or to directly hold an intimate conversation. This is now changing, with the introduction of generative A.I.s like OpenAI’s GPT-4.
"""

# The author(s) of this homework do not necessarily agree/endorse the views expressed in the text above.


Let's start with implementing the simplest form of the tokenizer: character level tokenizer. In fact many early NLP research were conducted using character level tokenization.

Notice the CharTokenizer class below. For the rest of the homework, your implementation will follow a similar structure.

In [5]:
class CharTokenizer:
    def __init__(self):
        """
        All of your tokenizer implementation should have tokenizer_map attribute
        tokenizer_map is a dictionary that maps individual numbers to corresponding tokens
        map_of_char maps individual characters to corresponding token indices. In this particular case, it is the reverse of tokenizer_map; however, it is not always the case. map_of_char is an optional attribute for your tokenizer implementation. You may not need it for your implementation.
        """
        self.tokenizer_map = None
        self.map_of_char = None


    def train(self, text):

        self.character_set = set(text)
        self.tokenizer_map = {i: char for i, char in enumerate(self.character_set)}
        self.map_of_char = {char: i for i, char in self.tokenizer_map.items()}

    def encode_text(self, text):
        return [self.map_of_char[char] for char in text]

    def decode_text(self, encoded_text):
        return "".join([list(self.map_of_char.keys())[i] for i in encoded_text])

chat_tokenizer = CharTokenizer()
chat_tokenizer.train(text_)

Now see the usage:

In [6]:
# First, initialize the tokenizer
char_tokenizer = CharTokenizer()

# Then train the tokenizer on the Emma text
char_tokenizer.train(text)

# Now encode some texts using the trained tokenizer
enc = char_tokenizer.encode_text(text_) # notice that we are using text_ here

# Now decode the encoded text
print(f"Decoded text:\n\n{char_tokenizer.decode_text(enc)}")

# If the decoded text is the same as the original text, then the implementation is correct
# use this sanity check to verify all of your implementations
assert char_tokenizer.decode_text(char_tokenizer.encode_text(text_)) == text_

Decoded text:

Democracy is a conversation. Its function and survival depend on the available information technology. For most of history, no technology existed for holding large-scale conversations among millions of people. In the premodern world, democracies existed only in small city-states like Rome and Athens, or in even smaller tribes. Once a polity grew large, the democratic conversation collapsed, and authoritarianism remained the only alternative.

Large-scale democracies became feasible only after the rise of modern information technologies like the newspaper, the telegraph and the radio. The fact that modern democracy has been built on top of modern information technologies means that any major change in the underlying technology is likely to result in a political upheaval.

This partly explains the current worldwide crisis of democracy. In the United States, Democrats and Republicans can hardly agree on even the most basic facts, such as who won the 2020 presidential electi

However, in reality, we do not use character level tokenization. What we use is subword level tokenization. This is why you will implement one of the most effective and popular subword level tokenization method called BPE algorithm.


First, you will implement the vanilla version of it.

## 1. Vanilla BPETokenizer:
This version does *not* have a preprocessing step. This is also called SentencePiece tokenization (do not confuse with the library with the same name) and was introduced into [this paper](https://arxiv.org/pdf/1808.06226)
Check out this pseudocode for a clearer understanding:

![Vanilla BPETokenizer](./images/vanilla%20BPE.png)

Note that when multiple bigrams have same frequency, ties can be broken by picking a random one.

In [7]:
class BPETokenizer:
    def __init__(self):
        """
        READ THIS DESCRIPTION:

        map_of_merged_tokens: is a dictionary that maps the merged bigram to a single token. According to the notations in the pseudocode above, it is a dictionary with keys (v_i, v_j) and values v_n. You need to keep this dictionary updated as you merge tokens in the training step. This dictionary is needed when decoding. When done right, it should look something like {(5, 68): 92, (68, 23): 93, (68, 35): 94, ...}. Here it means that token 5 and 68 were merged to form token 92, 68 and 23 were merged to form token 93, and so on. *The order in the dictionary refers to the order of merging.*
        tokenizer_map: is a dictionary that maps token numbers to their corresponding tokens (subwords). Its length should be equal to the target_vocab_size. After a successful training, tokenizer_map might look something like {..., '995: 'uten',996: 'ion of ', 997: ' turn', ...} Notice that this behavior is opposite to get_vocab method in huggingface tokenizers, where the subwords are the keys and the values are the token numbers.
        map_of_char: maps individual characters to corresponding token indices. Optional. Use it only if your implementation needs.
        """
        self.map_of_merged_tokens = None
        self.tokenizer_map = None
        self.map_of_char = None

    @staticmethod
    def replace_pairs(lst:list, bigram:tuple, c: int):

        """
        Helper function to replace bigram with a single token in the list.
        Whether to use this particular function or not depends on *your* implementations.
        It is possible that you may not need this function at all.
        """
        i = 0
        while i < len(lst) - 1:
            if lst[i] == bigram[0] and lst[i + 1] == bigram[1]:
                lst[i] = c
                del lst[i + 1]
            else:
                i += 1
        return lst

    def train(self, text: str, target_vocab_size: int):
      self.target_vocab_size = target_vocab_size
      character_set = set(text)
      vocabsize = len(character_set)
      assert self.target_vocab_size > vocabsize, "target vocab size must be greater than the number of unique characters in the text"

      self.map_of_char = {char: i for i, char in enumerate(character_set)}
      map_of_merged_tokens = {}
      tokenized = [self.map_of_char[char] for char in text]

      current_token_id = max(self.map_of_char.values()) + 1
      while vocabsize < target_vocab_size:
          bigram_counts = Counter((tokenized[i], tokenized[i + 1]) for i in range(len(tokenized) - 1))

          # Checking to see if bigram_counts is empty
          if not bigram_counts:
              break

          most_common_frequency = bigram_counts.most_common(1)[0][1]  #finding most common frequency
          most_frequent_bigrams = [bigram for bigram, count in bigram_counts.items() if count == most_common_frequency]

          # Another check to ensure there is at least one frequent bigram
          if not most_frequent_bigrams:
              break

          selected_bigram = random.choice(most_frequent_bigrams)
          map_of_merged_tokens[selected_bigram] = current_token_id
          tokenized = self.replace_pairs(tokenized, selected_bigram, current_token_id)

          if self.tokenizer_map is None:
              self.tokenizer_map = {}
          self.tokenizer_map[current_token_id] = selected_bigram
          current_token_id += 1
          vocabsize += 1

      self.map_of_merged_tokens = map_of_merged_tokens
      self.tokenizer_map = self.tokenizer_map

    def encode_text(self, text: str):

        ### TO DO: Write the implementation of the encode_text function ###
        """Given a text, return the tokenized text as a list of integers"""
        tokenized_text = [self.map_of_char[char] for char in text]
        tokenized = []

        i = 0
        while i < len(tokenized_text):
            if i < len(tokenized_text) - 1:
                bigram = (tokenized_text[i], tokenized_text[i + 1])
                if bigram in self.map_of_merged_tokens:
                    tokenized.append(self.map_of_merged_tokens[bigram])
                    i += 2
                    continue
            tokenized.append(tokenized_text[i])
            i += 1

        ### END ###
        return tokenized

    def decode_text(self, tokenized_text: list):
        """Given a tokenized text, return the original text as a string"""
        ### TO DO: Write the implementation of the decode_text function ###
        decoded = []
        reverse_map = {v: k for k, v in self.map_of_char.items()}

        for token in tokenized_text:
            if token in self.tokenizer_map:
                subword = self.tokenizer_map[token]
                decoded.extend(reverse_map[char] for char in subword)
            elif token in reverse_map:
                decoded.append(reverse_map[token])

        return "".join(decoded)
        ### END ###
        return decoded




In [8]:
tokenizer = BPETokenizer()

tokenizer.train(text_, 100) # just for checking if the training is running or not

In [9]:
encoded_text = tokenizer.encode_text("Life is beautiful")
tokenizer.decode_text(encoded_text)
assert tokenizer.decode_text(tokenizer.encode_text("Life is beautiful")) == "Life is beautiful"

### 2. BPE Tokenizer with preprocessing

However, in reality we apply a preprocessing step before the BPE tokenization pipeline.

In the preprocessing step, we split the text with a complex regex pattern to separate the texts on the space and punctuation boundary.

To see the effect of the regex, see this example:

In [10]:
# Uncomment and run this code:

re.findall(r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+""", text_)

['Democracy',
 ' is',
 ' a',
 ' conversation',
 '.',
 ' Its',
 ' function',
 ' and',
 ' survival',
 ' depend',
 ' on',
 ' the',
 ' available',
 ' information',
 ' technology',
 '.',
 ' For',
 ' most',
 ' of',
 ' history',
 ',',
 ' no',
 ' technology',
 ' existed',
 ' for',
 ' holding',
 ' large',
 '-scale',
 ' conversations',
 ' among',
 ' millions',
 ' of',
 ' people',
 '.',
 ' In',
 ' the',
 ' premodern',
 ' world',
 ',',
 ' democracies',
 ' existed',
 ' only',
 ' in',
 ' small',
 ' city',
 '-states',
 ' like',
 ' Rome',
 ' and',
 ' Athens',
 ',',
 ' or',
 ' in',
 ' even',
 ' smaller',
 ' tribes',
 '.',
 ' Once',
 ' a',
 ' polity',
 ' grew',
 ' large',
 ',',
 ' the',
 ' democratic',
 ' conversation',
 ' collapsed',
 ',',
 ' and',
 ' authoritarianism',
 ' remained',
 ' the',
 ' only',
 ' alternative',
 '.\n\n',
 'Large',
 '-scale',
 ' democracies',
 ' became',
 ' feasible',
 ' only',
 ' after',
 ' the',
 ' rise',
 ' of',
 ' modern',
 ' information',
 ' technologies',
 ' like',
 ' the'

After splitting the whole dataset into small chunks, we apply the algorithm on each chunk, that is, when we count the bigrams, we do not count bigrams from two neigboring chunks. See the pseudocode below:
![BPETokenizer Preprocess](./images/BPE_preprocess.png)

However, note that here the example shows everything on "text" space ("hug", "pug", "i" etc) for the sake of interpretability. In reality, we do these merging on the token space (v1, v2, v3 ...), ie, on the token ids.  

In [65]:
class BPETokenizerwithPreprocessing:
    def __init__(self):
        self.target_vocab_size = None
        self.map_of_merged_tokens = None
        self.tokenizer_map = None
        self.map_of_char = None
        self.split_pattern = r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+"""
        # this is the split boundary pattern used in GPT4 tokenizer. Feel free to ask ChatGPT to explain what this regex pattern does.



    @staticmethod
    def replace_pairs(lst:list, bigram:tuple, c: int):

        """
        Helper function to replace bigram with a single token in the list.
        Whether to use this particular function or not depends on *your* implementation of the tokenizer.
        It is possible that you may not need this function at all.
        """
        i = 0
        while i < len(lst) - 1:
            if lst[i] == bigram[0] and lst[i + 1] == bigram[1]:
                lst[i] = c
                del lst[i + 1]
            else:
                i += 1
        return lst


    ### OPTIONAL (not graded) ####

    # If you need, you can define another helper function to replace bigrams in the list of lists.


    #### END ####
    def preprocess_text(self, text: str):
        return re.findall(self.split_pattern, text)

    def train(self, text: str, target_vocab_size: int):

        self.target_vocab_size = target_vocab_size
        character_set = set(text)
        vocabsize = len(character_set)
        map_of_char = {char: i for i, char in enumerate(character_set)}
        self.map_of_char = map_of_char
        map_of_merged_tokens = {}
        print(f"Preprocessing the text")
        preprocessed = self.preprocess_text(text)

        #### TO DO: Write the loop and finish the implementation ####

        self.tokenizer_map = {i: char for char, i in map_of_char.items()}
        current_token_id = max(map_of_char.values()) + 1
        tokenized_chunks = [[map_of_char[char] for char in chunk] for chunk in preprocessed]
        while vocabsize < target_vocab_size:
            bigram_counts = Counter()
            for chunk in tokenized_chunks:
                bigram_counts.update((chunk[i], chunk[i + 1]) for i in range(len(chunk) - 1))

            if not bigram_counts:
                break  # No more bigrams to process
            most_frequent_bigram, _ = bigram_counts.most_common(1)[0]
            for i, chunk in enumerate(tokenized_chunks):
                tokenized_chunks[i] = self.replace_pairs(chunk, most_frequent_bigram, current_token_id)
            map_of_merged_tokens[most_frequent_bigram] = current_token_id
            self.tokenizer_map[current_token_id] = most_frequent_bigram
            current_token_id += 1
            vocabsize += 1

        self.map_of_merged_tokens = map_of_merged_tokens
        self.tokenizer_map = self.tokenizer_map # set this to the tokenizer_map you created in the loop above

        #### END ####
        self.map_of_merged_tokens = map_of_merged_tokens
        self.tokenizer_map =self.tokenizer_map # set this to the tokenizer_map you created in the loop above


    def encode_text(self, text: str):
        """Given a text, return the tokenized text"""
        ### TO DO: Write the implementation of the encode_text function ###
        tokenized = []
        pos = 0
        while pos < len(text):
            match = None
            for end in range(len(text), pos, -1):
                subword = text[pos:end]
                token_id = next((tid for tid, tsub in self.tokenizer_map.items() if tsub == subword), None)
                if token_id is not None:
                    match = token_id
                    pos = end
                    break
            if match is None:
                raise ValueError(f"Subword '{text[pos]}' not found in tokenizer vocabulary.")
            tokenized.append(match)
          #### END ####
        return tokenized

    def decode_text(self, tokenized_text: list):
        """Given a tokenized text, return the original text"""
        ### TO DO: Write the implementation of the encode_text function ###
        decoded_subwords = []
        for token_id in tokenized_text:
            decoded_subwords.append(self.tokenizer_map[token_id])
        decode_text = "".join(decoded_subwords)
        #### END ####
        return decode_text




In [12]:
tokenizer = BPETokenizerwithPreprocessing()
tokenizer.train(text, 500)
encoded = tokenizer.encode_text("Life is beautiful")
tokenizer.decode_text(encoded)


Preprocessing the text


'Life is beautiful'

In [15]:
#encoded = tokenizer.encode_text("인생은 아름답다") # if you try Korean text, it will raise an error

ValueError: Subword '인' not found in tokenizer vocabulary.

### Tokenizing in bytespace

So far we have only worked with English texts.

But of course, there are thousands of other languages in the world, and not all languages can be represented in the way English is represented so far. This is more true for character based langauges like Chinese. Even for Korean language, building from the character level [like the way we did in English] is not possible. Which is why we need an universal way to represent languages of various scripts. Utf-8 provides such a way.

Utf-8 is an encoding method that encodes **all** human written languages into variable length byte sequece of up to 4 bytes (256 bits). One way to think of it is to have an universal "alphabet" for all human written language, where the alphabet size is 256.

However, the alphabet analogy is not quite right. To see why, see below:

In [16]:
# Let's play with uft-8 encoding
s = "Life"
print(f"{s} is encoded as: {list(s.encode('utf-8'))}")
print(f"L is encoded as: {list('L'.encode('utf-8'))}")
print(f"i is encoded as: {list('i'.encode('utf-8'))}")
print(f"f is encoded as: {list('f'.encode('utf-8'))}")
print(f"e is encoded as: {list('e'.encode('utf-8'))}")


s = "생일"
print(f"{s} is encoded as: {list(s.encode('utf-8'))}")
print(f"생 is encoded as: {list('생'.encode('utf-8'))}")
print(f"ㅅ is encoded as: {list('ㅅ'.encode('utf-8'))}")
print(f"새 is encoded as: {list('새'.encode('utf-8'))}")
print(f"이 is encoded as: {list('이'.encode('utf-8'))}")
print(f"일 is encoded as: {list('일'.encode('utf-8'))}")

s = "🎂"
print(f"{s} is encoded as: {list(s.encode('utf-8'))}")

s= "🤞"
print(f"{s} is encoded as: {list(s.encode('utf-8'))}")

Life is encoded as: [76, 105, 102, 101]
L is encoded as: [76]
i is encoded as: [105]
f is encoded as: [102]
e is encoded as: [101]
생일 is encoded as: [236, 131, 157, 236, 157, 188]
생 is encoded as: [236, 131, 157]
ㅅ is encoded as: [227, 133, 133]
새 is encoded as: [236, 131, 136]
이 is encoded as: [236, 157, 180]
일 is encoded as: [236, 157, 188]
🎂 is encoded as: [240, 159, 142, 130]
🤞 is encoded as: [240, 159, 164, 158]


Did you see that utf-8 can also encode emojies and many languages? However, also note that the mapping from each alphabet to utf-8 encoding is not always very intuitive (look at the Korean example carefully). The reason English alphabet takes up lower values than non-English alphabet is because utf-8 is backward compatible (ASCII). For all non English languages, the encoded lists are usually long.

Let's look at some more examples:

In [17]:
# decoding
encoded = [236, 131, 157, 236, 157, 188]
print(f"Decoded: {bytes(encoded).decode('utf-8', errors='replace')}")

encoded = [150, 131, 157, 236, 157, 188]
print(f"Decoded: {bytes(encoded).decode('utf-8', errors='replace')}")

# the second one is not a valid utf-8 encoding, so it will be replaced with a question mark
# if your language model is not strong enough in modeling lanuguages, it will through more errors.


Decoded: 생일
Decoded: ���일


To know more about utf-8 encoding, check out the corresponding Wikipedia page.

Because of the obvious advantage of using utf-8 encoding, when we train a real encoder, we first project all the data into byte-space and then do the tokenizer training.

### 4. ByteSpacePreprocessed
BPE algorithm on byte space, however, the text first go through the regex preprocessing. Make sure to do the preprocessing first.

Make sure to do the preprocessing **before** the byte encoding (and not after).

In [25]:
class BPETokenizerByteSpacePreprocessed:
    def __init__(self):

        self.map_of_merged_tokens = None
        self.tokenizer_map = None
        self.map_of_char = None
        self.split_pattern = r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+"""


    ### OPTIONAL (not graded) ####

    # If you need, you can define another helper function to replace bigrams in the list of lists.


    #### END ####

    @staticmethod
    def replace_pairs(lst:list, bigram:tuple, c: int):

        """
        Helper function to replace bigram with a single token in the list.
        Whether to use this particular function or not depends on *your* implementation of the tokenizer.
        It is possible that you may not need this function at all.
        """
        i = 0
        while i < len(lst) - 1:
            if lst[i] == bigram[0] and lst[i + 1] == bigram[1]:
                lst[i] = c
                del lst[i + 1]
            else:
                i += 1
        return lst

    def preprocess_text(self, text: str):
        return re.findall(self.split_pattern, text)

    def train(self, text: str, target_vocab_size: int):
        self.target_vocab_size = target_vocab_size
        character_set = set(range(256))  # All possible byte values
        vocabsize = len(character_set)
        assert self.target_vocab_size > vocabsize, "target vocab size must be greater than 256"

        map_of_char = {char: i for i, char in enumerate(character_set)}
        self.map_of_char = map_of_char

        preprocessed = self.preprocess_text(text)
        byte_chunks = [list(chunk.encode("utf-8")) for chunk in preprocessed]

        self.tokenizer_map = {i: (char,) for char, i in map_of_char.items()}

        map_of_merged_tokens = {}
        current_token_id = max(map_of_char.values()) + 1
        while vocabsize < target_vocab_size:
            bigram_counts = Counter()
            for chunk in byte_chunks:
                bigram_counts.update((chunk[i], chunk[i + 1]) for i in range(len(chunk) - 1))

            if not bigram_counts:
                break  # No more bigrams to merge

            most_frequent_bigram, _ = bigram_counts.most_common(1)[0]

            # Updating all chunks with the new merged token because necessary
            for i, chunk in enumerate(byte_chunks):
                byte_chunks[i] = self.replace_pairs(chunk, most_frequent_bigram, current_token_id)

            # Update maps w new merged token
            map_of_merged_tokens[most_frequent_bigram] = current_token_id
            self.tokenizer_map[current_token_id] = most_frequent_bigram  # Store as tuple of ints

            current_token_id += 1
            vocabsize += 1
        ### END ###
        self.map_of_merged_tokens = map_of_merged_tokens
        self.tokenizer_map = self.tokenizer_map


    def encode_text(self, text: str):
        """Given a text, return the tokenized text"""
        ### Implement the encode_text function ###
        # You only need to sligthly modify existing code in your previous implementation
        """Given a text, return the tokenized text"""
        preprocessed = self.preprocess_text(text)
        byte_tokens = [list(chunk.encode("utf-8")) for chunk in preprocessed]

        tokenized = []
        for chunk in byte_tokens:
            i = 0
            while i < len(chunk):
                if i < len(chunk) - 1:
                    bigram = (chunk[i], chunk[i + 1])
                    if bigram in self.map_of_merged_tokens:
                        tokenized.append(self.map_of_merged_tokens[bigram])
                        i += 2
                        continue
                tokenized.append(chunk[i])
                i += 1
        ### END ###
        return tokenized

    def decode_text(self, tokenized_text: list):
        """Given a tokenized text, return the original text"""

        ### TO DO: Write the implementation of the decode_text function ###
        # make sure to use errors='replace' when decoding

        decoded_bytes = []
        for token_id in tokenized_text:
            token_value = self.tokenizer_map.get(token_id)
            if token_value is not None:
                decoded_bytes.extend(token_value)  # token_value is a tuple of byte values
            else:
                pass
        decoded_text = bytes(decoded_bytes).decode("utf-8", errors="replace")
        return decoded_text


        ### END ###




In [26]:
# Sanity check

with open("./kor_Emma.txt", "r") as f: # grab some Korean text from Wikipedia
    kor_text = f.read()

tokenizer = BPETokenizerByteSpacePreprocessed()
tokenizer.train(kor_text, 500)  # train the tokenizer on the Korean text
encoded = tokenizer.encode_text("Life is beautiful 🤞") # still manages to tokenize English
decoded = tokenizer.decode_text(encoded)

print(f"Decoded: {decoded}")
encoded = tokenizer.encode_text("Life is beautiful 🤞")
print(f"Encoded: {encoded}")  # Inspect the tokenized output

decoded = tokenizer.decode_text(encoded)
print(f"Decoded (with repr): {repr(decoded)}")

print(f"Decoded (with repr): {repr(decoded)}")  # Use repr to show any hidden whitespace characters

assert decoded == "Life is beautiful 🤞"

Decoded: Life is beautiful 🤞
Encoded: [76, 105, 102, 101, 32, 105, 115, 32, 98, 101, 97, 117, 116, 105, 102, 117, 108, 32, 240, 159, 164, 158]
Decoded (with repr): 'Life is beautiful 🤞'
Decoded (with repr): 'Life is beautiful 🤞'


In [27]:
# Try with Korean text;

encoded = tokenizer.encode_text("생일 축하해")
print(f"Encoded Korean tokens: {encoded}")  # Check the tokenized output

decoded = tokenizer.decode_text(encoded)
print(f"Decoded Korean text (with repr): {repr(decoded)}")  # Display any hidden differences

assert decoded == "생일 축하해"
# and of course it works with Korean as well!

Encoded Korean tokens: [236, 337, 257, 188, 256, 182, 149, 261, 152, 261, 180]
Decoded Korean text (with repr): '생일 축하해'


In [None]:
### TO DO ###

# What are some distinctions you notice between the tokens produced by byte-level BPE with preprocessing vs vanilla BPE?

# Train tokenizers using the Emma text first and do some analysis
# Feel free to run some codes and show some examples, plot graphs etc.
# citing papers/other resources is also fine, but make sure to explain the distinctions in your own words

After doing the comparison I find out that tokens are often readable subwords or characters, such as 'e', 'th', 'ing' under Vanilla BPE. A small number of tokens have very high frequencies, corresponding to common subwords. But for Byte-Level BPE tokens are byte sequences that may not directly correspond to readable characters, e.g., 'T', 'h', 'is'. There is a more uniform usage of tokens due to byte-level granularity. So in summary Vanilla BE is effective in producing meaningful tokens, which can be advantageous for language modeling tasks that benefit from understanding word structure. However, it may struggle with multilingual text not present in the training data.
Byte-Level BPE with Preprocessing is good across different languages and character sets, as it operates at the byte level but then the sequences are longer. So what you want to use really depends on what your goal is.

In [None]:
eval_text_en = """“Ever since the day—about four years ago—that Miss Taylor and I met
with him in Broadway Lane, when, because it began to drizzle, he darted
away with so much gallantry, and borrowed two umbrellas for us from
Farmer Mitchell’s, I made up my mind on the subject. I planned the
match from that hour; and when such success has blessed me in this
instance, dear papa, you cannot think that I shall leave off
match-making.”

“I do not understand what you mean by ‘success,’” said Mr. Knightley.
“Success supposes endeavour. Your time has been properly and delicately
spent, if you have been endeavouring for the last four years to bring
about this marriage. A worthy employment for a young lady’s mind! But
if, which I rather imagine, your making the match, as you call it,
means only your planning it, your saying to yourself one idle day, ‘I
think it would be a very good thing for Miss Taylor if Mr. Weston were
to marry her,’ and saying it again to yourself every now and then
afterwards, why do you talk of success? Where is your merit? What are
you proud of? You made a lucky guess; and _that_ is all that can be
said.”

“And have you never known the pleasure and triumph of a lucky guess?—I
pity you.—I thought you cleverer—for, depend upon it a lucky guess is
never merely luck. There is always some talent in it. And as to my poor
word ‘success,’ which you quarrel with, I do not know that I am so
entirely without any claim to it. You have drawn two pretty pictures;
but I think there may be a third—a something between the do-nothing and
the do-all. If I had not promoted Mr. Weston’s visits here, and given
many little encouragements, and smoothed many little matters, it might
not have come to any thing after all. I think you must know Hartfield
enough to comprehend that.”"""

eval_text_kor = """"4년 전쯤 테일러 양과 내가 브로드웨이 레인에서 그를 만난 날부터, 이슬비가 내리기 시작하자 그는 매우 용감하게 달려가서
Farmer Mitchell's에서 우산 두 개를 빌렸을 때부터, 나는 그 문제에 대해 마음먹었습니다. 나는 그 순간부터
결혼을 계획했습니다. 그리고 이런 성공이 이 경우에 나에게 축복이 되었을 때, 사랑하는 아빠, 당신은 내가
결혼을 그만둘 것이라고 생각할 수 없습니다."

"나는 당신이 '성공'이라는 말로 무슨 뜻인지 이해하지 못합니다." 나이틀리 씨가 말했습니다.
"성공은 노력을 전제로 합니다. 지난 4년 동안 이 결혼을 이루기 위해 노력했다면, 당신의 시간은 적절하고 신중하게
보내졌습니다. 젊은 여성의 마음에 가치 있는 일이었습니다! 하지만
만약 당신이 결혼을 한다고 상상한다면, 당신이 말하는 대로
그저 계획하는 것일 뿐이고, 어느 날 한가한 시간에 ‘웨스턴 씨가 테일러 양과 결혼하면 아주 좋은 일이 될 것 같아’라고 스스로에게 말하고, 그 후로 가끔씩 스스로에게 다시 말하는 것일 뿐이라면, 왜 성공에 대해 이야기하는 거지? 당신의 공로는 어디에 있지? 당신은 무엇을 자랑스러워하는 거지? 당신은 행운의 추측을 했고; _그게_ 말할 수 있는 전부야.

"그리고 당신은 행운의 추측의 즐거움과 승리를 결코 알지 못했니?—나는 당신을
불쌍히 여긴다.—나는 당신이 더 똑똑하다고 생각했다—왜냐하면, 행운의 추측은
결코 단순한 행운이 아니기 때문이다. 항상 어떤 재능이 거기에 있다. 그리고 당신이 다투는 나의 형편없는
단어 ‘성공’에 대해 말하자면, 나는 내가 그것에 대한 권리가 전혀 없다는 것을 모른다. 당신은 두 개의 예쁜 그림을 그렸다.
하지만 나는 세 번째 그림이 있을 수 있다고 생각한다—아무것도 하지 않는 것과
모든 것을 하는 것 사이의 무언가. 내가 웨스턴 씨의 방문을 홍보하지 않았고,
많은 작은 격려를 하지 않았고, 많은 작은 문제들을 해결하지 않았다면,
결국 아무것도 이루어지지 않았을지도 모릅니다. 당신은 하트필드를
충분히 알고 있을 테니 그걸 이해할 수 있을 겁니다."""

In [56]:
tokenizer1 = BPETokenizer()
tokenizer2 = BPETokenizerByteSpacePreprocessed()
tokenizer1.train(eval_text_en, 500)
tokenizer2.train(eval_text_en, 500)

In [57]:
test_words = ["KAIST", "tree", "déjà vu", "こんにちは"]

for word in test_words:
    print(f"\nTesting word: '{word}'")
    # Vanilla BPE Tokenization
    vanilla_encoded = tokenizer1.encode_text(word)
    vanilla_decoded = tokenizer1.decode_text(vanilla_encoded)
    print(f"Vanilla BPE Tokens: {vanilla_encoded}")
    print(f"Vanilla Decoded: '{vanilla_decoded}'")

    # Byte-Level BPE Tokenization
    byte_encoded = tokenizer2.encode_text(word)
    byte_decoded = tokenizer2.decode_text(byte_encoded)
    print(f"Byte-Level BPE Tokens: {byte_encoded}")
    print(f"Byte-Level Decoded: '{byte_decoded}'")


Testing word: 'KAIST'
Vanilla BPE Tokens: [48, 9, 23, 20, 40]
Vanilla Decoded: 'KAIST'
Byte-Level BPE Tokens: [75, 65, 73, 83, 84]
Byte-Level Decoded: 'KAIST'

Testing word: 'tree'
Vanilla BPE Tokens: [27, 12, 35, 35]
Vanilla Decoded: 'tree'
Byte-Level BPE Tokens: [116, 367, 101]
Byte-Level Decoded: 'tree'

Testing word: 'déjà vu'


KeyError: 'é'

# Multilingual Tokenization analysis:

You are given two files in the ./multilingual-data folder. The English one is a chunk of the original English Emma novel. The Korean one is machine translated version of that. Now you will think about some properties of the tokenizers when trained on different corpus.

In [24]:
with open("./kor_Emma.txt", "r") as f:
    kor_emma = f.read()

with open("./en_Emma.txt", "r") as f:
    eng_emma = f.read()


In [32]:
eval_text_en = """“Ever since the day—about four years ago—that Miss Taylor and I met
with him in Broadway Lane, when, because it began to drizzle, he darted
away with so much gallantry, and borrowed two umbrellas for us from
Farmer Mitchell’s, I made up my mind on the subject. I planned the
match from that hour; and when such success has blessed me in this
instance, dear papa, you cannot think that I shall leave off
match-making.”

“I do not understand what you mean by ‘success,’” said Mr. Knightley.
“Success supposes endeavour. Your time has been properly and delicately
spent, if you have been endeavouring for the last four years to bring
about this marriage. A worthy employment for a young lady’s mind! But
if, which I rather imagine, your making the match, as you call it,
means only your planning it, your saying to yourself one idle day, ‘I
think it would be a very good thing for Miss Taylor if Mr. Weston were
to marry her,’ and saying it again to yourself every now and then
afterwards, why do you talk of success? Where is your merit? What are
you proud of? You made a lucky guess; and _that_ is all that can be
said.”

“And have you never known the pleasure and triumph of a lucky guess?—I
pity you.—I thought you cleverer—for, depend upon it a lucky guess is
never merely luck. There is always some talent in it. And as to my poor
word ‘success,’ which you quarrel with, I do not know that I am so
entirely without any claim to it. You have drawn two pretty pictures;
but I think there may be a third—a something between the do-nothing and
the do-all. If I had not promoted Mr. Weston’s visits here, and given
many little encouragements, and smoothed many little matters, it might
not have come to any thing after all. I think you must know Hartfield
enough to comprehend that.”"""

eval_text_kor = """"4년 전쯤 테일러 양과 내가 브로드웨이 레인에서 그를 만난 날부터, 이슬비가 내리기 시작하자 그는 매우 용감하게 달려가서
Farmer Mitchell's에서 우산 두 개를 빌렸을 때부터, 나는 그 문제에 대해 마음먹었습니다. 나는 그 순간부터
결혼을 계획했습니다. 그리고 이런 성공이 이 경우에 나에게 축복이 되었을 때, 사랑하는 아빠, 당신은 내가
결혼을 그만둘 것이라고 생각할 수 없습니다."

"나는 당신이 '성공'이라는 말로 무슨 뜻인지 이해하지 못합니다." 나이틀리 씨가 말했습니다.
"성공은 노력을 전제로 합니다. 지난 4년 동안 이 결혼을 이루기 위해 노력했다면, 당신의 시간은 적절하고 신중하게
보내졌습니다. 젊은 여성의 마음에 가치 있는 일이었습니다! 하지만
만약 당신이 결혼을 한다고 상상한다면, 당신이 말하는 대로
그저 계획하는 것일 뿐이고, 어느 날 한가한 시간에 ‘웨스턴 씨가 테일러 양과 결혼하면 아주 좋은 일이 될 것 같아’라고 스스로에게 말하고, 그 후로 가끔씩 스스로에게 다시 말하는 것일 뿐이라면, 왜 성공에 대해 이야기하는 거지? 당신의 공로는 어디에 있지? 당신은 무엇을 자랑스러워하는 거지? 당신은 행운의 추측을 했고; _그게_ 말할 수 있는 전부야.

"그리고 당신은 행운의 추측의 즐거움과 승리를 결코 알지 못했니?—나는 당신을
불쌍히 여긴다.—나는 당신이 더 똑똑하다고 생각했다—왜냐하면, 행운의 추측은
결코 단순한 행운이 아니기 때문이다. 항상 어떤 재능이 거기에 있다. 그리고 당신이 다투는 나의 형편없는
단어 ‘성공’에 대해 말하자면, 나는 내가 그것에 대한 권리가 전혀 없다는 것을 모른다. 당신은 두 개의 예쁜 그림을 그렸다.
하지만 나는 세 번째 그림이 있을 수 있다고 생각한다—아무것도 하지 않는 것과
모든 것을 하는 것 사이의 무언가. 내가 웨스턴 씨의 방문을 홍보하지 않았고,
많은 작은 격려를 하지 않았고, 많은 작은 문제들을 해결하지 않았다면,
결국 아무것도 이루어지지 않았을지도 모릅니다. 당신은 하트필드를
충분히 알고 있을 테니 그걸 이해할 수 있을 겁니다."""

TESTING

In [28]:
tokenizer_kr = BPETokenizerByteSpacePreprocessed()
tokenizer_kr.train(kor_emma, target_vocab_size=1000)

In [58]:
tokenizer_en = BPETokenizerByteSpacePreprocessed()
tokenizer_en.train(eng_emma, target_vocab_size=1000)  # This is where i adjust vocab size for testing

In [29]:
combined_text = eng_emma + "\n" + kor_emma
tokenizer_combined = BPETokenizerByteSpacePreprocessed()
tokenizer_combined.train(combined_text, target_vocab_size=1000)

In [60]:
sample_text_en = eval_text_en
encoded_en = tokenizer_en.encode_text(sample_text_en)
decoded_en = tokenizer_en.decode_text(encoded_en)
assert decoded_en == sample_text_en
print("English tokenizer works correctly on English text.")

English tokenizer works correctly on English text.


In [59]:
sample_text_kr = eval_text_kor
encoded_kr = tokenizer_kr.encode_text(sample_text_kr)
decoded_kr = tokenizer_kr.decode_text(encoded_kr)
assert decoded_kr == sample_text_kr
print("Korean tokenizer works correctly on Korean text.")

Korean tokenizer works correctly on Korean text.


In [61]:
encoded_combined_en = tokenizer_combined.encode_text(sample_text_en)
decoded_combined_en = tokenizer_combined.decode_text(encoded_combined_en)
assert decoded_combined_en == sample_text_en
print("Combined tokenizer works correctly on English text.")
encoded_combined_kr = tokenizer_combined.encode_text(sample_text_kr)
decoded_combined_kr = tokenizer_combined.decode_text(encoded_combined_kr)
assert decoded_combined_kr == sample_text_kr
print("Combined tokenizer works correctly on Korean text.")

Combined tokenizer works correctly on English text.
Combined tokenizer works correctly on Korean text.


In [62]:
vocab_en = tokenizer_en.tokenizer_map
print("\nSample tokens from English tokenizer:")
for token_id in list(vocab_en.keys())[:20]:
    token = vocab_en[token_id]
    token_bytes = bytes(token)
    token_str = token_bytes.decode('utf-8', errors='replace')
    print(f"Token ID {token_id}: {repr(token_str)}")


Sample tokens from English tokenizer:
Token ID 0: '\x00'
Token ID 1: '\x01'
Token ID 2: '\x02'
Token ID 3: '\x03'
Token ID 4: '\x04'
Token ID 5: '\x05'
Token ID 6: '\x06'
Token ID 7: '\x07'
Token ID 8: '\x08'
Token ID 9: '\t'
Token ID 10: '\n'
Token ID 11: '\x0b'
Token ID 12: '\x0c'
Token ID 13: '\r'
Token ID 14: '\x0e'
Token ID 15: '\x0f'
Token ID 16: '\x10'
Token ID 17: '\x11'
Token ID 18: '\x12'
Token ID 19: '\x13'


In [64]:
vocab_kr = tokenizer_kr.tokenizer_map

print("\nSample tokens from Korean tokenizer:")
for token_id in list(vocab_kr.keys())[:20]:
    token = vocab_kr[token_id]
    token_bytes = bytes(token)
    token_str = token_bytes.decode('utf-8', errors='replace')
    print(f"Token ID {token_id}: {repr(token_str)}")


Sample tokens from Korean tokenizer:
Token ID 0: '\x00'
Token ID 1: '\x01'
Token ID 2: '\x02'
Token ID 3: '\x03'
Token ID 4: '\x04'
Token ID 5: '\x05'
Token ID 6: '\x06'
Token ID 7: '\x07'
Token ID 8: '\x08'
Token ID 9: '\t'
Token ID 10: '\n'
Token ID 11: '\x0b'
Token ID 12: '\x0c'
Token ID 13: '\r'
Token ID 14: '\x0e'
Token ID 15: '\x0f'
Token ID 16: '\x10'
Token ID 17: '\x11'
Token ID 18: '\x12'
Token ID 19: '\x13'


In [41]:
vocab_combined = tokenizer_combined.tokenizer_map

print("\nSample tokens from combined tokenizer:")
for token_id in list(vocab_combined.keys())[:20]:
    token = vocab_combined[token_id]
    token_bytes = bytes(token)
    token_str = token_bytes.decode('utf-8', errors='replace')
    print(f"Token ID {token_id}: {repr(token_str)}")


Sample tokens from combined tokenizer:
Token ID 0: '\x00'
Token ID 1: '\x01'
Token ID 2: '\x02'
Token ID 3: '\x03'
Token ID 4: '\x04'
Token ID 5: '\x05'
Token ID 6: '\x06'
Token ID 7: '\x07'
Token ID 8: '\x08'
Token ID 9: '\t'
Token ID 10: '\n'
Token ID 11: '\x0b'
Token ID 12: '\x0c'
Token ID 13: '\r'
Token ID 14: '\x0e'
Token ID 15: '\x0f'
Token ID 16: '\x10'
Token ID 17: '\x11'
Token ID 18: '\x12'
Token ID 19: '\x13'


TESTING END

In [None]:
### TODO: Open Ended

# Play around with the BPETokenizerByteSpacePreprocessed that you implemented using the Korean and English Emma data, which you can use for training the tokenizers, and write down if you see anything interesting.

# What does your findings imply for designing multilingual tokenizers?

# Feel free to nudge vocabulary size as well. You can use the eval_text in the above cell for validation.

# Do not hesitate to look into the actual tokens and see how much of it makes sense.

Seeing these results, I see that tokenizer performance depends on data as the tokenizer performs best on the data it was trained on. Also for english, the tokenizer merges frequent byte pairs corresponding to common English letter combinations (e.g., 'th', 'he', 'in'). But Korean characters are encoded in multiple bytes and the tokenizer merges byte pairs within characters. For the combined tokenizer, The tokenizer includes tokens from both languages and performs good enough as combined assertion works well too. Thus this implies that when designing multilingual tokenizers, its important to have a balanced training data consisting of both languages like I used emma and both korean and english here.  

## Wordpiece

Now, implement the wordpiece tokenizer algorithm, which is almost similar to BPE, except the objective function is a bit different.

Check the description below.

![Wordpiece Algorithm](./images/Wordpiece.png)


Make sure to:
1. Preprocess the text first
2. Work on the byte-space (utf-8 encoding)
3. When you estimate p(v_i, v_j), divide by N-1, instead of N where N is the total number of tokens in the corpus.

In [47]:
class WordPieceTokenizer:
    def __init__(self):

        self.map_of_merged_tokens = None
        self.tokenizer_map = None
        self.map_of_char = None
        self.split_pattern = r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+"""

    ### OPTIONAL ####
    # You can define another helper function to replace bigrams in the list of lists.


    #### END ####

    @staticmethod
    def replace_pairs(lst:list, bigram:tuple, c: int):

        """
        Helper function to replace bigram with a single token in the list.
        Whether to use this particular function or not depends on *your* implementation of the tokenizer.
        It is possible that you may not need this function at all.
        """
        i = 0
        while i < len(lst) - 1:
            if lst[i] == bigram[0] and lst[i + 1] == bigram[1]:
                lst[i] = c
                del lst[i + 1]
            else:
                i += 1
        return lst

    def preprocess_text(self, text: str):
        return re.findall(self.split_pattern, text)

    def train(self, text: str, target_vocab_size: int):

        self.target_vocab_size = target_vocab_size



        character_set = set([i for i in range(256)]) # We start with all the possible bytes, which is 256
        vocabsize = len(character_set)
        assert self.target_vocab_size > vocabsize, "target vocab size must be greater than 256"

        map_of_char = {char: i for i, char in enumerate(character_set)}
        self.map_of_char = map_of_char

        map_of_merged_tokens = {}

        ### TODO: Your implementation of wordpiece training goes here. ###
        # Make sure to preprocess the text first
        self.tokenizer_map = {char: (char,) for char in character_set}
        preprocessed = self.preprocess_text(text)
        byte_chunks = [list(chunk.encode("utf-8")) for chunk in preprocessed]
        corpus = [chunk[:] for chunk in byte_chunks]  # Copy of byte_chunks

        current_token_id = 256

        while vocabsize < self.target_vocab_size:
            tokens = [token for chunk in corpus for token in chunk]
            N = len(tokens)

            if N < 2:
                break  # Not enough tokens to form bigrams

            # Counting all the tokens and bigrams
            token_counts = Counter(tokens)
            bigram_counts = Counter()
            for chunk in corpus:
                bigram_counts.update(zip(chunk, chunk[1:]))
            delta_L_dict = {}
            for bigram, c_vi_vj in bigram_counts.items():
                vi, vj = bigram
                c_vi = token_counts[vi]
                c_vj = token_counts[vj]
                p_vi_vj = c_vi_vj / (N - 1)
                p_vi = c_vi / N
                p_vj = c_vj / N
                delta_L = c_vi_vj * math.log(p_vi_vj / (p_vi * p_vj) + 1e-8)
                delta_L_dict[bigram] = delta_L

            if not delta_L_dict:
                break
            best_bigram, max_delta_L = max(delta_L_dict.items(), key=lambda item: item[1])

            for i, chunk in enumerate(corpus):
                corpus[i] = self.replace_pairs(chunk, best_bigram, current_token_id)
            # Updating the tokenizer_map with new token
            vi, vj = best_bigram
            bytes_vi = self.tokenizer_map[vi]
            bytes_vj = self.tokenizer_map[vj]
            new_token_bytes = bytes_vi + bytes_vj
            self.tokenizer_map[current_token_id] = new_token_bytes

            if self.map_of_merged_tokens is None:
                self.map_of_merged_tokens = {}
            self.map_of_merged_tokens[best_bigram] = current_token_id
            current_token_id += 1
            vocabsize += 1

        ### END ###

        self.map_of_merged_tokens = map_of_merged_tokens
        self.tokenizer_map = self.tokenizer_map # set this to the tokenizer_map you created in the loop above


    def encode_text(self, text: str):
        """Given a text, return the tokenized text"""
        ### Implement the encode_text function ###
        # You only need to sligthly modify existing code in your previous implementation
        preprocessed = self.preprocess_text(text)
        byte_chunks = [chunk.encode("utf-8") for chunk in preprocessed]
        byte_seq_to_token_id = {v: k for k, v in self.tokenizer_map.items()}

        tokenized = []
        for byte_chunk in byte_chunks:
            i = 0
            while i < len(byte_chunk):
                matched = False
                # Startiing from longest possible substring
                for j in range(len(byte_chunk), i, -1):
                    sub_bytes = byte_chunk[i:j]
                    sub_bytes_tuple = tuple(sub_bytes)
                    if sub_bytes_tuple in byte_seq_to_token_id:
                        token_id = byte_seq_to_token_id[sub_bytes_tuple]
                        tokenized.append(token_id)
                        i = j
                        matched = True
                        break
                if not matched:
                    token_id = byte_chunk[i]
                    tokenized.append(token_id)
                    i += 1

        return tokenized

    def decode_text(self, tokenized_text: list):
        """Given a tokenized text, return the original text"""

        ### TO DO: Write the implementation of the decode_text function ###
        decoded_bytes = []
        for token_id in tokenized_text:
            token_value = self.tokenizer_map.get(token_id)
            if token_value is not None:
                decoded_bytes.extend(token_value)
            else:
                pass

        decoded = bytes(decoded_bytes).decode("utf-8", errors="replace")
        return decoded
        # make sure to use errors='replace' when decoding

        ### END ###




In [49]:
# Sanity check
tokenizer = WordPieceTokenizer()
tokenizer.train(text, 500)
encoded = tokenizer.encode_text("Life is beautiful 🤞")
decoded = tokenizer.decode_text(encoded)

assert decoded == "Life is beautiful 🤞"