> <p><small><small>This Notebook is made available subject to the licence and terms set out in the <a href = "http://www.github.com/google-deepmind/ai-foundations">AI Research Foundations Github README file</a>.

# **Build Your Own Small Language Model, Lab 4: Are You Ready to Build Your Own Small Language Model?**

<a href='https://colab.research.google.com/github/google-deepmind/ai-foundations/blob/master/course_1/introduction_to_language_modeling_lab_4.ipynb' target='_parent'><img src='https://colab.research.google.com/assets/colab-badge.svg' alt='Open In Colab'/></a>

This lab guides you through preparing a text dataset for training a small language model (SLM). The lab focuses on **pre-processing steps** such as tokenization, vocabulary creation, handling out-of-vocabulary words, and adding a padding token to the vocabulary for later use. The lab concludes with the implementation of a simple word tokenizer class.


In [None]:
# Packages used.
import tensorflow as tf
import pandas as pd

## Step 1: Load the dataset

To begin, you will load the dataset. This lab uses the [Africa Galore](https://storage.googleapis.com/dm-educational/assets/ai_foundations/africa_galore.json) dataset.

In [None]:
africa_galore = pd.read_json('https://storage.googleapis.com/dm-educational/assets/ai_foundations/africa_galore.json')
train_dataset = africa_galore['description']
print(train_dataset.shape)

Now, read  the first paragraph:

In [None]:
print(train_dataset[0])

## Step 2: Convert the text sequence to tokens

This step will focus on *tokenization*, which is the process of converting sentences into smaller, manageable units known as *tokens*. You will be using the same function `split_text` developed in the second lab. This function splits words on whitespace so when you tokenize the sentence `'Bimpe didn't come home yesterday'`, you'll get a list of tokens like `['Bimpe', "didn't", 'come', 'home', 'yesterday']`. This type of tokenization is called word-level tokenization, because the tokens are words. It is one of the simplest forms of tokenization, as it treats each word as an individual unit of meaning:

In [None]:
def split_text(text: str) -> list[str]:
    """Splits a string into a list of words as tokens.

    Splits sentence on whitespace.

    Args:
        text: The input text.

    Returns:
        A list of words as tokens. Returns empty list if text is
        empty or all whitespace.
    """
    tokens = text.split(' ')
    return tokens

print(split_text('Here\'s how you tokenize! Awesome, isn\'t it?'))

This tokenizer is extremely naive. When splitting on space, you can see how punctuation marks are now part of the token (for example, `'tokenize!'` will be a different token from `'tokenize'`).

In [None]:
tokens = [token for paragraph in train_dataset for token in split_text(paragraph)]
print('Total number of tokens (words) in our train dataset:', len(tokens))

Print out the first 20 tokens:

In [None]:
tokens[:20]

## Step 3: Create a vocabulary comprising of unique tokens

The *vocabulary* is the set of unique tokens that the model is trained on. These are the building blocks the model uses to understand and generate text data.

In the cell below, write a function to build the vocabulary (i.e., the list of unique tokens). After you've attempted the function, run the "Run this to test your code" cell to check if your solution works. A solution is provided in the third cell, but try to write your own first before reviewing it:

In [None]:
def build_vocab(tokens: list[str])-> list[str]:
    # Your code here.
    # Build a vocabulary list from the set of tokens.
    vocab = ... # Enter code here.
    return vocab

In [None]:
# @title Run this to test your code.
def test_build_vocab():
    hint = """
           Hints:
           ======
            1. Create a unique set of tokens e.g, if you have
               ['hello', 'world', 'world'], it becomes {'hello', 'world'}.
               There is a Python `set` function you can use.
            2. Convert the set to list e.g {'hello', 'world'} becomes
               ['hello', 'world']. There is a Python `list` function that
               you can use.
          """

    if build_vocab(['hello', 'world', 'world']) == ['hello', 'world']:
        print('Nice! Your answer looks correct.')
    elif type(build_vocab(['hello', 'world', 'world'])) == set:
        print('\033[1m\033[91mSorry, your answer is not correct. Make sure ' +
              'that you return a list, not a set.\033[0m')
        give_hints = input('Would you like some hints? Type Yes or No.')
        if give_hints.lower() in ['yes', 'y']:
            print(hint)
    else:
        print('\033[1m\033[91mSorry, your answer is not correct.\033[0m')
        give_hints = input('Would you like some hints? Type Yes or No.')
        if give_hints.lower() in ['yes', 'y']:
            print(hint)

test_build_vocab()
assert build_vocab(['hello', 'world', 'world']) == ['hello', 'world'], '`build_vocab` function is not implemented correctly. Try again.'

In [None]:
# @title build_vocab function solution. Attempt it yourself before revealing the solution

def build_vocab(tokens: list[str])-> list[str]:

    # Enter code here.

    # Build a vocabulary list from the set of tokens.
    vocab = list(set(tokens))
    return vocab

Run the cell below to create the vocabulary `vocab`, and count the number of unique tokens in `vocab`:

In [None]:
vocab = build_vocab(tokens)

# Size of the vocabulary (number of unique tokens).
vocab_size = len(vocab)

print(vocab_size)

Run the cell below to print out the first ten tokens in the `vocab`:

In [None]:
vocab[:10]

## Step 4: Add a special \<UNK\> token to the vocabulary to handle unknown tokens

When you use a model for generation, it is possible for it to encounter unknown tokens. These are patterns that didn't appear in the training data, such as spelling errors. To handle this, you can use the unknown token `<UNK>`. When the model encounters a token not in the vocabulary, it replaces it with `<UNK>`. This prevents the model failing when encountering unknown tokens.

For example, if your vocabulary consists of tokens `['Bimpe', 'rode', 'a', 'car', 'yesterday']` and you prompted the model with the phrase `'Bimpe rode a keke-marwa yesterday'`, it will substitute `'keke-marwa' `with `'<UNK>'` because it is an unknown token that does not exist in its vocabulary.

You can *append* (add to the end) the unknown token `'<UNK>'` to the vocabulary:

In [None]:
UNKNOWN_TOKEN = '<UNK>'

vocab.append(UNKNOWN_TOKEN)
vocab_size = len(vocab)

print(vocab_size)

The size of the vocabulary has increased by 1.

Print out the last ten tokens in the vocabulary:

In [None]:
vocab[-10:]

You now see the special unknown token `'<UNK>'` at the very bottom.

## Step 5: Convert the tokens into token IDs (or indexes)

It is convenient to represent the tokens as numbers (or indexes), where each token is represented by a number between `0` and `vocab_size`.

There are two dictionaries to facilitate this mapping:

1. **`index_to_token`**: This dictionary maps an index (a number) back to its corresponding token (a string). Given an index between 0 and the vocabulary size, it returns the token at that position.
2. **`token_to_index`**: This dictionary maps each token in the vocabulary to its corresponding index.

Now, when you need to convert a token to a number, use `token_to_index`. And when you need to convert a number back to a token, use `index_to_token`.

Next, create a dictionary below called `index_to_token`, where the index is the key and the token is the value. This dictionary should be the reverse of the `token_to_index` dictionary. After implementing the dictionary, run the cell and verify that the tokens and their corresponding indexes match between `token_to_index` and `index_to_token`:

In [None]:
# Note the index here is starting with 1.
# The first index is reserved for another special <PAD> token, explained below.
token_to_index = {token: index+1 for index, token in enumerate(vocab)}

In [None]:
# Create a dictionary that maps an index (a number) back to
# its corresponding token in the vocab, starting the index with 1.
index_to_token = ... # Enter code here.

Get the index for the special unknown `'<UNK>'` token:

In [None]:
token_to_index[UNKNOWN_TOKEN]

Run the cell below to test your `index_to_token` function to make sure it is correct. It should return `True` if your implementation is correct:

In [None]:
print(index_to_token[token_to_index[UNKNOWN_TOKEN]] == UNKNOWN_TOKEN)

In [None]:
# @title index_to_token solution. Attempt it yourself before revealing the solution
index_to_token = {index+1: token for index, token in enumerate(vocab)}

You have now completed the indexing step. Now, examine the first ten tokens and their indexes:

In [None]:
count = 0
for key, value in token_to_index.items():
    # `repr` returns a printable representational string.
    print(f'{repr(key)}: {value}')
    count += 1
    if count == 10:
        break

Next, examine the first ten indexes and their tokens:

In [None]:
count = 0
for key, value in index_to_token.items():
    # `repr` returns a printable representational string.
    print(f'{key}: {repr(value)}')
    count += 1
    if count == 10:
        break

Notice how `token_to_index` and `index_to_token` are inverses of each other:

Find the index of the token `'fans'`:

In [None]:
token_to_index['fans']

Find the index of the unknown token:

In [None]:
token_to_index[UNKNOWN_TOKEN]

Find the index of the token `'photosynthesis'`:

In [None]:
token_to_index['photosynthesis']

`'photosynthesis'` does not exist in the dataset hence `token_to_index`, which explains how it finds its index. As discussed above, in such cases, you want to return the index of the `'<UNK>'` token. You can use the [.get(key, value)](https://www.w3schools.com/python/ref_dictionary_get.asp) dictionary method, which allows a default `value` when the `key` is not in the dictionary.

Here you can see the unknown token index being returned for a word that does not exist in the vocabulary:

In [None]:
token_to_index.get('photosynthesis', token_to_index[UNKNOWN_TOKEN])

The 5451 is the token ID for `'<UNK>'` token. Run the cell below to verify:

In [None]:
index_to_token[5451]

**The encode and decode functions**

Create two functions: `encode` and `decode`.
- The `encode` function takes a string of text and returns corresponding indices of the tokens. Whenever it encounters a token not in the vocab, it will return the index of the `'<UNK>'` token.
- The `decode` function takes a list of indices and returns the text associated with it.


In [None]:
def encode(text: str) -> list[int]:
    return [token_to_index.get(token, token_to_index[UNKNOWN_TOKEN])
            for token in split_text(text)]


def decode(numbers: int | list[int]) -> list[str]:
    return [index_to_token.get(number, UNKNOWN_TOKEN) for number in numbers]

The next step is to verify and encode the string of text into indexes, and then decode those indexes. You should get the original text back. Now read the first paragraph of the train dataset:

In [None]:
text = train_dataset[0]
print(text)

Encode the text and look at the first ten tokens:

In [None]:
encode(text)[:10]

Decode the encoded text back and check the expected outcome:


In [None]:
decode(encode(text)[:10])

Now encode all paragraphs in the train dataset:

In [None]:
encoded_tokens = [encode(text) for text in train_dataset]
len(encoded_tokens)

Run the cell below to convert the list of `encoded_tokens` to a numerical array.

Neural networks operate on and process numerical arrays.

> **NOTE**: Another word for a multi-dimensional numerical array is a *tensor*.

In [None]:
encoded_tensor = tf.convert_to_tensor(encoded_tokens, dtype=tf.int32)

What happened?

The cell above gave an error: `ValueError: can't convert non-rectangular Python sequence to Tensor`. But what does that mean?

This error happens because TensorFlow requires all the lists (or sequences) you're trying to convert into a tensor to have the same size. In other words, the lists must form a *rectangular shape*, as all the inner lists need to have the same number of elements.

Now look at this example: `[[1, 2, 3], [1, 2]]`.

If you try to turn this into a tensor, it will give an error. This is because the two inner lists have different lengths. The first list has three numbers, but the second one has only two.

To fix this, you need to make sure all the lists have the same length. This can be done in two ways:

1. **Padding**: You can add extra values (like `0` or `-1`) to the shorter list, making it the same length as the longer one. For example: `[[1, 2, 3], [1, 2, 0]]`.

   `0` was added to the second list as a placeholder. This way, both lists have three elements.

2. **Truncation**: You can shorten the longer list to match the shorter one. For example: `[[1, 2], [1, 2]]`.

   In this case, the extra element was removed from the first list, so both lists now have two elements.

Next, add padding to make sure all the sequences in the train dataset have the same length. This will allow you to turn them into a proper tensor.

## Step 6: Add the special padding token to handle varying input lengths

The *padding* process ensures that sequences of varying lengths are all the same size. This is done by adding the special token `'<PAD>'` to shorter sequences. This way they align with the longest sequence in the dataset. Padding is necessary to create consistent input for the model (input of the same shape).



Before you can pad the token, you first need to update our vocabulary to include the special pad token `'<PAD>'`.

To do this, you will insert the `'<PAD>'` token (add to the beginning) of the vocabulary list. This way, the index for `'<PAD>'` will be 0, making it the first token in the vocabulary.

Run the cell below to add the pad token to the beginning of the `vocab` list:

In [None]:
PAD_TOKEN = '<PAD>'
# Inserts pad token in the first position of the list.
vocab.insert(0, PAD_TOKEN)

vocab_size = len(vocab)
print(vocab_size)

Run the cell below to print out the first ten tokens:

In [None]:
vocab[:10]

Remember, you started the `token_to_index` and `index_to_token` dictionaries with the first index. You left the 0th index free  so that the `'<PAD>'` token can be added there. Now, added the following:

In [None]:
token_to_index['<PAD>'] = 0
index_to_token[0] = '<PAD>'

In [None]:
index_to_token[0]

Before you can proceed with padding the sequences, it's useful to create a handy Python class that will bring together everything learned so far about tokenizing text and converting it into numbers.

Call this class `SimpleWordTokenizer`. This class will organize the code and help manage the vocabulary, and the `token-to-index` and `index-to-token` dictionaries. It will also perform tokenization, encoding, and decoding as you have done above. There's nothing new here. It is simply taking all the concepts you've learned and organizing them into a class structure to make the code cleaner and more efficient.

This `SimpleWordTokenizer` class provides a solid foundation for understanding tokenization methods used in language modeling. As you continue to explore the world of language modeling further, you'll come across other tokenization techniques that follow a similar structure. By building this class now, you'll have a strong base to build upon:

In [None]:
# Putting it all together.
class SimpleWordTokenizer:
    """A simple word tokenizer that can be initialized with texts
    or using a provided vocabulary list.

    The tokenizer splits text sequences based on whitespace, using the `encode`
    method to convert text into a sequence of indices and the `decode` method
    to convert indices back into text. It handles unknown words and includes
    padding functionality.

    Typical usage example:

        text = 'Hello there!'
        tokenizer = SimpleWordTokenizer(text)
        print(tokenizer.encode('Hello'))

    Attributes:
        UNKNOWN_TOKEN: A string constant representing the special
            token for unknown words.
        PAD_TOKEN: A string constant representing the special token
            for padding.
        texts: Input text dataset used to build the vocabulary if no 'vocab' is
            provided.
        vocab: A pre-defined vocabulary. Defaults to None. If None,
            the vocabulary is automatically inferred from the texts.
            The inferred vocabulary includes PAD_TOKEN at index 0
            and UNKNOWN_TOKEN at the last index.
        vocab_size: The total number of tokens in the vocabulary,
            including special tokens.
        token_to_index: A dictionary mapping tokens to their corresponding
            indices.
        index_to_token: A dictionary mapping indices to their corresponding
            tokens.
        pad_token_id: The index of the PAD_TOKEN in the vocabulary.
        unknown_token_id: The index of the UNKNOWN_TOKEN.
    """

    # Define constants.
    UNKNOWN_TOKEN: str = '<UNK>'
    PAD_TOKEN: str = '<PAD>'


    def __init__(self, texts: list[str], vocab: list[str] | None = None):
        """Initializes the tokenizer with texts or using a provided vocabulary.

        Args:
          texts: Input text dataset.
          vocab: A pre-defined vocabulary. Defaults to None. If None,
                the vocab is automatically inferred from the texts.
        """

        if vocab is None:
            # Build the vocab from scratch.
            if isinstance(texts, str):
                texts = [texts]

            # Step 2: Convert text sequence to tokens.
            tokens = [token for text in texts
                      for token in self.split_text(text)]

            # Step 3: Create a vocabulary comprising of unique tokens.
            vocab = self.build_vocab(tokens)

            # Step 4 and 6: Add special unknown and pad token to the vocabulary.
            self.vocab = [self.PAD_TOKEN] + vocab +  [self.UNKNOWN_TOKEN]

        else:
          self.vocab = vocab

        # Size of vocabulary.
        self.vocab_size = len(self.vocab)

        # Create token-to-index and index-to-token mappings.
        self.token_to_index = {token: index
                               for index, token in enumerate(self.vocab)}
        self.index_to_token = {index: token
                               for index, token in enumerate(self.vocab)}

        # Map the special tokens to their IDs.
        self.pad_token_id = self.token_to_index[self.PAD_TOKEN]
        self.unknown_token_id = self.token_to_index[self.UNKNOWN_TOKEN]


    def split_text(self, text: str) -> list[str]:
        """Splits a given text on whitespace into tokens (tokens)."""
        return text.split(' ')


    def join_text(self, text_lists: list[str]) -> str:
        """Combines a list of tokens into a single string,
            with tokens separated by spaces.
        """
        return ' '.join(text_lists)


    def build_vocab(self, tokens: list[str])-> list[str]:
      """Create a vocabulary list from the set of tokens"""
      return list(set(tokens))


    # This is the same function from step 5.
    def encode(self, text: str) -> list[int]:
        """Encodes a text sequence into a list of indices based on the vocabulary.

        Args:
            text: The input text to be encoded.

        Returns:
            list: A list of indices corresponding to the tokens in the
                  input text.
        """

        # Step 5: Convert tokens into indexes.
        return [self.token_to_index.get(token,
                                        self.token_to_index[self.UNKNOWN_TOKEN])
                for token in self.split_text(text)]


    # This is mostly the same function that was developed in step 5.
    def decode(self, numbers: int | list[int]) -> str:
        """Decodes a list (or single index) of integers back into
        corresponding tokens from the vocabulary.

        Args:
            numbers: A single index or a list of indices to be
                     decoded into tokens.

        Returns:
            str: A string of decoded tokens corresponding to the input indices.
        """

        # If a single integer is passed, convert it into a list.
        if isinstance(numbers, int):
            numbers = [numbers]

        # Map indices to tokens.
        tokens = [self.index_to_token.get(number, self.unknown_token_id) for number in numbers]

        # Join the decoded tokens into a single string.
        return self.join_text(tokens)

Verify that the class created returns the same vocabulary as the one made before:

In [None]:
tokenizer = SimpleWordTokenizer(train_dataset)
assert tokenizer.vocab == vocab
assert tokenizer.decode(tokenizer.encode(train_dataset[0])) == train_dataset[0]

In the cell above, you can run some tests using `assert` statements to make sure that the first paragraph from the training dataset remains the same after encoding and then decoding it. You can also inspect this yourself to check that everything works as expected.

Run the cell below to check the first paragraph in the training dataset. Apply the `tokenizer.encode` method from the tokenizer class to convert it into numbers. Then, use the `tokenizer.decode` method to convert it back into text. Finally, compare the decoded text with the original paragraph to ensure they match:

In [None]:
tokenizer.decode(tokenizer.encode(train_dataset[0]))

In [None]:
train_dataset[0]

Now you can simply use the tokenizer to encode the text data. It will perform steps 2-6 above:

In [None]:
encoded_tokens = [tokenizer.encode(text) for text in train_dataset]

## Reflection

This is the end of **Lab 4: Are You Ready to Build your Own Small Language Model?**

This lab guided you through preparing a text dataset for training a small language model (SLM), focusing on:

- **Loading and exploring the dataset:** You examined the structure and content of the Africa Galore dataset, focusing on the text descriptions.

- **Tokenized the text:** You used a simple word-level tokenization method to split the text into individual words, creating a vocabulary of unique tokens.

- **Handled unknown words:** You added a special `'<UNK>'` token to the vocabulary to represent words not encountered during training.

- **Created numerical representations:** You mapped each token to a unique numerical index, creating `token_to_index` and `index_to_token` dictionaries to facilitate conversions between words and numbers.

- **Addressed varying sequence lengths:** You introduced padding, using the `'<PAD>'` token, to ensure all text sequences have the same length, a requirement for processing data in neural networks.

- **Built a tokenizer class:** You consolidated all the tokenization and encoding/decoding logic into a reusable SimpleWordTokenizer class. This class streamlines the process of converting text into numerical data that can be fed into a language model.

In the next lab, you will use this tokenizer class to tokenize the data that you will be training a small language model on.