> <p><small><small>This Notebook is made available subject to the licence and terms set out in the <a href = "http://www.github.com/google-deepmind/ai-foundations">AI Research Foundations Github README file</a>.

<img src="https://storage.googleapis.com/dm-educational/assets/ai_foundations/GDM-Labs-banner-image-C1-white-bg.png">

# Lab: Prepare The Dataset For Training an SLM

<a href='https://colab.research.google.com/github/google-deepmind/ai-foundations/blob/master/course_1/gdm_lab_1_4_prepare_the_dataset_for_training_a_slm.ipynb' target='_parent'><img src='https://colab.research.google.com/assets/colab-badge.svg' alt='Open In Colab'/></a>

Prepare the dataset so that a transformer model can process it.

30 minutes


## Overview

This lab guides you through preparing a text dataset for training a small language model (SLM). The SLM that you will train is a transformer model. Transformer models and, more generally, neural network models, require data to be in a specific format so that they can process them. Specifically, when processing texts, you first have to tokenize the text into tokens. Then, you have to translate these tokens into unique numeric IDs before you can process them with a transformer model.

In this lab, you will focus on these necessary **pre-processing steps** such as tokenization, vocabulary creation, and mapping tokens to their IDs. This will build the foundation for training your SLM in the next lab.



### What you will learn:

By the end of this lab, you will know:
* The data format requirements of transformer models and how to map tokens to token IDs and vice versa.
* How to prepare a dataset for training a transformer model.

### Tasks

As in previous labs, you will load the Africa Galore dataset and tokenize it using a space tokenizer. You will then build all the ingredients necessary for converting a dataset such that it can be used for training a transformer model.


**In this lab, you will**:
* Load the dataset and tokenize it.
* Construct a list of all tokens in the dataset.
* Construct a list of unique tokens in the dataset.
* Create a mapping of tokens to token IDs and a mapping of token IDs to tokens.
* Define functions that can translate between tokens and their corresponding IDs.
* Define a Python class that encapsulates all methods necessary for preparing the data for a transformer model.


## How to use Google Colaboratory (Colab)

Google Colaboratory (also known as Google Colab) is a platform that allows you to run Python code in your browser. The code is written in **cells** that are excuted on a remote server.

To run a cell, hover over a cell and click on the `run` button to its left. The run button is the circle with the triangle (▶). Alternatively, you can also click on a cell and use the keyboard combination Ctrl+Return (or ⌘+Return if you are using a Mac).

To try this out, run the following cell. This should print today's day of the week below it.

In [1]:
from datetime import datetime
print(f"Today is {datetime.today():%A}.")

Today is Saturday.


Note that the *order in which you run the cells matters*. When you are working through a lab, make sure to always run *all* cells in order. Otherwise, the code might not work. If you take a break while working on a lab, Colab may disconnect you. In that case, you have to execute all cells again before  continuing your work. To make this easier, you can select the cell you are currently working on and then choose __Runtime → Run before__  from the menu above (or use the keyboard combination Ctrl/⌘ + F8). This will re-execute all cells before the current one.

## Imports



In this lab, you will use the [Pandas](http://pandas.pydata.org) package for reading the dataset. Run the following cell to import these packages.

In [2]:
%%capture
!pip install "git+https://github.com/google-deepmind/ai-foundations.git@main"

import re # Used for splitting strings on spaces.

# Packages used.
import pandas as pd # For reading the dataset.
import textwrap # For adding linebreaks to paragraphs.

# For providing feedback.
from ai_foundations.feedback.course_1 import slm

## Loading and tokenizing the dataset

As in the previous labs, you will again use the [Africa Galore](https://storage.googleapis.com/dm-educational/assets/ai_foundations/africa_galore.json) dataset for the activities in this lab.

Run the following cell to download the dataset and print its first paragraph.

In [3]:
africa_galore = pd.read_json(
    "https://storage.googleapis.com/dm-educational/assets/ai_foundations/africa_galore.json"
)
dataset = africa_galore["description"]
print(f"Loaded Africa Galore dataset with {len(dataset)} paragraphs.")
print(f"\nFirst paragraph:")
print(textwrap.fill(dataset[0]))

Loaded Africa Galore dataset with 232 paragraphs.

First paragraph:
The Lagos air was thick with humidity, but the energy in the club was
electric. The band launched into a hypnotic Afrobeat groove, the drums
pounding out a complex polyrhythm, the horns blaring a soaring melody,
and the bass laying down a deep, funky foundation. A woman named Imani
moved effortlessly to the music, her body swaying in time with the
rhythm. The music seemed to flow through her, a powerful current of
energy and joy. All around her, people were dancing, singing, and
clapping, caught up in the infectious rhythm. The music was more than
just entertainment; it was a celebration of life, a connection to
their shared heritage, a vibrant expression of the soul of Lagos.


As with n-gram language models, you also have to tokenize sequences before you can use them to train a transformer model. You will again use a space tokenizer that splits sequences on spaces.

Run the following cell to define and test the space tokenizer that is implemented by the function `space_tokenize`. This function is almost identical to the function you have already seen. Instead of the string `split` function, it uses the [`re.split`](https://docs.python.org/3/library/re.html#re.split) function, since it's better at handling texts that contain multiple spaces.

In [4]:
def space_tokenize(text: str) -> list[str]:
    """Splits a string into a list of tokens.

    Splits text on space.

    Args:
        text: The input text.

    Returns:
        A list of tokens. Returns empty list if text is empty or all spaces.
    """
    # Use `re` package so that splitting on multiple spaces also works.
    tokens = re.split(r" +", text)
    return tokens


# Tokenize an example text with the `space_tokenize` function.
space_tokenize("Kanga, a colorful printed cloth is more than just a fabric.")

['Kanga,',
 'a',
 'colorful',
 'printed',
 'cloth',
 'is',
 'more',
 'than',
 'just',
 'a',
 'fabric.']

### Coding Activity 1: Build a list of all tokens in the dataset



------
> 💻 **Your task:**
>
> Complete the following cell to construct a `tokens` list that contains all tokens in the dataset in the order that they appear in.
>
> You will have to loop through all paragraphs in the dataset and then extract all tokens for each paragraph and add them to the `tokens` list.
>
> Once you have completed your implementation, run the next two cells to build the list and test your code.
------

In [9]:
tokens = []

for text in dataset:
    for token in space_tokenize(text):
        tokens.append(token)

# Add your code here.

print(f"Total number of tokens in the Africa Galore dataset: {len(tokens):,}")

Total number of tokens in the Africa Galore dataset: 19,065


In [10]:
# @title Run this cell to test your code

slm.test_build_tokens_list(tokens, space_tokenize, dataset)

✅ Nice! Your answer looks correct.


To remind yourself what the tokenized dataset looks like, run the following cell to print the first 30 tokens of the first paragraph in the dataset.

In [11]:
tokens[:30]

['The',
 'Lagos',
 'air',
 'was',
 'thick',
 'with',
 'humidity,',
 'but',
 'the',
 'energy',
 'in',
 'the',
 'club',
 'was',
 'electric.',
 'The',
 'band',
 'launched',
 'into',
 'a',
 'hypnotic',
 'Afrobeat',
 'groove,',
 'the',
 'drums',
 'pounding',
 'out',
 'a',
 'complex',
 'polyrhythm,']

### Coding Activity 2: Build the vocabulary

Transformer models use a fixed set of tokens that they can process and generate. This set of tokens is known as the **vocabulary**. In many cases, this vocabulary is set to the list of unique tokens that appear in the data that the model is being trained on.

------
> 💻 **Your task:**
>
> Complete the `build_vocabulary` function that should return the list of
> unique tokens that appear in `tokens`.
>>
> Once you have implemented this function, run the two cells to define the function and test your code.
------

In [12]:
def build_vocabulary(tokens: list[str]) -> list[str]:
    # Build a vocabulary list from the set of tokens.
    vocabulary = list(set(tokens))
    return vocabulary

In [13]:
# @title Run this cell to test your code
slm.test_build_vocabulary(build_vocabulary)

✅ Nice! Your answer looks correct.


You can now use the function that you have implemented to construct the vocabulary for the Africa Galore dataset.

Run the cell below to create the vocabulary and print its size, that is, the number of unique tokens in the dataset.

In [14]:
vocabulary = build_vocabulary(tokens)

vocabulary_size = len(vocabulary)

print(
    "Total number of unique tokens in the Africa Galore dataset:"
    f" {vocabulary_size:,}"
)

Total number of unique tokens in the Africa Galore dataset: 5,260


To get a sense of what the vocabulary looks like, run the following cell that prints the first 30 entries of the vocabulary.

In [15]:
vocabulary[:30]

['',
 'certain',
 'Kwame',
 'rhinoceros',
 'portion',
 'quantities',
 'Sofia,',
 'aftershave.',
 'experiment',
 'greens).',
 'create',
 'knowledge',
 'south,',
 'infectious',
 'glistening,',
 'Kilimanjaro',
 'donuts',
 'building',
 'magical',
 'dissolve',
 'power,',
 'Victoria',
 '10',
 'xima',
 'Camel',
 'riads',
 'Mohammed',
 'When',
 'clay',
 'coat.']

 Note that, unlike when you printed the first 30 tokens in the dataset, there are no duplicate entries. Every token appears exactly once in the vocabulary.

## Convert the tokens into token IDs (indices)

As discussed above, in order to train a transformer on a text dataset, you have to turn the text data into a list of **token IDs**. These IDs are numbers such that each token maps uniquely to a different number. The IDs should always be consecutive. That means that, if the vocabulary has size $k$, then each token should map to an ID between 0 and $k-1$.

In practice, the translation between tokens and IDs is implemented using two dictionaries:

1. **`token_to_index`**: This dictionary maps each token in the vocabulary to its corresponding ID (index). The index must be between 0 and the vocabulary size $k-1$.
2. **`index_to_token`**: This dictionary maps an ID (index) back to its corresponding token (a string). Given an index between 0 and $k-1$, it returns the token at that position.

When you need to convert a token to a number, use `token_to_index`.
And when you need to convert a number back to a token, use `index_to_token`.

### Build `token_to_index`

The following cell shows you how to implement the construction of the `token_to_index` mapping from the vocabulary. If you are not familiar with the [`enumerate`](https://docs.python.org/3/library/functions.html#enumerate) function in Python, print the `index` and `token` on each iteration to get a sense of what it is doing and why this code is creating the correct mapping.

In [16]:
# Build the `token_to_index` dictionary.
token_to_index = {}

for index, token in enumerate(vocabulary):
    token_to_index[token] = index

### Coding Activity 3: Build `index_to_token`

Next, create a dictionary below called `index_to_token`, where the index is the key and the token is the value. This dictionary should be the reverse of the `token_to_index` dictionary. After implementing the dictionary, run the cell and verify that the tokens and their corresponding indexes match between `token_to_index` and `index_to_token`:

------
> 💻 **Your task:**
>
> Complete the following cell such that it constructs the `index_to_token` mapping that maps all token IDs to their corresponding strings versions of the tokens.
>
> **Hint**: It may be useful to iterate through the entries in `token_to_index` using  `token_to_index.items()` to obtain the pairs of indices and tokens.
>
> Once you have implemented this function, run the two cells to construct the dictionary and test your code.
------

In [17]:
# Create a dictionary that maps an index (a number) back to
# its corresponding token in the vocab.
index_to_token = {}

for index, token in enumerate(vocabulary):
    index_to_token[index] = token

# Add your code here.

In [18]:
# @title Run this cell to test your code
slm.test_index_to_token(index_to_token, vocabulary)

✅ Nice! Your answer looks correct.


To see how `token_to_index` and `index_to_token` are inverse mappings, take a look at ten entries of both dictionaries:

In [19]:
print("token_to_index:\n")

count = 0
first_ten_indices = []
for token, token_id in token_to_index.items():
    print(f"'{token}': {token_id}")
    first_ten_indices.append(token_id)
    count += 1
    if count == 10:
        break

print("\n\n")
print("index_to_token:\n")
for token_id in first_ten_indices:
    print(f"{token_id}: '{index_to_token[token_id]}'")

token_to_index:

'': 0
'certain': 1
'Kwame': 2
'rhinoceros': 3
'portion': 4
'quantities': 5
'Sofia,': 6
'aftershave.': 7
'experiment': 8
'greens).': 9



index_to_token:

0: ''
1: 'certain'
2: 'Kwame'
3: 'rhinoceros'
4: 'portion'
5: 'quantities'
6: 'Sofia,'
7: 'aftershave.'
8: 'experiment'
9: 'greens).'


You should see above that the first ten tokens all have IDs between zero and ten and these IDs map back to exactly the same ten tokens.

## Encode and decode functions

Rather than manually translating between lists of tokens and lists of token indices, it can be much easier to convert between these two representations of your data by implementing an `encode` and a `decode` function.

- The `encode` function takes a string of text and returns the corresponding indices of the tokens.
- The `decode` function takes a list of indices and returns the text associated with it.

The following cell provides an implementation of these two functions. Run it to define both of them.

In [22]:
def encode(text: str) -> list[int]:
    """Encodes a text sequence into a list of indices based on the vocabulary.

    Args:
        text: The input text to be encoded.

    Returns:
        A list of indices corresponding to the tokens in the input text.
    """

    # Convert tokens into indices.
    indices = []
    for token in space_tokenize(text):
        token_index = token_to_index.get(token)
        indices.append(token_index)

    return indices


def decode(indices: int | list[int]) -> list[str]:
    """Decodes a list (or single index) of integers back into tokens.

    Args:
        indices: A single index or a list of indices to be decoded into tokens.

    Returns:
        str: A string of decoded tokens corresponding to the input indices.
    """

    # If a single integer is passed, convert it into a list.
    if isinstance(indices, int):
        indices = [indices]

    # Map indices to tokens.
    tokens = []
    for index in indices:
        token = index_to_token.get(index)
        tokens.append(token)

    # Join the decoded tokens into a single string.
    return " ".join(tokens)

To verify that these functions are working as expected, you can encode a text so that its tokens are mapped to indices and then decode those indices. The decoding step should return the original text.

The following cell prints again the first paragraph in the dataset:

In [20]:
text = dataset[0]
print(text)

The Lagos air was thick with humidity, but the energy in the club was electric. The band launched into a hypnotic Afrobeat groove, the drums pounding out a complex polyrhythm, the horns blaring a soaring melody, and the bass laying down a deep, funky foundation. A woman named Imani moved effortlessly to the music, her body swaying in time with the rhythm. The music seemed to flow through her, a powerful current of energy and joy. All around her, people were dancing, singing, and clapping, caught up in the infectious rhythm. The music was more than just entertainment; it was a celebration of life, a connection to their shared heritage, a vibrant expression of the soul of Lagos.


Run the following cell to encode the pargraph above and look at the first ten indices.

In [23]:
encode(text)[:10]

[3635, 2603, 5157, 212, 316, 656, 3499, 3830, 1187, 494]

Now decode these indices to obtain the first ten tokens of the original paragraph. This should be the same as the beginning of the original paragraph above.

In [24]:
decode(encode(text)[:10])

'The Lagos air was thick with humidity, but the energy'

### Package the methods from above in a Python class

You have now implemented the most important pre-processing steps for preparing a dataset to be used for training a transformer model.

In order to save you from always having to go through these steps and implement these functions whenever you want to train a model, it can be useful to define a class, e.g., `SimpleWordTokenizer`, that includes methods for extracting the vocabulary, building the `token_to_index` and `index_to_token` mappings, and implementing the `encode` and `decode` methods.

The `SimpleWordTokenizer` class below provides a solid foundation for understanding tokenization methods used for preparing the input for language models. As you continue to explore the world of language models further, you will come across other tokenization techniques that follow a similar structure.

Make sure to go through each component of this class to remind yourself which steps are involved in constructing the vocabulary and translating between tokens and their corresponding IDs.

In [None]:
class SimpleWordTokenizer:
    """A simple word tokenizer that can be initialized with a corpus of texts
       or using a provided vocabulary list.

    The tokenizer splits the text sequence based on spaces,
    using the `encode` method to convert the text into a sequence of indices
    and the `decode` method to convert indices back into text.

    Typical usage example:

        corpus = "Hello there!"
        tokenizer = SimpleWordTokenizer(corpus)
        print(tokenizer.encode('Hello'))

    """

    def __init__(self, corpus: list[str], vocabulary: list[str] | None = None):
        """Initializes the tokenizer with texts in corpus or with a vocabulary.

        Args:
            corpus: Input text dataset.
            vocabulary: A pre-defined vocabulary. If None,
                the vocabulary is automatically inferred from the texts.
        """

        if vocabulary is None:
            # Build the vocabulary from scratch.
            if isinstance(corpus, str):
                corpus = [corpus]

            # Convert text sequence to tokens.
            tokens = []
            for text in corpus:
                for token in self.space_tokenize(text):
                    tokens.append(token)

            # Create a vocabulary comprising of unique tokens.
            self.vocabulary = self.build_vocabulary(tokens)

        else:
            self.vocabulary = vocabulary

        # Size of vocabulary.
        self.vocabulary_size = len(self.vocabulary)

        # Create token-to-index and index-to-token mappings.
        self.token_to_index = {}
        self.index_to_token = {}
        # Loop through all tokens in the vocabulary. enumerate automatically
        # assigns a unique index to each token.
        for index, token in enumerate(self.vocabulary):
            self.token_to_index[token] = index
            self.index_to_token[index] = token

    def space_tokenize(self, text: str) -> list[str]:
        """Splits a given text on space into tokens.

        Args:
            text: Text to split on space.

        Returns:
            List of tokens after splitting `text`.
        """

        # Use re.split such that multiple spaces are treated as a single
        # separator.
        return re.split(" +", text)

    def join_text(self, text_list: list[str]) -> str:
        """Combines a list of tokens into a single string, with tokens separated
           by spaces.

        Args:
            text_list: List of tokens to be joined.

        Returns:
            String with all tokens joined with a space.

        """
        return " ".join(text_list)

    def build_vocabulary(self, tokens: list[str]) -> list[str]:
        """Create a vocabulary list from the list of tokens.

        Args:
            tokens: The list of tokens in the dataset.

        Returns:
            List of unique tokens (vocabulary) in the dataset.
        """
        return sorted(list(set(tokens)))

    def encode(self, text: str) -> list[int]:
        """Encodes a text sequence into a list of indices.

        Args:
            text: The input text to be encoded.

        Returns:
            A list of indices corresponding to the tokens in the input text.
        """

        # Convert tokens into indices.
        indices = []
        for token in self.space_tokenize(text):
            token_index = self.token_to_index.get(token)
            indices.append(token_index)

        return indices

    def decode(self, indices: int | list[int]) -> str:
        """Decodes a list (or single index) of integers back into tokens.

        Args:
            indices: A single index or a list of indices to be decoded into
                tokens.

        Returns:
            str: A string of decoded tokens corresponding to the input indices.
        """

        # If a single integer is passed, convert it into a list.
        if isinstance(indices, int):
            indices = [indices]

        # Map indices to tokens.
        tokens = []
        for index in indices:
            token = self.index_to_token.get(index)
            tokens.append(token)

        # Join the decoded tokens into a single string.
        return self.join_text(tokens)

To observe that this class performs the same processing as your previous implementations, run the following cell. This cell runs some tests that verify that the first paragraph from the dataset remains the same after encoding and then decoding it using the `SimpleWordTokenizer`.

In [None]:
tokenizer = SimpleWordTokenizer(dataset)
slm.test_simple_word_tokenizer(tokenizer, vocabulary, dataset)

## Summary

This is the end of the **Prepare The Dataset For Training a SLM** lab.

This lab guided you through the steps necessary for preparing a text dataset to be used for training a small language model (SLM). You focused on:

- **Loading and exploring the dataset:** You examined the structure and content of the Africa Galore dataset and inspected example paragraphs in the dataset.

- **Tokenizing the text:** You used a simple word-level tokenization method to split the text into individual tokens and created a vocabulary of unique tokens.

- **Creating numerical representations:** You mapped each token to a unique numerical index by creating `token_to_index` and `index_to_token` dictionaries, which enable the conversion between tokens and token IDs.

- **Packaging the steps in a tokenizer class:** You examined a consolidated version of the tokenization and encoding/decoding logic in a reusable `SimpleWordTokenizer` class. This class streamlines the process of converting text into numerical data that can be fed into a language model and converting the output of a language model to human-readable texts.

In the next lab, you will use this tokenizer class to tokenize the data and use this data to train a small language model.

## Solutions

The following cells provide reference solutions to the coding activities above. If you really get stuck after trying to solve the activities yourself, you may want to consult these solutions.

However, we recommend that you *only* look at the solutions after you have tried to solve the activities above *multiple times*. The best way to learn challenging concepts in computer science and artifical intelligence is to debug your code piece-by-piece until it works rather than copying existing solutions.

If you feel stuck, you may want to first try to debug your code, for example, by adding additional print statements to see what your code is doing at every step. This will provide you with a much deeper understanding of the code and the materials. It will also make you practice how to solve challenging coding problems beyond this course.

To view the solutions for an activity, click on the arrow to the left of the activity name. If you consult the solutions, do not copy and paste them into the cells above. Instead, look at them and then type them manually into the cell. This will help you understand where you went wrong.

### Coding Activity 1

In [None]:
# Add this code to the cell in the coding activity above to build the list of
# tokens.
for paragraph in dataset:
    for token in space_tokenize(paragraph):
        tokens.append(token)

### Coding Activity 2

In [None]:
# Complete implementation of the build_vocab function.
def build_vocabulary(tokens: list[str])-> list[str]:

    # Build a vocabulary list from the set of tokens.
    vocab = list(set(tokens))
    return vocab

### Coding Activity 3


In [None]:
# Add this code to the cell in the coding activity above to build the
# index_to_token mapping.
for token, index in token_to_index.items():
    index_to_token[index] = token