> <p><small><small>This Notebook is made available subject to the licence and terms set out in the <a href = "http://www.github.com/google-deepmind/ai-foundations">AI Research Foundations GitHub README file</a>.

<img src="https://storage.googleapis.com/dm-educational/assets/ai_foundations/GDM-Labs-banner-image-C1-white-bg.png">

# Build Your Own Small Language Model: Challenge Lab



## Challenge Scenario

### Cymbal Chat: Developing a chatbot for the Arabic speaking market.

Cymbal Chat is an AI language modeling small startup. They would like to expand the languages that they cover to also include Arabic. It has a very different grammar than English, and uses a different character set than what their AI researchers are used to. You are going to help them with their first steps: exploring the use of character-based language models.

<br />

------
> **ℹ️ Info**
>
> Sometimes character-based language models may work better for Arabic NLP tasks than word-based models [1]. It is because subwords can be attached to the beginning or end of words to add meaning. For example, a noun (book) can be modified as:
>
>* **كتاب (kitāb)** - a book
>* **كتابي (kitāb-ī)** - my book
>* **كتابك (kitāb-uk)** - your (masc.) book
>* **كتابه (kitāb-uh)** - his book
>
>Arabic is read from *right to left*, and here the modifiers were added to the end of the host word (book).
>They can also be added before the host word, for example conjunctions like "و " (wa-, meaning 'and') and "ف " (fa-, meaning 'so' or 'then'), as well as prepositions such as "ب " (bi-, meaning 'with' or 'in') and "ل " (li-, meaning 'for' or 'to').
------

<br />

Exploring the use of character-based language models will involve these Tasks:

* **Task 1.** *Define helper functions and load data.*
* **Task 2.** *Character tokenizer*: Create a simple character level tokenizer for Arabic text.
* **Task 3.** *N-gram text generator*: Develop an n-gram based text generator as a baseline against which to evaluate a more sophisticated model.
* **Task 4.** *Data preparation*: Prepare an Arabic dataset so that it can be used for training a character-based transformer language model on restricted resources.

## Task 1.  Define helper functions and load data

You will use a small dataset of short children's stories in Arabic to explore the use of character-based language models for Cymbal Chat.

For this task, you will import the libraries required for this lab and load and display the dataset of Arabic stories.

✅ All code cells have already been written for you in all parts of this task. **You are not required to add or modify code for Task 1.**

<br />

------
> 💻 **Your task**:
>
> Run the cells in **Task 1** to load a small children's stories dataset.
-----

In [1]:
# Install logging packages for lab grading
%pip install --upgrade google-cloud-logging > /dev/null 2>&1
%pip install --force-reinstall protobuf==3.20.0 > /dev/null 2>&1

In [2]:
# Packages used.
from collections import Counter, defaultdict
import random
import re
from typing import Any, Literal
import itertools

import keras
import numpy as np
import pandas as pd
from google.cloud import logging as gcp_logging
from google.cloud.logging.handlers import CloudLoggingHandler

In [3]:
# Instantiate the Google Cloud Logging client
import logging
from google.cloud import logging as gcp_logging
from google.cloud.logging.handlers import CloudLoggingHandler

# Instantiate the Google Cloud Logging client
client = gcp_logging.Client()

# Create a specific logger for cloud logging
logger = logging.getLogger('my_cloud_logger')

# Prevent this logger from sending messages to its parent (the root logger),
# which has the default console handler
logger.propagate = False

# Create and add the Cloud Logging handler
handler = CloudLoggingHandler(client)
logger.addHandler(handler)

# Set the logging level
logger.setLevel(logging.INFO)

### Helper functions

------
> **ℹ️ Info: Working with Arabic text**
>
>An Arabic sentence like
>
>ذَهَبَ الطَالبُ الى المَدْرَسَة
>
>(meaning *"the student went to school"*) is read from *right to left*. The first character in the string is printed on the far right where one starts reading, and programatically still has string index `0`. The string can contain:
>
>* **Bidirectional control characters** that states that the text should be displayed right-to-left on a screen. (These additional control characters do not change Python indexing.)
>
>* **Diacritic characters** or markings that are placed above or below letters to indicate, for instance, short vowel sounds. They can help to disambiguate text.
-----

<br />

The helper functions `clean_bidi_chars` and `remove_diacritics` will be used to preprocess text by removing bidirectional control characters, as well as diacritic markings that are not essential for this task.

In [4]:
def clean_bidi_chars(text: str) -> str:
   """Remove Unicode bidirectional control characters from text.

   Args:
        text: Input text that may contain BiDi control characters.

   Returns:
       Text with all bidirectional control characters removed.
   """
   bidi_chars = re.compile(r"[\u202A-\u202E\u2066-\u2069]")
   return bidi_chars.sub("", text)


def remove_diacritics(text: str) -> str:
   """Remove Arabic diacritic characters to simplify text.

   Args:
       text: Arabic text that may contain diacritical marks.

   Returns:
       Text with all Arabic diacritics and tatweel removed.
   """
   arabic_diacritics = re.compile(r"[\u064B-\u065F\u0670\u0640]")
   return arabic_diacritics.sub("", text)


def display_arabic(text: str) -> str:
   """Format Arabic text terminals with limited right-to-left support.

   Args:
       text: Arabic text to format for display.

   Returns:
       Text wrapped with RTL (right-to-left) override characters for improved
       terminal rendering.
   """
   return "\u202E" + text + "\u202C"


sentence = "ذَهَبَ الطَالبُ الى المَدْرَسَة"  # The student went to school
sentence_bidi_clean = clean_bidi_chars(sentence)
sentence_without_diacritics = remove_diacritics(sentence_bidi_clean)

print("EXAMPLE")
print(f"Text:           {sentence}")
print(f"Processed text: {display_arabic(sentence_without_diacritics)}")

EXAMPLE
Text:           ذَهَبَ الطَالبُ الى المَدْرَسَة
Processed text: ‮ذهب الطالب الى المدرسة‬


### Load and display the dataset.



In [5]:
# Load dataset of Arabic stories.
url = "https://storage.googleapis.com/dm-educational/assets/ai_foundations/200-stories-ar.json"
df = pd.read_json(url)

# Extract dataset from dataframe and preprocess text.
dataset = []
for story in df["story_ar"].to_list():
    story = clean_bidi_chars(story)
    story = remove_diacritics(story)
    dataset.append(story)

first_story = dataset[0]
print(f"First story after preprocessing:\n\t{display_arabic(first_story)}")

First story after preprocessing:
	‮الشمس طلعت! بطة صغيرة تستيقظ. تحب اللعب. ترى مزارعا قرب بركتها. المزارع لطيف. يلوح للبطة.

البطة تتهادى إلى صديقها، أرنب ناعم. الأرنب يأكل جزرة. تقول البطة: "مرحبا! هل تريد اللعب؟" الأرنب يومئ برأسه. هما سعيدان.

يريان صخرة ملساء ومسطحة. تبدو مكانا ممتعا للتزلج! يحاول الأرنب التزلج لكنه يقفز بدلا من ذلك. البطة تنزلق قليلا. من الممتع المحاولة!

قريبا، حان وقت الغداء. البطة تأكل بذورا لذيذة. الأرنب يأكل المزيد من الجزر. يأخذان قيلولة في الشمس الدافئة. هما صديقان حميمان.‬


## Task 2. Character Tokenizer

You need to convert natural language text to character-tokens and back again. For example, the text

`text = "الشمس طلعت"`

(meaning: *"The sun is out"*) should be split into a list of character tokens.  The first three character tokens are:

```
tokens[0] == "ا"
tokens[1] == "ل"
tokens[2] == "ش"
```

Remember that the text literaral is formatted to appear on screen *right-to-left*. When the tokens are joined, the original text should be recreated.



### Complete a simple Arabic character tokenizer.

In this section, you will complete two methods in the `SimpleArabicCharacterTokenizer` class to
convert natural language text to character-tokens and back.

<br />

------
> 💻 **Your task**:
>
> Complete the **`character_tokenize`** method.
>
> Given text input like `text = "الشمس طلعت"` (meaning: *"The sun is out"*), the method should return a list of character level tokens, `tokens`, each a `str` object of length one. These must be the characters of the input text in the same order. For the example text, it means `tokens[0]` is `"ا"`.
>
> *Note that if you print `tokens` to standard output it may also appear in right-to-left order. Like this:*
> ```
> ["ا", "ل", "ش", "م", "س", " ", "ط", "ل", "ع", "ت"]
> ```
> *This is a consequence of the right-to-left encoding of the Arabic characters but it does not affect the reference order of the elements.*
-----
> 💻 **Your task**:
>
> Complete the **`join_text`** method.
>
> Given a list of tokens, this should return a string containing the characters appearing in the input tokens.
------


In [6]:
class SimpleArabicCharacterTokenizer:
    """A simple character tokenizer."""

    def __init__(self):
        pass

    def character_tokenize(self, text: str) -> list[str]:
        """Splits a given Arabic text into character tokens.

        Args:
            text: Text to split into character tokens.

        Returns:
            List of tokens after splitting `text`.
        """
        # Tokenize the string into character tokens by converting it to a list.
        tokens = list(text)
        return tokens

    def join_text(self, tokens: list[str]) -> str:
        """Combines a list of tokens into a single string.

        The combined tokens are combined simply into a single string.

        Args:
            tokens: List of tokens to be joined.

        Returns:
            String with all tokens joined together without intervening characters.
        """
        # Join the list of character tokens into a string.
        text = "".join(tokens)
        return text


### Test your tokenizer

You can experiment with your implementation of `SimpleArabicCharacterTokenizer` by running the cell below.

✅ All the code has been written for you in this part. **You are not required to add or modify code.**

In [7]:
# Test your tokenizer on the first line of the first story from the dataset.

# Create tokenizer.
tokenizer = SimpleArabicCharacterTokenizer()

# Get first line of first story.
first_story = dataset[0]
first_line = first_story.split("\n")[0]
print(f"First line of dataset:\n\t{first_line}")

# Tokenize the line of text.
first_line_tokens = tokenizer.character_tokenize(first_line)
print(f"First line tokens:\n\t{first_line_tokens}")

# Join the tokens to reform the line of text.
first_line_rejoined =  tokenizer.join_text(first_line_tokens)
print(f"First line rejoined:\n\t{first_line_rejoined}")

# Do not remove or modify this logging call, it will be used for tracking purposes
logger.info(f'Task 2: First line rejoined: {len(first_line_tokens)}')

First line of dataset:
	الشمس طلعت! بطة صغيرة تستيقظ. تحب اللعب. ترى مزارعا قرب بركتها. المزارع لطيف. يلوح للبطة.
First line tokens:
	['ا', 'ل', 'ش', 'م', 'س', ' ', 'ط', 'ل', 'ع', 'ت', '!', ' ', 'ب', 'ط', 'ة', ' ', 'ص', 'غ', 'ي', 'ر', 'ة', ' ', 'ت', 'س', 'ت', 'ي', 'ق', 'ظ', '.', ' ', 'ت', 'ح', 'ب', ' ', 'ا', 'ل', 'ل', 'ع', 'ب', '.', ' ', 'ت', 'ر', 'ى', ' ', 'م', 'ز', 'ا', 'ر', 'ع', 'ا', ' ', 'ق', 'ر', 'ب', ' ', 'ب', 'ر', 'ك', 'ت', 'ه', 'ا', '.', ' ', 'ا', 'ل', 'م', 'ز', 'ا', 'ر', 'ع', ' ', 'ل', 'ط', 'ي', 'ف', '.', ' ', 'ي', 'ل', 'و', 'ح', ' ', 'ل', 'ل', 'ب', 'ط', 'ة', '.']
First line rejoined:
	الشمس طلعت! بطة صغيرة تستيقظ. تحب اللعب. ترى مزارعا قرب بركتها. المزارع لطيف. يلوح للبطة.


In [8]:
class SimpleArabicCharacterTokenizer:
    """A simple character tokenizer for Arabic text."""

    def __init__(self):
        """Initializes the tokenizer."""
        pass

    def character_tokenize(self, text: str) -> list[str]:
        """Splits a given Arabic text into character tokens.

        Args:
            text: The input string to tokenize.

        Returns:
            A list of single-character strings.
        """
        return list(text)

    def join_text(self, tokens: list[str]) -> str:
        """Joins a list of character tokens back into a single string.

        Args:
            tokens: A list of single-character strings.

        Returns:
            The reconstructed string.
        """
        return "".join(tokens)


In [9]:
class SimpleArabicCharacterTokenizer:
    """A simple character tokenizer."""

    def __init__(self):
        pass

    def character_tokenize(self, text: str) -> list[str]:
        """Splits a given Arabic text into character tokens.

        Args:
            text: Text to split into character tokens.

        Returns:
            List of tokens after splitting `text`.
        """

        # Tokenize the string into character tokens.
        tokens = list(text)

        return tokens

    def join_text(self, tokens: list[str]) -> str:
        """Combines a list of tokens into a single string.

        The combined tokens are combined simply into a single string.

        Args:
            tokens: List of tokens to be joined.

        Returns:
            String with all tokens joined together without intervening characters.
        """

        # Join the list of character tokens into a string.
        text = "".join(tokens)

        return text


Go to the **Task 2. Character Tokenizer** section of the lab instructions and click **Check my progress** to verify the objective.

## Task 3. Generating text from an n-gram model

Cymbal Chat wishes to create simple baseline for text generation, using an n-gram model. They already have code to build an n-gram model, which they adapted from *Lab: Experiment with N-Gram Models* from course *01 Build Your Own Small Language Model*.

For Task 3, you will write code to generate text from their n-gram model, given a prompt. Your instructions will appear in the "💻 **Your task**" box after their n-gram model functions. First, run Cymbal Chat's functions:

### N-gram model functions

Cymbal Chat already have code to build an n-gram model. Their function `build_ngram_model` takes a corpus of text documents and a simple Arabic character tokenizer and builds an n-gram model from the data.

Run the cell with Cymbal Chat's functions that build an n-gram model. **You are not required to modify any of these functions.**

In [10]:
def generate_character_ngrams(
        text: str,
        n: int,
        tokenizer: SimpleArabicCharacterTokenizer
) -> list[tuple[str]]:
    """Generates character n-grams from a given text.

    Args:
        text: The input text string.
        n: The size of the n-grams.
        tokenizer: A tokenizer that converts text into character tokens.

    Returns:
        A list of n-grams, each represented as a list of character tokens.
    """

    # Tokenize text.
    tokens = tokenizer.character_tokenize(text)

    # Construct the list of n-grams.
    ngrams = []

    num_of_tokens = len(tokens)

    # The last n-gram will be tokens[num_of_tokens - n + 1: num_of_tokens + 1].
    for i in range(0, num_of_tokens - n + 1):
        ngrams.append(tuple(tokens[i:i+n]))

    return ngrams

def get_character_ngram_counts(
        dataset: list[str],
        n: int,
        tokenizer: SimpleArabicCharacterTokenizer
) -> dict[str, Counter]:
    """Computes the character n-gram counts from a dataset.

    This function takes a list of text strings (paragraphs or sentences) as
    input, constructs n-grams from each text, and creates a dictionary where:

    * Tokens are individual characters.
    * Keys represent n-1 token long contexts `context`.
    * Values are a Counter object `counts` such that `counts[next_token]` is the
    * count of `next_token` following `context`.

    Args:
        dataset: The list of text strings in the dataset.
        n: The size of the n-grams to generate (e.g., 2 for bigrams, 3 for
            trigrams).
        tokenizer: A tokenizer that converts text into character tokens.

    Returns:
        A dictionary where keys are (n-1)-token contexts and values are Counter
        objects storing the counts of each next token for that context.
    """
    ngram_counts = defaultdict(Counter)

    # Loop through all paragraphs.
    for paragraph in dataset:
        # Loop through all n-grams for the paragraph.
        for ngram in generate_character_ngrams(paragraph, n, tokenizer):
            # Extract the context. This will be all but the last token.
            context = "".join(ngram[:-1])
            # Extract the next token. This will be the last token of the n-gram.
            next_token = ngram[-1]
            # Increment the counter for the context - next_token pair by 1.
            ngram_counts[context][next_token] += 1

    return dict(ngram_counts)

def build_ngram_model(
        dataset: list[str],
        n: int,
        tokenizer: SimpleArabicCharacterTokenizer
) -> dict[str, dict[str, float]]:
    """Builds an n-gram language model.

    This function takes a list of text strings (paragraphs or sentences) as
    input, generates n-grams from each text using the function get_ngram_counts
    and converts them into probabilities. The resulting model is a dictionary,
    where keys are (n-1)-token contexts and values are dictionaries mapping
    possible next tokens to their conditional probabilities given the context.
    When a conditional probability is requested for an unseen context the
    model returns the marginal probability of that token.

    Args:
        dataset: A list of text strings representing the dataset.
        n: The size of the n-grams (e.g., 2 for a bigram model).
        tokenizer: A tokenizer that converts text into character tokens.

    Returns:
        A dictionary representing the n-gram language model with fallback, where
        keys are (n-1)-tokens contexts and values are dictionaries mapping
        possible next tokens to their conditional probabilities.
    """
    # A dictionary to store P(B | A).
    ngram_model = {}

    # Use the ngram_counts as computed by the get_ngram_counts function.
    ngram_counts = get_character_ngram_counts(dataset, n, tokenizer)

    # Compute Count(A) and P(B | A ) for observed contexts A.
    for context, next_tokens in ngram_counts.items():
        context_total_count = sum(next_tokens.values())
        ngram_model[context] = {}
        for token, count in next_tokens.items():
            ngram_model[context][token] = count / context_total_count

    return ngram_model

### Complete a function that generates text from an n-gram model

------
> 💻 **Your task:**
>
> Complete the `generate_text_from_ngram_model` function.
>
> You should add code to the `generate_text_from_ngram_model` function so that text can be generated using an n-gram model, `ngram_model`. Given a prompt like `start_prompt = "يوم واحد"`, an additional `n_tokens` should be generated, either:
>
> * by randomly sampling the next token (`sampling_mode = "random"`), or
> * by greedily picking the token with highest probability (`sampling_mode = "greedy"`)
>
> at each step.
>
> Further instructions are marked in the `generate_text_from_ngram_model` function that you should complete. The `argmax` function is useful for greedy sampling.
------


In [11]:
import random
from typing import Literal

# This helper function is provided in the lab.
def argmax(arr: list[float]) -> int:
    """Get the index of the largest element in list of float elements."""
    return max(range(len(arr)), key=arr.__getitem__)

def generate_text_from_ngram_model(
        start_prompt: str,
        n_tokens: int,
        ngram_model: dict[str, dict[str, float]],
        tokenizer: SimpleArabicCharacterTokenizer,
        sampling_mode: Literal["random", "greedy"] = "random"
) -> str:
    """Generate text based on a starting prompt using an ngram model.

    Args:
        start_prompt: The initial prompt to start the generation.
        n_tokens: The number of tokens to generate after the prompt.
        ngram_model: An ngram model mapping contexts of n-1 tokens to distributions
            over next token.
        tokenizer: The tokenizer to encode and decode text.
        sampling_mode: Whether to use random or greedy sampling. Supported
            options are "random" and "greedy".

    Returns:
        The generated text from the prompt.
    """
    # Infer n-1 from the length of the keys in the model
    n_minus_1 = len(next(iter(ngram_model.keys())))

    # Tokenize the starting prompt.
    generated_tokens = tokenizer.character_tokenize(start_prompt)

    for _ in range(n_tokens):
        # 1. Determine the context for the n-gram model.
        context = "".join(generated_tokens[-n_minus_1:])

        # 2. Use the n-gram model to get the conditional probabilities.
        # If context is not in model, we cannot proceed.
        if context not in ngram_model:
            break

        distribution = ngram_model[context]
        possible_next_tokens = list(distribution.keys())
        probabilities = list(distribution.values())

        # 3. Support "random" or "greedy" sampling.
        if sampling_mode == "greedy":
            # Greedily pick the token with the highest probability.
            next_token_index = argmax(probabilities)
            next_token = possible_next_tokens[next_token_index]
        elif sampling_mode == "random":
            # Randomly sample the next token based on its probability.
            next_token = random.choices(possible_next_tokens, weights=probabilities, k=1)[0]
        else:
            raise ValueError(f"Unsupported sampling_mode: {sampling_mode}")

        generated_tokens.append(next_token)

    # 4. Return the complete resulting sequence as a string.
    generated_text = tokenizer.join_text(generated_tokens)
    return generated_text


### Test your text generator

You can experiment with your implementation of `generate_text_from_ngram_model` by running this cell.

✅ All the code has been written for you in this part. **You are not required to add or modify code.**

In [12]:
# Train n-gram model from dataset.
n = 4 # Size of n-grams.
ngram_model = build_ngram_model(dataset, n, tokenizer)

# Generate text.
start_prompt = "يوم واحد"
print(f"Start prompt is:\n\t{start_prompt}")
n_tokens = 15 # Specify the number of new tokens to generate.
tokenizer = SimpleArabicCharacterTokenizer()

generated_text = generate_text_from_ngram_model(
    start_prompt,
    n_tokens,
    ngram_model,
    tokenizer,
    sampling_mode="random")

print(f"Text generated is:\n\t{display_arabic(generated_text)}")

# Do not remove or modify this logging call, it will be used for tracking purposes
logger.info(f'Task 3: The total word count for the generated text is: {len(generated_text.split())}')

Start prompt is:
	يوم واحد
Text generated is:
	‮يوم واحدة. سيلعبان مودع‬


Go to the **Task 3. Generate Text from an n-gram Model** section of the lab instructions and click **Check my progress** to verify the objective.

## Task 4. Preparing dataset for training character-based language model  

Cymbal Chat wants to develop better language models. To `encode` tokens into indices and to `decode` indices back to text tokens, Cymbal Chat extended the `SimpleArabicCharacterTokenizer` that you developed previously. They adapted code from *Lab: Prepare The Dataset For Training an SLM* and *Lab: Train Your Own Small Language Model* from the course *01 Build Your Own Small Language Model*. They called their class `EnhancedTokenizer`.

The input to their transformer models will be capped to a maximum number of tokens that the model can process. This number is called the model's **context length**. However, some of the stories that Cymbal Chat want to train on, have more tokens than their model's context length. This will be a problem.

For Task 4, you have to write code that will break token sequences up into overlapping segments that will each fit in the context length.
This will support different sizes of transformer models including some with a limited maximum number of tokens. Your instructions will appear in the "💻 **Your task**" box after Cymbal Chat's enhanced tokenizer class. First, run Cymbal Chat's code:

### An enhanced character-based tokenizer

Cymbal Chat's `EnhancedTokenizer` class includes padding and unknown tokens with the vocabulary from the simple Arabic character tokenizer.

Run the cell with Cymbal Chat's tokenizer class. **You are not required to modify the class or any of its methods.**

In [13]:
class EnhancedTokenizer(SimpleArabicCharacterTokenizer):
    """
    Tokenizer provides two key additional functions:
    `encode` method to convert the text into a sequence of indices and the
    `decode` method to convert indices back into text.

    """
    # Define constants.
    UNKNOWN_TOKEN = "<UNK>"
    PAD_TOKEN = "<PAD>"

    def __init__(self, corpus: list[str], vocabulary: list[str] | None = None):
        """Initializes the tokenizer with texts in corpus or with a vocabulary.

        Args:
          corpus: Input text dataset
          vocabulary: A pre-defined vocabulary. If None,
              the vocabulary is automatically inferred from the texts.
        """
        super(EnhancedTokenizer).__init__()
        if vocabulary is None:
            # Build the vocabulary from scratch.
            if isinstance(corpus, str):
                corpus = [corpus]

            # Convert input text sequence to tokens.
            tokens = []
            for text in corpus:
                tokens.extend(self.character_tokenize(text))

            # Create a vocabulary comprising of unique tokens.
            vocabulary = self.build_vocabulary(tokens)

            # Add special unknown and pad tokens to the vocabulary list.
            self.vocabulary = (
                [self.PAD_TOKEN] + vocabulary + [self.UNKNOWN_TOKEN]
            )

        else:
            self.vocabulary = vocabulary

        # Size of vocabulary.
        self.vocabulary_size = len(self.vocabulary)

        # Create token-to-index and index-to-token mappings.
        self.token_to_index = {}
        self.index_to_token = {}
        # Loop through all tokens in the vocabulary. enumerate automatically
        # assigns a unique index to each token.
        for index, token in enumerate(self.vocabulary):
            self.token_to_index[token] = index
            self.index_to_token[index] = token

        # Map the special tokens to their IDs.
        self.pad_token_id = self.token_to_index[self.PAD_TOKEN]
        self.unknown_token_id = self.token_to_index[self.UNKNOWN_TOKEN]

    def build_vocabulary(self, tokens: list[str]) -> list[str]:
        """Create a vocabulary list from the list of tokens.

        Args:
            tokens: The list of tokens in the dataset.

        Returns:
            List of unique tokens (vocabulary) in the dataset.
        """
        # Build vocabulary of tokens.
        vocabulary = sorted(list(set(tokens)))
        return vocabulary

    def encode(self, text: str) -> list[int]:
        """Encodes a text sequence into a list of indices.

        Args:
            text: The input text to be encoded.

        Returns:
            A list of indices corresponding to the tokens in the input text.
        """
        indices = []

        # Convert tokens into indices.
        # Replacing out of vocabulary tokens with "<UNK>".
        indices = []
        for token in self.character_tokenize(text):
            if token in self.token_to_index:
                token_index = self.token_to_index[token]
            else:
                token_index = self.unknown_token_id
            indices.append(token_index)

        return indices

    def decode(self, indices: int | list[int]) -> str:
        """Decodes a list (or single index) of integers back into tokens.

        Args:
            indices: A single index or a list of indices to be
                decoded into tokens.

        Returns:
            A string of decoded tokens corresponding to the input indices.
        """
        # Map a sequence of encoded indices back to a string of decoded tokens.
        text = ""

        # Map indices to tokens.
        tokens = []
        for index in indices:
            token = self.index_to_token.get(index, self.unknown_token_id)
            tokens.append(token)

        # Join the decoded tokens into a single string.
        text = self.join_text(tokens)
        return text

### Complete a function that segments an encoded sequence

------
> 💻 **Your task:**
>
> Complete the `segment_encoded_sequence` function.
>
> This function takes an encoded sequence of token ids, `sequence`, as input. It should segment the sequence into subsequences of at most `max_length` length with specified overlap `n_overlap`.
>
> For example, a text string might be encoded into the sequence of token indices:
>
> ```
> [17, 40, 30, 41, 29, 2, 33, 40, 35, 20]
> ```
>
> If a model's input context length is only three tokens, `max_length=3`, then the longest subsequence is three tokens long. If one token overlaps between subsequences, `n_overlap=1`, then `segment_encoded_sequence` should return the following `subsequences` list of lists:
> ```
> [ [17, 40, 30],
>   [30, 41, 29],
>   [29, 2, 33],
>   [33, 40, 35],
>   [35, 20] ]
> ```
>
> It is clear that the original sequence `[17, 40, 30, 41, 29, 2, 33, 40, 35, 20]` can be reconstructed by simple concatenation of the subsequences with overlaps removed.
>
> Detailed instructions are marked in the `segment_encoded_sequence` function that you should complete.
------

In [14]:
def segment_encoded_sequence(
        sequence: list[int],
        max_length: int,
        n_overlap: int
) -> list[list[int]]:
    """Segment a long encoded sequence into overlapping subsequences of maximum
    length.

    Divides the input sequence into chunks of max_length tokens with specified
    overlap between consecutive segments. The final segment may be shorter than
    max_length if insufficient tokens remain.

    Args:
        sequence: List of token indices to segment.
        max_length: Maximum length for each subsequence.
        n_overlap: Number of tokens to overlap between consecutive segments.

    Returns:
        List of subsequences, each with at most max_length token indices. All
        segments except possibly the last will have exactly max_length tokens.
    """
    subsequences = []

    # Calculate the step size to move the window forward for each new segment.
    # This ensures the specified overlap between consecutive segments.
    step = max_length - n_overlap

    # Iterate through the sequence with the calculated step size.
    for i in range(0, len(sequence), step):
        # Extract the subsequence of max_length. Python's slicing handles the
        # end of the list gracefully, creating a shorter final segment if
        # i + max_length exceeds the sequence length.
        subsequence = sequence[i:i + max_length]
        subsequences.append(subsequence)

    return subsequences


### Test your segmenter

You can experiment with your implementation of `segment_encoded_sequence` by running this cell.

✅ All the code has been written for you in this part. **You are not required to add or modify code.**

In [15]:
def segment_encoded_sequence(
        sequence: list[int],
        max_length: int,
        n_overlap: int
) -> list[list[int]]:
    """Segment a long encoded sequence into overlapping subsequences of maximum
    length.

    Divides the input sequence into chunks of max_length tokens with specified
    overlap between consecutive segments. The final segment may be shorter than
    max_length if insufficient tokens remain.

    Args:
        sequence: List of token indices to segment.
        max_length: Maximum length for each subsequence.
        n_overlap: Number of tokens to overlap between consecutive segments.

    Returns:
        List of subsequences, each with at most max_length token indices. All
        segments except possibly the last will have exactly max_length tokens.
    """
    subsequences = []

    # 1. Repeatedly take a segment of (up to) `max_length` from the sequence.
    # 2. Ensure that is an overlap of n_overlap between consecutive
    #    subsequences.
    # 3. Handle the case when the final segment may be shorter than max_length.

    # Calculate the step size for moving the window.
    step = max_length - n_overlap

    # Iterate through the sequence with the calculated step size.
    for i in range(0, len(sequence), step):
        # Extract the subsequence of max_length.
        subsequence = sequence[i:i + max_length]
        subsequences.append(subsequence)

    return subsequences



Go to the sub-section **Build segment_encoded_sequence** in the section **Task 4. Prepare dataset for training character-based language model**  of the lab instructions and click **Check my progress** to verify the objective.

### Complete a function that creates training inputs and targets

Now that you can segment a long sequence of token ids into subsequences of a maximum length, the last task is to take a dataset (of type `list[str]`) and turn it into training inputs and targets:

* `inputs`: an array of token sequences, each of length `context_length`.
* `targets`: an array of target sequences (inputs shifted by one position), each of length `context_length`.

<br />

------
> 💻 **Your task:**
>
> Complete the `create_training_sequences` function.
>
> The function should encode and segment the complete dataset, convert subsequences to padded `numpy` arrays and then prepare `input` and `target` arrays for training a transformer model.
>
> Your task is to add code to create a list of lists called `encoded_tokens` from an input `dataset`. Each inner list in `encoded_tokens` represents a sequence of scalars (e.g., integers for tokenized text)
>
> Code to then pad and format `encoded_tokens` into `inputs` and `targets` is provided for you.
------

In [16]:
import numpy as np
import keras
# The following are assumed to be defined or imported in your environment
# from the previous steps of the lab.
# - class EnhancedTokenizer
# - def segment_encoded_sequence(...)

def create_training_sequences(
        dataset: list[str],
        context_length: int,
        n_overlap: int,
        tokenizer: 'EnhancedTokenizer'  # Use quotes for forward reference if needed
) -> tuple[np.ndarray, np.ndarray]:
    """Create training input-target sequence pairs from text dataset.

    Encodes text data into token sequences, segments them into fixed-length
    overlapping windows, and creates input-target pairs for language modeling
    where targets are inputs shifted by one position.

    Args:
        dataset: List of text strings to process into training sequences.
        context_length: Maximum sequence length for model input.
        n_overlap: Number of tokens to overlap between consecutive segments.
        tokenizer: Tokenizer object with an encode method for text-to-tokens
            conversion.

    Returns:
        Tuple of (inputs, targets) where:
        - inputs: Array of token sequences of length context_length.
        - targets: Array of target sequences (inputs shifted by one position).
    """

    segmentation_length = context_length + 1
    # The segments are one token longer than the model's maximum input length
    # because the target (next) tokens to predict are the input tokens shifted
    # by one position.

    pad_token_id = tokenizer.pad_token_id
    encoded_tokens = []

    # 1. Iterate over the entries in the dataset.
    for text in dataset:
        # 2. For each dataset entry (text), encode it into a sequence of token ids.
        sequence_ids = tokenizer.encode(text)

        # 3. Segment the sequence of token ids into overlapping segments.
        #    This uses the `segment_encoded_sequence` function from the previous task.
        segments = segment_encoded_sequence(
            sequence_ids, segmentation_length, n_overlap
        )

        # 4. Include the segments in the list of all encoded tokens.
        encoded_tokens.extend(segments)

    # The code below was provided and is now correctly fed by `encoded_tokens`.
    # It creates padded sequences one token longer than the maximum input length.
    padded_sequences = keras.preprocessing.sequence.pad_sequences(
            encoded_tokens,
            maxlen=segmentation_length,
            padding="post",
            value=pad_token_id)

    # Create inputs and targets from the padded sequences.
    # `inputs` are all tokens except the last one.
    inputs = padded_sequences[:, :-1]
    # `targets` are all tokens except the first one.
    targets = padded_sequences[:, 1:]
    return inputs, targets


### Test your training sequence creation function

You can experiment with your implementation of `create_training_sequence` by running this cell.

✅ All the code has been written for you in this part. **You are not required to add or modify code.**

In [17]:
import numpy as np
import keras

def create_training_sequences(
        dataset: list[str],
        context_length: int,
        n_overlap: int,
        tokenizer: 'EnhancedTokenizer' # Using quotes for forward reference
) -> tuple[np.ndarray, np.ndarray]:
    """Create training input-target sequence pairs from text dataset.

    Encodes text data into token sequences, segments them into fixed-length
    overlapping windows, and creates input-target pairs for language modeling
    where targets are inputs shifted by one position.

    Args:
        dataset: List of text strings to process into training sequences.
        context_length: Maximum sequence length for model input.
        n_overlap: Number of tokens to overlap between consecutive segments.
        tokenizer: Tokenizer object with encode method for text-to-tokens
            conversion.

    Returns:
        Tuple of (inputs, targets) where:
        - inputs: Array of token sequences of length context_length.
        - targets: Array of target sequences (inputs shifted by one position).
    """
    segmentation_length = context_length + 1
    pad_token_id = tokenizer.pad_token_id
    encoded_tokens = []

    # 1. Iterate over the entries in the dataset.
    for text in dataset:
        # 2. Encode the text into a sequence of token ids.
        sequence_ids = tokenizer.encode(text)

        # 3. Segment the sequence of token ids into overlapping segments.
        segments = segment_encoded_sequence(
            sequence_ids, segmentation_length, n_overlap
        )

        # 4. Include the segments in the list of encoded tokens.
        encoded_tokens.extend(segments)

    # The code below is provided in the lab.
    # Create padded sequences one token longer than the maximum input length.
    padded_sequences = keras.preprocessing.sequence.pad_sequences(
            encoded_tokens,
            maxlen=segmentation_length,
            padding="post",
            value=pad_token_id)

    # Create inputs and targets from padded sequences.
    inputs = padded_sequences[:, :-1]
    targets = padded_sequences[:, 1:]
    return inputs, targets


Go to the sub-section **Build create_training_sequences** in the section **Task 4. Prepare dataset for training character-based language model**  of the lab instructions and click **Check my progress** to verify the objective.

## Summary

This is the end of the Challenge Lab for the **Build your own Small Language Model** course. In this challenge lab, you have:

- Developed a character-based tokenizer for Arabic text.
- Built a function to generate text from an n-gram model, both by random and greedy sampling.
- Implemented functions to segment and prepare character encoded data for the training of transformer-based language models of varying sizes.

## References

[1] Mohamed, M. and Al-Azani, S. (2025). Enhancing Arabic NLP Tasks through Character-Level Models and Data Augmentation. In Proceedings of the 31st International Conference on Computational Linguistics (pp. 2744-2757).  Retrieved from [https://aclanthology.org/2025.coling-main.186.pdf](https://aclanthology.org/2025.coling-main.186.pdf)
