[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gsarti/ik-nlp-tutorials/blob/main/notebooks/W3E_BPE_Transduction.ipynb)

In [1]:
# Run in Colab to install local packages
!pip install spacy transformers datasets
!pip install sentencepiece datasets simplet5 tokenizers
!python -m spacy download en_core_web_sm

Collecting datasets
  Downloading datasets-3.3.2-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.3.2-py3-none-any.whl (485 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m485.4/485.4 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading multiprocess-0.70.16-py311-none-any.whl (143 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.5/143.5 kB[0m [31m10.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading x

# BPE Tokenizer

*This exercise follows the explanation of using BPE tokenization as explained on Huggingface [Build a Tokenizer from Scratch](https://huggingface.co/docs/tokenizers/python/latest/quicktour.html#build-a-tokenizer-from-scratch). Adapted from a notebook by Wietse de Vries*

The [Tokenizers](https://huggingface.co/docs/tokenizers/python/latest/quicktour.html) library by Huggingface provides implementations of today’s most used tokenizers (especially subword-based ones) that is both easy to use and blazing fast (Rust-compiled code!).

You will start by exploring the impact of different vocabulary sizes on a subword tokenizer using the Tokenizers library, and how these can be imported and used with spaCy. Finally, you will be asked to train a small transformer model to perform transduction from feminine to masculine words.

Exercise 1 is mandatory and will be part of your graded midterm portfolio. Exercise 2 is optional, but we highly recommend you to complete it, especially if you're interested in the "Modern Neural Networks meet Linguistic Theory" final project.

## Exercise 1: Byte Pair Encoding with Huggingface Tokenizers

In the following exercise, we will use a byte-pair encoding (BPE) tokenizer (see Jurafsky & Martin Sec. 2.4.3 and [Sennich et al, 2015](https://aclanthology.org/P16-1162/) to create a vocabulary of frequent words and subwords, allowing us to handle less frequent words.

### Setup

The following code loads a BPE tokenizer and trainer, tells the system to use whitespace as a separator and defines `[UNK]` as a special token intended to handle unknown words.

In [3]:
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace

tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
tokenizer.pre_tokenizer = Whitespace()
trainer = BpeTrainer(special_tokens=["[UNK]"], vocab_size=20000)

### Corpus

The tokenizer creates a dictionary by concatenating characters and substrings into longer strings (possibly full words) based on frequency. So we need a corpus to learn what the most frequent words and substrings are.

[Wikitext-103](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/) is a dump of the (English) Wikipedia. You can use the `train_from_iterator` method to train from the data in memory, which can be done using the `wikitext` corpus in the [Huggingface Datasets library](https://huggingface.co/datasets/wikitext).

### Run the trainer

The command below trains the tokenizer on the data:

In [4]:
import datasets
dataset = datasets.load_dataset(
    "wikitext", "wikitext-103-raw-v1", split="train+test+validation"
)

# Build a generator to iterate over the dataset
def batch_iterator(batch_size=1000):
    for i in range(0, len(dataset), batch_size):
        yield dataset[i : i + batch_size]["text"]

tokenizer.train_from_iterator(batch_iterator(), trainer=trainer, length=len(dataset))

### Test the tokenizer

Now that we have created a vocabulary, we can use it to tokenize a string into words and subtokens (for infrequent words).

The example shows that most of the words are included in the vocabulary created by training on Wikipedia text, but that the acronym *UG*, the name *Hanze*, and the word *Applied*, *jointly* and *initiating* are segmented into subword strings. This suggests that these words were not seen during training, or very infrequently. (*UG* occurs 5 times in the training data and *Applied* over 200 times,  also note that the encoding is case-sensitive.).

Try a few other examples to get a feeling for the lexical coverage of the tokenizer.

In [6]:
def show_tokens(text):
    output = tokenizer.encode(text)
    print(f"Tokens: {output.tokens}")
    number_of_words = len(tokenizer.pre_tokenizer.pre_tokenize_str(text))
    number_of_segments = len(output.tokens)
    print(f"{number_of_words} words and {number_of_segments} segments")

example = "The UG and the Hanze University of Applied Sciences are jointly initiating a pilot rapid testing centre, which will start on 18 January."
show_tokens(example)

Tokens: ['The', 'U', 'G', 'and', 'the', 'Han', 'ze', 'University', 'of', 'Appl', 'ied', 'Sciences', 'are', 'jointly', 'initi', 'ating', 'a', 'pilot', 'rapid', 'testing', 'centre', ',', 'which', 'will', 'start', 'on', '18', 'January', '.']
25 words and 29 segments


### Your Turn: Experiment with Vocabulary Size

The training data contains 103 M tokens and has a vocabulary size of 267,000 unique types. The default setting for the trainer is to create a dictionary of max 30,000 words. This means that a fair amount of compression takes place. Even more compression can be achieved by setting the vocab_size to a smaller value.

1. Choose an example text consisting of at least 100 words. You may want to ensure that it contains some rare words or tokens.

2. Experiment with various settings for vocab_size.

3. Count the number of words in the example, and the number of segments created by the BPE-tokenizer. Note that if the number segments goes up, more words are segmented into subwords.

4. What is the vocabulary size where the number of segments is approx. 150% of the number of words?

5. For this setting, what was the longest word in your example text that was not segmented?

In [7]:
# TODO: Try with various vocab_sizes
# Important: You will need to redefine the tokenizer for every new vocab size,
# otherwise you might incur in an "PanicError: no entry found for key" exception
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
tokenizer.pre_tokenizer = Whitespace()
trainer = BpeTrainer(special_tokens=["[UNK]"],vocab_size=30000)

tokenizer.train_from_iterator(batch_iterator(), trainer=trainer, length=len(dataset))

test_text = "Enter some English text containing at least 100 words"

show_tokens(test_text)

# Answer question 5 by going over the output, or write a
# few lines of code to provide the answer.

Tokens: ['Enter', 'some', 'English', 'text', 'containing', 'at', 'least', '100', 'words']
9 words and 9 segments


### Loading the BPE Tokenizer into spaCy

Now that you experimented with the creation of many tokenizers using Huggingface Tokenizers, you might want to move them to a more familiar environment. The following class lets you load a Huggingface Tokenizer into spaCy: the `get_words_spaces` function is used to preserve the whitespaces before tokens that are not word pieces.

### Your Turn: Fill in the missing code

Your task is to complete the `__call__` method of the `BPETokenizer` class to go from text to spaCy `Docs`, and finally to print the tokenized text.

In [9]:
from spacy.tokens import Doc
from spacy.vocab import Vocab
import spacy
from transformers import GPT2Tokenizer

class BPETokenizer:
    def __init__(self, tokenizer, vocab):
        self.tokenizer = tokenizer
        self.vocab = vocab

    def get_words_spaces(self, tokens, text):
        words = []
        spaces = []
        start_idx = 0

        for token in tokens:
            token_str = self.tokenizer.convert_ids_to_tokens(token)
            word_start = text.find(token_str, start_idx)
            words.append(token_str)
            # Check if there’s a space between the current and next token
            if word_start + len(token_str) < len(text) and text[word_start + len(token_str)] == ' ':
                spaces.append(True)
            else:
                spaces.append(False)
            start_idx = word_start + len(token_str)
        spaces[-1] = False  # The last token doesn't have a space after it
        return words, spaces

    def __call__(self, text):
        # Tokenize the text to get input IDs
        encoded = self.tokenizer.encode(text)
        tokens = encoded  # Use token IDs for tracking tokens

        # Use get_words_spaces to obtain the words and spaces
        words, spaces = self.get_words_spaces(tokens, text)

        # Return the spaCy Doc object
        return Doc(self.vocab, words=words, spaces=spaces)

# Initialize the Huggingface tokenizer (e.g., GPT2)
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# Now let's set up spaCy pipeline with the custom tokenizer
nlp = spacy.blank("en")
nlp.vocab = Vocab(strings=[tok for tok in tokenizer.get_vocab().keys()])
nlp.tokenizer = BPETokenizer(tokenizer, nlp.vocab)

# Test with the text
text = "Jeff Bezos is a billionaire who became famous after the Dutch bridge controversy."

# Tokenize the text using the custom tokenizer
doc = nlp(text)

# Print the tokenized text
print([token.text for token in doc])

['Jeff', 'ĠBezos', 'Ġis', 'Ġa', 'Ġbillionaire', 'Ġwho', 'Ġbecame', 'Ġfamous', 'Ġafter', 'Ġthe', 'ĠDutch', 'Ġbridge', 'Ġcontroversy', '.']


## Exercise 2: Tokenization in different languages

Most tokenizers we have seen so far are trained on English data. They often work reasonably well for the English language, but what about other languages? A major issue in multilingual Large LMs is that their shared subword vocabulary favors high-resource languages at the cost of the low-resource ones.

To understand this problem, you can run different subword tokenizers on translations of the same sentence in different languages.

Below, the first tokenizer belongs to the *multilingual* BLOOM model ([BLOOM](https://huggingface.co/bigscience/bloom)) which was trained on a mix of more than 40 languages. While model developers strived to balance the amount of different languages in the dataset, the distribution remains strongly uneven with most data being in English, followed by Chinese and French ([https://huggingface.co/bigscience/bloom#training-data]). This imbalance is even stronger in newer, larger LMs.

The second tokenizer instead was trained on a 100mb-sized *monolingual* corpus ([GoldfishLM-NLD](https://huggingface.co/goldfish-models/nld_latn_100mb)). This is a much smaller corpus than the previous one, but contains only text in one language.

In [10]:
# Import the three different types of tokenizers
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load the multilingual (BLOOM) tokenizer
tokenizer_bloom = AutoTokenizer.from_pretrained("bigscience/bloom")

# Load monolingual tokenizers for several languages
# (Here English and Dutch but many more Goldifh models are available!)
tokenizer_eng = AutoTokenizer.from_pretrained("goldfish-models/eng_latn_100mb")
tokenizer_nld = AutoTokenizer.from_pretrained("goldfish-models/nld_latn_100mb")


tokenizer_config.json:   0%|          | 0.00/222 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/581 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/1.12M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/3.82M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/25.3k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/17.1k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/581 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/1.15M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/3.84M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/25.3k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/17.1k [00:00<?, ?B/s]

In [11]:
# Here are 4 sentences taken from Flores-200 dataset (https://github.com/facebookresearch/flores).
# These dataset contains translations of the same sentences in 200 languages

sentences_eng = [
    'The pilot was identified as Squadron Leader Dilokrit Pattavee.',
    'Local media reports an airport fire vehicle rolled over while responding.',
    '28-year-old Vidal had joined Barça three seasons ago, from Sevilla.',
    'Since moving to the Catalan-capital, Vidal had played 49 games for the club.'
]

sentences_nld = [
    'De piloot werd geïdentificeerd als majoor Dilokrit Pattavee.',
    'De lokale media meldt dat er tijdens een actie op de luchthaven een brandweerwagen is gekanteld.',
    'De 28-jaar oude Vidal is drie seizoenen geleden van Sevilla naar Barça overgestapt.',
    'Sinds hij verhuisde naar de Catalaanse hoofdstad heeft hij 49 wedstrijden gespeeld voor de club.'
    ]

In [12]:
for sentences,tokenizer_mono in zip([sentences_eng, sentences_nld],
                                    [tokenizer_eng, tokenizer_nld]):
  for sent in sentences:
    # print the token segmentations of each tokenizer
    subwords_bloom = tokenizer_bloom.tokenize(sent)
    subwords_mono  = tokenizer_mono.tokenize(sent)

    # for better visualization:
    subwords_bloom = [s.replace('Ġ','▁') for s in subwords_bloom]

    print(' '.join(subwords_bloom))
    print(' '.join(subwords_mono))

    # calculate the length and see which one is the shortest/longest
    # TODO: include the length results in your analysis
    print(len(subwords_bloom), len(subwords_mono))
    print()

The ▁pilot ▁was ▁identified ▁as ▁Squad ron ▁Leader ▁Dil ok rit ▁Pat ta ve e .
▁The ▁pilot ▁was ▁identified ▁as ▁Squadron ▁Leader ▁Di lok rit ▁Pat ta ve e .
16 15

Local ▁media ▁reports ▁an ▁airport ▁fire ▁vehicle ▁rolled ▁over ▁while ▁responding .
▁Local ▁media ▁reports ▁an ▁airport ▁fire ▁vehicle ▁rolled ▁over ▁while ▁responding .
12 12

28 -year-old ▁Vidal ▁had ▁joined ▁BarÃ§a ▁three ▁seasons ▁ago , ▁from ▁Sevilla .
▁28- year - old ▁Vid al ▁had ▁joined ▁Bar ç a ▁three ▁seasons ▁ago , ▁from ▁Sevilla .
13 18

Since ▁moving ▁to ▁the ▁Catalan -c apital , ▁Vidal ▁had ▁played ▁49 ▁games ▁for ▁the ▁club .
▁Since ▁moving ▁to ▁the ▁Catalan - capital , ▁Vid al ▁had ▁played ▁49 ▁games ▁for ▁the ▁club .
17 18

De ▁p ilo ot ▁werd ▁ge Ã¯d ent ifice erd ▁als ▁maj o or ▁Dil ok rit ▁Pat ta ve e .
▁De ▁piloot ▁werd ▁geïdentificeerd ▁als ▁ majoor ▁Di lok rit ▁Pat ta vee .
22 14

De ▁lok ale ▁media ▁mel dt ▁dat ▁er ▁ti jd ens ▁een ▁act ie ▁op ▁de ▁luch th aven ▁een ▁brand we er wagen ▁is ▁gek ant eld .


EXERCISE:

Consider the 40+ languages supported by BLOOM ([https://huggingface.co/bigscience/bloom#training-data]).

(a) What do you expect to determine the tokenizer's behavior on different languages? Is training data size the only explanation? What other factors may determine the tokenizer behavior?

[*Provide a textual answer for (a) in a paragraph of 4-5 sentences.*]

(b) Based on your reflection, choose 2 examples of languages on which you expect the multilingual tokenizer to do a decent job, and 2 others on which you expect it to work very poorly (i.e. to segment the text very aggressively)

For each of the 4 languages, provide the subword counts on the FLORES benchmark by the multilingual ([BLOOM](https://huggingface.co/bigscience/bloom)) tokenizer versus that of the monolingual ([Goldfish](https://huggingface.co/collections/goldfish-models/100mb-goldfish-66c3c17d7be2e67389bfa67f)) tokenizer.

Note: In a given language, a large difference between the multilingual tokenizer’s subword count and that of the corresponding monolingual tokenizer can be taken as a proxy of the underperformance of the multilingual tokenizer.

[*Provide the counts in an easy-to-read format, by adding a short text to explain how you chose the languages.*]


## [Optional] Exercise 3: Lexicon-based Transduction System

In this exercise you will build a rule-based tool to transduce a given input text **from masculine to feminine**. You are provided with a list of pairs including feminine words and their masculine counterparts. To create a rule based transducer, the following components will be needed:

1. Extract a subset of sentences from the `wikitext-103-raw-v1` containing masculine words (words from the list, gendered pronouns (e.g. he/his/him)). **Tip**: you can try to use the spaCy lemmas annotations to avoid removing inflected forms of words.

Fill the `is_masculine` function so that only sentences containing masculine words are preserved.

In [None]:
import re
import datasets

gender_lexicon = [
    ("Brother", "Sister"),
    ("Drake", "Duck"),
    ("Father", "Mother"),
    ("Gentleman", "Lady"),
    ("Husband", "Wife"),
    ("Man", "Woman"),
    ("Nephew", "Niece"),
    ("Son", "Daughter"),
    ("Wizard", "Witch"),
    ("Boy", "Girl"),
    ("Bull", "Cow"),
    ("Cock", "Hen"),
    ("Dog", "Bitch"),
    ("Drone", "Bee"),
    ("Gander", "Goose"),
    ("Horse", "Mare"),
    ("King", "Queen"),
    ("Monk", "Nun"),
    ("Sir", "Madam"),
    ("Stag", "Hind"),
    ("Stallion", "Mare"),
    ("Tutor", "Governess"),
    ("Drone", "Bee"),
    ("Brother-in-law", "Sister-in-law"),
    ("Son-in-law", "Daughter-in-law"),
    ("Maternal-uncle", "Maternal-aunt"),
    ("Step-son", "Step-daughter"),
    ("Hostess", "Steward"),
    ("Widow", "Widower"),
    ("author", "authoress"),
    ("count", "countess"),
    ("heir", "heiress"),
    ("manager", "manageress"),
    ("patron", "patroness"),
    ("priest", "priestess"),
    ("baron", "baroness"),
    ("giant", "giantess"),
    ("host", "hostess"),
    ("lion", "lioness"),
    ("mayor", "mayoress"),
    ("poet", "poetess"),
    ("shepherd", "shepherdess"),
    ("actor", "actress"),
    ("conductor", "conductress"),
    ("hunter", "huntress"),
    ("prince", "princess"),
    ("traitor", "traitress"),
    ("master", "mistress"),
    ("benefactor", "benefactress"),
    ("founder", "foundress"),
    ("instructor", "instructress"),
    ("emperor", "empress"),
    ("tiger", "tigress"),
    ("waiter", "waitress"),
    ("murderer", "murderess"),
    ("hero", "heroine"),
    ("fox", "vixen"),
    ("sultan", "sultana"),
    ("grandfather", "grandmother"),
    ("manservant", "maidservant"),
    ("milkman", "milkwoman"),
    ("salesman", "saleswoman"),
    ("great-uncle", "great-aunt"),
    ("landlord", "landlady"),
    ("he", "she"),
    ("him", "her"),
    ("his", "her")
]

def is_masculine(text):
    # TODO: Fill your regex with words from the wordlist
    # (use '|'.join(...) to join them in the regex)
    regex = None
    return bool(re.search(regex, text, re.IGNORECASE))


dataset = datasets.load_dataset(
    "wikitext", "wikitext-103-raw-v1", split="train+test+validation"
)

# We consider only the first 200 characters to avoid long paragraphs
filtered_dataset = dataset.filter(lambda x: is_masculine(x["text"][:200]))
filtered_dataset = filtered_dataset.map(lambda x: {"text": x["text"][:200]})
filtered_dataset

2. Create a `feminize` function that takes a sentence from the the filtered dataset and returns a feminized version of it, based on lexical pairs. Use it to create a new field "feminine_text" in the dataset.

In [None]:
def feminize(text):
    """Returns a feminized version of text"""
    feminized_text = text
    for m, f in gender_lexicon:
        # TODO: fill in your regex to select word m (adapted from is_masculine)
        match_regex = None
        # TODO: fill in your regex to replace m by f
        substitute_regex = None
        feminized_text = re.sub(match_regex, substitute_regex, feminized_text, re.IGNORECASE)
    return feminized_text

# TODO: Use filtered_dataset.map to add a feminized version of the text column

3. Rename the `text` field to `source_text` and the `feminine_text` field to `target_text` (this is needed for `SimpleT5` to work properly). Transform the dataset to Pandas DataFrame format and use the following code to train a simple neural transduction model.

*(More info on the [T5 model](https://huggingface.co/t5-small) and the [SimpleT5](https://github.com/Shivanandroy/simpleT5) library)*

In [None]:
import torch
from simplet5 import SimpleT5

# TODO: Convert the Huggingface Dataset in a Pandas dataframe and split it in training
# and evaluation sets (you decide the sizes based on your computational resources)
train_df, eval_df = None, None

model = SimpleT5()
model.from_pretrained(model_type="t5", model_name="t5-small")
model.train(
    train_df=train_df,
    eval_df=eval_df,
    source_max_token_len=128,
    target_max_token_len=128,
    batch_size=8, max_epochs=3, use_gpu=torch.cuda.is_available()
)

4. Conclude by testing the model on a few examples of your choice

In [None]:
model.load_model("t5", "<YOUR SAVED MODEL PATH>", use_gpu=torch.cuda.is_available())

text_to_feminize = "my brother thought that his uncle was a duke"
model.predict(text_to_feminize)