[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gsarti/ik-nlp-tutorials/blob/main/notebooks/W3E_BPE_Transduction.ipynb)

In [None]:
# Run in Colab to install local packages
!pip install spacy transformers sentencepiece tokenizers datasets simplet5
!python -m spacy download en_core_web_sm

# Training a BPE tokenizer and a Lexicon-based Transduction Model

*This exercise follows the explanation of using BPE tokenization as explained on Huggingface [Build a Tokenizer from Scratch](https://huggingface.co/docs/tokenizers/python/latest/quicktour.html#build-a-tokenizer-from-scratch). Adapted from a notebook by Wietse de Vries*

The [Tokenizers](https://huggingface.co/docs/tokenizers/python/latest/quicktour.html) library by Huggingface provides implementations of today’s most used tokenizers (especially subword-based ones) that is both easy to use and blazing fast (Rust-compiled code!).

You will start by exploring the impact of different vocabulary sizes on a subword tokenizer using the Tokenizers library, and how these can be imported and used with spaCy. Finally, you will be asked to train a small transformer model to perform transduction from feminine to masculine words.

Exercise 1 is mandatory and will be part of your graded midterm portfolio. Exercise 2 is optional, but we highly recommend you to complete it, especially if you're interested in the "Modern Neural Networks meet Linguistic Theory" final project.

## Exercise 1: Byte Pair Encoding with Huggingface Tokenizers

In the following exercise, we will use a byte-pair encoding (BPE) tokenizer (see Jurafsky & Martin Sec. 2.4.3 and [Sennich et al, 2015](https://aclanthology.org/P16-1162/) to create a vocabulary of frequent words and subwords, allowing us to handle less frequent words.

### Setup

The following code loads a BPE tokenizer and trainer, tells the system to use whitespace as a separator and defines `[UNK]` as a special token intended to handle unknown words.

In [2]:
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace

tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
tokenizer.pre_tokenizer = Whitespace()
trainer = BpeTrainer(special_tokens=["[UNK]"], vocab_size=20000) 

### Corpus

The tokenizer creates a dictionary by concatenating characters and substrings into longer strings (possibly full words) based on frequency. So we need a corpus to learn what the most frequent words and substrings are. 

[Wikitext-103](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/) is a dump of the (English) Wikipedia. You can use the `train_from_iterator` method to train from the data in memory, which can be done using the `wikitext` corpus in the Huggingface Datasets library. Alternatively, you can download using wget, or directly from the webpage:

```shell
wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip
unzip wikitext-103-raw-v1.zip
```

The unzipped data is 500 MB. Note that the file extension for the data-files is .raw but the data is just a (unicode) text file. Because this confuses (Ubuntu) Linux, files were renamed to .raw.txt. If you maintain the original .raw filenames, adapt the path below accordingly.

### Run the trainer

The command below trains the tokenizer on the data:

In [3]:
# UNCOMMENT AND ADAPT PATH TO TRAIN ON MANUALLY DOWNLOADED DATASETS
# data = [f'wikitext-103-raw/wiki.{split}.raw.txt' for split in ['train','test','valid']]
# tokenizer.train(trainer, data)

import datasets
dataset = datasets.load_dataset(
    "wikitext", "wikitext-103-raw-v1", split="train+test+validation"
)

# Build a generator to iterate over the dataset
def batch_iterator(batch_size=1000):
    for i in range(0, len(dataset), batch_size):
        yield dataset[i : i + batch_size]["text"]

tokenizer.train_from_iterator(batch_iterator(), trainer=trainer, length=len(dataset))

Reusing dataset wikitext (/home/gsarti/.cache/huggingface/datasets/wikitext/wikitext-103-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126)


### Test the tokenizer

Now that we have created a vocabulary, we can use it to tokenize a string into words and subtokens (for infrequent words).

The example shows that most of the words are included in the vocabulary created by training on Wikipedia text, but that the acronym *UG*, the name *Hanze*, and the word *Applied*, *jointly* and *initiating* are segmented into subword strings. This suggests that these words were not seen during training, or very infrequently. (*UG* occurs 5 times in the training data and *Applied* over 200 times,  also note that the encoding is case-sensitive.). 

Try a few other examples to get a feeling for the lexical coverage of the tokenizer. 

In [4]:
def show_tokens(text):
    output = tokenizer.encode(text)
    print(f"Tokens: {output.tokens}")
    number_of_words = len(tokenizer.pre_tokenizer.pre_tokenize_str(text))
    number_of_segments = len(output.tokens)
    print(f"{number_of_words} words and {number_of_segments} segments")

example = "The UG and the Hanze University of Applied Sciences are jointly initiating a pilot rapid testing centre, which will start on 18 January."
show_tokens(example)

Tokens: ['The', 'U', 'G', 'and', 'the', 'Han', 'ze', 'University', 'of', 'Ap', 'pl', 'ied', 'Sciences', 'are', 'joint', 'ly', 'initi', 'ating', 'a', 'pilot', 'rapid', 'testing', 'centre', ',', 'which', 'will', 'start', 'on', '18', 'January', '.']
25 words and 31 segments


### Your Turn: Experiment with Vocabulary Size

The training data contains 103 M tokens and has a vocabulary size of 267,000 unique types. The default setting for the trainer is to create a dictionary of max 30,000 words. This means that a fair amount of compression takes place. Even more compression can be achieved by setting the vocab_size to a smaller value. 

1. Choose an example text consisting of at least 100 words. You may want to ensure that it contains some rare words or tokens. 

2. Experiment with various settings for vocab_size.

3. Count the number of words in the example, and the number of segments created by the BPE-tokenizer. Note that if the number segments goes up, more words are segmented into subwords. 

4. What is the vocabulary size where the number of segments is approx. 150% of the number of words? 

5. For this setting, what was the longest word in your example text that was not segmented? 

In [5]:
# TODO: Try with various vocab_sizes
# Important: You will need to redefine the tokenizer for every new vocab size,
# otherwise you might incur in an "PanicError: no entry found for key" exception
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
tokenizer.pre_tokenizer = Whitespace()
trainer = BpeTrainer(special_tokens=["[UNK]"],vocab_size=30000)

tokenizer.train_from_iterator(batch_iterator(), trainer=trainer, length=len(dataset))

test_text = "Enter some English text containing at least 100 words"

show_tokens(test_text)

# Answer question 5 by going over the output, or write a 
# few lines of code to provide the answer.

### Loading the BPE Tokenizer into spaCy

Now that you experimented with the creation of many tokenizers using Huggingface Tokenizers, you might want to move them to a more familiar environment. The following class lets you load a Huggingface Tokenizer into spaCy: the `get_words_spaces` function is used to preserve the whitespaces before tokens that are not word pieces.

### Your Turn: Fill in the missing code

Your task is to complete the `__call__` method of the `BPETokenizer` class to go from text to spaCy `Docs`, and finally to print the tokenized text.

In [6]:
from spacy.tokens import Doc
from spacy.vocab import Vocab
import spacy

class BPETokenizer:
    def __init__(self, tokenizer, vocab):
        self.tokenizer = tokenizer
        self.vocab = vocab
    
    def get_words_spaces(self, tokens):
        words = []
        spaces = []
        for i, (text, (_, end)) in enumerate(
            zip(tokens.tokens, tokens.offsets)
        ):
            words.append(text)
            if i < len(tokens.tokens) - 1:
                # If next start != current end we assume a 
                # space in between
                next_start, _ = tokens.offsets[i + 1]
                spaces.append(next_start > end)
            else:
                spaces.append(True)
        return words, spaces

    def __call__(self, text):
        # TODO: Encode the texts to obtain tokens
        tokens = None
        # TODO: Use get_words_spaces to obtain the words and spaces
        words, spaces = None, None
        return Doc(self.vocab, words=words, spaces=spaces)

nlp = spacy.blank("en")
nlp.vocab = Vocab(strings=[
    tok for tok in tokenizer.get_vocab().keys()
])
nlp.tokenizer = BPETokenizer(tokenizer, nlp.vocab)

text = "Jeff Bezos is a billionaire who became famous after the Dutch bridge controversy."
# TODO: Convert the text in a list of tokens and print them

## (Optional) Exercise 2: Lexicon-based Transduction System

In this exercise you will build a rule-based tool to transduce a given input text **from masculine to feminine**. You are provided with a list of pairs including feminine words and their masculine counterparts. To create a rule based transducer, the following components will be needed:

1. Extract a subset of sentences from the `wikitext-103-raw-v1` containing masculine words (words from the list, gendered pronouns (e.g. he/his/him)). **Tip**: you can try to use the spaCy lemmas annotations to avoid removing inflected forms of words.

Fill the `is_masculine` function so that only sentences containing masculine words are preserved.

In [None]:
import re
import datasets

gender_lexicon = [
    ("Brother", "Sister"),
    ("Drake", "Duck"),
    ("Father", "Mother"),
    ("Gentleman", "Lady"),
    ("Husband", "Wife"),
    ("Man", "Woman"),
    ("Nephew", "Niece"),
    ("Son", "Daughter"),
    ("Wizard", "Witch"),
    ("Boy", "Girl"),
    ("Bull", "Cow"),
    ("Cock", "Hen"),
    ("Dog", "Bitch"),
    ("Drone", "Bee"),
    ("Gander", "Goose"),
    ("Horse", "Mare"),
    ("King", "Queen"),
    ("Monk", "Nun"),
    ("Sir", "Madam"),
    ("Stag", "Hind"),
    ("Stallion", "Mare"),
    ("Tutor", "Governess"),
    ("Drone", "Bee"),
    ("Brother-in-law", "Sister-in-law"),
    ("Son-in-law", "Daughter-in-law"),
    ("Maternal-uncle", "Maternal-aunt"),
    ("Step-son", "Step-daughter"),
    ("Hostess", "Steward"),
    ("Widow", "Widower"),
    ("author", "authoress"),
    ("count", "countess"),
    ("heir", "heiress"),
    ("manager", "manageress"),
    ("patron", "patroness"),
    ("priest", "priestess"),
    ("baron", "baroness"),
    ("giant", "giantess"),
    ("host", "hostess"),
    ("lion", "lioness"),
    ("mayor", "mayoress"),
    ("poet", "poetess"),
    ("shepherd", "shepherdess"),
    ("actor", "actress"),
    ("conductor", "conductress"),
    ("hunter", "huntress"),
    ("prince", "princess"),
    ("traitor", "traitress"),
    ("master", "mistress"),
    ("benefactor", "benefactress"),
    ("founder", "foundress"),
    ("instructor", "instructress"),
    ("emperor", "empress"),
    ("tiger", "tigress"),
    ("waiter", "waitress"),
    ("murderer", "murderess"),
    ("hero", "heroine"),
    ("fox", "vixen"),
    ("sultan", "sultana"),
    ("grandfather", "grandmother"),
    ("manservant", "maidservant"),
    ("milkman", "milkwoman"),
    ("salesman", "saleswoman"),
    ("great-uncle", "great-aunt"),
    ("landlord", "landlady"),
    ("he", "she"),
    ("him", "her"),
    ("his", "her")
]

def is_masculine(text):
    # TODO: Fill your regex with words from the wordlist
    # (use '|'.join(...) to join them in the regex)
    regex = None
    return bool(re.search(regex, text, re.IGNORECASE))


dataset = datasets.load_dataset(
    "wikitext", "wikitext-103-raw-v1", split="train+test+validation"
)

# We consider only the first 200 characters to avoid long paragraphs
filtered_dataset = dataset.filter(lambda x: is_masculine(x["text"][:200]))
filtered_dataset = filtered_dataset.map(lambda x: {"text": x["text"][:200]})
filtered_dataset

2. Create a `feminize` function that takes a sententence from the the filtered dataset and returns a feminized version of it, based on lexical pairs. Use it to create a new field "feminine_text" in the dataset.

In [None]:
def feminize(text):
    """Returns a feminized version of text"""
    feminized_text = text
    for m, f in gender_lexicon:
        # TODO: fill in your regex to select word m (adapted from is_masculine)
        match_regex = None 
        # TODO: fill in your regex to replace m by f
        substitute_regex = None
        feminized_text = re.sub(match_regex, substitute_regex, feminized_text, re.IGNORECASE)
    return feminized_text

# TODO: Use filtered_dataset.map to add a feminized version of the text column

3. Rename the `text` field to `source_text` and the `feminine_text` field to `target_text` (this is needed for `SimpleT5` to work properly). Transform the dataset to Pandas DataFrame format and use the following code to train a simple neural transduction model.

*(More info on the [T5 model](https://huggingface.co/t5-small) and the [SimpleT5](https://github.com/Shivanandroy/simpleT5) library)*

In [None]:
import torch
from simplet5 import SimpleT5

# TODO: Convert the Huggingface Dataset in a Pandas dataframe and split it in training
# and evaluation sets (you decide the sizes based on your computational resources)
train_df, eval_df = None, None

model = SimpleT5()
model.from_pretrained(model_type="t5", model_name="t5-small")
model.train(
    train_df=train_df,
    eval_df=eval_df, 
    source_max_token_len=128, 
    target_max_token_len=128, 
    batch_size=8, max_epochs=3, use_gpu=torch.cuda.is_available()
)

4. Conclude by testing the model on a few examples of your choice

In [None]:
model.load_model("t5", "<YOUR SAVED MODEL PATH>", use_gpu=torch.cuda.is_available())

text_to_feminize = "my brother thought that his uncle was a duke"
model.predict(text_to_feminize)