## Tokenization using Byte-Pair Encoding and a Unigram Language Model

Author: Pierre Nugues with help from Marcus Klang

In this assignment, you will create a tokenization program to handle subwords.

In many scripts from Asia, like Chinese, Korean, or Japanese scripts, tokenization cannot rely on white spaces. The byte-pair encoding and the unigram language model are techniques that are now common in machine translation to carry out a tokenization at a subword level. Subword level tokenization shows better multilingual capabilities.

You will follow two papers: 
* Subword Regularization: _Improving Neural Network Translation Models with Multiple Subword Candidates_ by Kudo (2018) (https://arxiv.org/pdf/1804.10959.pdf) and 
* _Byte Pair Encoding is Suboptimal for Language Model Pretraining_ by Bostrom and Durrett (2020) (https://aclanthology.org/2020.findings-emnlp.414.pdf). 

In addition, you will start from a clear and easy-to-understand description in Google’s Neural Machine Translation System: _Bridging the Gap between Human and Machine Translation_ by Wu et al. (2016). (Do not read them now)
https://arxiv.org/abs/1609.08144

You will use a small corpus make it easier to test and correct your code. Note also that you will use _characters_ and not _bytes_ in this lab as this is simpler to implement. For a complete program, see the link at the end.

**In your report, be sure to answer all the questions. Please reuse the section titles of this notebook so that I can check your answers more easily**

## Preliminaries

As an overall description of the subword tokenizers, read Sections 4 (introduction paragraph) and 4.1. in the paper on translation: _Bridging the Gap between Human and Machine Translation_ by Wu et al. (2016), https://arxiv.org/abs/1609.08144.  

In your report, in a few lines (10 to 15 lines or so) you will:

1. Outline the difference with tokenization as you saw it during the course;
2. Imagine how the tokens will be learned (this will developed in the rest of the lab);
3. Summarize what could be the advantages for Asian languages, unknown words, and translation.

Commenting Sections 4 and 4.1 in your report is **mandatory**. If you are curious, you can read the complete article.

## Design of the BPE Algorithm

The first algorithm to build the subwords from a corpus is a byte-pair encoding (BPE), due to Gage (1994). In the lab, you will first read two sections of more recent articles as they are easier to understand and specifically targeted to natural language processing.

Read these two sections:

1. Section 3.1 of _Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates_ (https://arxiv.org/pdf/1804.10959.pdf) by Kudo (2018).
2. Section 2, algorithm 1 of _Byte Pair Encoding is Suboptimal for Language Model Pretraining_ (https://aclanthology.org/2020.findings-emnlp.414.pdf) by Bostrom and Durrett (2020).

In your report, **summarize** (10 to 15 lines or so) with your own words the byte-pair encoding (BPE) algorithm as described by Kudo (2018) and Bostrom and Durrett (2020) (Only BPE and not the unigram language model).

## BPE Programming

You will now program a byte-pair encoding program in Python. You will do it step by step. The first part will be to extract the subwords from a corpus. Note that you will use the characters, not the bytes. 

In [1]:
import regex as re
import tqdm as tqdm
import math

First use a small corpus and then, if you have time, test your program on a larger one. Here we take the smallest novel from Selma Lagerlöf in our corpus.

In [2]:
import os
from zipfile import ZipFile
import requests

# Parameters for Selma dataset
SELMA_URL = "https://github.com/pnugues/ilppp/raw/master/programs/corpus/Selma.zip"

SELMA_FILES = [
    os.path.join("Selma", fname) 
    for fname in 
    [
        "bannlyst.txt", 
        "gosta.txt", 
        "herrgard.txt", 
        "jerusalem.txt", 
        "kejsaren.txt", 
        "marbacka.txt", 
        "nils.txt", 
        "osynliga.txt", 
        "troll.txt"
    ]
]

def download_and_extract_selma():
    """Downloads and unpacks Selma.zip"""
    
    # Download if not all files exist
    req = requests.get(SELMA_URL, stream=True)
    if req.status_code != 200:
        print("Failed to download file, got status: " + req.status_code)
        req.close()
    else:
        with open("Selma.zip", "wb") as fd:
            written = 0
            for chunk in req.iter_content(chunk_size=65536):
                fd.write(chunk)
                written += len(chunk)
                print("Downloading: %d bytes written to Selma.zip" % written)

        print("Selma.zip donwnloaded.")
        req.close()
        
        selma_zipfile = ZipFile("Selma.zip")
        selma_files_to_extract = [zi for zi in selma_zipfile.filelist if not zi.filename.startswith("__") and zi.filename.endswith(".txt")]
        for zi in selma_files_to_extract:
            selma_zipfile.extract(zi)
            print("Extracted: " + zi.filename)
            
        print("Done!")
        
# If not all path exists (all are true), then download
if not all([os.path.exists(fname) for fname in SELMA_FILES]):
    download_and_extract_selma()
else:
    print("Selma has been downloaded.")
    
SELMA_FILES

Selma has been downloaded.


['Selma/bannlyst.txt',
 'Selma/gosta.txt',
 'Selma/herrgard.txt',
 'Selma/jerusalem.txt',
 'Selma/kejsaren.txt',
 'Selma/marbacka.txt',
 'Selma/nils.txt',
 'Selma/osynliga.txt',
 'Selma/troll.txt']

In [3]:
#FILE_PATH = '../../corpus/Selma.txt'
FILE_PATH = 'Selma/herrgard.txt'

Read the corpus and store it in the `corpus` string variable.

In [4]:
with open(FILE_PATH, encoding='utf8') as f:
    corpus = f.read().strip()

Replace all the space sequences in `corpus`, including newlines and tabulations, and normalize them as one space.

In [5]:
# Write your code


In [6]:
corpus[:100]

'Selma Lagerlöf En herrgårdssägen Bokutgåva Albert Bonniers förlag, Stockholm 1899. I. Det var en skö'

### BPE

#### Initial Vocabulary

Write the code (one instruction) to split the corpus in a list of characters and store the results in `corpus_l`. This is just a type conversion. Given the input:
<pre><span style="font-size: 12pt;">corpus = 'De senaste fem &aring;ren har cirka 25 000 unga'</span></pre>

Return:
<pre><span style="font-size: 12pt;">corpus_l = ['D', 'e', ' ', 's', 'e', 'n', 'a', 's', 't', 'e', ' ', 'f', 'e', 'm', ' ', ...]</span></pre>

In [7]:
# Write your code


In [8]:
corpus_l[:15]

['S', 'e', 'l', 'm', 'a', ' ', 'L', 'a', 'g', 'e', 'r', 'l', 'ö', 'f', ' ']

Extract the set of characters that will serve as initial subword tokens:

1. Write a statement to extract the set of all the characters from `corpus_l`; 
2. Exclude the space from this set and call the resulting set: `char_set`.

In [9]:
# Write your code


In [10]:
len(char_set)

67

Using code from the previous question, write an `initial_vocabulary()` function taking the the `corpus_l` variable as input and returning the the set of all characters appearing in the corpus (the initial character set), deprived from the white space.

In [11]:
# Write your code here


In [12]:
initial_vocabulary(corpus_l)

{'!',
 ',',
 '-',
 '.',
 '1',
 '8',
 '9',
 ':',
 ';',
 '?',
 'A',
 'B',
 'C',
 'D',
 'E',
 'F',
 'G',
 'H',
 'I',
 'J',
 'K',
 'L',
 'M',
 'N',
 'O',
 'P',
 'R',
 'S',
 'T',
 'U',
 'V',
 'X',
 '_',
 'a',
 'b',
 'c',
 'd',
 'e',
 'f',
 'g',
 'h',
 'i',
 'j',
 'k',
 'l',
 'm',
 'n',
 'o',
 'p',
 'r',
 's',
 't',
 'u',
 'v',
 'x',
 'y',
 'z',
 '»',
 'Ä',
 'Å',
 'Ö',
 'ä',
 'å',
 'é',
 'ö',
 '–',
 '’'}

#### Counting

Write a `pair_count()` function that takes a list of tokens as input, possibly single characters or subword tokens, and that counts the adjacent pairs (bigrams). You will implement these counts as dictionaries: The key will be a pair (tuple) of adjacent symbols and the value, its frequency. Remember that you cannot cross whitespaces, i.e. a pair cannot include a whitespace.

Given the input

`['D', 'e', ' ', 's', 'e', 'n', 'a', 's', 't', ...]`
count_pairs should return a dictionary: 


`{('D', 'e'): 1, ('s', 'e'): 1, ('e', 'n'): 1, ('n', 'a'): 1, ...}`

In [13]:
# Write your code here


In [14]:
pairs = pair_count(corpus_l)

In [15]:
pairs

{('S', 'e'): 7,
 ('e', 'l'): 511,
 ('l', 'm'): 22,
 ('m', 'a'): 424,
 ('L', 'a'): 8,
 ('a', 'g'): 477,
 ('g', 'e'): 646,
 ('e', 'r'): 1281,
 ('r', 'l'): 213,
 ('l', 'ö'): 70,
 ('ö', 'f'): 217,
 ('E', 'n'): 21,
 ('h', 'e'): 773,
 ('r', 'r'): 177,
 ('r', 'g'): 140,
 ('g', 'å'): 327,
 ('å', 'r'): 295,
 ('r', 'd'): 413,
 ('d', 's'): 83,
 ('s', 's'): 218,
 ('s', 'ä'): 205,
 ('ä', 'g'): 143,
 ('e', 'n'): 3439,
 ('B', 'o'): 2,
 ('o', 'k'): 54,
 ('k', 'u'): 562,
 ('u', 't'): 295,
 ('t', 'g'): 43,
 ('å', 'v'): 2,
 ('v', 'a'): 1426,
 ('A', 'l'): 24,
 ('l', 'b'): 44,
 ('b', 'e'): 314,
 ('r', 't'): 381,
 ('o', 'n'): 1600,
 ('n', 'n'): 1070,
 ('n', 'i'): 231,
 ('i', 'e'): 22,
 ('r', 's'): 258,
 ('f', 'ö'): 980,
 ('ö', 'r'): 1317,
 ('l', 'a'): 873,
 ('g', ','): 192,
 ('S', 't'): 88,
 ('t', 'o'): 351,
 ('o', 'c'): 1223,
 ('c', 'k'): 794,
 ('k', 'h'): 23,
 ('h', 'o'): 1098,
 ('o', 'l'): 211,
 ('1', '8'): 1,
 ('8', '9'): 1,
 ('9', '9'): 1,
 ('9', '.'): 1,
 ('I', '.'): 6,
 ('D', 'e'): 429,
 ('e', 't'): 

Determine the most frequent pair

In [16]:
# write your code


In [17]:
most_freq_pair

('d', 'e')

In [18]:
''.join(most_freq_pair)

'de'

#### The First Iteration

We store the initial symbols in a `vocabulary` variable

In [19]:
vocabulary = initial_vocabulary(corpus_l)

In [20]:
len(vocabulary)

67

Add your most frequent pair to the vocabulary after one iteration

In [21]:
# write your code here


In [22]:
len(vocabulary)

68

#### Incremental Construction
We will now incrementally build the vocabulary.

Create a `merge_bigrams()` function that takes a list of tokens, `corpus_l`, and a pair of subword tokens `(token_r, token_l)` as input and merges adjacent sequences token_r, token_l into a new token, `token_new`, replacing the sequence `token_r, token_l` in `corpus_l`. Your function will return a new list. 

Given the input 

`corpus_l = ['D', 'e', ' ', 's', 'e', 'n', 'a', 's', 't', ...]`

`merge_bigrams(corpus_l, ('e', 'n'))` should return where all the seuquences of 'e' and 'n' have been merged:

`['D', 'e', ' ', 's', 'en', 'a', 's', 't', ...]`

And reapplying `merge_bigrams(corpus_l, ('s', 'en'))` to this corpus should return

`['D', 'e', ' ', 'sen', 'a', 's', 't', ...]`

You will apply a greedy algorithm. Given the pair ('a', 'a') and the list ['a', 'a', 'a'], the result will be: ['aa', 'a']

In [23]:
# Write your code here


In [24]:
corpus_test = ['D', 'e', ' ', 's', 'e', 'n', 'a', 's', 't']
merge_bigrams(corpus_test, ('e', 'n'))

['D', 'e', ' ', 's', 'en', 'a', 's', 't']

In [25]:
merge_bigrams(merge_bigrams(corpus_test, ('e', 'n')), ('s', 'en'))

['D', 'e', ' ', 'sen', 'a', 's', 't']

#### Byte Pair Encoding (BPE): Building the Vocabulary

Write now a `BPE()` function following Algorithm 1 in _Byte Pair Encoding is Suboptimal for Language Model Pretraining_ by Bostrom and Durrett (2020). 

Your function will take `corpus_l` and the vocabulary size `k` as input. This size `k` will correspond to the count of new subwords added to the initial list of symbols. With your initial corpus, you should have 67 found symbols. With `k = 10`, you will add 10 subwords to this initial list. Note that Bostrom and Durrett (2020) define their $k_\text{Bostrom and Durrett}$ as `k + initial vocabulary`. 

Return the vocabulary of subword tokens in the form of a list: the initial vocabulary and the subwords you will create.

You will start from the initial vocabulary and `k` will be the number of symbols you add to this vocabulary.

In [26]:
# Write your code here


We build a vocabulary of 50 subwords in addition to our initial set of symbols

In [27]:
vocabulary = BPE(corpus_l, 50)
vocabulary

de en an tt ar st om on ll ör att ch ade ig er ng och var hon et för sk är ck han or na det ne så än in ej un ill den som fv på ed ag li enne henne id ra hade all ing ta 

{'!',
 ',',
 '-',
 '.',
 '1',
 '8',
 '9',
 ':',
 ';',
 '?',
 'A',
 'B',
 'C',
 'D',
 'E',
 'F',
 'G',
 'H',
 'I',
 'J',
 'K',
 'L',
 'M',
 'N',
 'O',
 'P',
 'R',
 'S',
 'T',
 'U',
 'V',
 'X',
 '_',
 'a',
 'ade',
 'ag',
 'all',
 'an',
 'ar',
 'att',
 'b',
 'c',
 'ch',
 'ck',
 'd',
 'de',
 'den',
 'det',
 'e',
 'ed',
 'ej',
 'en',
 'enne',
 'er',
 'et',
 'f',
 'fv',
 'för',
 'g',
 'h',
 'hade',
 'han',
 'henne',
 'hon',
 'i',
 'id',
 'ig',
 'ill',
 'in',
 'ing',
 'j',
 'k',
 'l',
 'li',
 'll',
 'm',
 'n',
 'na',
 'ne',
 'ng',
 'o',
 'och',
 'om',
 'on',
 'or',
 'p',
 'på',
 'r',
 'ra',
 's',
 'sk',
 'som',
 'st',
 'så',
 't',
 'ta',
 'tt',
 'u',
 'un',
 'v',
 'var',
 'x',
 'y',
 'z',
 '»',
 'Ä',
 'Å',
 'Ö',
 'ä',
 'än',
 'är',
 'å',
 'é',
 'ö',
 'ör',
 '–',
 '’'}

In [28]:
len(vocabulary)

117

#### BPE Tokenizer

You will now use the vocabulary you obtained to tokenize a text stored in the corpus string.

You will implement a greedy technique building on Python's regular expression engine. You will call this function `tokenize_bpe()` that will take two inputs: `corpus` and `vocabulary`, and that will return the tokenized text in the form of a list.

    def tokenize_bpe(corpus, vocabulary):

      ...

      return tokens
Here are a few hints on how to write this function. Before you call a regular expression and apply it to a text, a regex engine compiles it into an efficient automaton (you do not need to call `compile()` as the automaton is automatically cached). The only thing you have to take care of is the length order of the strings. In the tokenization function:

1. Write a statement to order the strings in your vocabulary list,
  * first by decreasing length, and then
  * by alphabetic order.
  
  You will call this list `vocabulary_srt`; Knowing that, in the ASCII order, the upper case letters are placed before lower case ones, the list: ['D', 'e', 'sen', 'a', 's', 't']

will be sorted as: ['sen', 'D', 'a', 'e', 's', 't']

2. Escape the regular expression with `re.escape()` as some strings may include metacharacters, for instance 'a.', where the dot matches all the characters.
3. Convert this list into a regular expression that results in a disjunction of subword tokens. Remember that the disjunction operator (or) for regular expressions is the vertical bar (`|`), as in `'a'|'b'`, meaning match `'a'` or `'b'`;
3. Apply a regular expression function to tokenize your text: the corpus string. You will use `findall()`for this. You will return this result.

In [29]:
re.escape('a.')

'a\\.'

In [30]:
# Write your code here


In [31]:
print(tokenize_bpe(corpus, vocabulary)[:200])

['S', 'e', 'l', 'm', 'a', 'L', 'ag', 'er', 'l', 'ö', 'f', 'E', 'n', 'h', 'er', 'r', 'g', 'å', 'r', 'd', 's', 's', 'ä', 'g', 'en', 'B', 'o', 'k', 'u', 't', 'g', 'å', 'v', 'a', 'A', 'l', 'b', 'er', 't', 'B', 'on', 'n', 'i', 'er', 's', 'för', 'l', 'ag', ',', 'S', 't', 'o', 'ck', 'h', 'o', 'l', 'm', '1', '8', '9', '9', '.', 'I', '.', 'D', 'et', 'var', 'en', 'sk', 'ö', 'n', 'h', 'ö', 'st', 'd', 'ag', 'h', 'än', 'e', 'm', 'o', 't', 's', 'l', 'u', 't', 'et', 'a', 'f', 't', 'r', 'et', 't', 'i', 'o', 'ta', 'l', 'et', '.', 'P', 'å', 'den', 't', 'id', 'en', 'f', 'an', 'n', 's', 'i', 'U', 'p', 's', 'a', 'l', 'a', 'et', 't', 'h', 'ö', 'g', 't', ',', 'g', 'u', 'l', 't', 't', 'v', 'å', 'v', 'å', 'n', 'ing', 's', 'h', 'u', 's', ',', 'som', 'st', 'o', 'd', 'un', 'de', 'r', 'li', 'g', 't', 'en', 's', 'a', 'm', 't', 'på', 'en', 'li', 't', 'en', 'än', 'g', ',', 'l', 'å', 'ng', 't', 'b', 'or', 'ta', 'i', 'en', 'u', 't', 'k', 'an', 't', 'a', 'f', 'st', 'ade', 'n', '.', 'D', 'et', 'var', 'et', 't', 'r', 'ä',

## Unigram Language Model

You are now done with BPE and you can now consider the unigram language model.

Read these two sections:

1. Section 3.2 of _Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates_ (https://arxiv.org/pdf/1804.10959.pdf) by Kudo (2018).
2. Section 2, algorithm 2 and the related text of _Byte Pair Encoding is Suboptimal for Language Model Pretraining_ (https://aclanthology.org/2020.findings-emnlp.414.pdf) by Bostrom and Durrett (2020).

In your report, **summarize** (10 to 15 lines or so) with your own words the tokenization with a unigram language model as described by by Kudo (2018) and Bostrom and Durrett (2020). You will notably consider two aspects:
1. How to obtain the subword vocabulary;
2. How to tokenize a text.

In your report, given what you have done on the byte-pair encoding, how would you build the “reasonably big seed vocabulary” needed for the unigram language model?

### Unigram Probabilities

Starting from the “reasonably big seed vocabulary”, you will now fit a unigram language model. You will start with a vocabulary of 50 subwords in addition to the character set and reduce it to 49, i.e. you will find one subword to discard.

Kudo (2018) proposes the expectation-maximization algorithm that we have not seen in the course on natural language processing. Instead, in this lab, you will approximate the language model with the BPE algorithm.

Write a `unigram_lm()` function that takes a corpus string and a vocabulary of subword tokens as input and returns a dictionary, where the keys are the subwords and each key value, the key relative frequency:

    def unigram_lm(corpus, vocabulary):

       ...

      return unigram_probs
Your function will:

1. Tokenize your corpus with BPE (you can reuse the `tokenize_bpe()` function);
2. Estimate the probability of each word (simply count the occurrences of the subwords and divide them by the length of the tokenized corpus);
3. Return this model as a dictionary.

In [32]:
# Write your code here


In [33]:
unigram_probs = unigram_lm(corpus, vocabulary)
unigram_probs

{'S': 0.0017218346445005196,
 'e': 0.02772747513730147,
 'l': 0.028687348473603484,
 'm': 0.028835782494681113,
 'a': 0.040186037306417295,
 'L': 0.0003760328533966652,
 'ag': 0.004017614170501212,
 'er': 0.010360694671218643,
 'ö': 0.010083617831873733,
 'f': 0.020790658552273515,
 'E': 0.0002671812379397358,
 'n': 0.0172579288506259,
 'h': 0.015773588639849588,
 'r': 0.03131957844738014,
 'g': 0.03128989164316461,
 'å': 0.021255751818316758,
 'd': 0.029498787788827866,
 's': 0.03803869180149424,
 'ä': 0.013754885953193805,
 'en': 0.023838503785067536,
 'B': 0.0008114393152243828,
 'o': 0.015832962248280638,
 'k': 0.02452130028202464,
 'u': 0.020948988174756322,
 't': 0.044886447973875615,
 'v': 0.01462569887684924,
 'A': 0.0006234228885260501,
 'b': 0.01510068774429766,
 'on': 0.005719657612191381,
 'i': 0.019652664390678344,
 'för': 0.009074266488545842,
 ',': 0.027133739052990945,
 'ck': 0.007857107515709267,
 '1': 9.8956014051754e-06,
 '8': 9.8956014051754e-06,
 '9': 1.97912028103

In [34]:
len(unigram_probs)

117

### Unigram Tokenization

You will now apply your unigram language model to tokenize a character sequence that does not include spaces, typically a single word in the Latin or Greek scripts or a sequence of words in Asian scripts, like Chinese or Korean.

Write a `tokenize_lm()` function that takes a character sequence, `char_seq`, and a dictionary of unigram probabilities, `unigram_probs`,  as input and returns the subword tokens and the segmentation probability, (prob,tokens). You will only return the token list with the highest probability.

    def tokenize_lm(char_seq, unigram_probs):

      ...

      return max(candidates)

As an example, applying 

tokenize_lm('senare', unigram_probs)
results in

`(2.0899522820189735e-07, ['s', 'en', 'ar', 'e'])`

Your function will cache (memoize) the results to speed up the computation. It will be similar to that of Norvig's in the notebook: How to Do Things with Words.ipynb. You can reuse it.
Python has a built-in memoization function that you can use: @functools.lru_cache(maxsize=2**10). You can also use the newer @functools.cache() function if you have Python 3.9 or higher. See here: https://docs.python.org/3/library/functools.html.

In [35]:
import functools

def tokenize_lm(char_seq, unigram_probs):
    # Use one of the two cache functions below to have a faster answer:  
    # @functools.lru_cache(maxsize=2**10)
    @functools.cache # Available from Python 3.9
    # The arguments of the cached function must be hashable that's why we define an inner cacheable function
    def __tokenize_lm(char_seq):
    # Write your code here
    
    return __tokenize_lm(char_seq)

In [36]:
tokenize_lm('senare', unigram_probs)

(2.0899522820189735e-07, ['s', 'en', 'ar', 'e'])

### Text Tokenization with Unigrams

The previous function applies to a sequence without spaces. You will now apply it to your corpus. Write a `tokenize_text_lm()` function that takes the whole `corpus` string as input and the unigram probabilities `unigram_probs` and return the corpus probability and the tokenized subwords. 

This function is just an application of the functions you just wrote, where you will:
1. `split()` the string by whitespaces
2. Break the tokens into subtokens and compute the probabilities of the resulting sequences;
3. Sum the logarithm of these probabilities. Use log10 to check your output with the numbers in the notebook. 

It is very significant that you use the logarithm of the probabilities and the sum. If you multiply the probabilities, you will get an underflow.

In [37]:
# Write your code


In [38]:
init_loglikelihood, tokens = tokenize_text_lm(corpus, unigram_probs)

In [39]:
init_loglikelihood, tokens[:10]

(-183398.9777556855, ['_S', 'e', 'l', 'm', 'a', '_L', 'ag', 'er', 'l', 'ö'])

### Vocabulary Selection

You will now implement the final loop, where you will, at each iteration:
1. Select one subword from the vocabulary.
2. Compute the resulting log-likelihood of the corpus without this word.
3. Compute the loss, i.e. the log-likelihood reduction when the subword is removed from the current vocabulary

You will always keep the single characters in your vocabulary to avoid unknown words.

Store the pairs, (log-likelihood, removed_subword) in a list `logloss_word` and rank them by likelihood value.

In [40]:
logloss_word = []

In [41]:
# Write your code here


100%|██████████| 117/117 [00:35<00:00,  3.32it/s]


In [42]:
sorted(logloss_word)

[(-92.75720750979963, 'tt'),
 (-63.05992010710179, 'ne'),
 (-38.08057148766238, 'ta'),
 (2.0738454080710653, 'enne'),
 (12.061332144396147, 'ag'),
 (20.017389423010172, 'ra'),
 (71.24347190588014, 'li'),
 (117.21043151628692, 'll'),
 (133.63334921817295, 'ed'),
 (169.48387613904197, 'id'),
 (172.16923521881108, 'ör'),
 (178.6423015303153, 'ng'),
 (192.93231688259402, 'ar'),
 (203.4787342007039, 'så'),
 (227.7244415571622, 'na'),
 (251.89493320914335, 'ch'),
 (263.89934408143745, 'all'),
 (264.8709145585017, 'ing'),
 (283.4139082902111, 'den'),
 (307.6344156662235, 'som'),
 (327.0449843176466, 'ade'),
 (337.8945769118727, 'sk'),
 (340.5907729867322, 'in'),
 (341.17597176364507, 'det'),
 (367.542386146466, 'fv'),
 (368.11862880177796, 'un'),
 (379.358788624726, 'et'),
 (379.7334326027485, 'hade'),
 (385.8579164824914, 'på'),
 (424.2208616450662, 'or'),
 (435.2485914659337, 'ig'),
 (437.26840572143556, 'ej'),
 (442.5478414479294, 'on'),
 (514.6311576679873, 'än'),
 (526.9518980007851, 'il

You will reduce now your vocabulary by one token: `out_candidate`. Write the piece of code to determine it.

In [43]:
# Write your code here


In [1]:
out_candidate

NameError: name 'out_candidate' is not defined

If you are interested, you can improve this program and test it on larger corpora. You can also read a fine implementation of BPE by Andrej Karpathy: https://github.com/karpathy/minGPT/blob/master/mingpt/bpe.py

## Submission

When you have written all the code and run all the cells, fill in your ID and as well as the name of the notebook.

In [45]:
STIL_ID = ["student_1", "student_2"] # Write your stil ids as a list
CURRENT_NOTEBOOK_PATH = os.path.join(os.getcwd(), 
                                     "5-BPE_solution.ipynb") # Write the name of your notebook

The submission code will send your answer. It consists of the subword to discard.

In [2]:
import json
ANSWER = json.dumps({'out_candidate': out_candidate})
ANSWER

NameError: name 'out_candidate' is not defined

Now the moment of truth:
1. Save your notebook and
2. Run the cells below

In [47]:
SUBMISSION_NOTEBOOK_PATH = CURRENT_NOTEBOOK_PATH + ".submission.bz2"

In [48]:
import bz2
ASSIGNMENT = 5
API_KEY = "f581ba347babfea0b8f2c74a3a6776a7"

# Copy and compress current notebook
with bz2.open(SUBMISSION_NOTEBOOK_PATH, mode="wb") as fout:
    with open(CURRENT_NOTEBOOK_PATH, "rb") as fin:
        fout.write(fin.read())

In [49]:
res = requests.post("https://vilde.cs.lth.se/edan20checker/submit", 
                    files={"notebook_file": open(SUBMISSION_NOTEBOOK_PATH, "rb")}, 
                    data={
                        "stil_id": STIL_ID,
                        "assignment": ASSIGNMENT,
                        "answer": ANSWER,
                        "api_key": API_KEY,
                    },
               verify=True)

# from IPython.display import display, JSON
res.json()

{'msg': None,
 'status': 'correct',
 'signature': '56944abf570d9d98eff11924c0fd6620bb99d29b80755114c3a97144357d5d762990df47bff346d713d4ce17c839d3fc6cf7c3a47f4db6d773f7fdad11d98d20',
 'submission_id': '7f69d780-5eb5-4dc9-9c82-e43928bf9cba'}

## Turning in your assignment

Now your are done with the program. To complete this assignment, you will write a report where you will:
1. Describe the background as well as the algorithms you used. For this, summarize the articles as described in the notebook:
   * Preliminaries: subword tokenizers
   * Design of the BPE Algorithm
   * Unigram Language Model
2. Describe your program as well as your results

The whole report should be of 2 to 3 pages.

Submit your report as well as your **notebook** (for archiving purposes) to Canvas: https://canvas.education.lu.se/. To write your report, you can either
1. Write directly your text in Canvas, or
2. Use Latex and Overleaf (www.overleaf.com). This will probably help you structure your text. You will then upload a PDF file in Canvas.

The submission deadline is October 14, 2022.

## Curious?

If you are interested, you can improve this program and test it on larger corpora. You can also read a fine implementation of BPE by Andrej Karpathy: https://github.com/karpathy/minGPT/blob/master/mingpt/bpe.py