# SentencePiece and Byte Pair Encoding 

## Introduction to Tokenization

In order to process text in neural network models, it is first required to **encode** text as numbers with ids (such as the embedding vectors we've been using in the previous assignments), since the tensor operations act on numbers. Finally, if the output of the network are words, it is required to **decode** the predicted tokens ids back to text.

To encode text, the first decision that has to be made is to what level of granularity are we going to consider the text? Because ultimately, from these **tokens**, features are going to be created about them. Many different experiments have been carried out using *words*, *morphological units*, *phonemic units*, *characters*. For example, 

- Tokens are tricky. (raw text)
- Tokens are tricky . ([words](https://arxiv.org/pdf/1301.3781))
- Token s _ are _ trick _ y . ([morphemes](https://arxiv.org/pdf/1907.02423.pdf))
- t oʊ k ə n z _ ɑː _ ˈt r ɪ k i. ([phonemes](https://www.aclweb.org/anthology/W18-5812.pdf), for STT)
- T o k e n s _ a r e _ t r i c k y . ([character](https://www.aclweb.org/anthology/C18-1139/))

But how to identify these units, such as words, are largely determined by the language they come from. For example, in many European languages a space is used to separate words, while in some Asian languages there are no spaces between words. Compare English and Mandarin.

- Tokens are tricky. (original sentence)
- 令牌很棘手 (Mandarin)
- Lìng pái hěn jí shǒu (pinyin)
- 令牌 很 棘手 (Mandarin with spaces)


So, the ability to **tokenize**, i.e. split text into meaningful fundamental units is not always straight-forward.

Also, there are practical issues of how large our *vocabulary* of words, `vocab_size`, should be, considering memory limitations vs. coverage. A compromise between the finest-grained models employing characters which can be memory and more computationally efficient *subword* units such as [n-grams](https://arxiv.org/pdf/1712.09405) or larger units need to be made.

In [SentencePiece](https://www.aclweb.org/anthology/D18-2012.pdf) unicode characters are grouped together using either a [unigram language model](https://www.aclweb.org/anthology/P18-1007.pdf) (used in this week's assignment) or [BPE](https://arxiv.org/pdf/1508.07909.pdf), **byte-pair encoding**. We will discuss BPE, since BERT and many of its variant uses a modified version of BPE and its pseudocode is easy to implement and understand... hopefully!

## SentencePiece Preprocessing
### NFKC Normalization

Unsurprisingly, even using unicode to initially tokenize text can be ambiguous, e.g., 

In [6]:
eaccent = '\u00E9'
e_accent = '\u0065\u0301'
print(f'{eaccent} = {e_accent} : {eaccent == e_accent}')

é = é : False


SentencePiece uses the Unicode standard Normalization form, [NFKC](https://en.wikipedia.org/wiki/Unicode_equivalence), so this isn't an issue. Looking at our example from above again with normalization:

In [9]:
from unicodedata import normalize

norm_eaccent = normalize("NFKC", "\u00E9")
norm_e_accent = normalize("NFKC", "\u0065\u0301")
print(f"{norm_eaccent} = {norm_e_accent} : {norm_eaccent == norm_e_accent}")

é = é : True


Normalization has actually changed the unicode code point (unicode unique id) for one of these two characters.

In [10]:
def get_hex_encoding(s):
    return " ".join(hex(ord(c)) for c in s)

def print_string_and_encoding(s):
    print(f"{s} : {get_hex_encoding(s)}")

In [11]:
for s in [eaccent, e_accent, norm_eaccent, norm_e_accent]:
    print_string_and_encoding(s)

é : 0xe9
é : 0x65 0x301
é : 0xe9
é : 0xe9


This normalization has other side effects which may be considered useful such as converting curly quotes &ldquo; to " their ASCII equivalent. (Although we *now* lose directionality of the quote...)

### Lossless Tokenization<sup>*</sup>

SentencePiece also ensures that when you tokenize your data and detokenize your data the original position of white space is preserved. (However, tabs and newlines are converted to spaces, please try this experiment yourself later below.)

To ensure this **lossless tokenization** it replaces white space with _ (U+2581). So that a simple join of the replace underscores with spaces can restore the white space, even if there are consecutives symbols. But remember first to normalize and then replace spaces with _ (U+2581). As the following example shows.

In [15]:
s = "Tokenization is hard."
s_ = s.replace(" ", "\u2581")
s_n = normalize("NFKC", "Tokenization is hard.")

In [16]:
print(get_hex_encoding(s))
print(get_hex_encoding(s_))
print(get_hex_encoding(s_n))

0x54 0x6f 0x6b 0x65 0x6e 0x69 0x7a 0x61 0x74 0x69 0x6f 0x6e 0x20 0x69 0x73 0x20 0x68 0x61 0x72 0x64 0x2e
0x54 0x6f 0x6b 0x65 0x6e 0x69 0x7a 0x61 0x74 0x69 0x6f 0x6e 0x2581 0x69 0x73 0x2581 0x68 0x61 0x72 0x64 0x2e
0x54 0x6f 0x6b 0x65 0x6e 0x69 0x7a 0x61 0x74 0x69 0x6f 0x6e 0x20 0x69 0x73 0x20 0x68 0x61 0x72 0x64 0x2e


## BPE Algorithm

Now that we have discussed the preprocessing that SentencePiece performs we will go get our data, preprocess, and apply the BPE algorithm. We will show how this reproduces the tokenization produced by training SentencePiece on our example dataset (from this week's assignment).

### Preparing our Data
First, we get our Squad data and process as above.

In [17]:
import ast

In [18]:
def convert_json_examples_to_text(filepath):
    example_jsons = list(map(ast.literal_eval, open(filepath)))
    texts = [example_jsons["text"].decode("utf-8") for example_json in example_jsons]
    text = "\n\n".join(texts)
    text = normalize("NFKC", text)
    with open("example.txt", "w") as fw:
        fw.write(text)
    return text

In [19]:
text = convert_json_examples_to_text("data.txt")
print(text[:900])

FileNotFoundError: ignored

In [2]:
%%capture
!pip install whatlies[all]

In [5]:
from whatlies.language import BytePairLang

ImportError: ignored