# Tokenization

The first step on almost any NLP pipeline is tokenization. Tokenization is the process through which a string of characters is chopped into a sequence of smaller parts, which are then called tokens, which are then assigned a numerical identifier that will be used by the models. This is done because Machine Learning models know absolutelly nothing about language, but are in general very good at crunching numbers.

 As an example imagine we had the following sentence. 

```
I can't believe this. You can't touch the ground.
```

We could tokenize this text by splitting on whitespaces, getting the following sequence of tokens:

```
["I", "can't", "believe", "this.", "You", "can't", "touch", "the", "ground"]
```

Which would then be associated with different numerical identifiers:

```
[1, 2, 3, 4, 5, 2, 6, 7, 8]
```

See that the number `2`, which represents the token `can't` shows up twice in the sequence. Take also notice that this simple method has some obvious shortcomings - For example, the model see `this.` as a single token since there aren't any spaces to split there. It might then create another token for `this!`, which might not be desirable as the model would not know that those are closely related. In practice tokenization methods are more complicated than simply splitting by white spaces (And in some languages such as Japanese this is not even an option as words are not separated by whitespaces at all). Most modern approaches train tokenization methods directly from data, usually through some application of information theory. In this article we will use a tokenization called Word Piece which we explain in more details below.

https://blog.floydhub.com/tokenization-nlp

# Word Piece

The Word Piece Model was first proposed by [Wu et al](https://arxiv.org/pdf/1609.08144.pdf). Initially created to solve the problem of segmenting Korean and Japanese text (Which, as I said above, are not whitespace separated languages) this method was then adopted to automatically segmenting text into sub-word units. But before going into the details of Word Piece let's delve a little bit into the motivation for the use of Sub-word units.

## Why should we use sub-word units?

But why would we want to use sub-word units instead of words? The answer is surprisingly simple and clever - Using subwords allow our model to learn from the internal structure of a word, and allow us to better deal with rare words. Let's illustrate these claims with an example:

Image you had the following list of 10 words that need to be tokenized:

`increasing surprising beautiful delicate quick increasingly surprisingly beautifully delicately quickly`

Using regular word level tokenization would yield a total of 10 different tokens, one for each word. 

`increasing surprising beautiful delicate quick increasingly surprisingly beautifully delicately quickly`

However we can easily see that these words are Adjectives and their Adverb version - In english this can usually be done by adding `ly` to a word. If on the other hand we use sub-word units our model might decide that the following tokenization is actually more useful:

`increasing surprising beautiful delicate quick ##ly`

These are only 6 tokens, which already is useful to us from a model size standpoint. This also make it easier for the model to learn that words such as `increasing` and `increasingly` have related meaning, and that most words ending with `##ly` function as adverbs. But where this technique really shines is in how it help us to deal with rare words. 

Imagine now your model if faced with a word it never saw in training: `supernaturally`. If we used whitespace tokenization we would have to treat this word as an Unknown Word (Usually represented by a catch all token UNK) and would probably not derive much meaning from it. However, if we use sub-word units we could compose this new word from two tokens that occur much more frequently: `supernatural` and `##ly`. This way, even though the model never saw the word `supernaturally` it can derives that it is an adverd that add a supernatural quality to an action. 

Supernatural is much more common than Supernaturally in english text, as we can see in [Google Books NGram Viewer](https://books.google.com/ngrams/graph?content=Supernatural%2C+Supernaturally&case_insensitive=on&year_start=1800&year_end=2008&corpus=15&smoothing=3&share=&direct_url=t4%3B%2CSupernatural%3B%2Cc0%3B%2Cs0%3B%3Bsupernatural%3B%2Cc0%3B%3BSupernatural%3B%2Cc0%3B.t1%3B%2Csupernaturally%3B%2Cc0#t4%3B%2CSupernatural%3B%2Cc0%3B%2Cs0%3B%3Bsupernatural%3B%2Cc0%3B%3BSupernatural%3B%2Cc0%3B.t1%3B%2Csupernaturally%3B%2Cc0), and this situation will happen with a lot of different words. By using subword units we are able to better handle those rare words, and to derive meaning from them. 


# Let's Code!

To train our tokenizer we will use a library that is called, appropriately enought, tokenizers. This library contains
efficient implementations of most modern tokenizers, including Word Piece.

## Training the tokenizer

In the tokenizer library a tokenizer is composed by some components:

- **Normalizer:** Pre-process the text before feeding it to subsequent steps. Example of normalizations are Unicode Normalization, Lower-Casing, etc... 
- **Pre-Tokenizer:** Creates the initial candidate splitting that will then be fine-tuned by our tokenizer model. The most common one is pre-tokenizing by whitespace.
- **Model:** The model that will actually do the tokenization.
- **Post-Processor:** In charge of adding any extra processing needed for an specific language model (For example special tokens for classification or sentence separation).

Other than that we only need to pick which special tokens our tokenizer will need to handle. For our model we utilized the following choices:

- **Normalizer:** We use the *BertNormalizer*, the same one used to train the Bert language model. This normalizer replace all type of whitespace characters by the common whitespace, remove accented characters and apply lowercasing to all characters. It also add spaces around chinese characters (So that they are split by the pre-tokenizer), but that won't be necessary for our dataset.
- **Pre-tokenizer:** We use the BertPreTokenizer. This Pre-tokenizer simply split on whitespace characters and punctuations.
- **Model:** We use the Word Piece model.
- **Post-Processor:** We have no need of any extra processing after the tokenization is done.

Finally we let our model handle two special tokens: `<unk>`, for out of vocabulary tokens and `<pad>` for padding tokens. We train our tokenizer to produce a vocabulary of 30,000 tokens.

In [1]:
from tokenizers import Tokenizer, pre_tokenizers, decoders, trainers, models, normalizers

UNK_TOKEN = "<unk>"
PAD_TOKEN = "<pad>"
VOCAB_SIZE = 30000
SUBWORD_PREFIX = "##"

tokenizer = Tokenizer(models.WordPiece(unk_token=UNK_TOKEN))
tokenizer.add_special_tokens([UNK_TOKEN, PAD_TOKEN])
tokenizer.normalizer = normalizers.BertNormalizer(
    clean_text=True,
    handle_chinese_chars=True,
    strip_accents=True,
    lowercase=True
)

tokenizer.pre_tokenizer = pre_tokenizers.BertPreTokenizer()
tokenizer.decoder = decoders.WordPiece(prefix=SUBWORD_PREFIX)

trainer = trainers.WordPieceTrainer(
    vocab_size=VOCAB_SIZE,
    min_frequency=2,
    limit_alphabet=1000,
    special_tokens=[UNK_TOKEN, PAD_TOKEN],
    show_progress=True,
    continuing_subword_prefix=SUBWORD_PREFIX
)

We can then read our data into memory, filtering only english language reviews:

In [2]:
import polars as pl

df_en = pl.read_parquet("../data/processed/reviews.parquet").filter(pl.col("language") == "en")
df_en.head()

user,date,score,text,length,game,platform,language,set
str,date,i8,str,u32,str,str,str,str
"""doodlerman""",2011-06-09,10,"""I'm one of tho…",1900,"""The Legend of …","""Nintendo 64""","""en""","""train"""
"""Jacody""",2010-11-25,10,"""Anyone who giv…",768,"""The Legend of …","""Nintendo 64""","""en""","""test"""
"""Kaistlin""",2011-04-25,10,"""I won't bore y…",176,"""The Legend of …","""Nintendo 64""","""en""","""train"""
"""SirCaestus""",2011-06-12,10,"""Everything in …",153,"""The Legend of …","""Nintendo 64""","""en""","""train"""
"""StevenA""",2010-03-21,10,"""This game is t…",504,"""The Legend of …","""Nintendo 64""","""en""","""train"""


And finally train our tokenizer from the loaded data, taking care of only using the
training set.

In [3]:
tokenizer.train_from_iterator(
    df_en.filter(pl.col("set") == "train")["text"].to_list(), trainer=trainer
)







Now that we trained our tokenizer let's check how it tokenize our example sentence from the beginning:

In [4]:
encoded = tokenizer.encode("I can't believe this. You can't touch the ground!")
print(encoded.tokens)
print(encoded.ids)

['i', 'can', "'", 't', 'believe', 'this', '.', 'you', 'can', "'", 't', 'touch', 'the', 'ground', '!']
[48, 1456, 8, 59, 2654, 1394, 15, 1385, 1456, 8, 59, 3229, 1356, 3318, 2]


Wait a minute... I thought we went through all this trouble to get subword units, but all those tokens are actually 
complete words! What is happening here is that these words are frequent enought that it is worth for the model to keep 
them as a single token. However let's see what happen when we try to tokenize the sentence below:

In [5]:
encoded = tokenizer.encode("Feast your eyes on this accursed nonsense.")

print(encoded.tokens)
print(encoded.ids)

['feast', 'your', 'eyes', 'on', 'this', 'accur', '##sed', 'nonsense', '.']
[16936, 1509, 4223, 1396, 1394, 4506, 5264, 5829, 15]


Here we can see that the tokenizer splitted the word `accursed` into two parts, `accur` and `#sed`.

# Saving our work

Now, let us save our work. First, let us tokenize all text and save that into our dataset.

In [18]:
def tokenize_review(text: str) -> tuple[list[str], list[int]]:
    encoded = tokenizer.encode(text)
    return encoded.tokens, encoded.ids

# df_en = df_en.sample(frac=0.01)
# df_en = df_en.drop(["tokens", "token_ids"])

df_en = df_en.with_columns(
    pl.col("text").apply(tokenize_review).alias("results")
).with_columns([
    pl.col("results").apply(lambda results: results[0]).alias("tokens"),
    pl.col("results").apply(lambda results: results[1]).cast(pl.List(pl.UInt16)).alias("token_ids")
]).drop("results").with_columns(
    pl.col("token_ids").arr.lengths().alias("n_tokens")
)

df_en.head()

user,date,score,text,length,game,platform,language,set,tokens,token_ids,n_tokens
str,date,i8,str,u32,str,str,str,str,list[str],list[u16],u32
"""Sigone""",2013-05-21,10,"""If you loved B…",722,"""Borderlands 2""","""PC""","""en""","""val""","[""if"", ""you"", … ""d""]","[1482, 1385, … 43]",9471
"""dusty0923""",2020-06-21,9,"""This game is n…",975,"""The Last of Us…","""PlayStation 4""","""en""","""test""","[""this"", ""game"", … "".""]","[1394, 1370, … 15]",9471
"""MegaOrca""",2014-10-29,3,"""Disappointing.…",177,"""Sid Meier's Ci…","""PC""","""en""","""train""","[""disappointing"", ""."", … "".""]","[3115, 15, … 15]",9471
"""Luigirific""",2013-11-15,9,"""Positive : +Ac…",459,"""The Wonderful …","""Wii U""","""en""","""train""","[""positive"", "":"", … ""0""]","[3556, 27, … 17]",9471
"""xAtomicLink""",2011-07-11,10,"""This is the pe…",222,"""Earth Defense …","""Xbox 360""","""en""","""test""","[""this"", ""is"", … "".""]","[1394, 1377, … 15]",9471


We will also need to save the tokenizer itself for later usage in our development pipeline.