# WordPiece

## About

It was developed by google and used in BERT. Since then it has been used by Transformer models based on BERT, such as DistilBERT, MobileBERT, Funnel Transformers, and MPNET.

## Training

The training is done the same as with BPE.

Starting with a small vocabulary, we add special tokens used by the model and the initial alphabet. By adding the prefix `##` (WordPiece prefix), the subwords are identified.

Then, based on a score, merging begins. In contrast to BPE, where the most frequent pairs are merged, WordPiece computes a score as it follows:

score = (freq_of_pair)/(freq_of_first_element × freq_of_second_element)

This formula gives a higher score for the pairs that are frequent, but the individual parts are less frequent.

### Example

Let's assume we have a vocabulary with their corresponding frequencies as below:

`("hug", 10), ("pug", 5), ("pun", 12), ("bun", 4), ("hugs", 5)`

We start tokenizing at the alphabetic level:

`[("b", 4), ("h", 15), ("p", 17), ("##g", 20), ("##n", 16), ("##s", 5), ("##u", 36)]`

The most frequent pairs are `(("##u", "##g"), 20)`, but `("##u", 36)` is very frequent as well, leading to a score equal to 1/36. The highest score goes to `(("##g", "##s"), 5)`, with a value equal to 5/(20*5)=1/20. So we determine the first merge:

`("h" "##u" "##g", 10), ("p" "##u" "##g", 5), ("p" "##u" "##n", 12), ("b" "##u" "##n", 4), ("h" "##u" "##gs", 5)`

And our vocabulary will be:

`["b", "h", "p", "##g", "##n", "##s", "##u", "##gs"]`

At this point, all of the pairs are going to be with `##u` and they all have the same score equal to 1/36. So we say the first pair is going to be merged.

`("hu" "##g", 10), ("p" "##u" "##g", 5), ("p" "##u" "##n", 12), ("b" "##u" "##n", 4), ("hu" "##gs", 5)`

The next pair will be `"hu" "##g"` with a score equal to 10/(15*15). And this goes on untill we reach the desired length of vocabulary.

## Tokenization Algorithm

In contrast to BPE, the merging rules are not saved. For dividing any word, we start with the longest subword found in the vocabulary and do the first split, e.g. 'bugs' will be divided into ('b', '##ugs'). Then we do the same to the remaining part of the word untill we hit the finish. So it will be divided to ('b', '##u', '##gs').

In the presence of a character where no vocabulary can be used to tokenize it, the whole word will be tokenized as ["[UNK]"].

In [1]:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
sentence = "This is a sample sentence."
tokens = tokenizer(sentence)
print(tokens)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

{'input_ids': [101, 2023, 2003, 1037, 7099, 6251, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}
