<a href="https://colab.research.google.com/github/juacardonahe/Curso_NLP/blob/main/1_FundamentosNLP/1.2_WordPiece/1_2_1_WordPieceAlgorithm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src="https://raw.githubusercontent.com/juacardonahe/Curso_NLP/refs/heads/main/data/UnFieldB.png" width="40%">

# **Procesamiento de Lenguaje Natural (NLP)**
### Departamento de Ingeniería Eléctrica, Electrónica y Computación
#### Universidad Nacional de Colombia - Sede Manizales

#### Elaboró: Juan José Cardona H.
#### Revisó: Diego A. Perez

#**1.2.1 - WordPiece Algorithm**

**WordPiece** is the tokenization algorithm that Google developed to pretrain BERT. It has since been reused in several BERT-based Transformer models, such as DistilBERT, MobileBERT, Funnel Transformers, and MPNET. It is very similar to BPE in terms of training, but the tokenization itself is done differently.

⚠️ *Google has never released the source code for its implementation of the WordPiece training algorithm.*

WordPiece starts from a small vocabulary including the special tokens used by the model and the initial alphabet. Since it identifies subwords by adding a prefix (like ## for BERT), each word is initially split by adding that prefix to all the characters inside the word. So, for instance, "word" gets split like this:

```
w ##o ##r ##d
```



The WordPiece algorithm is iterative and the summary of the algorithm according to the original paper ["Japanese and Korean Voice Search (Schuster et al., 2012)"](https://static.googleusercontent.com/media/research.google.com/ja//pubs/archive/37842.pdf), is as follows:

1. Initialize the word unit inventory with the base characters.

2. Build a language model on the training data using the word inventory from 1.

3. Generate a new word unit by combining two units out of the current word inventory. The word unit inventory will be incremented by 1 after adding this new word unit. The new word unit is chosen from all the possible ones so that it increases the likelihood of the training data the most when added to the model.

4. Goto 2 until a pre-defined limit of word units is reached or the likelihood increase falls below a certain threshold.

##**Implementing WordPiece Algorithm**

 Since we're replicating a WordPiece tokenizer (like BERT), we'll use the `bert-base-cased` tokenizer for pretokenization:

In [6]:
from transformers import BertTokenizer

# Load the BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

In [5]:
# Check if the tokenizer is loaded
print('Tokenizer loaded successfully! Vocabulary size:', len(tokenizer.vocab))

Tokenizer loaded successfully! Vocabulary size: 30522


The vocabulary size tells us how many unique tokens or subwords the tokenizer knows.

###**1. Tokenization example**
Let's tokenize a sample sentence to see how WordPiece breaks down words into subword units. The WordPiece tokenizer splits unknown words into smaller known pieces from its vocabulary.

In [7]:
# Sample sentence
sentence = "The quick brown fox jumps over the lazy dog"

# Tokenize the sentence
tokens = tokenizer.tokenize(sentence)

# Convert tokens to IDs
token_ids = tokenizer.convert_tokens_to_ids(tokens)

# Print results
print('Original sentence:', sentence)
print('Tokens:', tokens)
print('Token IDs:', token_ids)

Original sentence: The quick brown fox jumps over the lazy dog
Tokens: ['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
Token IDs: [1996, 4248, 2829, 4419, 14523, 2058, 1996, 13971, 3899]


Let's try a more complex word to see how WordPiece handles it. We'll use a word like 'playing' to observe subword units.

In [8]:
# Complex word example
word = "playing"

# Tokenize the word
tokens = tokenizer.tokenize(word)

# Convert tokens to IDs
token_ids = tokenizer.convert_tokens_to_ids(tokens)

# Print results
print('Original word:', word)
print('Tokens:', tokens)
print('Token IDs:', token_ids)

Original word: playing
Tokens: ['playing']
Token IDs: [2652]


The WordPiece tokenizer often breaks words with affixes (like "-ing") into smaller units. For "playing," it typically splits into two tokens: "play" and "##ing." The "##" indicates that "##ing" is a suffix continuing the previous token. This shows how WordPiece handles morphological components (root + suffix) rather than keeping the word intact.

###**2. IDs to text**
We can convert the token IDs back to text to see how the tokenizer reconstructs the input.

In [9]:
# Decode token IDs back to text
decoded_text = tokenizer.decode(token_ids)

print('Token IDs:', token_ids)
print('Decoded text:', decoded_text)

Token IDs: [2652]
Decoded text: playing


###**3. Handle special tokens**
The BERT tokenizer adds special tokens like `[CLS]` and `[SEP]` when encoding sentences for model input. Let's see how this works.

In [10]:
# Encode the sentence with special tokens
encoded_input = tokenizer(sentence, add_special_tokens=True, return_tensors='pt')

# Decode to see special tokens
decoded_with_special = tokenizer.decode(encoded_input['input_ids'][0])

print('Encoded input IDs:', encoded_input['input_ids'])
print('Decoded with special tokens:', decoded_with_special)

Encoded input IDs: tensor([[  101,  1996,  4248,  2829,  4419, 14523,  2058,  1996, 13971,  3899,
           102]])
Decoded with special tokens: [CLS] the quick brown fox jumps over the lazy dog [SEP]


##**Summary**
WordPiece algorithm trains a language model on the base vocabulary, picks the pair which has the highest likelihood, add this pair to the vocabulary, train the language model on the new vocabulary and repeat the steps repeated until the desired vocabulary size or likelihood threshold is reached.

## **Notes**
- The WordPiece tokenizer splits words into subword units (e.g., 'playing' → 'play' + '##ing'). The '##' indicates a subword that continues a previous word.
- The `bert-base-uncased` tokenizer converts all text to lowercase before tokenizing.
- Special tokens like `[CLS]` and `[SEP]` are used by BERT for specific tasks, such as classification or separating sentences.

You can experiment by changing the `sentence` or `word` variables to see how different inputs are tokenized!