## Byte-Pair Encoding

This notebook shows how the class BPETokenization is used. We're going to go with an example step-by-step and let a notebook cell prepared to preprocess all the corpus elements.

### Ensure visibility for the class file

In [1]:
import os, sys

LIB_PATH = os.path.join(os.getcwd(), '../')
sys.path.append(LIB_PATH)

### Step-by-step example

In [2]:
from atividade_1.bpe_tokenization import BPETokenization

bpe = BPETokenization(vocab_size=1000)

The BPE algorithm relies on 3 major steps:
1. Convert a text to a byte representation, which we encode to integer IDs for convinience;
2. Count the number of adjacent pairs of bytes;
3. Given the most frequent byte pair, we merge this pair into a new token ID

We repeat this process until the vocabulary size reaches the value specified at the `vocab_size` class parameter. Jumping to code, we have the implementation as follows:

In [3]:
# The first step is to convert from a string to a list of byte ids
text = "VASCO VASCO VASCO DA GAMA"
print(f"Original Text: {text}")
byte_ids = bpe.text_to_byte_ids(text)
print(f"Byte IDs: {byte_ids}")

Original Text: VASCO VASCO VASCO DA GAMA
Byte IDs: [86, 65, 83, 67, 79, 32, 86, 65, 83, 67, 79, 32, 86, 65, 83, 67, 79, 32, 68, 65, 32, 71, 65, 77, 65]


In [4]:
# Then we count the byte adjacent pairs
pair_counts = bpe.get_pair_counts(byte_ids)
print(f"Pair Counts: {pair_counts}")

Pair Counts: {(86, 65): 3, (65, 83): 3, (83, 67): 3, (67, 79): 3, (79, 32): 3, (32, 86): 2, (32, 68): 1, (68, 65): 1, (65, 32): 1, (32, 71): 1, (71, 65): 1, (65, 77): 1, (77, 65): 1}


In [5]:
# Next, we find the most frequent pair
most_freq = bpe.find_most_frequent_pair(pair_counts)
print(f"Most frequent pair: {most_freq}")

Most frequent pair: (86, 65)


In [6]:
# Lastly, we merge this frequent pair into a new token id, and
token_ids = bpe.merge_pair(byte_ids, most_freq, max(byte_ids)+1)
print(f"Token IDs after the first merge: {token_ids}")

Token IDs after the first merge: [87, 83, 67, 79, 32, 87, 83, 67, 79, 32, 87, 83, 67, 79, 32, 68, 65, 32, 71, 65, 77, 65]


### Testing on the corpus provided

I uploaded the `corpus.zip` file to my personal Google Drive, so I can easily retrieve it using the `gdown` lib, and unzip it using `zipfile`:

In [7]:
import gdown

file_id = '1LtxrgoRfNivPry38pb28hiKDYDCXRSPs'  

download_url = f"https://drive.google.com/uc?id={file_id}"

output = 'corpus.zip'

gdown.download(download_url, output, quiet=False)

Downloading...
From (original): https://drive.google.com/uc?id=1LtxrgoRfNivPry38pb28hiKDYDCXRSPs
From (redirected): https://drive.google.com/uc?id=1LtxrgoRfNivPry38pb28hiKDYDCXRSPs&confirm=t&uuid=c00b2734-2726-44fe-a803-b5b4f944b5f3
To: /home/user/unb/unb_mestrado/2_semestre/topicos_nlp/nlp/atividade_1/corpus.zip
100%|██████████| 31.7M/31.7M [00:04<00:00, 6.80MB/s]


'corpus.zip'

In [8]:
from zipfile import ZipFile

with ZipFile("corpus.zip", "r") as f:
    f.extractall("corpus/")

To process all elements in the corpus, just execute the following cell:

In [9]:
import json

def corpus_generator(corpus_path: str):
    for file in os.listdir(corpus_path):
        with open(f"{corpus_path}/{file}", "r") as f:
            json_file = json.load(f)
            yield json_file["text"]

In [10]:
bpe = BPETokenization(vocab_size=20000)
bpe.train(corpus_generator("corpus"))

0it [00:03, ?it/s]

No more pairs to merge. Stopping training.





In [11]:
len(bpe.reverse_vocab)

4183

In [12]:
len(bpe.reverse_vocab)

4183

That transformations are encapsulated at the `encode` method, and have a `decode` method to bring things to human readable format as well:

In [13]:
eae = bpe.encode(text="O rato roeu a roupa do rei de Roma, enquanto a grande e vegetariana flor de lis dançava nas nuvens de Shikamaru.")
eae

Pair (97, 32) already in vocabulary. Stopping encoding.


[0,
 79,
 32,
 114,
 97,
 116,
 111,
 32,
 114,
 111,
 101,
 117,
 32,
 97,
 32,
 114,
 111,
 117,
 112,
 97,
 32,
 100,
 111,
 32,
 114,
 101,
 105,
 32,
 100,
 101,
 32,
 82,
 111,
 109,
 97,
 44,
 32,
 101,
 110,
 113,
 117,
 97,
 110,
 116,
 111,
 32,
 97,
 32,
 103,
 114,
 97,
 110,
 100,
 101,
 32,
 101,
 32,
 118,
 101,
 103,
 101,
 116,
 97,
 114,
 105,
 97,
 110,
 97,
 32,
 102,
 108,
 111,
 114,
 32,
 100,
 101,
 32,
 108,
 105,
 115,
 32,
 100,
 97,
 110,
 195,
 167,
 97,
 118,
 97,
 32,
 110,
 97,
 115,
 32,
 110,
 117,
 118,
 101,
 110,
 115,
 32,
 100,
 101,
 32,
 83,
 104,
 105,
 107,
 97,
 109,
 97,
 114,
 117,
 46,
 1]

In [14]:
bpe.decode(ids=eae)

'O rato roeu a roupa do rei de Roma, enquanto a grande e vegetariana flor de lis dançava nas nuvens de Shikamaru.'