## Byte-Pair Encoding

This notebook shows how the class BPETokenizer is used. We're going to go with an example step-by-step and let a notebook cell prepared to preprocess all the corpus elements.

### Ensure visibility for the class file

In [None]:
import os, sys

LIB_PATH = os.path.join(os.getcwd(), '../')
sys.path.append(LIB_PATH)

### Step-by-step example

In [20]:
from atividade_1.bpe_tokenization import BPETokenization

# k is number of merges that we want to perform
bpe = BPETokenization(k=3)

The BPE algorithm relies on 3 major steps:
1. Convert a text to a byte representation, which we encode to integer IDs for convinience;
2. Count the number of adjacent pairs of bytes;
3. Given the most frequent byte pair, we merge this pair into a new token ID

We repeat this process `k` times, where `k` is a given integer value. Jumping to code, we have the implementation as follows:

In [21]:
# The first step is to convert from a string to a list of byte ids
text = "VASCO VASCO VASCO DA GAMA"
print(f"Original Text: {text}")
byte_ids = bpe.text_to_byte_ids(text)
print(f"Byte IDs: {byte_ids}")

Original Text: VASCO VASCO VASCO DA GAMA
Byte IDs: [86, 65, 83, 67, 79, 32, 86, 65, 83, 67, 79, 32, 86, 65, 83, 67, 79, 32, 68, 65, 32, 71, 65, 77, 65]


In [22]:
# Then we count the byte adjacent pairs
pair_counts = bpe.get_pair_counts(byte_ids)
print(f"Pair Counts: {pair_counts}")

Pair Counts: {(86, 65): 3, (65, 83): 3, (83, 67): 3, (67, 79): 3, (79, 32): 3, (32, 86): 2, (32, 68): 1, (68, 65): 1, (65, 32): 1, (32, 71): 1, (71, 65): 1, (65, 77): 1, (77, 65): 1}


In [23]:
# Next, we find the most frequent pair
most_freq = bpe.find_most_frequent_pair(pair_counts)
print(f"Most frequent pair: {most_freq}")

Most frequent pair: (86, 65)


In [24]:
# Lastly, we merge this frequent pair into a new token id, and
token_ids = bpe.merge_pair(byte_ids, most_freq, max(byte_ids)+1)
print(f"Token IDs after the first merge: {token_ids}")

Token IDs after merge: [87, 83, 67, 79, 32, 87, 83, 67, 79, 32, 87, 83, 67, 79, 32, 68, 65, 32, 71, 65, 77, 65]


That transformations are encapsulated at the `encode` method, and have a `decode` method to bring things to human readable format as well:

In [25]:
# here we do k merges
encoded_ids = bpe.encode(text)
print(f"Encoded IDs: {encoded_ids}")

decoded_text = bpe.decode(encoded_ids)
print(f"Decoded Text: {decoded_text}")

Encoded IDs: [89, 79, 32, 89, 79, 32, 89, 79, 32, 68, 65, 32, 71, 65, 77, 65]
Decoded Text: XCO XCO XCO DA GAMA


### Testing on the corpus provided

I uploaded the `corpus.zip` file to my personal Google Drive, so I can easily retrieve it using the `gdown` lib, and unzip it using `zipfile`:

In [5]:
import gdown

file_id = '1LtxrgoRfNivPry38pb28hiKDYDCXRSPs'  

download_url = f"https://drive.google.com/uc?id={file_id}"

output = 'corpus.zip'

gdown.download(download_url, output, quiet=False)

Downloading...
From (original): https://drive.google.com/uc?id=1LtxrgoRfNivPry38pb28hiKDYDCXRSPs
From (redirected): https://drive.google.com/uc?id=1LtxrgoRfNivPry38pb28hiKDYDCXRSPs&confirm=t&uuid=024b405f-3ed4-47a6-9520-5e39e4ff46f2
To: /home/user/unb/unb_mestrado/2_semestre/topicos_nlp/nlp/atividade_1/corpus.zip
100%|██████████| 31.7M/31.7M [00:04<00:00, 7.25MB/s]


'corpus.zip'

In [6]:
from zipfile import ZipFile

with ZipFile("corpus.zip", "r") as f:
    f.extractall("corpus/")

Let's explore a bit on a file from the corpus:

In [7]:
import json

with open("corpus/240.json", "r") as f:
    json_file = json.load(f)

json_file

{'id': '240',
 'text': 'Alexandre é um prenome popular da língua portuguesa. É cognato ao nome Alexander, da língua inglesa. Em países lusófonos, pessoas chamadas Alexandre são normalmente apelidadas de Alex. == Origem == O primeiro registro conhecido do nome foi feito no grego micênico: encontrou-se a versão feminina do nome, Alexandra, escrito em Linear B.Chadwick, John, The Mycenaean World, Nova Iorque: Imprensa da Universidade de Cambrígia, 1976, 1999. == Variações em outros idiomas == * Albanês – Aleksandër, Aleks, Leka i Madh, Lekë (no norte da Albânia), Sandër, Skëndër, Skander (ver Skanderbeg) * Amárico – Eskender * Árabe – الاسكندر / اسكندر (Iskandar), Skandar, Skender * Bielorrusso – Аляксандp (Aliaksandr), Алeсь (Ales\'), Алелька (Alyel\'ka) * Catalão – Alexandre, Àlex, Xandre * Inglês – Alexander, Alec, Alex, Sandy, Andy, Alexis, Alexa, Sandra, Xander * Gaélico escocês – Alasdair, Alastair, Alistair, Alisdair * Galego – Alexandre, Álex * Georgiano – ალექსანდრე (Alexandre), 

In [9]:
encoded_240 = bpe.encode(json_file["text"])
len(encoded_240)

2158

In [10]:
decoded_240 = bpe.decode(encoded_240)
print(decoded_240)

�ehe�re é um prenome popular da língua portuguesa. É cognato ao nome �ehea�er, da língua inglesa. Em países lusófonos, pessoas chamadas �ehe�re são normalmente apelidadas de �ehe. == Origem == O primeiro registro conhecido do nome foi feito no grego micênico: encontrou-se a versão feminina do nome, �ehe�ra, escrito em Linear B.Chadwick, John, The Mxlcenaean World, Nova Iorque: Imprensa da Universidade de Cambrígia, 1976, 1999. == Variações em outros idiomas == * Albanês �r� �eksa�ër, �eks, Leka i Madh, Lekë (no norte da Albânia), Sa�ër, Skëndër, Ska�er (ver Ska�erbeg) * Amárico �r� Eskender * Àlrabe �r� الاسكندر / اسكندر (Iska�ar), Ska�ar, Skender * Bielorrusso �r� Алякрlандp (Aliaks�r), Алeрlь (�es'), Алелька (Alxlel'ka) * Catalão �r� �ehe�re, �rlehe, X�re * Inglês �r� �ehea�er, �ec, �ehe, Sa�xl, Andxl, �eheis, �ehea, S�ra, Xa�er * Gaélico escocês �r� Alasdair, Alastair, Alistair, Alisdair * Galego �r� �ehe�re, Àllehe * Georgiano �r� ალექსანდრე (�ehe�re), ალეკო (�eko), ლექსო

To process all elements in the corpus, just execute the following cell:

In [None]:
for file in os.listdir("corpus/"):
    with open(f"corpus/{file}", "r") as f:
        json_file = json.load(f)
    print(file)
    encoded = bpe.encode(json_file["text"])
    print(len(encoded))
    decoded = bpe.decode(encoded)
    print(decoded)