# DEMO 3: **Byte-Pair Encoding Tokenization** (encoding & decoding)
---

Demo run of Byte-Pair Encoding Tokenization encoding and decoding process over the TinyStories dataset. This run uses a Macbook Pro 2023, M3 Pro. 

The implementation uses a greedy approach for the inference. The greediness provides more efficiency in runtime for the tokenization process; Some studies, such as [Greed is All You Need: An Evaluation of Tokenizer Inference Methods](https://arxiv.org/pdf/2403.01289) (2024), show that greedy inference also yields good results in benchmarks, especially for morphologically-motivated tasks.

## **BPETokenizer**
---

Let's instantiate a BPE Tokenizer with the `BPETokenizer` class. We can do it using the vocab and merges files created after running `2_pre_tokenization_training.ipynb`, or from the ones provided in `./sample_data/bpe_tokenizer` (which were produced running that notebook, trained on `TinyStories` train.txt data):

In [2]:
from pathlib import Path


input_dir = Path("./sample_data/bpe_tokenizer")

In [3]:
from bpe_transformer.tokenization.bpe_tokenizer import BPETokenizer


bpe = BPETokenizer.from_files(
    vocab_filepath=Path(input_dir / "vocab.pkl"), merges_filepath=Path(input_dir / "merges.pkl")
)

In [5]:
l_50 = 50 * "="
l_50_2 = 50 * "-"
print(l_50)
print("BPE TOKENIZER")
print(l_50)
print("Vocab:")
print(l_50_2)
print(bpe.vocab)
print(f"\nLength: {len(bpe.vocab)}\n")
print(l_50_2)
print("Merges:")
print(l_50_2)
print(bpe.merges)
print(f"\nLength: {len(bpe.merges)}\n")
print(l_50)
print("\n")

BPE TOKENIZER
Vocab:
--------------------------------------------------

Length: 10000

--------------------------------------------------
Merges:
--------------------------------------------------
[(b' ', b't'), (b'h', b'e'), (b' ', b'a'), (b' ', b's'), (b' ', b'w'), (b'n', b'd'), (b' t', b'he'), (b'e', b'd'), (b' ', b'b'), (b' t', b'o'), (b' a', b'nd'), (b' ', b'h'), (b' ', b'f'), (b'i', b'n'), (b' ', b'T'), (b' w', b'a'), (b'r', b'e'), (b'i', b't'), (b'o', b'u'), (b' ', b'l'), (b' ', b'd'), (b' ', b'c'), (b' ', b'p'), (b'a', b'y'), (b' ', b'm'), (b'e', b'r'), (b' wa', b's'), (b'o', b'm'), (b' T', b'he'), (b' ', b'he'), (b'i', b's'), (b'a', b'r'), (b' ', b'n'), (b'i', b'm'), (b'o', b'n'), (b' s', b'a'), (b'i', b'd'), (b'l', b'l'), (b' h', b'a'), (b' ', b'g'), (b'a', b't'), (b' ', b'S'), (b'in', b'g'), (b'o', b't'), (b'e', b'n'), (b'a', b'n'), (b'l', b'e'), (b'o', b'r'), (b'i', b'r'), (b'a', b'm'), (b'e', b't'), (b' ', b'H'), (b' ', b'it'), (b' t', b'h'), (b'i', b'g'), (b' The', b'y')

### **Encoding**

Now that we loaded our vocab, let's try encoding a simple string.

In [20]:
text = "Hello, I am encoding my text. How are you?"

print(text)
print(len(text))

Hello, I am encoding my text. How are you?
42


Try encoding it based on our vocab:

In [19]:
encode_text = bpe.encode(text=text)

print(encode_text)
print(len(encode_text))

[1183, 44, 338, 740, 835, 1262, 5686, 622, 799, 983, 46, 2687, 483, 349, 63]
15


After encoding, the length of our text representation reduced from 42 to 15. Let's see our tokens:

In [25]:
tokens = [bpe.vocab[token] for token in encode_text]
print(tokens)

[b'Hello', b',', b' I', b' am', b' en', b'co', b'ding', b' my', b' te', b'xt', b'.', b' How', b' are', b' you', b'?']


### **Decoding**

Now, let's try decoding the string we just encoded:

In [29]:
decode_text = bpe.decode(ids=encode_text)

print(decode_text)
print(len(decode_text))

assert decode_text == text

Hello, I am encoding my text. How are you?
42


The decoded text is equivalent to the `text` input string we defined in the beginning.

TODO: iterator for long text chunk encoding.