# Explore with BPE Tokenizer

## Imports

In [None]:
import sys
import os

root_path = os.path.abspath(os.path.join('..'))
if root_path not in sys.path:
    sys.path.append(root_path)

import src.utils.byte_pair_encoding_tokenizer as bpe

## Initialize BPE Tokenizer

In [5]:
tokenizer = bpe.CustomBPETokenizer(
        ["[PAD]", "[UNK]", "[START]", "[END]"], "../bpe_tokenizers/ted_hrlr_translate_pt_to_en")

## Explore

### Step 1: Tokenizing the Input

Given the input sentence:

In [27]:
input = ["this is a sentence to be tokenized"]
print(f"Input: {input}")

Input: ['this is a sentence to be tokenized']


The `tokenize` method tokenizes it into a set of integer tokens:

In [28]:
tokenized = tokenizer.tokenize(input)
print(f"Output: {tokenized}")

Output: <tf.RaggedTensor [[2, 693, 186, 120, 7380, 165, 248, 165, 2399, 1609, 3]]>


Each integer here corresponds to a token position in the vocabulary. To see the complete generated vocabulary refer to [bpe_tokenizers/ted_hrlr_translate_pt_to_en/vocab.txt](../bpe_tokenizers/ted_hrlr_translate_pt_to_en/vocab.txt). 

### Step 2: Unveiling Tokens via Lookup

It is possible to see what characters this integer tokens correspond to using the `lookup` method:

In [30]:
tokens = tokenizer.lookup(tokenized)
tokens = tokens.to_list()
decoded_tokens = [token.decode('utf-8') for sublist in tokens for token in sublist]
print(f"Tokens: {decoded_tokens}")

Tokens: ['[START]', 'this', 'Ġis', 'Ġa', 'Ġsentence', 'Ġto', 'Ġbe', 'Ġto', 'ken', 'ized', '[END]']


#### Key Observations:

* **Utilization of `Ġ`**: This character signifies the commencement of a new word.
* **Single vs. Multi-Token Words**: While frequent words might attain individual tokens, rarer or absent words (like 'tokenized') break into smaller tokens: 'Ġto', 'ken', and 'ized'.

### Step 3: Detokenize

Retrieving the original sentence is attainable with a call to the detokenize method:

In [33]:
detokenized = tokenizer.detokenize(tokenized)
print(f"Detokenizes: {detokenized}")

Detokenizes: [b'this is a sentence to be tokenized']
