# Generating BPE Tokenizer

> Sentences were encoded using byte-pair encoding, which has a shared source-target vocabulary of about 37000 tokens.
>
>
> — [*Attention Is All You Need* by Vaswani et al. (2017)](https://arxiv.org/abs/1706.03762)

Key details:
* A single BPE tokenizer was used for both the source and target. We will do the same.
* The vocabulary size was 37000. In our case we are going to limit our vocabularly to 8000 tokens.

## Imports

In [20]:
import sys
import os

import tensorflow as tf
import tensorflow_text as text
import tensorflow_datasets as tfds

root_path = os.path.abspath(os.path.join('..'))
if root_path not in sys.path:
    sys.path.append(root_path)

import src.utils.byte_pair_encoding_tokenizer as bpe

## Download the dataset

In [21]:
dataset, _ = tfds.load('ted_hrlr_translate/pt_to_en', with_info=True)

## Generate a corpus

As the tokenizer vocabulary is shared between the source and target, we are going to create a corpus from our dataset that includes both the source and targets.

In [22]:
corpus_generator = (
    sentence.decode('utf-8')  # Decoding bytes to string
    for example in dataset['train']
    for sentence in (example['pt'].numpy(), example['en'].numpy())
)

## Generate vocabulary and merge rules

In [None]:
vocab, merge_rules = bpe.bpe_from_dataset(
    corpus_generator,
    8000,
    ["[PAD]", "[UNK]", "[START]", "[END]"]
)

## Save vocabulary and merge rules

In [None]:
bpe.save("bpe_tokenizers/ted_hrlr_translate_pt_to_en", vocab, merge_rules)