# Tokenization
This notebook builds two subword tokenizers using TensorFlow's `text.BertTokenizer`. Based on the [Subword Tokenizer Tutorial](https://www.tensorflow.org/text/guide/subwords_tokenizer#setup) from TensorFlow.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import re
import os
import pathlib
import pandas as pd
import tensorflow as tf
import tensorflow_text as text
from tensorflow_text.python.ops import bert_tokenizer
from tensorflow_text.tools.wordpiece_vocab import wordpiece_tokenizer_learner_lib as learner
import config

## Load dataset
Load the source text from the concatenated [concatenated works of Shakespeare](https://cs.stanford.edu/people/karpathy/char-rnn/shakespear.txt).

In [3]:
with open(config.RAW_DATA_PATH, 'r') as file:
    shakespeare_plays = file.read()

In [4]:
sample = shakespeare_plays[:147]
print(sample)

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?


## Vocabulary
Generate the vocabulary.

In [5]:
tokenizer = bert_tokenizer.BasicTokenizer(**config.BERT_TOKENIZER_PARAMS)
words_dataset = tokenizer.tokenize(shakespeare_plays)
word_counts = learner.count_words(words_dataset)
vocab = learner.learn(word_counts, config.VOCAB_SIZE, config.RESERVED_TOKENS, **config.LEARN_PARAMS)

In [6]:
print(vocab[:10])
print(vocab[100:110])
print(vocab[-10:])

with open(config.VOCAB_PATH, "w") as f:
    for token in vocab:
        print(token, file=f)

['[PAD]', '[UNK]', '[START]', '[END]', '!', '$', '&', "'", ',', '-']
['well', 'was', 'which', 'there', 'how', 'am', 'then', '##ed', '##ing', 'man']
['##.', '##3', '##:', '##;', '##?', '##[', '##]', '##j', '##q', '##v']


## Tokenizer
Build and test the tokenizer.

In [7]:
tokenizer = text.BertTokenizer(config.VOCAB_PATH, **config.BERT_TOKENIZER_PARAMS)

In [8]:
tokens = tokenizer.tokenize(sample).merge_dims(-2, -1)
print(tokens)

<tf.RaggedTensor [[140, 606, 12, 196, 76, 1417, 178, 539, 8, 170, 53, 147, 10, 72, 12, 147,
  8, 147, 10, 140, 606, 12, 47, 80, 72, 1917, 361, 45, 269, 115, 45, 4344,
  14]]>


In [9]:
txt_tokens = tf.gather(vocab, tokens)
txt_tokens = tf.strings.reduce_join(txt_tokens, separator=" ", axis=-1).numpy()[0].decode("utf-8")
print(txt_tokens)

first citizen : before we proceed any further , hear me speak . all : speak , speak . first citizen : you are all resolved rather to die than to famish ?


## Customization and export
Define a custom tokenizer class that can be exported and used in the GPT, including functionality for cleaning up output after detokenization.

In [10]:
def cleanup_text(reserved_tokens, token_txt):
    # Drop the reserved tokens, except for "[UNK]".
    bad_tokens = [re.escape(tok) for tok in reserved_tokens if tok != "[UNK]"]
    bad_token_re = "|".join(bad_tokens)

    bad_cells = tf.strings.regex_full_match(token_txt, bad_token_re)
    result = tf.ragged.boolean_mask(token_txt, ~bad_cells)

    # Join them into strings.
    result = tf.strings.reduce_join(result, separator=" ", axis=-1)

    return result


class CustomTokenizer(tf.Module):
    def __init__(self, config):
        self.tokenizer = text.BertTokenizer(
            config.VOCAB_PATH, **config.BERT_TOKENIZER_PARAMS
        )
        self._reserved_tokens = config.RESERVED_TOKENS
        self._vocab_path = tf.saved_model.Asset(config.VOCAB_PATH)

        vocab = pathlib.Path(config.VOCAB_PATH).read_text().splitlines()
        self.vocab = tf.Variable(vocab)

        ## Create the signatures for export:

        # Include a tokenize signature for a batch of strings.
        self.tokenize.get_concrete_function(
            tf.TensorSpec(shape=[None], dtype=tf.string)
        )

        # Include `detokenize` and `lookup` signatures for:
        #   * `Tensors` with shapes [tokens] and [batch, tokens]
        #   * `RaggedTensors` with shape [batch, tokens]
        self.detokenize.get_concrete_function(
            tf.TensorSpec(shape=[None, None], dtype=tf.int64)
        )
        self.detokenize.get_concrete_function(
            tf.RaggedTensorSpec(shape=[None, None], dtype=tf.int64)
        )

    @tf.function
    def tokenize(self, strings):
        enc = self.tokenizer.tokenize(strings)
        enc = enc.merge_dims(-2, -1)
        return enc

    @tf.function
    def detokenize(self, tokenized):
        words = self.tokenizer.detokenize(tokenized)
        return cleanup_text(self._reserved_tokens, words)

In [11]:
tokenizer = CustomTokenizer(config)
tf.saved_model.save(tokenizer, config.TOKENIZER_PATH)
reloaded_tokenizer = tf.saved_model.load(config.TOKENIZER_PATH)

INFO:tensorflow:Assets written to: tokenizer/assets


In [12]:
tokens = reloaded_tokenizer.tokenize(["Hello TensorFlow!"])
tokens.numpy()

array([[ 647,  650,  736,   63,  866, 2003, 4975,    4]])

In [13]:
round_trip = reloaded_tokenizer.detokenize(tokens)
print(round_trip.numpy()[0].decode("utf-8"))

hello tensorflow !


## Save tokenized dataset
Split tokenized dataset into batches and then separate into train and validation sets. Store the resulting tokenized datasets on disk.

In [15]:
tokens = tokenizer.tokenize(shakespeare_plays).numpy()
n_tokens = tokens.shape[1]
print(f"Number of tokens: {n_tokens:,}")

Number of tokens: 1,167,156


In [20]:
n_samples = 100_000
sample_len = config.MAX_TOKENS + 1
indices = tf.random.uniform((n_samples,), minval=0, maxval=n_tokens - sample_len, dtype=tf.dtypes.int32)

In [36]:
tokens

array([[140, 606,  12, ...,  62, 181,  10]])

In [41]:
range = tf.range(0, 0 + sample_len, 1)
tf.gather(tokens[0], range)

<tf.Tensor: shape=(129,), dtype=int64, numpy=
array([ 140,  606,   12,  196,   76, 1417,  178,  539,    8,  170,   53,
        147,   10,   72,   12,  147,    8,  147,   10,  140,  606,   12,
         47,   80,   72, 1917,  361,   45,  269,  115,   45, 4344,   14,
         72,   12, 1917,   10, 1917,   10,  140,  606,   12,  140,    8,
         47,  119,  844,  861,   51, 1099,  651,   45,   43,  520,   10,
         72,   12,   76,  119,    7,   36,    8,   76,  119,    7,   36,
         10,  140,  606,   12,   96,  123,  400,   67,    8,   44,   76,
          7,   91,   64, 2334,   93,   82,  188, 2712,   10,   51,    7,
         36,   17, 4671, 3638, 2997,   14,   72,   12,   73,   99, 3376,
         84,    7,   36,   13,   96,   54,   57,  201,   12,  177,    8,
        177,    4,  216,  606,   12,  118,  252,    8,   87, 1464,   10,
        140,  606,   12,   76,   80, 2035,  107,  215])>

In [42]:
def get_sample_from_index(i):
    range = tf.range(i, i + sample_len, 1)
    return tf.gather(tokens[0], range)

dataset = (
    tf.data.Dataset.from_tensor_slices(indices)
    .map(get_sample_from_index, tf.data.AUTOTUNE)
)

In [134]:
dataset = (
    tf.data.Dataset.from_tensor_slices(tokens[0])
    .batch(config.MAX_TOKENS + 1, drop_remainder=True)
    .shuffle(config.BUFFER_SIZE)
)

In [43]:
examples = list(dataset.take(3))
detokenized_examples = tokenizer.detokenize(examples).numpy()
for ex in detokenized_examples:
    print(ex.decode("utf-8") + "\n")

eunuch ; peace ! she hath betray ' d me and shall die the death . mardian : death of one person can be paid but once , and that she has discharged : what thou wouldst do is done unto thy hand : the last she spake was ' antony ! most noble antony ! ' then in the midst a tearing groan did break the name of antony ; it was divided between her heart and lips : she render ' d life , thy name so buried in her . mark antony : dead , then ? mardian : dead . mark antony : unarm , eros ; the long day ' s task is done , and we must sleep . that thou depart

subject as i am , against thy oath and true allegiance sworn , should raise so great a power without his leave , or dare to bring thy force so near the court . york : buckingham : that is too much presumption on thy part : but if thy arms be to no other end , the king hath yielded unto thy demand : the duke of somerset is in the tower . york : upon thine honour , is he prisoner ? buckingham : upon mine honour , he is prisoner . york : then , b

In [44]:
n_samples = dataset.cardinality().numpy()
val_size = int(n_samples * config.VALIDATION_SHARE)
val_dataset = dataset.take(val_size)
train_dataset = dataset.skip(val_size)

In [45]:
val_dataset.save(config.VAL_DATA_PATH)
train_dataset.save(config.TRAIN_DATA_PATH)