# Tokenization
This notebook builds two subword tokenizers using TensorFlow's `text.BertTokenizer`. Based on the [Subword Tokenizer Tutorial](https://www.tensorflow.org/text/guide/subwords_tokenizer#setup) from TensorFlow.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import re
import os
import pathlib
import pandas as pd
import tensorflow as tf
import tensorflow_text as text
from tensorflow_text.python.ops import bert_tokenizer
from tensorflow_text.tools.wordpiece_vocab import (
    wordpiece_tokenizer_learner_lib as learner,
)
import config

## Load dataset
Load the source text from the concatenated [concatenated works of Shakespeare](https://cs.stanford.edu/people/karpathy/char-rnn/shakespear.txt). Replace newline symbols (\n) with a special token so that information about the structure of the text is correctly tokenized.

In [3]:
with open(config.RAW_DATA_PATH, "r") as file:
    shakespeare_plays = file.read()

In [4]:
sample = shakespeare_plays[:147]
print(sample)

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?


In [5]:
shakespeare_plays = shakespeare_plays.replace("\n\n", config.DOUBLEN_TOKEN)
shakespeare_plays = shakespeare_plays.replace("\n", config.NEWLINE_TOKEN)
sample = shakespeare_plays[:185]
print(sample)

First Citizen: NEWLINE Before we proceed any further, hear me speak. DOUBLEN All: NEWLINE Speak, speak. DOUBLEN First Citizen: NEWLINE You are all resolved rather to die than to famish?


## Vocabulary
Generate the vocabulary.

In [6]:
tokenizer = bert_tokenizer.BasicTokenizer(**config.BERT_TOKENIZER_PARAMS)
words_dataset = tokenizer.tokenize(shakespeare_plays)
word_counts = learner.count_words(words_dataset)
vocab = learner.learn(
    word_counts,
    config.VOCAB_SIZE,
    config.RESERVED_TOKENS,
    **config.LEARN_PARAMS
)

In [7]:
print(vocab[:10])
print(vocab[100:110])
print(vocab[-10:])

with open(config.VOCAB_PATH, "w") as f:
    for token in vocab:
        print(token, file=f)

['!', '$', '&', "'", ',', '-', '.', '3', ':', ';']
['shall', 'are', 'To', 'thee', 'by', 'we', 'That', 'on', 'no', 'our']
['##U', '##V', '##W', '##X', '##Z', '##[', '##]', '##j', '##q', '##v']


## Tokenizer
Build and test the tokenizer.

In [8]:
tokenizer = text.BertTokenizer(
    config.VOCAB_PATH, **config.BERT_TOKENIZER_PARAMS
)

In [9]:
tokens = tokenizer.tokenize(sample).merge_dims(-2, -1)
print(tokens)

<tf.RaggedTensor [[205, 698, 8, 65, 992, 105, 2366, 221, 641, 4, 222, 77, 184, 6, 66, 355,
  8, 65, 804, 4, 184, 6, 66, 205, 698, 8, 65, 151, 101, 99, 2311, 467, 69,
  325, 153, 69, 5270, 10]]>


In [10]:
txt_tokens = tf.gather(vocab, tokens)
txt_tokens = (
    tf.strings.reduce_join(txt_tokens, separator=" ", axis=-1)
    .numpy()[0]
    .decode("utf-8")
)
print(txt_tokens)

First Citizen : NEWLINE Before we proceed any further , hear me speak . DOUBLEN All : NEWLINE Speak , speak . DOUBLEN First Citizen : NEWLINE You are all resolved rather to die than to famish ?


In [11]:
txt_tokens = tokenizer.detokenize(tokens)
print(txt_tokens)

<tf.RaggedTensor [[b'First', b'Citizen', b':', b'NEWLINE', b'Before', b'we', b'proceed',
  b'any', b'further', b',', b'hear', b'me', b'speak', b'.', b'DOUBLEN',
  b'All', b':', b'NEWLINE', b'Speak', b',', b'speak', b'.', b'DOUBLEN',
  b'First', b'Citizen', b':', b'NEWLINE', b'You', b'are', b'all',
  b'resolved', b'rather', b'to', b'die', b'than', b'to', b'famish', b'?']]>


In [12]:
def encode_newlines(txt: tf.Tensor):
    """Replace newline symbols with special tokens."""
    result = tf.strings.regex_replace(txt, "\n\n", config.DOUBLEN_TOKEN)
    result = tf.strings.regex_replace(result, "\n", config.NEWLINE_TOKEN)
    return result


def cleanup_text(txt_tokens: tf.RaggedTensor):
    """Remove special tokens and concatenate the words into a coherent string."""
    result = tf.strings.reduce_join(txt_tokens, separator=" ", axis=-1)
    result = tf.strings.regex_replace(result, config.DOUBLEN_TOKEN, "\n\n")
    result = tf.strings.regex_replace(result, config.NEWLINE_TOKEN, "\n")
    result = tf.strings.regex_replace(result, config.NEWLINE_TOKEN.strip(), "")
    return result


txt_clean = cleanup_text(txt_tokens)
print(txt_clean[0].numpy().decode("utf-8"))

First Citizen :
Before we proceed any further , hear me speak .

All :
Speak , speak .

First Citizen :
You are all resolved rather to die than to famish ?


## Customization and export
Define a custom tokenizer class that can be exported and used in the GPT, including functionality for cleaning up output after detokenization.

In [13]:
class CustomTokenizer(tf.Module):
    def __init__(self, config):
        self.tokenizer = text.BertTokenizer(
            config.VOCAB_PATH, **config.BERT_TOKENIZER_PARAMS
        )
        self._reserved_tokens = config.RESERVED_TOKENS
        self._vocab_path = tf.saved_model.Asset(config.VOCAB_PATH)

        vocab = pathlib.Path(config.VOCAB_PATH).read_text().splitlines()
        self.vocab = tf.Variable(vocab)

        ## Create the signatures for export:

        # Include a tokenize signature for a batch of strings.
        self.tokenize.get_concrete_function(
            tf.TensorSpec(shape=[None], dtype=tf.string)
        )

        # Include `detokenize` and `lookup` signatures for:
        #   * `Tensors` with shapes [tokens] and [batch, tokens]
        #   * `RaggedTensors` with shape [batch, tokens]
        self.detokenize.get_concrete_function(
            tf.TensorSpec(shape=[None, None], dtype=tf.int64)
        )
        self.detokenize.get_concrete_function(
            tf.RaggedTensorSpec(shape=[None, None], dtype=tf.int64)
        )

    @tf.function
    def tokenize(self, strings):
        strings = encode_newlines(strings)
        enc = self.tokenizer.tokenize(strings)
        enc = enc.merge_dims(-2, -1)
        return enc

    @tf.function
    def detokenize(self, tokenized):
        words = self.tokenizer.detokenize(tokenized)
        return cleanup_text(words)

In [14]:
tokenizer = CustomTokenizer(config)
tf.saved_model.save(tokenizer, config.TOKENIZER_PATH)
reloaded_tokenizer = tf.saved_model.load(config.TOKENIZER_PATH)

INFO:tensorflow:Assets written to: tokenizer/assets


In [15]:
tokens = reloaded_tokenizer.tokenize(["Hello\nTensorFlow!\n\n:]"])
tokens.numpy()

array([[ 167, 5422,   65, 4051,   78,  675, 5715, 2470,    0,   66,    8,
          38]])

In [16]:
round_trip = reloaded_tokenizer.detokenize(tokens)
print(round_trip[0].numpy().decode("utf-8"))

Hello
TensorFlow !

: ]


## Save tokenized dataset
Split tokenized dataset into batches and then separate into train and validation sets. Store the resulting tokenized datasets on disk.

In [17]:
tokens = tokenizer.tokenize(shakespeare_plays).numpy()
n_tokens = tokens.shape[1]
print(f"Number of tokens: {n_tokens:,}")

Number of tokens: 1,348,491


In [28]:
sample_len = config.MAX_TOKENS + 1
indices = tf.random.uniform(
    (config.N_SAMPLES,),
    minval=0,
    maxval=n_tokens - sample_len,
    dtype=tf.dtypes.int32,
)

In [39]:
def get_sample_from_index(i):
    range = tf.range(i, i + sample_len, 1)
    return tf.gather(tokens[0], range)


dataset = (
    tf.data.Dataset.from_tensor_slices(indices)
    .map(get_sample_from_index, tf.data.AUTOTUNE)
    .shuffle(config.BUFFER_SIZE)
)

In [40]:
examples = list(dataset.take(3))
detokenized_examples = tokenizer.detokenize(examples).numpy()
for ex in detokenized_examples:
    print(ex.decode("utf-8"))
    print("=" * 80)

you will live , resolve it you .
Sharp physic is the last : but , O you powers
That give heaven countless eyes to view men ' s acts ,
Why cloud they not their sights perpetually ,
If this be true , which makes me pale to read it ?
Fair glass of light , I loved you , and could still ,
Were not this glorious casket stored with ill :
But I must tell you , now my thoughts revolt
For he ' s no man on whom perfections wait
That , knowing sin within , will touch the gate .
You are a fair
##er than mankind .
The gods confound - - hear me , you good gods all - -
The Athenians both within and out that wall !
And grant , as Timon grows , his hate may grow
To the whole race of mankind , high and low ! Amen .

First Servant :
Hear you , master steward , where ' s our master ?
Are we undone ? cast off ? nothing remaining ?

FLAVIUS :
Alack , my fellows , what should I say to you ?
Let me be recorded by the righteous gods ,
I am as poor as you .

First Servant :
Such a
me cast my love on him ?

LUCET

In [41]:
n_samples = dataset.cardinality().numpy()
val_size = int(n_samples * config.VALIDATION_SHARE)
val_dataset = dataset.take(val_size)
train_dataset = dataset.skip(val_size)

In [43]:
val_dataset.save(config.VAL_DATA_PATH)
train_dataset.save(config.TRAIN_DATA_PATH)