# **Creating the Tokenizers**

This notebook will explore the tokenization of the Kiche and Spanish sentences. The tokenizer used is the BERT Tokenizer, a subword tokenizer; this allows for the tokenization to better account for morphemes, as Kiche and Spanish are both highly synthetic languages.

As noted, the following code in the notebook borrows heavily from the [Tensorflow tutorial for subword tokenization](https://www.tensorflow.org/text/guide/subwords_tokenizer). I have indicated which code is from the tutorial.

In [1]:
# Imports the Tokenizer module from the Translation folder/package
from Translation import Tokenizer as tok

import tensorflow as tf

# This will be imported to generate the wordpiece vocabulary
from tensorflow_text.tools.wordpiece_vocab import bert_vocab_from_dataset as bert_vocab

## Preprocessing and Exploration

In this section, I load and preprocess the data, along with examining the first and last 5 examples for each language.

In [2]:
filepath = 'Bilingual Corpus.csv'
source = 'Kiche'
target = 'Spanish'

all_ki, all_sp = tok.preclean(filepath, source, target)

In [3]:
# First 5 Kiche sentences
all_ki.head()

0    weta’m chi ri utaqanik are k’aslemal ri maj uk...
1    qeta’m chi ronojel ri tiktalik koq’ik jetaq ri...
2    we k’u kape jun achi chirij ri sib’alaj k’o uc...
3    pune’ ri e are’ man e q’ui taj , xa’ e kieb’ o...
4    kraj k’u ne xekikamisaj juwinaq waqlajuj aj is...
Name: Kiche, dtype: object

In [4]:
# Last 5 Kiche sentences
all_ki.tail()

37083    man xinwil ta pa ri tinimit ri’ jun templo rum...
37084       i le jun chkech le e pareyib’ , le espanyolib’
37085    xkik’am b’ik ri jesús pa ri ulew ub’i’nam gólg...
37086      na kinb’eta pa le nimatijob’al rumal xinkosik .
37087    e are’ k’ut ri kekanaj kan chiwe , kinya’ na n...
Name: Kiche, dtype: object

In [5]:
# First 5 Spanish sentences
all_sp.head()

0    y sé que su mandamiento es vida eterna ; así q...
1    porque ya sabemos que todas las criaturas gime...
2    mas si sobreviniendo otro más fuerte que él , ...
3    siendo vosotros pocos hombres en número , y pe...
4    los hombres de hai mataron a unos treinta y se...
Name: Spanish, dtype: object

In [6]:
# Last 5 Spanish sentences
all_sp.tail()

37083    y no vi en ella templo ; porque el señor dios ...
37084    y lo otro es que para los sacerdores , los esp...
37085    y le llevaron al lugar de gólgota , que declar...
37086           no voy a la escuela porque estoy cansado .
37087     “haré que aquellos de ustedes que sobrevivan ...
Name: Spanish, dtype: object

## Generating the Wordpiece Vocabulary

This section will generate a wordpiece vocabulary file for each language; these files store the "morphemes" that have been parsed by `bert_vocab.bert_vocab_from_dataset`. The `reserved_tokens` are the strings `"[PAD]"`, `"[UNK]"`, `"[START]"`, `"[END]"`.

Much of the code is from the Tensorflow tutorial.

In [7]:
# The arguments for the tokenizer are defined
bert_tokenizer_params=dict(lower_case=True)

bert_vocab_args = dict(
    # The target vocabulary size
    vocab_size = 8000,
    # Reserved tokens that must be included in the vocabulary
    reserved_tokens=tok.reserved_tokens,
    # Arguments for `text.BertTokenizer`
    bert_tokenizer_params=bert_tokenizer_params,
    # Arguments for `wordpiece_vocab.wordpiece_tokenizer_learner_lib.learn`
    learn_params={})

In [8]:
# The data is converted into tensors
kiche, spanish = tok.convert(all_ki, all_sp)

In [9]:
# Here's where the magic happens
# The wordpiece vocabulary is generated

ki_vocab, sp_vocab = tok.generate_vocab(kiche, spanish, bert_vocab_args)

In [10]:
# Prints out some vocab in Kiche
print(ki_vocab[:10])
print(ki_vocab[100:110])
print(ki_vocab[1000:1010])
print(ki_vocab[-10:])

['[PAD]', '[UNK]', '[START]', '[END]', '!', "'", '(', ')', '*', ',']
['taj', 'xeb', 'konojel', 'ob', 'chik', 'aretaq', 'je', 'wach', 'kab', 'anik']
['##en', '##uk', 'amoreyib', 'eliy', 'kal', 'kuchomaj', 'mexa', 'onik', 'amonib', 'kanajinaq']
['##h', '##v', '##¡', '##¿', '##–', '##—', '##‘', '##’', '##“', '##”']


In [11]:
# Prints out some vocab in Spanish
print(sp_vocab[:10])
print(sp_vocab[100:110])
print(sp_vocab[1000:1010])
print(sp_vocab[-10:])

['[PAD]', '[UNK]', '[START]', '[END]', '!', "'", '(', ')', ',', '-']
['te', 'si', 'cuando', 'una', 'todo', 'entonces', 'hijo', 'asi', 'casa', 'hijos']
['hallo', 'harina', 'hubiera', 'llamaba', 'menor', 'pos', 'ramas', 'sabes', 'subir', 'tomando']
['##¡', '##«', '##»', '##¿', '##–', '##—', '##‘', '##’', '##“', '##”']


In [12]:
# Writes the Kiche vocab into a file
tok.write_vocab_file('ki_vocab.txt', ki_vocab)

In [13]:
# Writes the Spanish vocab into a file
tok.write_vocab_file('sp_vocab.txt', sp_vocab)

## Building the Tokenizer

In this section, the tokenizers will be built. The CustomTokenizer class defined in Tokenizer module will be used to build them. The cosntructor contains a BertTokenizer object that will generate a tokenizer when a vocabulary file is passed into it.

Much of this code comes from the Tensorflow tutorial.

In [14]:
# Generates the tokenizers for Kiche and Spanish
ki_tok = tok.CustomTokenizer(tok.reserved_tokens, 'ki_vocab.txt')
sp_tok = tok.CustomTokenizer(tok.reserved_tokens, 'sp_vocab.txt')

### Kiche Tokenization

I will explore the tokenization of Kiche in this subsection.

In [15]:
# These are the untokenized sentences
for ki_sentence in kiche.batch(3).take(1):
    for ex in ki_sentence:
        print(ex)
        
# \xe2\x80\x99 represents ’

tf.Tensor(b'weta\xe2\x80\x99m chi ri utaqanik are k\xe2\x80\x99aslemal ri maj uk\xe2\x80\x99isik . xaq jeri\xe2\x80\x99 ronojel ri kinb\xe2\x80\x99ij are ri\xe2\x80\x99 ri ub\xe2\x80\x99im ri tataxel chwe kinb\xe2\x80\x99ij .', shape=(), dtype=string)
tf.Tensor(b'qeta\xe2\x80\x99m chi ronojel ri tiktalik koq\xe2\x80\x99ik jetaq ri kub\xe2\x80\x99an jun ixoq are kak\xe2\x80\x99oji\xe2\x80\x99 ral .', shape=(), dtype=string)
tf.Tensor(b'we k\xe2\x80\x99u kape jun achi chirij ri sib\xe2\x80\x99alaj k\xe2\x80\x99o uchuq\xe2\x80\x99ab\xe2\x80\x99 choch , kesax ri\xe2\x80\x99 ri uch\xe2\x80\x99eyab\xe2\x80\x99al ri ku\xe2\x80\x99l uk\xe2\x80\x99u\xe2\x80\x99x chirij , xuquje\xe2\x80\x99 kelaq\xe2\x80\x99ax b\xe2\x80\x99ik ronojel ri jastaq rech .', shape=(), dtype=string)


In [16]:
# The tokenized sentences
ki_tokenized = ki_tok.tokenize(ki_sentence)

for ex in ki_tokenized.to_list():
    print(ex)

[2, 528, 57, 38, 64, 60, 1267, 70, 36, 57, 195, 60, 224, 164, 57, 1029, 11, 135, 153, 57, 99, 60, 147, 57, 67, 70, 60, 57, 60, 163, 57, 272, 60, 625, 138, 147, 57, 67, 11, 3]
[2, 945, 57, 38, 64, 99, 60, 4876, 934, 57, 83, 338, 60, 111, 57, 78, 72, 226, 70, 157, 57, 198, 57, 375, 11, 3]
[2, 76, 36, 57, 46, 277, 72, 158, 495, 60, 110, 57, 97, 36, 57, 40, 1360, 57, 77, 57, 476, 9, 1076, 60, 57, 60, 400, 57, 576, 229, 57, 65, 60, 304, 57, 37, 164, 57, 46, 57, 49, 495, 9, 66, 57, 2671, 57, 143, 27, 57, 83, 99, 60, 152, 79, 11, 3]


In [17]:
# The tokens are matched with their wordpieces

# Lookup each token id in the vocabulary
ki_txt_tokens = tf.gather(ki_vocab, ki_tokenized)
# Join with spaces
tf.strings.reduce_join(ki_txt_tokens, separator=' ', axis=-1)

<tf.Tensor: shape=(3,), dtype=string, numpy=
array([b'[START] weta \xe2\x80\x99 m chi ri utaqanik are k \xe2\x80\x99 aslemal ri maj uk \xe2\x80\x99 isik . xaq jeri \xe2\x80\x99 ronojel ri kinb \xe2\x80\x99 ij are ri \xe2\x80\x99 ri ub \xe2\x80\x99 im ri tataxel chwe kinb \xe2\x80\x99 ij . [END]',
       b'[START] qeta \xe2\x80\x99 m chi ronojel ri tiktalik koq \xe2\x80\x99 ik jetaq ri kub \xe2\x80\x99 an jun ixoq are kak \xe2\x80\x99 oji \xe2\x80\x99 ral . [END]',
       b'[START] we k \xe2\x80\x99 u kape jun achi chirij ri sib \xe2\x80\x99 alaj k \xe2\x80\x99 o uchuq \xe2\x80\x99 ab \xe2\x80\x99 choch , kesax ri \xe2\x80\x99 ri uch \xe2\x80\x99 eya ##b \xe2\x80\x99 al ri ku \xe2\x80\x99 l uk \xe2\x80\x99 u \xe2\x80\x99 x chirij , xuquje \xe2\x80\x99 kelaq \xe2\x80\x99 ax b \xe2\x80\x99 ik ronojel ri jastaq rech . [END]'],
      dtype=object)>

In [18]:
# The detokenized, and reassembled, sentences
ki_words = ki_tok.detokenize(ki_tokenized)
tf.strings.reduce_join(ki_words, separator=' ', axis=-1)

<tf.Tensor: shape=(), dtype=string, numpy=b'weta \xe2\x80\x99 m chi ri utaqanik are k \xe2\x80\x99 aslemal ri maj uk \xe2\x80\x99 isik . xaq jeri \xe2\x80\x99 ronojel ri kinb \xe2\x80\x99 ij are ri \xe2\x80\x99 ri ub \xe2\x80\x99 im ri tataxel chwe kinb \xe2\x80\x99 ij . qeta \xe2\x80\x99 m chi ronojel ri tiktalik koq \xe2\x80\x99 ik jetaq ri kub \xe2\x80\x99 an jun ixoq are kak \xe2\x80\x99 oji \xe2\x80\x99 ral . we k \xe2\x80\x99 u kape jun achi chirij ri sib \xe2\x80\x99 alaj k \xe2\x80\x99 o uchuq \xe2\x80\x99 ab \xe2\x80\x99 choch , kesax ri \xe2\x80\x99 ri uch \xe2\x80\x99 eyab \xe2\x80\x99 al ri ku \xe2\x80\x99 l uk \xe2\x80\x99 u \xe2\x80\x99 x chirij , xuquje \xe2\x80\x99 kelaq \xe2\x80\x99 ax b \xe2\x80\x99 ik ronojel ri jastaq rech .'>

### Spanish Tokenization

I will explore the tokenization of Spanish in this subsection.

In [26]:
# These are the untokenized sentences
for sp_sentence in spanish.batch(3).take(1):
    for ex in sp_sentence:
        print(ex)

tf.Tensor(b'y s\xc3\xa9 que su mandamiento es vida eterna ; as\xc3\xad que , lo que yo hablo , como el padre me lo ha dicho , as\xc3\xad hablo .', shape=(), dtype=string)
tf.Tensor(b'porque\xc2\xa0ya\xc2\xa0sabemos que todas las criaturas gimen , y est\xc3\xa1n de parto hasta ahora .', shape=(), dtype=string)
tf.Tensor(b'mas si sobreviniendo otro m\xc3\xa1s fuerte que \xc3\xa9l , le venciere ,\xc2\xa0le\xc2\xa0toma todas sus armas en que confiaba , y reparte sus despojos .', shape=(), dtype=string)


In [27]:
# The tokenized sentences
sp_tokenized = sp_tok.tokenize(sp_sentence)

for ex in sp_tokenized.to_list():
    print(ex)

[2, 49, 70, 65, 73, 659, 83, 229, 1020, 23, 107, 65, 8, 81, 65, 93, 387, 8, 86, 64, 134, 92, 81, 118, 288, 8, 107, 387, 10, 3]
[2, 87, 194, 1326, 65, 139, 77, 4040, 31, 5134, 110, 8, 49, 216, 63, 1960, 123, 152, 10, 3]
[2, 91, 101, 94, 6221, 4459, 213, 91, 504, 65, 64, 8, 88, 4730, 8, 88, 700, 139, 79, 1245, 68, 65, 3037, 252, 8, 49, 42, 174, 2368, 179, 79, 2705, 10, 3]


In [28]:
# The tokens are matched with their wordpieces

# Lookup each token id in the vocabulary
sp_txt_tokens = tf.gather(sp_vocab, sp_tokenized)
# Join with spaces
tf.strings.reduce_join(sp_txt_tokens, separator=' ', axis=-1)

<tf.Tensor: shape=(3,), dtype=string, numpy=
array([b'[START] y se que su mandamiento es vida eterna ; asi que , lo que yo hablo , como el padre me lo ha dicho , asi hablo . [END]',
       b'[START] porque ya sabemos que todas las criaturas g ##ime ##n , y estan de parto hasta ahora . [END]',
       b'[START] mas si sobre ##vi ##niendo otro mas fuerte que el , le venciere , le toma todas sus armas en que confia ##ba , y r ##e ##par ##te sus despojos . [END]'],
      dtype=object)>

In [29]:
# The detokenized, and reassembled, sentences
sp_words = sp_tok.detokenize(sp_tokenized)
tf.strings.reduce_join(sp_words, separator=' ', axis=-1)

<tf.Tensor: shape=(), dtype=string, numpy=b'y se que su mandamiento es vida eterna ; asi que , lo que yo hablo , como el padre me lo ha dicho , asi hablo . porque ya sabemos que todas las criaturas gimen , y estan de parto hasta ahora . mas si sobreviniendo otro mas fuerte que el , le venciere , le toma todas sus armas en que confiaba , y reparte sus despojos .'>

## Saving and Exporting the Model

In this section, the tokenizers are saved and exported. The tokenizers are instantiated as a Module object to allow for consistent tokenization when reloaded.

Again, the code comes from the Tensorflow tutorial.

In [30]:
tokenizers = tf.Module()
tokenizers.ki = ki_tok
tokenizers.sp = sp_tok

model_name = 'kiche_spanish_tokens'
tf.saved_model.save(tokenizers, model_name)

INFO:tensorflow:Assets written to: kiche_spanish_tokens\assets


## Reloading and Testing the Tokenizers

The saved and exported models are reloaded and tested.

Again, the code comes from the Tensorflow tutorial.

### Kiche Tokenization

In [31]:
# The tokenizer is reloaded
reloaded_tokenizers = tf.saved_model.load(model_name)
reloaded_tokenizers.ki.get_vocab_size().numpy()

4985

In [32]:
# Tokenizes a phrase meaning "Hello Tensorflow!"
tokens = reloaded_tokenizers.ki.tokenize(['Saqirik TensorFlow!'])
tokens.numpy()

array([[   2, 4865,  218, 4617, 1128, 4976, 1546,  715,    4,    3]],
      dtype=int64)

In [33]:
# The tokens are matched with their corresponding word pieces
text_tokens = reloaded_tokenizers.ki.lookup(tokens)
text_tokens

<tf.RaggedTensor [[b'[START]', b'sedron', b'te', b'##nwaj', b'##or', b'##v', b'##it',
  b'##w', b'!', b'[END]']]>

In [34]:
# The phrase is detokenized
round_trip = reloaded_tokenizers.ki.detokenize(tokens)

print(round_trip.numpy()[0].decode('utf-8'))

saqirik tensorflow !


### Spanish Tokenization

In [35]:
# The tokenizer is reloaded
reloaded_tokenizers = tf.saved_model.load(model_name)
reloaded_tokenizers.sp.get_vocab_size().numpy()

6678

In [36]:
# Tokenizes a phrase meaning "Hello Tensorflow!"
tokens = reloaded_tokenizers.sp.tokenize(['Hola TensorFlow!'])
tokens.numpy()

array([[   2,   32, 3609, 1606,  709,  147, 2431,  218, 6665,    4,    3]],
      dtype=int64)

In [37]:
# The tokens are matched with their corresponding word pieces
text_tokens = reloaded_tokenizers.sp.lookup(tokens)
text_tokens

<tf.RaggedTensor [[b'[START]', b'h', b'##ola', b'tenido', b'amor', b'era', b'##gada',
  b'gran', b'##w', b'!', b'[END]']]>

In [38]:
# The phrase is detokenized
round_trip = reloaded_tokenizers.sp.detokenize(tokens)

print(round_trip.numpy()[0].decode('utf-8'))

hola tensorflow !


## Zipping the Tokenizers for Future Use

The tokenizers will be zipped for future use; they will be used in the transformer and seq2seq translators.

Once again, the code comes from the Tensorflow tutorial.

In [39]:
!zip -r {model_name}.zip {model_name}

updating: kiche_spanish_tokens/ (164 bytes security) (stored 0%)
updating: kiche_spanish_tokens/assets/ (164 bytes security) (stored 0%)
updating: kiche_spanish_tokens/assets/ki_vocab.txt (164 bytes security) (deflated 59%)
updating: kiche_spanish_tokens/assets/sp_vocab.txt (164 bytes security) (deflated 60%)
updating: kiche_spanish_tokens/saved_model.pb (164 bytes security) (deflated 91%)
updating: kiche_spanish_tokens/variables/ (164 bytes security) (stored 0%)
updating: kiche_spanish_tokens/variables/variables.data-00000-of-00001 (164 bytes security) (deflated 51%)
updating: kiche_spanish_tokens/variables/variables.index (164 bytes security) (deflated 33%)
