# Train Tokenizer
There are two popular encoding choices: character encoding and sub-word encoding. Sub-word encoding models are almost nearly identical to the character encoding models. The primary difference lies in the fact that a sub-word encoding model accepts a sub-word tokenized text corpus and emits sub-word tokens in its decoding step. 
Preparation of the tokenizer is made simple by the [process_asr_text_tokenizer.py script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py) in NeMo. We leverage this script to build the text corpus from the manifest directly, then create a tokenizer using that corpus.

## Subword Tokenization

If you are familiar with Natural Language Processing, then you might have heard of the term "<i>subword</i>" frequently. So what is a subword in the first place? Simply put, it is either a single character or a group of characters. When combined according to a tokenization-detokenization algorithm, it generates a set of characters, words, or entire sentences. 

Many subword tokenization-detokenization algorithms exist, which can be built using large corpora of text data to tokenize and detokenize the data to and from subwords effectively. Some of the most commonly used subword tokenization methods are [Byte Pair Encoding](https://arxiv.org/abs/1508.07909), [Word Piece Encoding](https://static.googleusercontent.com/media/research.google.com/ja//pubs/archive/37842.pdf) and [Sentence Piece Encoding](https://www.aclweb.org/anthology/D18-2012/).

## The necessity of subword tokenization for ASR

It has been found via extensive research in the domain of Neural Machine Translation and Language Modelling that subword tokenization not only reduces the length of the tokenized representation (thereby making sentences shorter and more manageable for models to learn), but also boosts the accuracy of prediction of correct tokens.

The [Connectionist Temporal Classification](https://www.cs.toronto.edu/~graves/icml_2006.pdf) loss function is commonly used to train acoustic models, but this loss function has a few limitations:

 - **Generated tokens are conditionally independent of each other**. In other words - the probability of character "l" being predicted after "hel##" is conditionally independent of the previous token - so any other token can also be predicted unless the model has future information!
 - **The length of the generated (target) sequence must be shorter than that of the source sequence.** 

It turns out - subword tokenization helps alleviate both of these issues!

 - Sophisticated subword tokenization algorithms build their vocabularies based on large text corpora. To accurately tokenize such large volumes of text with minimal vocabulary size, the subwords that are learned inherently model the interdependency between tokens of that language to some degree. 
 
Looking at the previous example, the token `hel##` is a single token that represents the relationship `h` => `e` => `l`. When the model predicts the singe token `hel##`, it implicitly predicts this relationship - even though the subsequent token can be either `l` (for `hell`) or `##lo` (for `hello`) and is predicted independently of the previous token!

 - By reducing the target sentence length by subword tokenization (target sentence here being the characters/subwords transcribed from the audio signal), we entirely sidestep the sequence length limitation of CTC loss!

This means we can perform a larger number of pooling steps in our acoustic models, thereby improving execution speed while simultaneously reducing memory requirements.

First, download the tokenizer creation script from the nemo repository.


In [11]:
import os

BRANCH = 'main'
if not os.path.exists("scripts/process_asr_text_tokenizer.py"):
  !mkdir scripts
  !wget -P scripts/ https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/scripts/tokenizers/process_asr_text_tokenizer.py

The script above takes a few important arguments -

 - either `--manifest` or `--data_file`: If your text data lies inside of an ASR manifest file, then use the `--manifest` path. If instead the text data is inside a file with separate lines corresponding to different text lines, then use `--data_file`. In either case, you can add commas to concatenate different manifests or different data files.

 - `--data_root`: The output directory (whose subdirectories will be created if not present) where the tokenizers will be placed.

 - `--vocab_size`: The size of the tokenizer vocabulary. Larger vocabularies can accommodate almost entire words, but the decoder size of any model will grow proportionally.

 - `--tokenizer`: Can be either `spe` or  `wpe` . `spe` refers to the Google `sentencepiece` library tokenizer. `wpe` refers to the HuggingFace BERT Word Piece tokenizer. Please refer to the papers above for the relevant technique in order to select an appropriate tokenizer.

 - `--no_lower_case`: When this flag is passed, it will force the tokenizer to create separate tokens for upper and lower case characters. By default, the script will turn all the text to lower case before tokenization (and if upper case characters are passed during training/inference, the tokenizer will emit a token equivalent to Out-Of-Vocabulary). Used primarily for the English language. 

 - `--spe_type`: The `sentencepiece` library has a few implementations of the tokenization technique, and `spe_type` refers to these implementations. Currently supported types are `unigram`, `bpe`, `char`, `word`. Defaults to `bpe`.

 - `--spe_character_coverage`: The `sentencepiece` library considers how much of the original vocabulary it should cover in its "base set" of tokens (akin to the lower and upper case characters of the English language). For almost all languages with small base token sets `(<1000 tokens)`, this should be kept at its default of 1.0. For languages with larger vocabularies (say Japanese, Mandarin, Korean etc), the suggested value is 0.9995.

 - `--spe_sample_size`: If the dataset is too large, consider using a sampled dataset indicated by a positive integer. By default, any negative value (default = -1) will use the entire dataset.

 - `--spe_train_extremely_large_corpus`: When training a sentencepiece tokenizer on very large amounts of text, sometimes the tokenizer will run out of memory or wont be able to process so much data on RAM. At some point you might receive the following error - "Input corpus too large, try with train_extremely_large_corpus=true". If your machine has large amounts of RAM, it might still be possible to build the tokenizer using the above flag. Will silently fail if it runs out of RAM.

 - `--log`: Whether the script should display log messages

In [12]:
!python ./scripts/process_asr_text_tokenizer.py --manifest=./data/processed/train_manifest_merged.json \
         --data_root=./data/processed/tokenizer \
         --vocab_size=1024 \
         --tokenizer="spe" \
         --log

INFO:root:Corpus already exists at path : ./data/processed/tokenizer/text_corpus/document.txt
[NeMo I 2022-05-31 04:45:16 sentencepiece_tokenizer:307] Processing ./data/processed/tokenizer/text_corpus/document.txt and store at ./data/processed/tokenizer/tokenizer_spe_bpe_v1024
sentencepiece_trainer.cc(177) LOG(INFO) Running command: --input=./data/processed/tokenizer/text_corpus/document.txt --model_prefix=./data/processed/tokenizer/tokenizer_spe_bpe_v1024/tokenizer --vocab_size=1024 --shuffle_input_sentence=true --hard_vocab_limit=false --model_type=bpe --character_coverage=1.0 --bos_id=-1 --eos_id=-1 --normalization_rule_name=nmt_nfkc_cf
sentencepiece_trainer.cc(77) LOG(INFO) Starts training with : 
trainer_spec {
  input: ./data/processed/tokenizer/text_corpus/document.txt
  input_format: 
  model_prefix: ./data/processed/tokenizer/tokenizer_spe_bpe_v1024/tokenizer
  model_type: BPE
  vocab_size: 1024
  self_test_sample_size: 0
  character_coverage: 1
  input_sentence_size: 0
  shuf

That's it! Our tokenizer is now built and stored inside the `data_root` directory that we provided to the script.

We can inspect the tokenizer vocabulary itself. To keep it manageable, we will print just the first 10 tokens of the vocabulary:

In [13]:
!head -n 10 ./data/processed/tokenizer/tokenizer_spe_bpe_v1024/vocab.txt

##en
##er
##ch
d
##ei
##ie
s
##un
a
w
