# Tokenizers

* Outlines creating and using models using the `AutoTokenizer` class from the Hugging Face `Transformers` library
* All classes and functions are imported from the `Transformers` library

## Setup

In [1]:
model_provider = "bert"
model_name = "bert-base-uncased"
model = f"{model_provider}/{model_name}"

---

## What is a Tokenizer?

* Tokenizers serve the purpose to translate text into data that can be processed by a model
* Models can only process numbers
* Tokenizers need to convert text inputs into numerical data
* In NLP tasks, data is that is generally processed is raw text: `Jim Henson was a puppeteer`
* The goal is to find the most meaningful representation of text converted to numbers as input to a model
* Where possible, find the smallest representation

---

## Types of Tokenizers

### Word-based

* Word-based tokenizers split raw text into words and assign a numerical representation for each of them
* There are different methods to split text into words
* For example, using Python's split function on whitespace:

In [2]:
tokenized_text = "Jim Henson was a puppeteer".split()

In [3]:
print(tokenized_text)

['Jim', 'Henson', 'was', 'a', 'puppeteer']


* There are also variations of word tokenizers that have extra rules for punctuation
* With this kind of tokenizer, one can end up with some pretty large "vocabularies"
* **A vocabulary is defined by the total number of independent tokens that one has in the corpus (collection of texts)**
* Each word gets assigned an ID
	* Starting from 0
	* And going up to the size of the vocabulary
* The model uses these IDs to identify each word
* Words like "dog" are represented differently from words like "dogs"
	* The model will initially have no way of knowing that "dog" and "dogs" are similar
	* It will identify the two words as unrelated
* The same applies to other similar words, like "run" and "running"
	* The model will not see these as being similar initially
* Finally, one needs a custom token to represent words that are not in the vocabulary
* This is known as the "unknown" token, often represented as:
	* \[UNK\]
	* \<unk\>
* It is generally a bad sign if one sees that the tokenizer is producing a lot of these tokens
	* As it was not able to retrieve a sensible representation of a word
	* And one is losing information along the way
* The goal when crafting the vocabulary is to do it in such a way that the tokenizer tokenizes as few words as possible into the unknown token
* One way to reduce the amount of unknown tokens is to go one level deeper, using a character-based tokenizer

---

### Character-based

* Character-based tokenizers split the text into characters, rather than words.
* This has two primary benefits:
    * The vocabulary is much smaller
    * There are much fewer out-of-vocabulary (unknown) tokens, since every word can be built from characters
* This approach is not perfect either
* Since the representation is now based on characters rather than words, one could argue that, intuitively, it is less meaningful:
	* Each character does not mean a lot on its own
	* Whereas that is the case with words
* Another thing to consider is that one will end up with a huge number of tokens to be processed by the model:
	* Whereas a word would only be a single token with a word-based tokenizer
	* It can easily turn into 10 or more tokens when converted into characters
* To get the best of both worlds, one can use a third technique that combines the two approaches: *subword tokenization*

---

### Subword

* Subword tokenization algorithms rely on the principle that frequently used words should not be split into smaller subwords, but rare words should be decomposed into meaningful subwords
* For instance, "annoyingly" might be considered a rare word and could be decomposed into "annoying" and "ly"
* These are both likely to appear more frequently as standalone subwords, while at the same time the meaning of "annoyingly" is kept by the composite meaning of "annoying" and "ly"
* These subwords end up providing a lot of semantic meaning:
	* For instance, in the example above "tokenization" was split into "token" and "ization"
	* Two tokens that have a semantic meaning while being space-efficient (only two tokens are needed to represent a long word)
* This allows one to have relatively good coverage with small vocabularies, and close to no unknown tokens

---

## Loading and Saving

* Loading and saving tokenizers is as simple as it is with models
* It is based on the same two methods:
	* `from_pretrained`
	* `save_pretrained`
* These methods will load or save the algorithm used by the tokenizer (a bit like the architecture of the model) as well as its vocabulary (a bit like the weights of the model)
* Loading the BERT tokenizer trained with the same checkpoint as BERT is done the same way as loading the model, except one uses the `AutoTokenizer` class:

In [5]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

In [6]:
tokenizer("Using a Transformer network is simple")

{'input_ids': [101, 7993, 170, 13809, 23763, 2443, 1110, 3014, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

---

## Encoding

* **Translating text to numbers is known as encoding**
* Encoding is done in a two-step process:
	* The tokenization
	* Followed by the conversion to input IDs
* The first step is to split the text into words (or parts of words, punctuation symbols, etc.), called tokens
* There are multiple rules that can govern that process, which is why one needs to instantiate the tokenizer using the name of the model, to make sure one uses the same rules that were used when the model was pre-trained
* The second step is to convert those tokens into numbers, so one can build a tensor out of them and feed them to the model
* To do this, the tokenizer has a vocabulary, which is the part one downloads when one instantiates it with the `from_pretrained` method

---

## Tokenization

* The tokenization process is done by the `tokenize` method of the tokenizer
* The output of this method is a list of strings or tokens
* This tokenizer is a subword tokenizer: it splits the words until it gets tokens that can be represented by its vocabulary
* That is the case here with transformer, which is split into two tokens:
	* transform
	* ##er

In [7]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

sequence = "Using a Transformer network is simple"
tokens = tokenizer.tokenize(sequence)

In [8]:
print(tokens)

['Using', 'a', 'Trans', '##former', 'network', 'is', 'simple']


---

## From Tokens to Input IDs

* The conversion to input IDs is handled by the `convert_tokens_to_ids` tokenizer method
* These outputs, once converted to the appropriate framework tensor, can then be used as inputs to a model

In [9]:
ids = tokenizer.convert_tokens_to_ids(tokens)

In [10]:
print(ids)

[7993, 170, 13809, 23763, 2443, 1110, 3014]


---

## Decoding

* Decoding is going the other way around: from vocabulary indices, one wants to get a string
* This can be done with the `decode` method
* Note that the `decode` method not only converts the indices back to tokens, but also groups together the tokens that were part of the same words to produce a readable sentence
* This behavior will be extremely useful when one uses models that predict new text (either text generated from a prompt, or for sequence-to-sequence problems like translation or summarization)

In [11]:
decoded_string = tokenizer.decode([7993, 170, 11303, 1200, 2443, 1110, 3014])

In [12]:
print(decoded_string)

Using a transformer network is simple
