# Tokenization

Transformers models cannot receive raw string as input; instead they assume the text has been tokenized and encoded as numerical vectors.

**Tokenization** is the step of breaking down a string into the atomic units used in the model. There are several tokenization strategies one can adopt, and the optimal splitting of words into subunits is usually learned from the corpus. Let's consider two extreme cases: character and word tokenization.

## Libraries

In [1]:
import pandas as pd

In [2]:
import torch

In [3]:
import torch.torch.nn.functional as F

In [4]:
from transformers import AutoTokenizer

``AutoToeknizer`` class belong to a larger set of "auto" classes whose job is to automatically retrieve the model's configuration, pretained weights, or vocabulary from the name of the checkpoint. This allows you to quickly switch between models, but if you wish to load the specific class manually you can do as well. For example, we could have loaded the DistilBERT tokenizer as follows: 

In [5]:
from transformers import DistilBertTokenizer
distilbert_tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

## Character Tokenization

The simplest tokenization scheme is to feed each character individually to the model.

In Python, str objects are really arrays under the hood, which allows us to quickly implement character-level tokenization with just one line of code:

In [6]:
text = 'Tokenizing text is a core task of NLP.'

In [7]:
tokenized_text = list(text)
print(tokenized_text)

['T', 'o', 'k', 'e', 'n', 'i', 'z', 'i', 'n', 'g', ' ', 't', 'e', 'x', 't', ' ', 'i', 's', ' ', 'a', ' ', 'c', 'o', 'r', 'e', ' ', 't', 'a', 's', 'k', ' ', 'o', 'f', ' ', 'N', 'L', 'P', '.']


### Numericalization

Models also expect each character to be converted to an integer, a process smetimes called **numericalization**.

In [8]:
token2ids = {ch: idx for idx, ch in enumerate(sorted(set(tokenized_text)))}
print(token2ids)

{' ': 0, '.': 1, 'L': 2, 'N': 3, 'P': 4, 'T': 5, 'a': 6, 'c': 7, 'e': 8, 'f': 9, 'g': 10, 'i': 11, 'k': 12, 'n': 13, 'o': 14, 'r': 15, 's': 16, 't': 17, 'x': 18, 'z': 19}


This gives us a mapping from each character in our vocabulary to a unique integer. We can now use token2idx to transform the tokenized text to a list of integers:

In [9]:
input_ids = [token2ids[token] for token in tokenized_text]
print(input_ids)

[5, 14, 12, 8, 13, 11, 19, 11, 13, 10, 0, 17, 8, 18, 17, 0, 11, 16, 0, 6, 0, 7, 14, 15, 8, 0, 17, 6, 16, 12, 0, 14, 9, 0, 3, 2, 4, 1]


### One Hot Encoding

The last step is to convert input_ids to a 2D tensor of one-hot vectors. One-hot vectors are frequently used in machine learning to encode categorical data, which can be either ordinal or nominal. For example, suppose we wanted to encode the names of chracters in the Transformers TV series. One way to do this would be to map each name to a unqiue ID, as follows:

In [10]:
categorical_df = pd.DataFrame(
    {'Name': ['Bumblebee', 'Optimus Prime', 'Megatron',], 'Label ID': [0, 1, 2,]})
categorical_df

Unnamed: 0,Name,Label ID
0,Bumblebee,0
1,Optimus Prime,1
2,Megatron,2


The problem with this approach is that it creates a fictious ordering between the names, and neural networks are really good at learning these kinds of relationships. So instead, we create a new column for each category and assign a 1 where the category is true, and a 0 otherwise. In Pandas, this can be implemented with the get_dummies() function as follows:

In [11]:
pd.get_dummies(categorical_df['Name'])

Unnamed: 0,Bumblebee,Megatron,Optimus Prime
0,True,False,False
1,False,False,True
2,False,True,False


The rows of this ``DataFrame`` are the one-hot vectors, which have a single "hot" entry with a 1 and 0s everywhere else. Now, looking at our ``input_ids``, we have a similar problem: the elements create an ordinal scale. This means that adding or subtracting two IDs is a meaningless operation, since the result is a new ID that represents another random token.

On the other hands, the result of adding two one-hot encodings can easil be interpreted: the two entries that are "hot" indicate that the corresponding tokens co=occur. We can create the one-hotencodings in PyTorch by converting ``input_ids`` to a tensor and applying the one_hot() function as follows:

In [12]:
input_ids = torch.tensor(input_ids)
input_ids

tensor([ 5, 14, 12,  8, 13, 11, 19, 11, 13, 10,  0, 17,  8, 18, 17,  0, 11, 16,
         0,  6,  0,  7, 14, 15,  8,  0, 17,  6, 16, 12,  0, 14,  9,  0,  3,  2,
         4,  1])

In [13]:
one_hot_encodings = F.one_hot(input_ids, num_classes=len(token2ids))
one_hot_encodings.shape

torch.Size([38, 20])

In [14]:
one_hot_encodings

tensor([[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
        [0, 0, 0, 0,

For each of the 38 input tokens we now have a one-hot vector with 20 dimensions, since our vocabulary consists of 20 unique characters.

Note: It is important to always set ``num_classes`` in the ``one_hot()`` function because otherwise the one-hot vectors may end up being shorter than the length of the vocabulary (and need to be padded with zeros manually).

By examining the first vector, we can verify that a 1 appears in the location indicated by input_ids[0]:

In [15]:
print(f'Token: {tokenized_text[0]}')

Token: T


In [16]:
print(f'Tensor index: {input_ids[0]}')

Tensor index: 5


In [17]:
print(f'One-hot: {one_hot_encodings[0]}')

One-hot: tensor([0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])


Notice that the chracter-level tokenization ignores any structure in the text and treats the whole string as a stream of chracters. Although this helps deal with misspellings and rare words, the main drawback is that linguistic structures such as words need to be learned from the data. This requires significant compute, memory, and data. For this reason, character tokenization is rarely used in practice. Instead, some structure of the text is preserved during the tokenization step.

## Word Toekenization

Instead of splitting the text into chracters, we split it into words and map each word to an integer. Using words from the outset enables the model to skip the step of learning words from characters, and thereby reduces the complexity of the training process.

One simple class of word tokenizers uses whitespace to tokenize the text. We can do this by applying Python's split() function directly on the raw text (just like we did to measure the tweet lengths):

In [18]:
tokenized_text = text.split()
print(tokenized_text)

['Tokenizing', 'text', 'is', 'a', 'core', 'task', 'of', 'NLP.']


One potential problem with this tokenization scheme: punctuation is not accounted for, so 'NLP.' is treated as a single token. Given that words can include declinations, conjugations, or misspellings, the size of the vocabulary can easily grow into the millions!

Some word tokenizers have extra rules for punctuation. One can also apply stemming or lemmatization, which normalizes words to their stem (e.g., 'great', 'greater', and 'greatest' all become 'great'), at the expense of losing some information in the text.

#### Problem with having large vocabulary

Having a large vocabulary is a problem because it requries neural networks to have an enormous number of parameters.

To illustrate this, suppose we have 1 million unique words and want to commpress the 1-million-dimensional input vectors to 1-thousand-dimensional vectors in the first layer of our neural network. This is a standard step in most NLP architectures, and the resulting weight matri of this first layer would contain 1 million x 1 thousand = 1 billion weights. This is already comparable to the largest GPT-2 model, which has around 1.5 billion parameters in total!

A common approach to tackle this is to limit the vocabulary and discard rare words by considering, say, the 100,000 most common words in the corpus. Words that are not part of the vocabulary are classified as "unknown" and mapped to a shared UNK token. This means that we lose some potentially important information in the process of word tokenization, since the model has no information about words associated with UNK.

Another alternative is subword tokenization:

## Subword Tokenization

The basic idea behind subword tokenization is to combine the best aspects of character and word tokenization. On the one hand, we want to split rare words into smaller units to allow the model to deal with complex words and misspellings. On the other hand, we want to keep frequent words as unique entities so that we can keep the length of our inputs to a manageable size.

The main distinguishing feature of subword tokenization (as well as word tokenization) is that it is learned from the pretaining corpus using a mix of statistical rules and algorithms.

### WordPiece

This is a subword tokenization algorithm used by BERT and DistilBERT tokenizers.

Hugging face Transformers provides a convenient AutoTokenizer class that allows you to quickly load the tokenizer associated with a pretrained model - we just call its ``from_pretrained()`` method, providing the ID of a model on the Hub or a local file path. Let's start by loading the tokenizer for DistilBERT:

In [19]:
model_ckpt = 'distilbert-base-uncased'

In [20]:
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

The first use of AutoTokenizer.from_pretrained() method shows a progress bar that shows which parameters of the pretrained tokenizer are loaded from the Hugging Face Hub. The second time, it will load the tokenizer from the cache, usually at ~/cache/huggingface.

We get the following from this:

In [21]:
encoded_text = tokenizer(text)
print(encoded_text)

{'input_ids': [101, 19204, 6026, 3793, 2003, 1037, 4563, 4708, 1997, 17953, 2361, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


Here, words are mapped to unique integers in the ``input_ids`` field. Role of ``attention_mask`` will be discussed later.

We can convert the input_ids back into tokens by using the tokenizer's convert_ids_to_token method:

In [22]:
tokens = tokenizer.convert_ids_to_tokens(encoded_text.input_ids)
print(tokens)

['[CLS]', 'token', '##izing', 'text', 'is', 'a', 'core', 'task', 'of', 'nl', '##p', '.', '[SEP]']


We observe the following:

1. Some special [CLS] and [SEP] tokens have been added to the start and end of the sequence. These tokens differ from model to model, but their main role is to indicate the start and end of a sequence.

2. Second, the tokens have each been lowercased, which is a feature of this particular checkpoint.

3. Finally, we can see that "tokenizing" and "NLP" have been split into two tokens, which mkaes sense since they are not common words. The ## prefix in ##izing and ##p means that the preceding string is not whitespace; any token with this prefix should be merged with the previous token when you convert the tokens back to string.

The Autotokenizer class has a ``convert_tokens_to_string()`` method for converting tokens to string:

In [23]:
print(tokenizer.convert_tokens_to_string(tokens))

[CLS] tokenizing text is a core task of nlp. [SEP]


The Autokenizer class also has several attributes that provide information about the tokenizer. For example, we can inspect the vocabular size:

In [24]:
tokenizer.vocab_size

30522

and the corresponding model's maximum context size:

In [25]:
tokenizer.model_max_length

512

Another interesting attribute to know about is the names of the fields that the model expects in its forward pass:

In [26]:
tokenizer.model_input_names

['input_ids', 'attention_mask']

When using pretrained models, it is really important to make sure that you use the same tokenizer that the model was trained with. From the model's perspective, switching the tokenizer is like shuffling the vocabulary. If everyone around you started swapping random words like "house" for "cat," you'd have a hard time understanding what was going on too!

## Tokenizing the Whole Dataset

To tokenize the whole corpus, we'll use the ``map()`` method of our ``DatasetDict`` object. This method provides a convenient way to apply a processing function to each element in a dataset.

Following function tokenize the example:

In [27]:
def tokenize(batch):
    return tokenizer(batch['text'], padding=True, truncation=True)

In [28]:
from datasets import load_dataset

In [29]:
emotion = load_dataset('emotion')

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


In [30]:
print(tokenize(emotion['train'][:2]))

{'input_ids': [[101, 1045, 2134, 2102, 2514, 26608, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 1045, 2064, 2175, 2013, 3110, 2061, 20625, 2000, 2061, 9636, 17772, 2074, 2013, 2108, 2105, 2619, 2040, 14977, 1998, 2003, 8300, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}


The result of padding: the first element of ``input_ids`` is shorter than the second, so zeros have been added to that element to make them the same length. These zeros have corresponding ``[PAD]`` token in the vocabulary, and the set of special tokens also includes the ``[CLS]`` and ``[SEP]`` tokens that we encountered earlier:

|Special Token|``[PAD]``|``[UNK]``|``[CLS]``|``[SEP]``|``[MASK]``|
|:-|-:|-:|-:|-:|-:|
|Special Token ID|0|100|101|102|103|


Also note that in addition to returning the encoded tweets as ``input_ids``, the tokenizer returns a list of ``attention_mask`` arrays. This is because we do not want the model to get confused by the additional padding tokens: the attention mask allows the model to ignore the padded parts of the input.

Now, ``tokenize`` function can be applied across all the splits in the corpus in a single line of code:

In [31]:
emotion_encoded = emotion.map(tokenize, batched=True, batch_size=None)

Map:   0%|          | 0/16000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

The ``map()`` method operates individually on every example in the corpus, so setting ``batched=True`` will encode the tweets in batches. Because we've set ``batch_size=None``, our ``tokenize()`` function will be applied on the full dataset as a single batch. This ensures that the input tensors and attention masks have the same shape globally, and we can see that this operation has added new ``input_ids`` and ``attention_mask`` columns to the dataset:

In [32]:
print(emotion_encoded['train'].column_names)

['text', 'label', 'input_ids', 'attention_mask']


In [34]:
emotion_encoded['train'][:2]

{'text': ['i didnt feel humiliated',
  'i can go from feeling so hopeless to so damned hopeful just from being around someone who cares and is awake'],
 'label': [0, 0],
 'input_ids': [[101,
   1045,
   2134,
   2102,
   2514,
   26608,
   102,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0],
  [101,
   1045,
   2064,
   2175,
   2013,
   3110,
   2061,
   20625,
   2000,
   2061,
   9636,
   17772,
   2074,
   2013,
   2108,
   2105,
   2619,
   2040,
   14977,
   1998,
   2003,
   8300,
   102,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,


Data collators can be used to dynamically pad the tensors in each batch. Padding globally will come in handy in the next section, where we extract a feature matrix from the whole corpus.