# Tokenizer

* [Tokenizer](https://huggingface.co/transformers/main_classes/tokenizer.html)

> A tokenizer is in charge of preparing the inputs for a model. 

## Fast Tokenizer

Use **Fast** tokenizer., not the Python tokenizers.

> Most of the tokenizers are available in two flavors: a full python implementation and a “Fast” implementation based on the Rust library tokenizers. 

## Base Classes
[PreTrainedTokenizerFast](https://huggingface.co/transformers/main_classes/tokenizer.html#pretrainedtokenizerfast) implements the common methods for encoding string inputs in inputs. Relies on PreTrainedTokenizerBase.

* **Tokenizing** <br>split strings in sub-word token strings, encode tokens into integer ids, decode ids back to tokens.
* **Managing new tokens** <br>adding new tokens the vocabulary in a way that is independent of the underlying structure (BPE, SentencePiece…).

* **Managing special tokens** (mask, CLS/beginning-of-sentence, etc.)<br> adding and assigning them to attributes in the tokenizer for easy access and making sure they are not split during tokenization.

In [1]:
import tensorflow as tf
import transformers

from transformers import (
    TFDistilBertForSequenceClassification,
    DistilBertTokenizerFast
)

2021-07-03 13:31:07.598471: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-07-03 13:31:07.598515: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


# Configurations

In [2]:
model_name = 'distilbert-base-uncased'
max_sequence_length = 256

In [3]:
model = TFDistilBertForSequenceClassification.from_pretrained(
    model_name,
    num_labels=2
)
tokenizer = DistilBertTokenizerFast.from_pretrained(
    model_name, 
    truncation=True,
    padding=True,
    max_length=max_sequence_length,
    return_tensors="tf"
)

2021-07-03 13:31:11.809437: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2021-07-03 13:31:11.809502: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)
2021-07-03 13:31:11.809534: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (ubuntu): /proc/driver/nvidia/version does not exist
2021-07-03 13:31:11.809861: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-07-03 13:31:11.849207: W tensorflow/python/util/util.cc:348] Sets are not currently considered sequences, but t

# Tokenize

Note that you may need to convert the result of the ```tokenizer``` which is ```transformers.tokenization_utils_base.BatchEncoding``` instance into dictionary to feed into the model.

## call

[```__call__```](https://huggingface.co/transformers/internal/tokenization_utils.html#transformers.tokenization_utils_base.PreTrainedTokenizerBase.__call__) method or ```tokenizer(input)``` geneates the ```token ids``` and ```attention masks``` to feed into the model.

Attention Masks are the flags to tell the model if the token should be used or ignored.




In [4]:
tokenized = tokenizer("A tokenizer is in charge of preparing the inputs for a model.")
print(type(tokenized))
print(tokenized)

<class 'transformers.tokenization_utils_base.BatchEncoding'>
{'input_ids': [101, 1037, 19204, 17629, 2003, 1999, 3715, 1997, 8225, 1996, 20407, 2005, 1037, 2944, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


## encode

[encode](https://huggingface.co/transformers/internal/tokenization_utils.html#transformers.tokenization_utils_base.PreTrainedTokenizerBase.encode) method generates the ```token ids``` without the Attention Masks.

In [6]:
ids = tokenizer.encode("A tokenizer is in charge of preparing the inputs for a model.")
print(ids)

[101, 1037, 19204, 17629, 2003, 1999, 3715, 1997, 8225, 1996, 20407, 2005, 1037, 2944, 1012, 102]


## decode

* [decode](https://huggingface.co/transformers/internal/tokenization_utils.html#transformers.tokenization_utils_base.PreTrainedTokenizerBase.decode) methods reverts the ```ids``` back to strings. 

In [7]:
tokenizer.decode(ids)

'[CLS] a tokenizer is in charge of preparing the inputs for a model. [SEP]'