### Tokenize and Preprocess

In [13]:
from transformers import BertTokenizer
from transformers import AutoTokenizer ## use AutoTOkenizer will load the fast tokenizer 

- Here’s an example using the BERT tokenizer, which is a WordPiece tokenizer:

In [2]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
sequence = "A Titan RTX has 24GB of VRAM"
tokenized_sequence = tokenizer.tokenize(sequence)
print(tokenized_sequence)

['A', 'Titan', 'R', '##T', '##X', 'has', '24', '##GB', 'of', 'V', '##RA', '##M']


- each token can be converted to IDs ; we can also decode from ids

In [15]:
inputs = tokenizer(sequence,return_offsets_mapping=True)
encoded_sequence = inputs['input_ids']
print(encoded_sequence)
decoded_sequence = tokenizer.decode(encoded_sequence)
print(decoded_sequence)

[101, 138, 18696, 155, 1942, 3190, 1144, 1572, 13745, 1104, 159, 9664, 2107, 102]
[CLS] A Titan RTX has 24GB of VRAM [SEP]


- You can also track see if tokens are came from the same words and also the mapping

In [16]:
print(inputs.tokens())
print(inputs.word_ids())
print(inputs['offset_mapping'])

['[CLS]', 'A', 'Titan', 'R', '##T', '##X', 'has', '24', '##GB', 'of', 'V', '##RA', '##M', '[SEP]']
[None, 0, 1, 2, 2, 2, 3, 4, 4, 5, 6, 6, 6, None]
[(0, 0), (0, 1), (2, 7), (8, 9), (9, 10), (10, 11), (12, 15), (16, 18), (18, 20), (21, 23), (24, 25), (25, 27), (27, 28), (0, 0)]


### Batch Tokenize

- Padding: Tokenizer will automatically pad a batch and provide proper attention masks
- Truncation: On the other end of the spectrum, sometimes a sequence may be too long for a model to handle. In this case, you will need to truncate the sequence to a shorter length. Set the truncation parameter to True to truncate a sequence to the maximum length accepted by the model
- Return Tensor: Set the return_tensors parameter to either pt for PyTorch, or tf for TensorFlow

In [6]:
sequence_a = "This is a short sequence."
sequence_b = "This is a rather long sequence. It is at least longer than the sequence A."
padded_sequences = tokenizer([sequence_a, sequence_b], 
                             padding=True,truncation=True,
                             return_tensors="pt")
print(padded_sequences)

{'input_ids': tensor([[ 101, 1188, 1110,  170, 1603, 4954,  119,  102,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0],
        [ 101, 1188, 1110,  170, 1897, 1263, 4954,  119, 1135, 1110, 1120, 1655,
         2039, 1190, 1103, 4954,  138,  119,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}


### Processing sentence pairs 

- Some models’ purpose is to do classification on pairs of sentences or question answering.These require two different sequences to be joined in a single “input_ids” entry, which usually is performed with the help of special tokens, such as the classifier ([CLS]) and separator ([SEP]) tokens. For example, the BERT model builds its two sequence input as such:

In [7]:
sequence_a = "HuggingFace is based in NYC"
sequence_b = "Where is HuggingFace based?"
encoded_dict = tokenizer(sequence_a, sequence_b)
decoded = tokenizer.decode(encoded_dict["input_ids"])
print(decoded)
print(encoded_dict["token_type_ids"])

[CLS] HuggingFace is based in NYC [SEP] Where is HuggingFace based? [SEP]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]


Small Tips :
- For models employing the function apply_chunking_to_forward(), the chunk_size defines the number of output embeddings that are computed in parallel and thus defines the trade-off between memory and time complexity. If chunk_size is set to 0, no feed forward chunking is done.
    https://huggingface.co/docs/transformers/glossary#feed-forward-chunking