# Putting it All Together

We were previously do the Tokenizer's work part by part, however, the hugging face Transformers API can handle all of this for us with a high-level function

When you call your tokenizer directly on the sentence, you get back inputs that are ready to pass through your model

As shown below, it can tokenize a single sentence

In [1]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

# model inputs now contains everything that's necessary for a model to operate well
# will contain what is required by the respective model (match the tokenizer and model)
model_inputs = tokenizer(sequence)



In [2]:
model_inputs

{'input_ids': [101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

## It can handle multiple sequences at a time with no change in API

In [3]:
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

model_inputs = tokenizer(sequences)

In [4]:
model_inputs

{'input_ids': [[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102], [101, 2061, 2031, 1045, 999, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]}

## It can pad according to several padding objectives

```
# Will pad the sequences up to the maximum sequence length
model_inputs = tokenizer(sequences, padding="longest")

# Will pad the sequences up to the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, padding="max_length")

# Will pad the sequences up to the specified max length
model_inputs = tokenizer(sequences, padding="max_length", max_length=8)
```

## Conversely, it can truncate according to different truncating objectives

```
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

# Will truncate the sequences that are longer than the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, truncation=True)

# Will truncate the sequences that are longer than the specified max length
model_inputs = tokenizer(sequences, max_length=8, truncation=True)
```

## The tokenizer object can handle the conversion to specific framework tensors which can then be directly sent to the model

In the code below, we are prompting the tokenizer to return tensors from different frameworks

```
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

# Returns PyTorch tensors
model_inputs = tokenizer(sequences, padding=True, return_tensors="pt")

# Returns TensorFlow tensors
model_inputs = tokenizer(sequences, padding=True, return_tensors="tf")

# Returns NumPy arrays
model_inputs = tokenizer(sequences, padding=True, return_tensors="np")
```

## Special Tokens

If we observe the input IDs returned by the tokenizer, we will see they are different from what we had earlier

One token ID was added at the beginning, and one at the end

In [5]:
sequence = "I've been waiting for a HuggingFace course my whole life."

print("Tokenizer Outputs")
model_inputs = tokenizer(sequence)
print(model_inputs["input_ids"])
print("")
print("Manual Steps Output")
tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

Tokenizer Outputs
[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102]

Manual Steps Output
[1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012]


In [8]:
print("Tokenizer Outputs Decoded")
print(tokenizer.decode(model_inputs["input_ids"]))
print("")
print("Manual Steps Output Decoded")
print(tokenizer.decode(ids))

Tokenizer Outputs Decoded
[CLS] i've been waiting for a huggingface course my whole life. [SEP]

Manual Steps Output Decoded
i've been waiting for a huggingface course my whole life.


The tokenizer added the special word ```[CLS]``` at the beginning and the special word ```[SEP]``` at the end. This is because the model was pretrained with those, so to get the same results for inference we need to add them as well. 

Note that some models don’t add special words, or add different ones; models may also add these special words only at the beginning, or only at the end. In any case, the tokenizer knows which ones are expected and will deal with this for you.

## Wrap up: Handle multiple sequences (use of padding), very long sequences (use of truncation) with main API

In [9]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

tokens = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")
output = model(**tokens)

