# Putting It All Together

* Outlines putting together all the usages from tokenizer to model using the `AutoTokenizer` and `AutoModel` classes from the Hugging Face `Transformers` library
* All classes and functions are imported from the `Transformers` library

## Setup

In [4]:
model_provider = "distilbert"
model_name = "distilbert-base-uncased"
model = f"{model_provider}/{model_name}"

---

## Mimicking the Pipeline Function

* The Transformers API can handle all of this with a high-level function that one will dive into here
* When one calls one's tokenizer directly on the sentence, one gets back inputs that are ready to pass through one's model:

In [5]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

model_inputs = tokenizer(sequence)

* Here, the `model_inputs` variable contains everything that is necessary for a model to operate well

* As one will see in some examples below, this method is very powerful
* First, it can tokenize a single sequence:

In [6]:
sequence = "I've been waiting for a HuggingFace course my whole life."

model_inputs = tokenizer(sequence)

It also handles multiple sequences at a time, with no change in the API:

In [7]:
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

model_inputs = tokenizer(sequences)

It can pad according to several objectives:

In [8]:
# will pad the sequences up to the maximum sequence length
model_inputs = tokenizer(sequences, padding="longest")

# will pad the sequences up to the model max length
model_inputs = tokenizer(sequences, padding="max_length")

# will pad the sequences up to the specified max length
model_inputs = tokenizer(sequences, padding="max_length", max_length=8)

It can also truncate sequences:

In [9]:
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

# will truncate the sequences that are longer than the model max length
model_inputs = tokenizer(sequences, truncation=True)

# will truncate the sequences that are longer than the specified max length
model_inputs = tokenizer(sequences, max_length=8, truncation=True)

* The `tokenizer` object can handle the conversion to specific framework tensors, which can then be directly sent to the model
* For example, in the following code sample, one is prompting the tokenizer to return tensors from the different frameworks:
	* `"pt"` returns PyTorch tensors
	* `"np"` returns NumPy arrays

In [10]:
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

# returns PyTorch tensors
model_inputs = tokenizer(sequences, padding=True, return_tensors="pt")

# returns NumPy arrays
model_inputs = tokenizer(sequences, padding=True, return_tensors="np")

---

## Special Tokens

If one takes a look at the input IDs returned by the tokenizer, one will see they are a tiny bit different from what one had earlier:

In [11]:
sequence = "I've been waiting for a HuggingFace course my whole life."

model_inputs = tokenizer(sequence)

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)

In [12]:
print(model_inputs["input_ids"])
print(ids)

[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102]
[1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012]


* One token ID was added at the beginning, and one at the end
* Let us decode the two sequences of IDs above to see what this is about:

In [13]:
print(tokenizer.decode(model_inputs["input_ids"]))
print(tokenizer.decode(ids))

[CLS] i've been waiting for a huggingface course my whole life. [SEP]
i've been waiting for a huggingface course my whole life.


* The tokenizer added the special word [CLS] at the beginning and the special word [SEP] at the end
* This is because the model was pre-trained with those, so to get the same results for inference one needs to add them as well
* Note that some models do not add special words, or add different ones
* Models may also add these special words only at the beginning, or only at the end
* In any case, the tokenizer knows which ones are expected and will deal with this for one

---

## From Tokenizer to Model

* Now that one has seen all the individual steps the `tokenizer` object uses when applied on texts
* Let us see one final time how it can handle:
	* Multiple sequences
	* Padding
	* Very long sequences
	* Truncation
	* Multiple types of tensors

	with its main API:

In [14]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

tokens = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")
output = model(**tokens)