<a href="https://colab.research.google.com/github/ngzhiwei517/Transformers/blob/main/Chapter2/Behind_the_pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Behind the pipeline (PyTorch)

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [None]:
!pip install datasets evaluate transformers[sentencepiece]

When you call the tokenizer, it does 3 things:

Tokenizes the text:
Breaks sentences into smaller pieces called tokens (words, subwords, or punctuation).

Maps tokens to numbers:
Every token has a unique ID from the modelâ€™s vocabulary.
Example: "HuggingFace" â†’ [12345]

Adds special tokens (depending on the model):
For example, [CLS] and [SEP] tokens for BERT.



In [None]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier(
    [
        "I've been waiting for a HuggingFace course my whole life.",
        "I hate this so much!",
    ]
)

This tells Hugging Face:

"Give me the tokenizer that was used when this model was trained"

The tokenizer is downloaded once and cached

In [None]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

ðŸ”¹ return_tensors="pt"
ðŸ‘‰ Returns the output as PyTorch tensors, not plain Python lists or NumPy arrays.

Other options:

"tf" â†’ TensorFlow

"np" â†’ NumPy arrays

In [None]:
raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)

In [None]:
from transformers import AutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)

In [None]:
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)

In [None]:
from transformers import AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)

In [None]:
print(outputs.logits.shape)

In [None]:
print(outputs.logits)

In [None]:
import torch

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

In [None]:
model.config.id2label