<a href="https://colab.research.google.com/github/ngzhiwei517/Transformers/blob/main/Chapter2/Behind_the_pipeline_(PyTorch).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Behind the pipeline (PyTorch)

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [None]:
!pip install datasets evaluate transformers[sentencepiece]

When you call the tokenizer, it does 3 things:

Tokenizes the text:
Breaks sentences into smaller pieces called tokens (words, subwords, or punctuation).

Maps tokens to numbers:
Every token has a unique ID from the model’s vocabulary.
Example: "HuggingFace" → [12345]

Adds special tokens (depending on the model):
For example, [CLS] and [SEP] tokens for BERT.



In [None]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier(
    [
        "I've been waiting for a HuggingFace course my whole life.",
        "I hate this so much!",
    ]
)

This tells Hugging Face:

"Give me the tokenizer that was used when this model was trained"

The tokenizer is downloaded once and cached

In [None]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

🔹 return_tensors="pt"
👉 Returns the output as PyTorch tensors, not plain Python lists or NumPy arrays.

Other options:

"tf" → TensorFlow

"np" → NumPy arrays

# **🔹 padding=True**

👉 Ensures all sentences are the same length by adding special [PAD] tokens.

Yes, the longest sentence in the batch becomes the reference length.

🔍 Step 1: Tokenize both sentences
"I love AI" → [101, 1045, 2293, 1034, 102] → length = 5

"I hate math!" → [101, 1045, 5223, 4667, 999, 102] → length = 6

So, the longest one is length 6.



---



**🔧 Step 2: Padding the shorter sentence**


---


To make both sequences the same length, the shorter one (length = 5) needs to be padded to length = 6.

So the first sentence becomes:
[101, 1045, 2293, 1034, 102, 0]



# **truncation=True**

means it will cut off extra tokens if your input is too long for the model to handle.

Most transformer models (like BERT, DistilBERT, GPT-2) have a maximum input length, usually 512 tokens.

If your sentence or paragraph is longer than that — for example, 600 tokens — it will cut off the extra tokens beyond 512.

💬 Example:
Imagine this input:

"The movie was slow at first, but the ending was absolutely amazing. One of the best twists I've ever seen!"

If the tokenizer cuts it off before "the ending was absolutely amazing...", then the model might wrongly predict that the sentiment is negative — because it only sees the "slow at first" part. 😬



In [None]:
raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)

In [None]:
from transformers import AutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)

In [None]:
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)

In [None]:
from transformers import AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)

In [None]:
print(outputs.logits.shape)

In [None]:
print(outputs.logits)

In [None]:
import torch

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

In [None]:
model.config.id2label