<a href="https://colab.research.google.com/github/ngzhiwei517/Transformers/blob/main/Chapter2/Behind_the_pipeline_(PyTorch).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Behind the pipeline (PyTorch)

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [None]:
!pip install datasets evaluate transformers[sentencepiece]

When you call the tokenizer, it does 3 things:

Tokenizes the text:
Breaks sentences into smaller pieces called tokens (words, subwords, or punctuation).

Maps tokens to numbers:
Every token has a unique ID from the model’s vocabulary.
Example: "HuggingFace" → [12345]

Adds special tokens (depending on the model):
For example, [CLS] and [SEP] tokens for BERT.



In [None]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier(
    [
        "I've been waiting for a HuggingFace course my whole life.",
        "I hate this so much!",
    ]
)

This tells Hugging Face:

"Give me the tokenizer that was used when this model was trained"

The tokenizer is downloaded once and cached

In [None]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

🔹 return_tensors="pt"
👉 Returns the output as PyTorch tensors, not plain Python lists or NumPy arrays.

Other options:

"tf" → TensorFlow

"np" → NumPy arrays

# **🔹 padding=True**

👉 Ensures all sentences are the same length by adding special [PAD] tokens.

Yes, the longest sentence in the batch becomes the reference length.

🔍 Step 1: Tokenize both sentences
"I love AI" → [101, 1045, 2293, 1034, 102] → length = 5

"I hate math!" → [101, 1045, 5223, 4667, 999, 102] → length = 6

So, the longest one is length 6.



---



**🔧 Step 2: Padding the shorter sentence**


---


To make both sequences the same length, the shorter one (length = 5) needs to be padded to length = 6.

So the first sentence becomes:
[101, 1045, 2293, 1034, 102, 0]



# **truncation=True**

means it will cut off extra tokens if your input is too long for the model to handle.

Most transformer models (like BERT, DistilBERT, GPT-2) have a maximum input length, usually 512 tokens.

If your sentence or paragraph is longer than that — for example, 600 tokens — it will cut off the extra tokens beyond 512.

💬 Example:
Imagine this input:

"The movie was slow at first, but the ending was absolutely amazing. One of the best twists I've ever seen!"

If the tokenizer cuts it off before "the ending was absolutely amazing...", then the model might wrongly predict that the sentiment is negative — because it only sees the "slow at first" part. 😬



In [None]:
raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt") # 1 = token used
print(inputs) # 1 = token used, 0 = padding ignored

Key points:

input_ids: numbers representing each word/token, plus special tokens like [101] (start) and [102] (end).

Padding tokens (0) added to the shorter sentence "I hate this so much!" to match the length of the longer one.

attention_mask tells the model which tokens are real (1) and which are padding (0) so it ignores padding during processing.



In [None]:
from transformers import AutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)

This code snippet loads a pre-trained DistilBERT model without a classification head.

This imports the AutoModel class, which is used to load just the base transformer model without any task-specific heads.

No Head: Note that this loads just the base model, not the version with a classification head.

In [None]:
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)

If the input is a single sentence with 8 tokens, the shape might be:

(1, 8, 768)

This means:

1 sequence

8 tokens

768 hidden state dimensions

✅ If you want a complete model for a task, you should use:

In [None]:
from transformers import AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)

This gives you the base + task-specific head → now it can predict sentiment or any other task it's trained for.

In [None]:
print(outputs.logits.shape)

torch.Size([2, 2])

Dimension	Meaning
2 (first)	You gave 2 input sentences (a batch of size 2) ✅

2 (second)	The model predicts 2 classes (positive or negative sentiment) ✅


In [None]:
print(outputs.logits)

🔁 For each input sentence, it outputs 2 numbers = the logits (scores) for the two classes.

represents the logits (raw scores) for two sentences:

First sentence: scores for class 0 (negative) = -1.5607 and class 1 (positive) = 1.6123

Second sentence: scores for class 0 = 4.1692 and class 1 = -3.3464



***But these are logits, not probabilities yet! 🤓***

In [None]:
import torch

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

What does this mean?

For Sentence 1, the model is 95.98% sure it’s positive sentiment.

For Sentence 2, the model is 99.95% sure it’s negative sentiment.

In [None]:
model.config.id2label