# 🤗 Hugging Face Transformers: Master the pipeline() API 🚀  
## Build NLP & Vision Applications in Minutes

Welcome to this hands-on tutorial on the 🤗 Hugging Face **Transformers** library — one of the most powerful and accessible toolkits for Natural Language Processing (NLP), Computer Vision, and Speech applications.

In this notebook, we’ll explore how to leverage the `pipeline()` API to quickly solve a wide range of tasks such as:
- Sentiment Analysis ❤️😡
- Zero-Shot Text Classification 🔮
- Text Generation ✍️🧠
- Fill-Mask Predictions 🧩
- Named Entity Recognition (NER) 🏷️
- Question Answering ❓✅
- Summarization 📰✂️
- Image Classification 📷🧠
- Automatic Speech Recognition (ASR) 🎙️🗣️

---

### 🎯 What You'll Learn

By the end of this tutorial, you'll be able to:
✅ Understand what Hugging Face `pipelines` are and how they simplify ML workflows  
✅ Use pre-trained models for NLP, Vision, and ASR tasks with just a few lines of code  
✅ Interpret model outputs and understand the underlying logic  
✅ Explore deeper components like tokenizers, attention masks, and AutoModel classes


`Hugging Face tutorial`, `Transformers pipeline`, `pipeline() API`, `NLP with Transformers`, `Hugging Face beginner guide`, `text classification`, `zero-shot learning`, `ASR`, `image classification transformers`, `Hugging Face notebook tutorial`, `fine-tune Hugging Face models`

Let’s get started! 👇


## ⚙️ Step 1: Environment Setup

Before we start using Hugging Face Transformers, make sure the necessary libraries are installed:

- `transformers`: The main Hugging Face library
- `datasets`: For loading benchmark datasets (optional for some pipelines)
- `evaluate`: For metrics (optional)
- `sentencepiece`: Needed for some multilingual models like MarianMT and mBART

In [None]:
#!pip install datasets evaluate transformers[sentencepiece]

## 🤖 What is `pipeline()` in Hugging Face?

The `pipeline()` API is a high-level interface provided by Hugging Face Transformers.

It allows you to:
✅ Quickly use powerful pre-trained models  
✅ Perform common ML tasks (e.g., sentiment analysis, summarization)  
✅ Skip low-level model/tokenizer loading unless needed  
✅ Handle preprocessing and postprocessing automatically

Think of it as a plug-and-play tool for working with Transformers!

### ✨ Example Pipelines You’ll Explore
- `"sentiment-analysis"`
- `"zero-shot-classification"`
- `"text-generation"`
- `"fill-mask"`
- `"ner"` (Named Entity Recognition)
- `"question-answering"`
- `"summarization"`
- `"translation"`
- `"image-classification"`
- `"automatic-speech-recognition"`


## 🧠 Sentiment Analysis

Sentiment analysis is one of the most common NLP tasks. It involves determining whether a given piece of text expresses a **positive**, **negative**, or sometimes **neutral** opinion.

The Hugging Face `pipeline("sentiment-analysis")` makes it incredibly easy to get started with just a few lines of code.

### ✅ Use Cases:
- Product and movie review classification  
- Social media sentiment monitoring  
- Customer support analytics

### 🔹 Example 1: Single-Line Sentiment Analysis

You can quickly test the sentiment of a single sentence or short passage using the pipeline. Here's a basic example.


In [None]:
from transformers import pipeline

# Load the sentiment analysis pipeline
#Default Model: distilbert/distilbert-base-uncased-finetuned-sst-2-english
sentiment_pipeline = pipeline("sentiment-analysis")

# Analyze a short sentence
sentiment_pipeline(["I absolutely loved this product! It exceeded my expectations."])


### 🔹 Example 2: Batch Sentiment Analysis (Multiple Sentences)

The pipeline also supports **batch processing**. You can analyze several reviews or opinions in one call.

🗒️ What It Does? IT Processes multiple inputs efficiently. This is incredibly useful for analyzing reviews, survey responses, or social media posts in bulk.

In [None]:
# Run sentiment analysis on a batch of 3 texts with varying tone
sentiment_pipeline(
    [
        # 1. Clearly positive, 2. Clearly negative, 3. More neutral/mixed
        "I absolutely loved this product! It exceeded my expectations in every way and the customer service was fantastic. Highly recommend!",
        "This was a complete disappointment. The item arrived broken, and no one responded to my support emails. I will never order from this company again.",
        "It works as intended, though there’s nothing particularly special about it."
    ]
)


### 🧠 Zero-Shot Classification with Hugging Face Transformers

The `zero-shot-classification` pipeline allows us to classify text into custom labels *without* the need for training on those specific categories. It uses models like BART or RoBERTa fine-tuned on NLI (Natural Language Inference) to evaluate whether a given label is an appropriate description of the text. This is especially useful when predefined labels are unavailable or dynamic.

🔹 **Use case:** Categorize reviews, support tickets, or social media posts into topics like *"billing"*, *"technical issue"*, or *"general feedback"* — even if the model was never trained on those exact classes.


In [None]:
from transformers import pipeline
#Default Model:  facebook/bart-large-mnli
text_classifier = pipeline("zero-shot-classification")
text_classifier(
    "I contacted support three times about my account being locked, but no one"
    "got back to me. This is really frustrating.",
    candidate_labels=["billing issue", "technical problem", "customer support"],
)

#### 🎓 Education Domain

classifier(
    "Although the course materials are well-structured, the instructor rarely responds to student questions on the discussion board.",
    candidate_labels=["course content", "instructor feedback", "platform usability", "technical issue"],
)


In [None]:
#🎓 Education Domain

text_classifier(
    "Although the course materials are well-structured, the instructor rarely responds to student questions on the discussion board.",
    candidate_labels=["course content", "instructor feedback", "platform usability", "technical issue"],
)

#### 💰 Finance Domain


classifier(
    "I noticed an unexpected charge on my credit card and I haven’t received any explanation from the bank yet.",
    candidate_labels=["fraud", "billing error", "customer support", "loan inquiry"],
)

In [None]:
# 💰 Finance Domain

text_classifier(
    "I noticed an unexpected charge on my credit card and I haven’t received any explanation from the bank yet.",
    candidate_labels=["fraud", "billing error", "customer support", "loan inquiry"],
)

## ✍️ Text Generation

Text generation is a core capability of generative language models family like GPT-2, GPT-Neo, and Falcon. Given a starting prompt, the model **generates a coherent continuation** based on language patterns it learned during pretraining.

This is widely used for:
- AI writing assistants ✍️  
- Story or poetry generation 📖  
- Code autocompletion 🤖  
- Educational content or chatbot responses 💬  



In [None]:
from transformers import pipeline

# Load the text generation pipeline
#Default model: openai-community/gpt2
text_generator = pipeline("text-generation")

#You can adjust generation using args: max_length, temperature, top_k, and more to control creativity.
text_generator("This tutorial will teach you how to use the Hugging Face Transformers library to")


### 🔁 Try a Lighter Model: `distilgpt2`

If you're working with limited hardware or want faster inference, try a distilled version of GPT-2.


In [None]:
from transformers import pipeline

# This is a lighter version of GPT-2 for faster generation
text_generator = pipeline("text-generation", model="distilgpt2")
# Generate a continuation using the smaller model and adjusting parameters
text_generator("This tutorial will teach you how to use the Hugging Face Transformers library to",
               num_return_sequences=2,
               max_length=50,
)

### ⚡ Try a Larger Instruction-Tuned Model: `tiiuae/falcon-7b-instruct`

The `Falcon` models are powerful, open-access LLMs. The `instruct` variant is designed for following human instructions — making it great for question answering, tutoring, and natural completions.
This will take quiet time to run on regular CPU...

In [None]:
from transformers import pipeline

# Load the Falcon 7B instruction-following model
text_generator = pipeline("text-generation", model="tiiuae/falcon-7b-instruct")

# Generate a response that simulates instruction-following
text_generator("This tutorial will teach you how to use the Hugging Face Transformers library to", max_length=50)
#Do you have enough memory...?!
#OutOfMemoryError: CUDA out of memory. Tried to allocate 316.00 MiB. GPU 0 has a total capacity of 22.16 GiB of which 47.38 MiB is free.

## 🧩 Fill-Mask: Masked Language Modeling

The `fill-mask` pipeline predicts missing words in a sentence by filling a masked token like `[MASK]` or `<mask>` (depending on the model) with the most probable word for each masking tag based on context. It returns multiple top predictions with associated confidence scores.

It’s primarily useful for:
- Exploring language model predictions
- Completing partially written sentences
- Pre-training tasks (like BERT-style MLM)
- It can be also used for classification (Trick)

Let’s see it in action with some examples.


In [None]:
from transformers import pipeline

# Load the fill-mask pipeline
# Default Model: distilbert/distilroberta-base
fill_mask = pipeline("fill-mask")

fill_mask("Transformers are a powerful tool for solving <mask> tasks.", top_k=2)


### 🔍 Use Case: Domain-Specific Language Understanding

We’ll now use the fill-mask pipeline to complete sentences in different domains:
- 💰 Finance
- 🎓 Education
- ⚖️ Legal

This helps assess how well the model understands specialized language or context.


In [None]:
# 💰 Finance Domain
print("Finance Example:")
print(fill_mask("The company's quarterly <mask> exceeded market expectations."))

# 🎓 Education Domain
print("\nEducation Example:")
print(fill_mask("Students are encouraged to submit their assignments before the <mask>."))

# ⚖️ Legal Domain
print("\nLegal Example:")
print(fill_mask("According to the new law, all businesses must <mask> with the updated regulations."))

## 🕵️ Named Entity Recognition (NER)

Named Entity Recognition identifies key elements in text such as names of people, organizations, locations, and more. For example, we can detects and group entities like "Yoshua Bengio" (Person), "Mila" (Organization), and "Montreal" (Location).

This can be useful for tagging and highlighting critical content.

NER is used in:
- Information extraction from documents
- Search engine enhancement
- News tagging and trend analysis
- Resume or medical record parsing


In [None]:
from transformers import pipeline

# Load the NER pipeline with grouped entities (e.g., full names vs individual tokens)
#Default Model: dbmdz/bert-large-cased-finetuned-conll03-english
ner = pipeline("ner", grouped_entities=True)

ner("Yoshua Bengio, founder of Mila, gave a keynote in Montreal last Monday to highlight the institute's latest breakthroughs in responsible AI research.")


## ❓ Question Answering (Extractive QA)

The `question-answering` pipeline extracts an answer span from a given context. The pipeline searches the context for the most relevant span that answers the question and returns the answer and confidence score

It’s widely used in:
- Chatbots
- Document search assistants
- Educational tools and tutoring systems


In [None]:
from transformers import pipeline

# Load the QA pipeline
# Default Model: distilbert/distilbert-base-cased-distilled-squad
qa = pipeline("question-answering")
# Provide context and ask a question
qa(
    question="Who gave a keynote in Montreal?",
    context="Yoshua Bengio, founder of Mila, gave a keynote in Montreal last "
    "Monday to highlight the institute's latest breakthroughs in responsible AI "
    "research.",
)

## 📰 Text Summarization

Summarization is the task of generating a concise version of a longer text while preserving its meaning.

Two common types:
- **Extractive**: Selects key sentences from the original
- **Abstractive**: Generates new phrasing, like how a human would summarize

Great for:
- Summarizing articles, emails, research papers
- Creating executive summaries or news digests


In [None]:
from transformers import pipeline

# Load the summarization pipeline
#Default Model : sshleifer/distilbart-cnn-12-6
text_summarizer = pipeline("summarization")
text_summarizer(
    """
    Mila, the Quebec Artificial Intelligence Institute, continues to play a leading
    role in shaping the future of ethical AI. Under the guidance of renowned researcher
    Yoshua Bengio, the institute has launched several initiatives focused on
    transparency, fairness, and social responsibility in AI development. These efforts
    include partnerships with universities, public institutions, and tech companies
    to promote the responsible use of machine learning models in healthcare, education,
    and the public sector.

    Last week, Mila hosted an international symposium in Montreal, drawing experts from
    over 30 countries to discuss regulatory frameworks and AI governance. During the
    event, Bengio emphasized the need for collaborative research and stronger safeguards
    to ensure that AI systems align with democratic values. The symposium concluded
    with a declaration urging governments and industry leaders to adopt ethical standards
    for AI that prioritize human well-being over purely commercial goals.
"""
)

## 🌍 Translation

Use Hugging Face Transformers for translating text between languages using models like MarianMT and MBART.

Works well for:
- Multilingual apps or chatbots
- Educational content delivery
- Quick translation without external APIs


In [None]:
from transformers import pipeline

# Load a translation pipeline from English to French
#Try the default
text_translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")

# Translate the following french sentence into an English sentence
text_translator("Mila, l’Institut québécois d’intelligence artificielle, est reconnu mondialement pour ses recherches de pointe en apprentissage profond et en intelligence artificielle responsable.")

## 🖼️ Image Classification

Image classification involves predicting the main object or scene depicted in an image. Transformers like ViT (Vision Transformer) and ConvNeXt have made this possible within Hugging Face.

This is useful for:
- Object recognition
- Content moderation
- Sorting images into categories
on

In [None]:
from PIL import Image
import requests

# Load the image from the URL for COCO dataset
image_url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(image_url, stream=True).raw)

image

In [None]:
from transformers import pipeline

# Load the image classification pipeline
#Default Model: google/vit-base-patch16-224
img_classifier = pipeline(task="image-classification")

image_classes_labels = img_classifier(image_url)
#Let's print the results
for class_label in image_classes_labels:
    print(f"Label: {class_label['label']}, Score: {class_label['score']:.4f}")

## 🎧 Automatic Speech Recognition (ASR)

ASR is the process of converting spoken language into written text using models like Wav2Vec2 or Whisper.

Perfect for:
- Transcribing audio files or lectures
- Creating subtitles or captions
- Building voice interfaces


The audio sample is sourced from the Open Speech Repository, which provides freely usable speech files in multiple languages for use in speech recognition and other applications.


In [None]:
from transformers import pipeline
import requests
from IPython.display import Audio
import tempfile
import os

# URL of the audio sample
#"https://www.voiptroubleshooter.com/open_speech/american/OSR_us_000_0010_8k.wav"

# Define the audio content
audio_file ='OSR_us_000_0010_8k.wav'

# Play the audio in notebook
Audio(audio_file)


In [None]:
# Load the ASR pipeline
# Default: facebook/wav2vec2-base-960h
asr = pipeline(
    task="automatic-speech-recognition",
    model="openai/whisper-large-v3"
)
transcript = asr(
  audio_file,
  #return_timestamps=True
  chunk_length_s=30,
  #stride_length_s=5,
)
print(transcript)

# Display the transcription
print("Transcription:", transcript["text"])

## 🔍 Under the Hood: What Happens Inside pipeline() ?

Behind the scenes, `pipeline()` uses the following components:
- `AutoTokenizer`: Converts raw text into tokens
- `AutoModel`: Loads the pre-trained neural network
- Post-processing: Decodes and formats the model output

Before we manually walk through these steps to better understand the pipeline internals, let's dive deep in `Tokenizers`.


## 🔍 Understanding Tokenizers in Transformers

Before feeding text into a Transformer model, it must be **preprocessed** into numerical form. That’s the job of the **Tokenizer**.

### 🧾 What a Tokenizer Does:
1. **Splits** raw text into smaller units called *tokens*  
   - Example: `"Transformers are awesome"` → `["Transform", "##ers", "are", "awesome"]`
2. **Maps** tokens to numerical IDs using a vocabulary  
   - Example: `"Transform"` → `20145`, `"##ers"` → `2543`
3. **Adds special tokens** required by the model  
   - Like `[CLS]`, `[SEP]`, or `<s>`, `</s>` depending on the architecture

---

### 🛠️ Output of the Tokenizer:
The tokenizer returns a dictionary with:
- `input_ids`: Tokenized integer IDs of the text
- `attention_mask`: 1s for real tokens, 0s for padding
- (Optional) `token_type_ids`: For tasks like QA with multiple segments

Example:
```python
{'input_ids': [101, 999, 2003, 2307, 102], 'attention_mask': [1, 1, 1, 1, 1]}


Recall...

In [None]:
from transformers import pipeline

text_classifier = pipeline("sentiment-analysis")
text_classifier(
     ["I absolutely loved this product! It exceeded my expectations in every way and the customer service was fantastic. Highly recommend!",
     "This was a complete disappointment. The item arrived broken, and no one responded to my support emails. I will never order from this company again.",
     "It works as intended, though there’s nothing particularly special about it."
     ]
)

 Let's walk through these steps to better understand the pipeline internals.
 ### 🧾 Step 1: Preprocessing with a Tokenizer

Transformers can’t process raw text directly — it first needs to be converted into token IDs.

We use `AutoTokenizer` to:
- Split text into tokens (words or subwords)
- Convert tokens to integer IDs using a vocabulary
- Add special tokens like `[CLS]` and `[SEP]`


In [None]:
from transformers import AutoTokenizer

# Load tokenizer for a classification task
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

# Encode a sample text
reviews =  ["I absolutely loved this product! It exceeded my expectations in every way and the customer service was fantastic. Highly recommend!",
     "This was a complete disappointment. The item arrived broken, and no one responded to my support emails. I will never order from this company again.",
     "It works as intended, though there’s nothing particularly special about it."
]
reviews_inputs = tokenizer(reviews, padding=True, truncation=True, return_tensors="pt")
print(reviews_inputs)

In [None]:
# View token IDs
print(reviews_inputs["input_ids"])

### 🧠 Step 2: Load the Pretrained Model

Now that we have tokenized input, we feed it into a model. Let’s load the underlying architecture using `AutoModel`.

#### Why You Must Use the Right Tokenizer
Each pretrained model was trained on a specific tokenizer — using the wrong one leads to:

- Misaligned token IDs

- Poor or invalid predictions

- Misinterpreted input length or structure

✅ Always load the tokenizer from the same checkpoint as the model


In [None]:
from transformers import AutoModel

# Load the base transformer model (no classification head)
model = AutoModel.from_pretrained(checkpoint)

# Run the model with encoded inputs
model_outputs = model(**reviews_inputs)

# Inspect last hidden states (output embeddings)
print(model_outputs.last_hidden_state.shape)

#### 🧠 Choosing the Right Model Architecture in 🤗 Transformers

The base class `AutoModel` only returns **hidden states** — it's like the "engine" of a Transformer.

To solve specific tasks (e.g., classification, QA, generation), 🤗 Transformers provides `AutoModelFor*` variants that **add a task-specific head** — a final layer trained to produce usable outputs for that task.

Below is a breakdown of what each variant adds compared to the base `AutoModel`.

---

##### 🔧 Core Model Variants (Compared to `AutoModel`)

- **`AutoModel`**
  - 🔹 Outputs: `last_hidden_state`, `pooler_output` (if available)
  - ❌ No task head → requires manual postprocessing
  - ➕ Ideal for embedding extraction or when building custom heads

---

- **`AutoModelForCausalLM`**
  - ➕ **Language Modeling Head**: Linear layer projecting to vocabulary size
  - 🔹 Used for: **Next-token prediction / text generation**
  - 🔁 Decoder-style models (e.g., GPT, Falcon)
  - 🧠 Adds: `lm_head = nn.Linear(hidden_size, vocab_size, bias=False)`

---

- **`AutoModelForMaskedLM`**
  - ➕ **Masked LM Head**: Linear + activation to predict masked tokens
  - 🔹 Used for: **Fill-mask tasks**, masked token recovery (BERT)
  - 🧠 Adds: `cls.predictions = nn.Linear(hidden_size, vocab_size)`

---

- **`AutoModelForMultipleChoice`**
  - ➕ **Choice Scoring Head**: Classification layer over each choice
  - 🔹 Used for: **Multiple-choice QA tasks** like SWAG or RACE
  - 🧠 Adds: `classifier = nn.Linear(hidden_size, 1)` applied across choices

---

- **`AutoModelForQuestionAnswering`**
  - ➕ **QA Span Head**: Two linear layers predicting start and end positions
  - 🔹 Used for: **Extractive question answering**
  - 🧠 Adds:
    - `qa_outputs = nn.Linear(hidden_size, 2)`
    - Output shape: `(batch_size, sequence_length, 2)` → (start, end)

---

- **`AutoModelForSequenceClassification`**
  - ➕ **Classification Head**: Linear + dropout for label prediction
  - 🔹 Used for: **Sentiment analysis, intent detection, etc.**
  - 🧠 Adds:
    - `dropout = nn.Dropout(p)`
    - `classifier = nn.Linear(hidden_size, num_labels)`

---

- **`AutoModelForTokenClassification`**
  - ➕ **Token-wise Classifier Head**: Predicts a label for each token
  - 🔹 Used for: **NER, POS tagging, chunking**
  - 🧠 Adds: `classifier = nn.Linear(hidden_size, num_labels)` applied over sequence

---

📌 Summary: All these variants wrap `AutoModel` with a **task-specific head**, so you don’t need to build one manually.

> 🎥 Learn more in my next tutorial being prepared for recroding.

### 🧩 Step 3: Use a Task-Specific Model (Sequence Classification)

Instead of raw embeddings, we often want task-specific predictions — like classifying sentiment.

We can use `AutoModelForSequenceClassification` to load a model with the classification head.


In [None]:
from transformers import AutoModelForSequenceClassification

# Load the model for sequence classification
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
model_outputs = model(**reviews_inputs)
print("Outputs shape:",model_outputs.logits.shape)
#torch.Size([3, 2]) Three sentences two labels
print("Output score:",model_outputs.logits)


In [None]:
import torch

# Convert raw logits to probabilities and show predictions
predictions = torch.nn.functional.softmax(model_outputs.logits, dim=-1)
print(predictions)

### 🏷️ Mapping Predictions to Human-Readable Labels

To make sense of model outputs, we need to translate index-based predictions into labels.


In [None]:
# View class index → label mapping
model.config.id2label

Let's translate the decisions score into labels...

In [None]:
# Get the predicted label index
predicted_label_indices = torch.argmax(predictions, dim=-1)
# Map index to label
predictions_labels = [model.config.id2label[index.item()] for index in predicted_label_indices]
print(predictions_labels)

## 🔡 Tokenization, Tokenizer, and AutoTokenizer

Tokenization is the first step in turning raw text into something a model can understand. Using `AutoTokenizer`, we can easily convert text into token IDs — and back.

Let’s see how encoding and decoding work:


## 🧱 Creating a Transformer from Scratch

While most use cases rely on **pretrained models**, it’s also possible to manually configure and initialize a transformer from scratch — starting with random weights.

We’ll use the `BertConfig` class to manually define the architecture, then instantiate a `BertModel` using this configuration.

> ⚠️ Note: This model is **untrained** — it has randomly initialized weights and won't perform well until fine-tuned on large datasets.


In [None]:
# Manually define a BERT configuration
from transformers import BertConfig, BertModel

# Create a default BERT config
bert_config = BertConfig()

# Print out the configuration parameters
print(bert_config)

# Initialize a model with random weights
model = BertModel(bert_config)


#### 📌 Why This Matters

- This initializes a BERT model **without pretrained weights**
- Useful for academic experiments or custom model design
- Not practical for production — requires a large dataset and compute to train

🔁 Instead, we often load a pretrained model using `from_pretrained()` to save time, cost, and carbon footprint.


Or loading a Transformer model that is already trained is simple — we can do this using the from_pretrained() method:


In [None]:
from transformers import BertModel

# The practical alternative — load a pretrained model
# Load weights trained on a large corpus
model = BertModel.from_pretrained("bert-base-cased")

bert_folder ="bert_model"

model.save_pretrained(bert_folder)


In [None]:
ls bert_model

Or use AutoModel class to replace BertModel and load checkpoint-agnostice model. So we can replace one checkpoint with another given that checkpoint is trained on similar tasks.

In [None]:
from transformers import AutoModel

model = AutoModel.from_pretrained(checkpoint)

## 🔡 Tokenization, Tokenizer, and AutoTokenizer

Tokenization is the first step in turning raw text into something a model can understand. Using `AutoTokenizer`, we can easily convert text into token IDs — and back.

Let’s see how encoding and decoding work:


In [None]:
from transformers import BertTokenizer

# Load tokenizer for a specific model
model_checkpoint = "bert-base-cased"
tokenizer = BertTokenizer.from_pretrained(model_checkpoint)
tokens = tokenizer("It works as intended, though there’s nothing particularly special about it.")
print(tokens)

### 🤖 Use `AutoTokenizer` for Generic Model Loading

Instead of hard-coding the tokenizer class, prefer `AutoTokenizer` — it auto-detects the correct tokenizer class based on the checkpoint.


In [None]:
from transformers import AutoTokenizer

# Load tokenizer generically
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
tokens = tokenizer("It works as intended, though there’s nothing particularly special about it.")

print("Tokens:", tokens)


Let's examine the tokenizer....

In [None]:
# Save tokenizer to disk
tokenizer.save_pretrained("my_bert_tokenizer")

In [None]:
# Preview saved tokenizer files
!ls my_bert_tokenizer

In [None]:
text ="It works as intended, though there’s nothing particularly special about it."
text_tokens = tokenizer.tokenize(text)

print(text_tokens)

In [None]:
text_tokens_ids = tokenizer.convert_tokens_to_ids(text_tokens)

print(text_tokens_ids)

In [None]:
# Decode back to text

text_dectoded = tokenizer.decode(text_tokens_ids)
text_dectoded = tokenizer.decode([1135, 1759, 1112, 3005, 117, 1463, 1175, 787, 188, 1720, 2521, 1957, 1164, 1122, 119])
print("Decoded text",text_dectoded)

## 🧿 Special Tokens

Transformer models rely on special tokens for structural understanding:
- `[CLS]`: Start of input (for classification)
- `[SEP]`: Separator between sentences (for QA)
- `[PAD]`: Padding
- `[MASK]`: Used in masked language modeling

These are automatically added by `AutoTokenizer`.


In [None]:
#Can you spot Special Tokens IDs
text_input_tokens = tokenizer(text, return_tensors="pt")
print(text_input_tokens["input_ids"])


In [None]:
# View model's special tokens
print("Model's special tokens:")
print(tokenizer.special_tokens_map)
print(tokenizer.special_tokens_map.values())
print(tokenizer.convert_tokens_to_ids(tokenizer.special_tokens_map.values()))

In [None]:
# Tokenize a new review
review = "The product is fine, but delivery was delayed."
review_tokens = tokenizer(review)
print("Tokenized:", review_tokens)


## 🎯 Sequence Classification with Attention Mask

Now let’s bring everything together and demonstrate a full classification pipeline using:

- `AutoTokenizer` to tokenize text
- `AutoModelForSequenceClassification` to load a pretrained classifier
- `attention_mask` to handle padded sequences
- `argmax` and `id2label` to get the predicted class


In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Load model and tokenizer for sequence classification
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

# Sample sentence for classification
text ="It works as intended, though there’s nothing particularly special about it."

# Tokenize and prepare tensors
text_tokens = tokenizer.tokenize(text)
text_tokens_ids = tokenizer.convert_tokens_to_ids(text_tokens)

text_input_ids = torch.tensor([text_tokens_ids])
print("Input IDs:", text_input_ids)

model_output = model(text_input_ids)
print("Logits:", model_output.logits)


## 🎯 Attention Masks & Padding
#### 🎯 What is Attention Mask?


Transformer models use **attention masks** to ignore padded tokens during computation.

### Why?
Different input lengths need to be padded for batch processing. Attention masks ensure:
- 1 → real token
- 0 → padding


An attention mask tells the model which tokens are real (1) and which are padding (0). It ensures the model **ignores padding** when computing attention weights.

Let’s inspect the attention mask in action.




In [None]:
# Check tokenizer's pad token ID
tokenizer.pad_token_id

In [None]:
# Sample token ID list for a trimmed sentence
reviews_ids = [
    [101, 1045, 7078, 3866, 2023, 4031],
    [101, 2023, 4031, 2003, 2307, 102]
]


In [None]:
#trimmed reviews
reviews_ids = [
    [101,  1045,  7078,  3866,  2023,  102],
    [101,  2023,  2001,  1037,   102,   tokenizer.pad_token_id],
    [  101,  2009,  2573, tokenizer.pad_token_id, tokenizer.pad_token_id,tokenizer.pad_token_id]
]

attention_mask = [
    [1, 1, 1, 1, 1, 1],   # All tokens valid
    [1, 1, 1, 1, 1, 0],   # One pad
    [1, 1, 1, 0, 0, 0]    # Three pads
]

model_outputs = model(torch.tensor(reviews_ids), attention_mask=torch.tensor(attention_mask))
print(model_outputs.logits)

In [None]:
# Sample input reviews with different lengths
reviews = [
    "This movie was amazing!",
    "Terrible acting and awful plot. I don't recommend it"
]

# Tokenize with padding
reviews_ids = tokenizer(reviews, padding=True, return_tensors="pt")
reviews_ids['attention_mask']


Longer sequences
With Transformer models, there is a limit to the lengths of the sequences we can pass the models. Most models handle sequences of up to 512 or 1024 tokens, and will crash when asked to process longer sequences. There are two solutions to this problem:

Use a model with a longer supported sequence length.
Truncate your sequences.

In [None]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

# Sample input reviews with different lengths
review = "I absolutely loved this product!"
reviews =  ["I absolutely loved this product! ",
     "This was a complete disappointment.I will never order from this company again.",
     "It works as intended."
]

# Tokenize with padding
review_tokens_ids = tokenizer(reviews)
print(review_tokens_ids)

## 📏 Padding Strategies

To create batches of input with equal length, we pad shorter sequences. Transformers supports:
- `padding=True`: Pad to longest in batch
- `padding='max_length'`: Pad to model's max input length

When batching sequences of different lengths, we need to **pad** or **truncate** them to ensure uniform shape.

Hugging Face’s `tokenizer()` supports flexible padding options:
- `"longest"`: Pads to the length of the longest sequence in the batch
- `"max_length"`: Pads to the model’s maximum length (e.g., 512 for BERT)
- `max_length=<value>`: Pads to a user-defined max length

We can also **truncate** sequences that are too long using `truncation=True`.

Let’s compare them in action using a batch of example reviews.

In [None]:
# 1️⃣ Padding up to the length of the longest sequence in the batch
reviews_tokens = tokenizer(reviews, padding="longest")
print("Padding: longest")
print("Input IDs:",reviews_tokens['input_ids'])
print("Attention Mask:",reviews_tokens['attention_mask'])


# 2️⃣ Padding up to the model's max length (e.g., 512 for BERT/DistilBERT)
reviews_tokens = tokenizer(reviews, padding="max_length")
print("Padding: max_length")
print("Input IDs:",reviews_tokens['input_ids'])
print("Attention Mask:",reviews_tokens['attention_mask'])


# 3️⃣ Padding to a user-defined max length (e.g., 8 tokens)
reviews_tokens = tokenizer(reviews, padding="max_length", max_length=8)
print("Padding: max_length=8")
print("Input IDs:",reviews_tokens['input_ids'])
print("Attention Mask:",reviews_tokens['attention_mask'])


# 4️⃣ Truncating sequences that exceed the model’s max length
reviews_tokens = tokenizer(reviews, truncation=True)


# 5️⃣ Truncating to a custom max length (e.g., 8 tokens)
# Will truncate the sequences that are longer than the specified max length
model_inputs = tokenizer(reviews, max_length=8, truncation=True)



### 🧠 Summary: When to Use What

- Use `padding="longest"` for memory-efficient batching in training
- Use `padding="max_length"` when exporting models or working with fixed-length inputs
- Use `max_length=...` and `truncation=True` to limit inputs explicitly (e.g., for mobile inference)

These options ensure your inputs are shaped correctly for model inference or training.


## 🔁 Tokenizer Output Formats: PyTorch, TensorFlow, NumPy

The Hugging Face `tokenizer()` can return outputs in different formats depending on the framework you're working with:

- `return_tensors="pt"` → returns **PyTorch tensors**
- `return_tensors="tf"` → returns **TensorFlow tensors**
- `return_tensors="np"` → returns **NumPy arrays**

This makes it seamless to integrate Transformers into any ML workflow.

Let’s compare them using the same review inputs.


In [None]:
# 1️⃣ PyTorch tensors
reviews_tokens = tokenizer(reviews, padding=True, return_tensors="pt")
print("🔶 PyTorch Tensors")
print("Input IDs:",reviews_tokens['input_ids'])
print("Attention Mask:",reviews_tokens['attention_mask'])

# 2️⃣ TensorFlow tensors
reviews_tokens = tokenizer(review, padding=True, return_tensors="tf")
print("🔷 TensorFlow Tensors")
print("Input IDs:",reviews_tokens['input_ids'])
print("Attention Mask:",reviews_tokens['attention_mask'])

# 3️⃣ NumPy arrays
reviews_tokens = tokenizer(reviews, padding=True, return_tensors="np")
print("🔶 NumPy Arrays")
print("Input IDs:",reviews_tokens['input_ids'])
print("Attention Mask:",reviews_tokens['attention_mask'])


### 🧠 When to Use Each Format

- Use `"pt"` for integration with **PyTorch** models and training loops
- Use `"tf"` when working in **TensorFlow/Keras**
- Use `"np"` for rapid prototyping or non-deep learning preprocessing

This flexibility is one of the reasons Hugging Face Transformers is framework-agnostic 💡


In [None]:
# Tokenize a review and check its components

print("Original review text:",review)
review_tokenized = tokenizer(review)
print("Tokens Input IDs:",review_tokenized["input_ids"])
print("Attention Mask :",review_tokenized["attention_mask"])
print("=========")

review_tokens = review_tokenized.tokens()
print("Review tokens:",review_tokens)

review_tokens_ids = tokenizer.convert_tokens_to_ids(review_tokens)
print("Review tokens converted to token IDs:",review_tokens_ids)

review_tokens_ids_withoutspecial = tokenizer.encode(review, add_special_tokens=False)
print("Review tokens converted to token IDS without special:",review_tokens_ids_withoutspecial)

print("=========")

print("Review Tokens decoded:",tokenizer.decode(review_tokens_ids))
print("Recover original review:",tokenizer.decode(review_tokens_ids, skip_special_tokens=True))

## ✅ Final Classification Pipeline

Let’s classify a review and convert the raw model logits into a human-readable label using:

- `softmax()` for probabilities
- `argmax()` for predicted class
- `id2label` for the actual label name


In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Load model and tokenizer
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

short_reviews =  ["I absolutely loved this product! ",
     "This was a complete disappointment.I will never order from this company again.",
     "It works as intended."
]

# Tokenize our short reviews
short_reviews_tokens = tokenizer(short_reviews, padding=True, truncation=True, return_tensors="pt")
model_outputs = model(**short_reviews_tokens)

print("Model Logits Output:")
print(model_outputs.logits)

# Apply softmax to get probabilities
probs = torch.nn.functional.softmax(model_outputs.logits, dim=-1)

# Get predicted label index and decode
pred_index = torch.argmax(probs, dim=-1)#.item()
pred_labels = [model.config.id2label[index.item()] for index in pred_index]

print("Predicted Label:", pred_labels)

## ✅ Final Summary & What's Next

🎉 Congratulations! You’ve just completed a comprehensive hands-on tour of the 🤗 **Hugging Face Transformers** library using the powerful and beginner-friendly `pipeline()` API.

---

### 🔍 What You’ve Learned:

- ✅ What the `pipeline()` function is and how it abstracts away model/tokenizer complexity
- ✅ How to run real-world tasks like:
  - `sentiment-analysis` for opinion mining  
  - `zero-shot-classification` for dynamic tagging  
  - `text-generation` for creative writing or assistants  
  - `fill-mask` for understanding model prediction  
  - `ner` and `question-answering` for knowledge extraction  
  - `summarization`, `translation`, `image-classification`, and `speech recognition`
- ✅ How tokenizers work and why they must match the model
- ✅ What special tokens, attention masks, and padding do
- ✅ How to decode logits into predicted labels manually
- ✅ The role of `AutoModel` vs `AutoModelFor*` variants

---

### 📦 Takeaways:

- Hugging Face makes **state-of-the-art NLP, vision, and speech** tasks accessible in minutes
- `pipeline()` is perfect for prototyping and learning
- For production and research, knowing what happens under the hood gives you real power

---

### 🚀 What's Next?

Here are some recommended next steps to continue your journey:

- 📚 Fine-tune your own models using the `Trainer` API
- 🧪 Explore Hugging Face Datasets and Evaluate libraries
- 🧠 Dive deeper into model internals with `transformers` + `accelerate`
- 🎓 Subscribe and stay tuned for **my next videa**

---

### 🙌 Support the Channel

If you found this notebook helpful:
- 👍 Like & Subscribe to the [almoghalisAI YouTube channel](https://www.youtube.com/@almoghalisAI)
- ⭐ Star the repo on [GitHub](https://github.com/almoghalisAI)
- 💬 Share your feedback and questions in the comments

Thanks for learning with us! See you in the next tutorial 👋
