# 02 - NLP Pipelines Deep Dive

This notebook covers all major NLP pipeline types:
- Text Classification (sentiment, topic, etc.)
- Token Classification (NER, POS tagging)
- Question Answering
- Text Generation
- Summarization
- Translation
- Fill-Mask
- Zero-Shot Classification
- Feature Extraction (Embeddings)

In [None]:
from transformers import pipeline
import torch

---
## 1. Text Classification

Classify entire text into predefined categories.

**Use Cases:**
- Sentiment analysis
- Spam detection
- Topic categorization
- Intent detection

In [None]:
# Sentiment Analysis
sentiment = pipeline("sentiment-analysis")

texts = [
    "I absolutely love this new feature!",
    "This is the worst experience I've ever had.",
    "It's okay, nothing special."
]

results = sentiment(texts)
for text, result in zip(texts, results):
    print(f"{result['label']:8} ({result['score']:.2f}): {text}")

In [None]:
# Multi-class Classification (5-star ratings)
rating_classifier = pipeline(
    "text-classification",
    model="nlptown/bert-base-multilingual-uncased-sentiment"
)

reviews = [
    "Excellent product! Exceeded all expectations.",
    "It works but could be better.",
    "Terrible quality, complete waste of money."
]

for review in reviews:
    result = rating_classifier(review)[0]
    stars = result['label']  # e.g., "5 stars"
    print(f"{stars}: {review[:50]}...")

In [None]:
# Get top-k predictions
classifier = pipeline("text-classification", model="distilbert-base-uncased-finetuned-sst-2-english")

# Return all class scores
result = classifier("This movie was pretty good but had some issues.", top_k=None)
print("All class scores:")
for item in result:
    print(f"  {item['label']}: {item['score']:.4f}")

---
## 2. Token Classification (NER, POS)

Classify each token in a sequence.

**Use Cases:**
- Named Entity Recognition (NER)
- Part-of-Speech tagging
- Chunking

In [None]:
# Named Entity Recognition
ner = pipeline("ner", grouped_entities=True)  # grouped_entities merges multi-token entities

text = "Elon Musk founded SpaceX in Hawthorne, California. Tesla's headquarters moved to Austin, Texas in 2021."

entities = ner(text)
print("Entities found:")
for entity in entities:
    print(f"  {entity['entity_group']:12} | {entity['word']:20} | Score: {entity['score']:.3f}")

In [None]:
# Without grouping - see individual tokens
ner_ungrouped = pipeline("ner", grouped_entities=False)

text = "Barack Obama was born in Hawaii."
entities = ner_ungrouped(text)

print("Token-level entities:")
for entity in entities:
    print(f"  {entity['entity']:12} | {entity['word']:15} | Score: {entity['score']:.3f}")

In [None]:
# Entity types explained:
# B-PER = Beginning of Person name
# I-PER = Inside/Continuation of Person name
# B-ORG = Beginning of Organization
# B-LOC = Beginning of Location
# B-MISC = Miscellaneous entity

# Real-world example: Extract entities for analysis
news = """
Microsoft announced today that CEO Satya Nadella will visit the European Union 
headquarters in Brussels next week to discuss AI regulations with Margrethe Vestager.
"""

entities = ner(news)
print("\nExtracted from news article:")
for e in entities:
    print(f"  [{e['entity_group']}] {e['word']}")

---
## 3. Question Answering

Extract answers from a given context.

**Use Cases:**
- Customer support bots
- Document search
- Knowledge extraction

In [None]:
qa = pipeline("question-answering")

context = """
Hugging Face was founded in 2016 by ClÃ©ment Delangue, Julien Chaumond, and Thomas Wolf.
The company is headquartered in New York City. They created the Transformers library,
which has become the most popular library for natural language processing. The library
supports PyTorch, TensorFlow, and JAX frameworks. As of 2024, the Hugging Face Hub
hosts over 500,000 models.
"""

questions = [
    "When was Hugging Face founded?",
    "Who founded Hugging Face?",
    "Where is the company headquartered?",
    "How many models are on the Hub?"
]

for question in questions:
    result = qa(question=question, context=context)
    print(f"Q: {question}")
    print(f"A: {result['answer']} (confidence: {result['score']:.3f})")
    print()

In [None]:
# Get multiple possible answers
result = qa(
    question="Who founded the company?",
    context=context,
    top_k=3  # Get top 3 answers
)

print("Top 3 possible answers:")
for i, ans in enumerate(result, 1):
    print(f"{i}. {ans['answer']} (score: {ans['score']:.3f}, position: {ans['start']}-{ans['end']})")

---
## 4. Text Generation

Generate text continuations.

**Use Cases:**
- Chatbots
- Creative writing
- Code completion
- Auto-complete

In [None]:
generator = pipeline("text-generation", model="gpt2")

prompt = "The key to successful machine learning is"

# Basic generation
result = generator(prompt, max_length=50, num_return_sequences=1)
print("Generated text:")
print(result[0]['generated_text'])

In [None]:
# Generation with different sampling strategies

# 1. Greedy (deterministic)
result_greedy = generator(
    prompt,
    max_new_tokens=30,
    do_sample=False  # Greedy decoding
)
print("Greedy:")
print(result_greedy[0]['generated_text'])
print()

In [None]:
# 2. With temperature (controls randomness)
# Lower = more focused, Higher = more creative/random
result_temp = generator(
    prompt,
    max_new_tokens=30,
    do_sample=True,
    temperature=0.9  # Higher = more random
)
print("Temperature=0.9 (creative):")
print(result_temp[0]['generated_text'])

In [None]:
# 3. Top-k sampling
result_topk = generator(
    prompt,
    max_new_tokens=30,
    do_sample=True,
    top_k=50  # Only sample from top 50 tokens
)
print("Top-k=50:")
print(result_topk[0]['generated_text'])

In [None]:
# 4. Nucleus (top-p) sampling
result_topp = generator(
    prompt,
    max_new_tokens=30,
    do_sample=True,
    top_p=0.92  # Sample from tokens that make up 92% of probability mass
)
print("Top-p=0.92 (nucleus):")
print(result_topp[0]['generated_text'])

In [None]:
# 5. Multiple sequences with beam search
result_beam = generator(
    prompt,
    max_new_tokens=30,
    num_beams=5,  # Beam search with 5 beams
    num_return_sequences=3,
    no_repeat_ngram_size=2  # Prevent repetition
)
print("Beam search (3 sequences):")
for i, seq in enumerate(result_beam, 1):
    print(f"{i}. {seq['generated_text']}")
    print()

---
## 5. Summarization

Condense long text into shorter summaries.

**Use Cases:**
- News summarization
- Document TL;DR
- Meeting notes
- Email summarization

In [None]:
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

article = """
Artificial intelligence has made remarkable strides in recent years, transforming 
industries from healthcare to finance. Machine learning models can now diagnose 
diseases with accuracy rivaling human doctors, while natural language processing 
enables chatbots to handle customer service inquiries with increasing sophistication.

However, these advances come with significant challenges. Concerns about AI bias, 
job displacement, and the potential misuse of AI-generated content have led to 
calls for stricter regulation. The European Union has proposed the AI Act, which 
would establish strict rules for high-risk AI applications.

Despite these challenges, investment in AI continues to grow. Tech giants and 
startups alike are racing to develop more powerful and efficient AI systems. 
The development of large language models like GPT-4 and Claude has demonstrated 
the potential for AI systems that can engage in complex reasoning and creative tasks.

Looking ahead, experts predict that AI will become increasingly integrated into 
everyday life, from autonomous vehicles to personalized education. The key 
challenge will be ensuring that these powerful tools are developed and deployed 
responsibly, with appropriate safeguards to protect privacy and prevent misuse.
"""

summary = summarizer(article, max_length=100, min_length=30, do_sample=False)
print("Summary:")
print(summary[0]['summary_text'])

In [None]:
# Control summary length
short_summary = summarizer(article, max_length=50, min_length=20)
print("Short summary:")
print(short_summary[0]['summary_text'])
print()

long_summary = summarizer(article, max_length=150, min_length=80)
print("Longer summary:")
print(long_summary[0]['summary_text'])

---
## 6. Translation

Convert text between languages.

**Use Cases:**
- Content localization
- Real-time translation
- Multi-language support

In [None]:
# English to French
translator_en_fr = pipeline("translation_en_to_fr", model="Helsinki-NLP/opus-mt-en-fr")

text = "Machine learning is transforming the way we interact with technology."
result = translator_en_fr(text)
print(f"English: {text}")
print(f"French:  {result[0]['translation_text']}")

In [None]:
# English to German
translator_en_de = pipeline("translation_en_to_de", model="Helsinki-NLP/opus-mt-en-de")

result = translator_en_de(text)
print(f"German:  {result[0]['translation_text']}")

In [None]:
# English to Spanish
translator_en_es = pipeline("translation_en_to_es", model="Helsinki-NLP/opus-mt-en-es")

result = translator_en_es(text)
print(f"Spanish: {result[0]['translation_text']}")

---
## 7. Fill-Mask (Masked Language Modeling)

Predict masked tokens in text.

**Use Cases:**
- Autocomplete
- Text correction
- Understanding model knowledge

In [None]:
fill_mask = pipeline("fill-mask", model="bert-base-uncased")

# Note: Use [MASK] token for BERT-based models
text = "Paris is the [MASK] of France."

predictions = fill_mask(text)
print(f"Input: {text}")
print("\nPredictions:")
for pred in predictions:
    print(f"  {pred['token_str']:15} (score: {pred['score']:.4f})")

In [None]:
# RoBERTa uses <mask> token
fill_mask_roberta = pipeline("fill-mask", model="roberta-base")

text = "The <mask> jumped over the lazy dog."
predictions = fill_mask_roberta(text)

print(f"Input: {text}")
print("\nPredictions:")
for pred in predictions[:5]:
    print(f"  {pred['token_str']:15} (score: {pred['score']:.4f})")

In [None]:
# Multiple masks
text = "The [MASK] [MASK] is a large language model."
# Note: Some models don't support multiple masks well
# For multiple masks, consider using autoregressive models

---
## 8. Zero-Shot Classification

Classify text into categories without training on those specific categories.

**Use Cases:**
- Flexible categorization
- Rapid prototyping
- When you don't have labeled training data

In [None]:
zero_shot = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

text = "The new MacBook Pro features an M3 chip with improved performance and battery life."
candidate_labels = ["technology", "sports", "politics", "entertainment", "business"]

result = zero_shot(text, candidate_labels)

print(f"Text: {text}")
print("\nClassification:")
for label, score in zip(result['labels'], result['scores']):
    print(f"  {label:15} {score:.4f}")

In [None]:
# Multi-label classification
text = "The company announced record profits while also laying off 10% of its workforce."
labels = ["business", "employment", "finance", "positive news", "negative news"]

result = zero_shot(text, labels, multi_label=True)  # Allow multiple labels

print(f"Text: {text}")
print("\nMulti-label classification:")
for label, score in zip(result['labels'], result['scores']):
    marker = "âœ“" if score > 0.5 else " "
    print(f"  {marker} {label:15} {score:.4f}")

In [None]:
# Custom use case: Support ticket routing
tickets = [
    "I can't log into my account, password reset isn't working.",
    "When will my order arrive? It's been 2 weeks.",
    "I want a refund, the product is damaged.",
    "How do I upgrade my subscription plan?"
]

categories = ["login issues", "shipping", "refunds", "billing", "product issues"]

print("Support Ticket Router:\n")
for ticket in tickets:
    result = zero_shot(ticket, categories)
    print(f"Ticket: {ticket[:50]}...")
    print(f"Route to: {result['labels'][0]} ({result['scores'][0]:.2f})")
    print()

---
## 9. Feature Extraction (Embeddings)

Get vector representations (embeddings) of text.

**Use Cases:**
- Semantic search
- Clustering
- Similarity comparison
- RAG (Retrieval Augmented Generation)

In [None]:
feature_extractor = pipeline("feature-extraction", model="distilbert-base-uncased")

text = "Machine learning is a subset of artificial intelligence."

# Get embeddings
embeddings = feature_extractor(text)

import numpy as np
embeddings = np.array(embeddings)

print(f"Embedding shape: {embeddings.shape}")
print(f"  - Batch size: {embeddings.shape[0]}")
print(f"  - Sequence length: {embeddings.shape[1]}")
print(f"  - Hidden dimension: {embeddings.shape[2]}")

In [None]:
# Get sentence embedding by mean pooling
def get_sentence_embedding(text, extractor):
    features = extractor(text)
    # Mean pool across sequence dimension
    embedding = np.array(features).mean(axis=1).squeeze()
    return embedding

sentence_embedding = get_sentence_embedding(text, feature_extractor)
print(f"Sentence embedding shape: {sentence_embedding.shape}")

In [None]:
# Compute similarity between sentences
from numpy.linalg import norm

def cosine_similarity(a, b):
    return np.dot(a, b) / (norm(a) * norm(b))

sentences = [
    "I love machine learning.",
    "AI is fascinating.",
    "The weather is nice today.",
    "Deep learning is powerful."
]

embeddings = [get_sentence_embedding(s, feature_extractor) for s in sentences]

# Compare first sentence to all others
print(f"Comparing: '{sentences[0]}'\n")
for i, sent in enumerate(sentences[1:], 1):
    sim = cosine_similarity(embeddings[0], embeddings[i])
    print(f"  vs '{sent}': {sim:.4f}")

---
## 10. Text-to-Text Generation (T5, FLAN-T5)

General-purpose text transformation.

**Use Cases:**
- Paraphrasing
- Grammar correction
- Style transfer
- Multi-task NLP

In [None]:
# T5 uses task prefixes
t5 = pipeline("text2text-generation", model="google/flan-t5-base")

# Summarization
result = t5("summarize: The quick brown fox jumps over the lazy dog. The dog was sleeping in the sun.")
print(f"Summary: {result[0]['generated_text']}")

# Translation
result = t5("translate English to German: Hello, how are you?")
print(f"Translation: {result[0]['generated_text']}")

# Question answering
result = t5("answer: What is the capital of France?")
print(f"Answer: {result[0]['generated_text']}")

---
## ðŸŽ¯ Pipeline Selection Guide

| I want to... | Use this pipeline | Recommended model |
|--------------|-------------------|-------------------|
| Analyze sentiment | `sentiment-analysis` | distilbert-base-uncased-finetuned-sst-2-english |
| Extract entities | `ner` | dslim/bert-base-NER |
| Answer questions | `question-answering` | deepset/roberta-base-squad2 |
| Generate text | `text-generation` | gpt2, meta-llama/Llama-2-7b |
| Summarize text | `summarization` | facebook/bart-large-cnn |
| Translate | `translation_xx_to_yy` | Helsinki-NLP/opus-mt-* |
| Classify flexibly | `zero-shot-classification` | facebook/bart-large-mnli |
| Get embeddings | `feature-extraction` | sentence-transformers/* |

## Next Steps

Continue to [03_multimodal_pipelines.ipynb](03_multimodal_pipelines.ipynb) for vision and audio pipelines!