# Text Summarization with Hugging Face 📄

Text summarization creates concise versions of longer documents while preserving the most important information. From news articles to research papers, summarization helps us quickly understand the essence of lengthy content.

## What is Text Summarization?

**Text Summarization** condenses text while maintaining key meaning:
- **Input**: Long document or article
- **Output**: Shorter text containing main points
- **Types**: Extractive (select sentences) vs Abstractive (generate new text)
- **Examples**: News summaries, research abstracts, meeting notes

## Learning Objectives

By the end of this notebook, you'll know how to:
1. Use pre-trained summarization models effectively
2. Control summary length and style parameters
3. Compare different summarization approaches
4. Handle various document types and domains
5. Evaluate and improve summary quality
6. Build production-ready summarization systems

Let's start condensing information efficiently! 🚀

In [None]:
# Import essential libraries
import torch
from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM
from transformers import TrainingArguments, Trainer, DataCollatorForSeq2Seq
from datasets import Dataset, load_dataset
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from rouge_score import rouge_scorer
import warnings
warnings.filterwarnings('ignore')

# Check device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

print("Libraries loaded successfully!")

## 1. Basic Text Summarization

Let's start with pre-trained summarization models:

In [None]:
# Basic text summarization with pre-trained models
print("📄 Basic Text Summarization")
print("=" * 29)

# Load summarization pipeline
summarizer = pipeline("summarization")

# Sample news articles
sample_articles = [
    {
        "title": "AI Technology Breakthrough",
        "text": """
        Artificial intelligence researchers at Stanford University have announced a significant breakthrough 
        in natural language processing that could revolutionize how computers understand human communication. 
        The new model, called GPT-Advanced, demonstrates unprecedented accuracy in language translation, 
        text summarization, and question answering tasks. According to the research team, led by Dr. Sarah 
        Johnson, the model achieved a 95% accuracy rate on standard benchmarks, surpassing previous 
        state-of-the-art models by 15%. The breakthrough comes from a novel training approach that 
        combines supervised learning with reinforcement learning techniques. The researchers trained 
        the model on a diverse dataset containing over 100 billion parameters, making it one of the 
        largest language models ever created. Industry experts believe this advancement could have 
        far-reaching implications for various applications, including virtual assistants, automated 
        customer service, and educational tools. The research paper has been submitted to the 
        International Conference on Machine Learning and is expected to be published next month.
        """
    },
    {
        "title": "Climate Change Report",
        "text": """
        The latest report from the Intergovernmental Panel on Climate Change (IPCC) presents alarming 
        findings about the current state of global warming and its projected impacts on human civilization. 
        The comprehensive study, involving over 200 scientists from 50 countries, indicates that global 
        temperatures have risen by 1.2 degrees Celsius since pre-industrial times, with the last decade 
        being the warmest on record. The report highlights several critical tipping points that could 
        trigger irreversible changes in Earth's climate system, including the collapse of major ice 
        sheets in Antarctica and Greenland, the shutdown of ocean circulation patterns, and the 
        widespread loss of tropical forests. Scientists warn that without immediate and drastic 
        reductions in greenhouse gas emissions, these tipping points could be reached within the 
        next two decades. The economic implications are staggering, with potential damages estimated 
        at over $20 trillion globally by 2050. The report calls for unprecedented international 
        cooperation to transition to renewable energy sources and implement carbon capture technologies.
        """
    }
]

for i, article in enumerate(sample_articles, 1):
    print(f"\n📰 Article {i}: {article['title']}")
    print(f"Original length: {len(article['text'].split())} words")
    
    # Generate summary
    summary = summarizer(article['text'], max_length=100, min_length=30, do_sample=False)
    summary_text = summary[0]['summary_text']
    
    print(f"Summary length: {len(summary_text.split())} words")
    print(f"Compression ratio: {len(summary_text.split()) / len(article['text'].split()):.2f}")
    print(f"\n📝 Summary: {summary_text}")
    print("-" * 60)

## 2. Controlling Summary Parameters

Learn how to fine-tune summary length, style, and quality:

In [None]:
# Advanced summarization with parameter control
print("⚙️ Controlling Summary Parameters")
print("=" * 34)

# Sample long text for testing different parameters
long_text = """
Machine learning has transformed numerous industries over the past decade, from healthcare and 
finance to transportation and entertainment. In healthcare, ML algorithms are being used to 
analyze medical images, predict patient outcomes, and discover new drugs. For example, 
Google's DeepMind developed an AI system that can detect over 50 eye diseases with 
unprecedented accuracy. In finance, algorithmic trading systems process millions of 
transactions per second, while fraud detection systems protect consumers from unauthorized 
activities. The transportation industry has seen remarkable advances with autonomous vehicles, 
where companies like Tesla and Waymo are leading the development of self-driving cars. 
Entertainment platforms like Netflix and Spotify use recommendation algorithms to personalize 
content for billions of users worldwide. However, the rapid adoption of ML also raises 
important ethical considerations, including privacy concerns, algorithmic bias, and job 
displacement. Researchers and policymakers are working together to develop guidelines and 
regulations that ensure AI systems are fair, transparent, and beneficial to society. The 
future of machine learning looks promising, with emerging technologies like quantum computing 
and neuromorphic chips potentially unlocking even more powerful AI capabilities.
"""

print(f"Original text: {len(long_text.split())} words\n")

# Test different summary lengths
summary_configs = [
    {"name": "Short Summary", "max_length": 50, "min_length": 20},
    {"name": "Medium Summary", "max_length": 100, "min_length": 40},
    {"name": "Long Summary", "max_length": 150, "min_length": 80}
]

for config in summary_configs:
    print(f"🔧 {config['name']} ({config['min_length']}-{config['max_length']} words):")
    
    summary = summarizer(
        long_text, 
        max_length=config['max_length'], 
        min_length=config['min_length'], 
        do_sample=False
    )
    
    summary_text = summary[0]['summary_text']
    actual_length = len(summary_text.split())
    compression = actual_length / len(long_text.split())
    
    print(f"   Actual length: {actual_length} words")
    print(f"   Compression: {compression:.2f} ({compression*100:.1f}% of original)")
    print(f"   Text: {summary_text}")
    print()

# Test different sampling strategies
print("🎲 Different Sampling Strategies:")
print("-" * 35)

sampling_configs = [
    {"name": "Deterministic", "do_sample": False, "temperature": None},
    {"name": "Creative (High Temp)", "do_sample": True, "temperature": 1.2},
    {"name": "Conservative (Low Temp)", "do_sample": True, "temperature": 0.3}
]

for config in sampling_configs:
    print(f"\n🔄 {config['name']}:")
    
    if config['temperature']:
        summary = summarizer(
            long_text, 
            max_length=80, 
            min_length=40,
            do_sample=config['do_sample'],
            temperature=config['temperature']
        )
    else:
        summary = summarizer(
            long_text, 
            max_length=80, 
            min_length=40,
            do_sample=config['do_sample']
        )
    
    print(f"   {summary[0]['summary_text']}")

## 3. Multi-Document Summarization

Summarize multiple related documents into a cohesive summary:

In [None]:
# Multi-document summarization
print("📚 Multi-Document Summarization")
print("=" * 33)

# Related articles about a common topic
related_articles = [
    {
        "source": "Tech News Daily",
        "content": """
        Apple announced its latest iPhone 15 series featuring significant improvements in camera 
        technology and battery life. The new devices include advanced computational photography 
        capabilities powered by the A17 Pro chip. Industry analysts expect strong sales during 
        the holiday season, with pre-orders already exceeding initial supply. The company also 
        introduced new sustainability features, including recycled materials and carbon-neutral 
        packaging.
        """
    },
    {
        "source": "Business Weekly",
        "content": """
        Apple's stock price surged 5% following the iPhone 15 announcement, with investors 
        showing confidence in the company's continued innovation. Market research firms predict 
        the new iPhone will capture significant market share from Android competitors. The 
        improved camera features and longer battery life address key consumer demands. Apple's 
        services revenue is also expected to benefit from increased hardware sales.
        """
    },
    {
        "source": "Consumer Reports",
        "content": """
        Early reviews of the iPhone 15 highlight exceptional photo quality and impressive 
        all-day battery performance. The new titanium design feels premium while being 
        lighter than previous models. However, the higher price point may limit adoption 
        among budget-conscious consumers. The improved durability and water resistance 
        features received positive feedback from reviewers.
        """
    }
]

class MultiDocumentSummarizer:
    def __init__(self):
        self.summarizer = pipeline("summarization")
    
    def summarize_collection(self, documents, strategy="concatenate"):
        """Summarize multiple documents using different strategies"""
        
        if strategy == "concatenate":
            # Combine all documents and summarize together
            combined_text = " ".join([doc["content"] for doc in documents])
            summary = self.summarizer(combined_text, max_length=120, min_length=50, do_sample=False)
            return summary[0]['summary_text']
        
        elif strategy == "individual_then_combine":
            # Summarize each document individually, then combine summaries
            individual_summaries = []
            
            for doc in documents:
                summary = self.summarizer(doc["content"], max_length=50, min_length=20, do_sample=False)
                individual_summaries.append(summary[0]['summary_text'])
            
            # Combine individual summaries
            combined_summaries = " ".join(individual_summaries)
            final_summary = self.summarizer(combined_summaries, max_length=100, min_length=40, do_sample=False)
            return final_summary[0]['summary_text']
        
        elif strategy == "extractive":
            # Extract key sentences from each document
            key_sentences = []
            for doc in documents:
                sentences = doc["content"].split('.')
                # Simple heuristic: take first two sentences from each document
                key_sentences.extend(sentences[:2])
            
            combined_extract = ". ".join([s.strip() for s in key_sentences if s.strip()])
            if len(combined_extract) > 100:  # Only summarize if long enough
                summary = self.summarizer(combined_extract, max_length=80, min_length=30, do_sample=False)
                return summary[0]['summary_text']
            else:
                return combined_extract
    
    def analyze_coverage(self, documents, summary):
        """Analyze how well the summary covers different sources"""
        coverage = {}
        summary_words = set(summary.lower().split())
        
        for doc in documents:
            doc_words = set(doc["content"].lower().split())
            overlap = len(summary_words.intersection(doc_words))
            coverage[doc["source"]] = overlap / len(doc_words) if doc_words else 0
        
        return coverage

# Initialize multi-document summarizer
multi_summarizer = MultiDocumentSummarizer()

print("📰 Source Articles:")
for i, article in enumerate(related_articles, 1):
    print(f"   {i}. {article['source']}: {len(article['content'].split())} words")

print(f"\nTotal content: {sum(len(doc['content'].split()) for doc in related_articles)} words\n")

# Test different summarization strategies
strategies = ["concatenate", "individual_then_combine", "extractive"]

for strategy in strategies:
    print(f"🔄 Strategy: {strategy.replace('_', ' ').title()}")
    
    summary = multi_summarizer.summarize_collection(related_articles, strategy)
    coverage = multi_summarizer.analyze_coverage(related_articles, summary)
    
    print(f"   Summary ({len(summary.split())} words): {summary}")
    print(f"   Source coverage: {', '.join([f'{source}: {score:.2f}' for source, score in coverage.items()])}")
    print()

## 4. Domain-Specific Summarization

Handle different types of documents with specialized approaches:

In [None]:
# Domain-specific summarization
print("🎯 Domain-Specific Summarization")
print("=" * 35)

class DomainSpecificSummarizer:
    def __init__(self):
        self.summarizer = pipeline("summarization")
    
    def summarize_research_paper(self, abstract, content, max_length=100):
        """Summarize academic research papers"""
        # Focus on methodology and findings
        full_text = f"Abstract: {abstract} Content: {content}"
        summary = self.summarizer(full_text, max_length=max_length, min_length=40, do_sample=False)
        return summary[0]['summary_text']
    
    def summarize_news_article(self, headline, content, max_length=80):
        """Summarize news articles focusing on key facts"""
        # Include headline context
        full_text = f"Headline: {headline}. Article: {content}"
        summary = self.summarizer(full_text, max_length=max_length, min_length=30, do_sample=False)
        return summary[0]['summary_text']
    
    def summarize_meeting_notes(self, content, focus="action_items"):
        """Summarize meeting notes with specific focus"""
        if focus == "action_items":
            prompt_text = f"Focus on action items and decisions: {content}"
        elif focus == "discussion":
            prompt_text = f"Focus on main discussion points: {content}"
        else:
            prompt_text = content
        
        summary = self.summarizer(prompt_text, max_length=100, min_length=30, do_sample=False)
        return summary[0]['summary_text']
    
    def summarize_product_reviews(self, reviews, aspect="overall"):
        """Summarize product reviews focusing on specific aspects"""
        combined_reviews = " ".join(reviews)
        
        if aspect == "pros":
            prompt_text = f"Focus on positive aspects: {combined_reviews}"
        elif aspect == "cons":
            prompt_text = f"Focus on negative aspects: {combined_reviews}"
        else:
            prompt_text = combined_reviews
        
        summary = self.summarizer(prompt_text, max_length=80, min_length=25, do_sample=False)
        return summary[0]['summary_text']

# Initialize domain-specific summarizer
domain_summarizer = DomainSpecificSummarizer()

# Test cases for different domains
test_cases = [
    {
        "type": "research_paper",
        "data": {
            "abstract": "This study investigates the effectiveness of transformer models in natural language processing tasks.",
            "content": """
            We conducted experiments on five benchmark datasets including GLUE and SuperGLUE. 
            Our proposed model achieved state-of-the-art results on text classification and 
            question answering tasks. The key innovation is a novel attention mechanism that 
            reduces computational complexity by 40% while maintaining performance. We trained 
            models with different sizes and evaluated their performance across multiple metrics. 
            Results show significant improvements in both accuracy and efficiency compared to 
            existing approaches.
            """
        }
    },
    {
        "type": "news_article",
        "data": {
            "headline": "Major Breakthrough in Renewable Energy Storage",
            "content": """
            Scientists at MIT have developed a revolutionary battery technology that could store 
            renewable energy for weeks rather than hours. The new lithium-metal batteries use 
            a novel electrolyte design that prevents dendrite formation, a major cause of 
            battery degradation. Initial tests show the batteries can retain 90% of their 
            capacity after 1000 charge cycles. This breakthrough could make renewable energy 
            much more reliable and cost-effective for grid storage applications.
            """
        }
    },
    {
        "type": "meeting_notes",
        "data": {
            "content": """
            Team discussed the Q4 product roadmap. Sarah presented the user research findings 
            showing high demand for mobile features. John raised concerns about technical 
            feasibility and timeline. Decision made to prioritize mobile app development. 
            Action items: Sarah to create detailed requirements by Friday, John to assess 
            technical resources needed, Mike to prepare budget estimates. Next meeting 
            scheduled for next Tuesday to review progress.
            """
        }
    },
    {
        "type": "product_reviews",
        "data": {
            "reviews": [
                "Great product with excellent battery life and fast performance. Camera quality is outstanding.",
                "Love the design and build quality. However, the price is quite high for the features offered.",
                "Amazing display quality and smooth user interface. Charging speed could be better though.",
                "Good value for money. Some software bugs need to be fixed but overall satisfied."
            ]
        }
    }
]

# Process each test case
for i, case in enumerate(test_cases, 1):
    print(f"\n📋 Test Case {i}: {case['type'].replace('_', ' ').title()}")
    
    if case['type'] == 'research_paper':
        summary = domain_summarizer.summarize_research_paper(
            case['data']['abstract'], 
            case['data']['content']
        )
        print(f"   📄 Research Summary: {summary}")
    
    elif case['type'] == 'news_article':
        summary = domain_summarizer.summarize_news_article(
            case['data']['headline'], 
            case['data']['content']
        )
        print(f"   📰 News Summary: {summary}")
    
    elif case['type'] == 'meeting_notes':
        action_summary = domain_summarizer.summarize_meeting_notes(
            case['data']['content'], 
            focus="action_items"
        )
        print(f"   📝 Action Items: {action_summary}")
    
    elif case['type'] == 'product_reviews':
        overall_summary = domain_summarizer.summarize_product_reviews(
            case['data']['reviews'], 
            aspect="overall"
        )
        print(f"   ⭐ Review Summary: {overall_summary}")
    
    print("-" * 50)

## 5. Summary Quality Evaluation

Measure and improve summarization performance:

In [None]:
# Summary quality evaluation and comparison
print("📊 Summary Quality Evaluation")
print("=" * 31)

class SummaryEvaluator:
    def __init__(self):
        # Initialize different models for comparison
        self.models = {
            "Default": pipeline("summarization"),
            "BART": pipeline("summarization", model="facebook/bart-large-cnn"),
        }
        
        try:
            from rouge_score import rouge_scorer
            self.rouge_scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
            self.rouge_available = True
        except ImportError:
            print("Note: ROUGE scorer not available. Install with 'pip install rouge-score'")
            self.rouge_available = False
    
    def generate_summaries(self, text, max_length=100):
        """Generate summaries using different models"""
        summaries = {}
        
        for model_name, model in self.models.items():
            try:
                summary = model(text, max_length=max_length, min_length=30, do_sample=False)
                summaries[model_name] = summary[0]['summary_text']
            except Exception as e:
                summaries[model_name] = f"Error: {str(e)[:50]}..."
        
        return summaries
    
    def calculate_basic_metrics(self, original_text, summary):
        """Calculate basic summary metrics"""
        original_words = len(original_text.split())
        summary_words = len(summary.split())
        
        return {
            'compression_ratio': summary_words / original_words,
            'original_length': original_words,
            'summary_length': summary_words,
            'reduction_percent': (1 - summary_words / original_words) * 100
        }
    
    def calculate_rouge_scores(self, reference, summary):
        """Calculate ROUGE scores if available"""
        if not self.rouge_available:
            return {"rouge1": "N/A", "rouge2": "N/A", "rougeL": "N/A"}
        
        scores = self.rouge_scorer.score(reference, summary)
        return {
            'rouge1': scores['rouge1'].fmeasure,
            'rouge2': scores['rouge2'].fmeasure,
            'rougeL': scores['rougeL'].fmeasure
        }
    
    def evaluate_coherence(self, summary):
        """Simple coherence evaluation based on sentence structure"""
        sentences = summary.split('.')
        sentence_count = len([s for s in sentences if s.strip()])
        
        # Simple heuristics for coherence
        avg_sentence_length = len(summary.split()) / max(sentence_count, 1)
        has_connecting_words = any(word in summary.lower() for word in 
                                 ['however', 'therefore', 'moreover', 'furthermore', 'additionally'])
        
        coherence_score = min(1.0, avg_sentence_length / 15)  # Normalize to 0-1
        if has_connecting_words:
            coherence_score += 0.1
        
        return min(1.0, coherence_score)

# Initialize evaluator
evaluator = SummaryEvaluator()

# Test article for evaluation
evaluation_text = """
The global semiconductor industry is facing unprecedented challenges due to supply chain 
disruptions and increasing demand for electronic devices. Major chip manufacturers like 
TSMC and Intel are investing billions of dollars in new fabrication facilities to meet 
growing demand. However, the complexity of modern chip manufacturing means that new 
facilities take several years to become operational. The shortage has affected numerous 
industries, from automotive to consumer electronics, with car manufacturers being 
particularly hard hit. Some automakers have had to temporarily halt production lines 
due to lack of chips. Government initiatives in the United States, Europe, and Asia 
are aimed at building domestic chip manufacturing capabilities to reduce dependence 
on foreign suppliers. Industry experts predict that the shortage will gradually ease 
over the next two years as new capacity comes online, but the experience has highlighted 
the strategic importance of semiconductors in the modern economy.
"""

# Reference summary for comparison (human-written)
reference_summary = """
The semiconductor industry faces supply chain disruptions and high demand, leading to 
shortages affecting automotive and electronics sectors. Major manufacturers are investing 
in new facilities, while governments promote domestic production capabilities. The shortage 
should ease within two years as new capacity becomes available.
"""

print(f"📄 Original Text: {len(evaluation_text.split())} words")
print(f"📝 Reference Summary: {len(reference_summary.split())} words\n")

# Generate summaries with different models
generated_summaries = evaluator.generate_summaries(evaluation_text)

# Evaluate each summary
print("🔍 Model Comparison:")
print("=" * 20)

for model_name, summary in generated_summaries.items():
    if not summary.startswith("Error:"):
        print(f"\n🤖 {model_name} Model:")
        print(f"   Summary: {summary}")
        
        # Basic metrics
        basic_metrics = evaluator.calculate_basic_metrics(evaluation_text, summary)
        print(f"   Length: {basic_metrics['summary_length']} words ({basic_metrics['reduction_percent']:.1f}% reduction)")
        print(f"   Compression: {basic_metrics['compression_ratio']:.3f}")
        
        # ROUGE scores (if available)
        rouge_scores = evaluator.calculate_rouge_scores(reference_summary, summary)
        if isinstance(rouge_scores['rouge1'], float):
            print(f"   ROUGE-1: {rouge_scores['rouge1']:.3f}")
            print(f"   ROUGE-2: {rouge_scores['rouge2']:.3f}")
            print(f"   ROUGE-L: {rouge_scores['rougeL']:.3f}")
        
        # Coherence score
        coherence = evaluator.evaluate_coherence(summary)
        print(f"   Coherence: {coherence:.3f}")
    else:
        print(f"\n❌ {model_name} Model: {summary}")

# Summary quality tips
print("\n💡 Summary Quality Tips:")
print("=" * 24)
quality_tips = [
    "Maintain key information while reducing length",
    "Ensure logical flow and coherence",
    "Preserve important entities and facts",
    "Use appropriate compression ratios for the task",
    "Consider domain-specific requirements",
    "Evaluate summaries with multiple metrics"
]

for i, tip in enumerate(quality_tips, 1):
    print(f"   {i}. {tip}")

## 🎯 Key Takeaways

**What you've learned about text summarization:**

✅ **Basic Summarization**: Use pre-trained models for quick text condensation  
✅ **Parameter Control**: Fine-tune summary length, style, and sampling strategies  
✅ **Multi-Document Processing**: Combine and summarize multiple related sources  
✅ **Domain Adaptation**: Handle different document types with specialized approaches  
✅ **Quality Evaluation**: Measure and compare summary performance with multiple metrics  
✅ **Compression Strategies**: Balance information retention with length reduction  
✅ **Real-world Applications**: Build production-ready summarization systems  

## 🔧 Best Practices

1. **Choose appropriate models** for your specific domain and use case
2. **Set reasonable length parameters** based on source content and requirements
3. **Evaluate summaries systematically** using both automatic metrics and human judgment
4. **Consider the target audience** when determining summary style and detail level
5. **Handle edge cases** like very short or very long input documents
6. **Preserve critical information** like names, dates, and key facts

## 🚀 Next Steps

Ready for the next challenge?

**Continue to**: `10_fine_tuning_basics.ipynb` - Learn about customizing models for your specific needs!

**Practice**: Try summarizing documents from your domain (legal contracts, research papers, news articles)!

Great work mastering text summarization! 📚