
# 📘 Text Summarization using Transformers (BART)

Welcome to this project!  
In this notebook, we’ll build a **text summarization system** using **BART (Bidirectional and Auto-Regressive Transformers)** from Hugging Face.  

We’ll go through:  
1. Loading the dataset  
2. Preprocessing text  
3. Fine-tuning a pretrained model  
4. Evaluating with ROUGE metrics  
5. Testing on real examples  

---


In [None]:

# Install dependencies
!pip install transformers datasets evaluate rouge_score torch --quiet


## 🔧 Step 1: Import Libraries
We’ll use **Hugging Face Transformers** for the model,  
**datasets** for data, and **evaluate** for metrics like ROUGE.

In [None]:

import torch
from transformers import BartForConditionalGeneration, BartTokenizer, Trainer, TrainingArguments
from datasets import load_dataset
import evaluate
import numpy as np


## 📂 Step 2: Load Dataset
We’ll use the **CNN/DailyMail dataset** which pairs news articles with human-written summaries.

In [None]:

dataset = load_dataset("cnn_dailymail", "3.0.0")

print(dataset)
print("Sample keys:", dataset["train"][0].keys())
print("\nArticle preview:\n", dataset["train"][0]["article"][:500])
print("\nReference summary:\n", dataset["train"][0]["highlights"])


## 🤖 Step 3: Load Pretrained Model & Tokenizer
We’ll use **facebook/bart-large-cnn**, a pretrained model specialized for summarization.

In [None]:

model_name = "facebook/bart-large-cnn"

tokenizer = BartTokenizer.from_pretrained(model_name)
model = BartForConditionalGeneration.from_pretrained(model_name)

device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)


## 🧹 Step 4: Preprocess Data
We need to tokenize both the **input article** and the **target summary**.

In [None]:

max_input_length = 1024
max_target_length = 128

def preprocess_function(examples):
    inputs = [doc for doc in examples["article"]]
    targets = [doc for doc in examples["highlights"]]
    model_inputs = tokenizer(
        inputs, max_length=max_input_length, truncation=True
    )

    # Encode the targets directly (no as_target_tokenizer needed in latest transformers)
    labels = tokenizer(
        targets, max_length=max_target_length, truncation=True
    )

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_datasets = dataset.map(preprocess_function, batched=True, remove_columns=dataset["train"].column_names)


## 📊 Step 5: Define Evaluation Metrics
We’ll use **ROUGE** to measure summary quality.

In [None]:

rouge = evaluate.load("rouge")

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    return {k: round(v * 100, 4) for k, v in result.items()}


## ⚙️ Step 6: Training Setup
We’ll fine-tune for **1 epoch** (demo purposes).

In [None]:

training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    weight_decay=0.01,
    save_total_limit=2,
    num_train_epochs=1,
    predict_with_generate=True,
    logging_dir="./logs",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"].select(range(2000)),  # subset for demo
    eval_dataset=tokenized_datasets["validation"].select(range(500)),
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)


## 🚀 Step 7: Fine-Tune the Model
This may take time depending on hardware.

In [None]:

trainer.train()


## 📈 Step 8: Evaluate Model
We’ll check ROUGE scores.

In [None]:

results = trainer.evaluate()
print(results)


## 📝 Step 9: Test on New Example
Now let’s test summarization on a sample article.

In [None]:

sample_text = dataset["test"][0]["article"]

inputs = tokenizer(sample_text, return_tensors="pt", max_length=1024, truncation=True).to(device)

summary_ids = model.generate(
    inputs["input_ids"], 
    max_length=150, 
    min_length=40, 
    num_beams=4, 
    length_penalty=2.0,
    early_stopping=True
)

print("Original Text:\n", sample_text[:500], "...")
print("\nGenerated Summary:\n", tokenizer.decode(summary_ids[0], skip_special_tokens=True))
print("\nReference Summary:\n", dataset["test"][0]["highlights"])


## 💾 Step 10: Save and Reload Model
Save the fine-tuned model for reuse or deployment.

In [None]:

model.save_pretrained("./bart-summarizer")
tokenizer.save_pretrained("./bart-summarizer")

# Load later
loaded_model = BartForConditionalGeneration.from_pretrained("./bart-summarizer")
loaded_tokenizer = BartTokenizer.from_pretrained("./bart-summarizer")



# ✅ Conclusion
- We fine-tuned **BART** for summarization  
- Evaluated with **ROUGE metrics**  
- Tested summarization on unseen data  
- Saved the model for reuse or deployment  

This project demonstrates an **end-to-end NLP workflow** using modern Transformers 🚀
