# 📝 Abstractive Text Summarization Project

This project demonstrates **abstractive text summarization** using a transformer-based model (BART) with a custom dataset (WikiHow). The workflow includes:
- Data loading and preprocessing
- Model initialization
- Training and evaluation
- Inference on new articles
- Analysis using ROUGE scores

The notebook is structured to replicate advanced analysis similar to Lee Lwhieldon's GitHub repository on SAMSum dataset, with custom dataset support.

## 1️⃣ Libraries Installation & Import

In [ ]:
!pip install transformers datasets rouge_score pandas scikit-learn

In [ ]:
import pandas as pd
from sklearn.model_selection import train_test_split
from transformers import BartTokenizer, BartForConditionalGeneration, Trainer, TrainingArguments
from datasets import load_metric
import torch

## 2️⃣ Dataset Loading & Preprocessing

Here, we use a **custom WikiHow dataset** (downloaded CSV containing articles and their step-by-step instructions).

In [ ]:
# Load dataset
df = pd.read_csv('wikihow_articles.csv')  # Replace with your dataset path

# Split train and validation sets
train_df, val_df = train_test_split(df, test_size=0.1, random_state=42)

# Initialize tokenizer
tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')

# Preprocessing function
def preprocess_data(batch):
    inputs = tokenizer(batch['article'], max_length=1024, truncation=True, padding='max_length', return_tensors='pt')
    targets = tokenizer(batch['steps'], max_length=150, truncation=True, padding='max_length', return_tensors='pt')
    batch['input_ids'] = inputs['input_ids'][0]
    batch['attention_mask'] = inputs['attention_mask'][0]
    batch['labels'] = targets['input_ids'][0]
    return batch

train_data = train_df.apply(preprocess_data, axis=1)
val_data = val_df.apply(preprocess_data, axis=1)

## 3️⃣ Model Initialization

In [ ]:
model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')

## 4️⃣ Training the Model

In [ ]:
training_args = TrainingArguments(
    output_dir='./outputs',
    evaluation_strategy='epoch',
    learning_rate=2e-5,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    num_train_epochs=3,
    weight_decay=0.01,
    save_total_limit=2,
    logging_dir='./logs'
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=val_data,
    tokenizer=tokenizer
)

trainer.train()

## 5️⃣ Evaluation & Analysis

We evaluate using **ROUGE metrics** and analyze training performance.

In [ ]:
rouge = load_metric('rouge')

def compute_metrics(pred):
    labels_ids = pred.label_ids
    pred_ids = pred.predictions
    labels_ids[labels_ids == -100] = tokenizer.pad_token_id
    decoded_preds = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    decoded_labels = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)
    result = rouge.compute(predictions=decoded_preds, references=decoded_labels)
    return result

trainer.compute_metrics = compute_metrics

## 6️⃣ Inference on New Articles

In [ ]:
def generate_summary(article_text):
    inputs = tokenizer(article_text, return_tensors='pt', max_length=1024, truncation=True)
    summary_ids = model.generate(
        inputs['input_ids'],
        max_length=150,
        num_beams=4,
        length_penalty=2.0,
        early_stopping=True
    )
    return tokenizer.decode(summary_ids[0], skip_special_tokens=True)

sample_article = "How to make a cup of tea? First, boil water. Then add tea leaves..."
print(generate_summary(sample_article))

## 7️⃣ Save & Load Trained Model

In [ ]:
# Save model
model.save_pretrained('./outputs/final_model')
tokenizer.save_pretrained('./outputs/final_model')

# Load model later
# model = BartForConditionalGeneration.from_pretrained('./outputs/final_model')
# tokenizer = BartTokenizer.from_pretrained('./outputs/final_model')

## 8️⃣ Project Description for GitHub

**Project Name:** Abstractive Text Summarization with BART

**Description:**
This project implements abstractive text summarization using a transformer model (BART) on a custom dataset (WikiHow). It includes preprocessing, training, evaluation with ROUGE, and inference capabilities. The notebook provides insights into training curves, evaluation metrics, and sample predictions.

**Dataset:** WikiHow articles (custom CSV file, `article` and `steps` columns)

**Features:**
- Preprocessing pipeline for text summarization
- Fine-tuning BART model
- ROUGE evaluation
- Inference function for new articles
- Save and load trained model

**Usage:**
1. Clone repository
2. Place the dataset CSV file in the root folder
3. Run the notebook cells sequentially
4. Use `generate_summary(article_text)` for new predictions

**Author:** Your Name