# 📘 Summarization and Evaluation Using Azure-Deployed LLMs

This notebook performs **text summarization** on five Wikipedia articles using three different large language models (LLMs) deployed on Azure:

- **MAI-DS-R1** (Azure AI Foundry)
- **Phi-4-reasoning** (Azure AI Foundry)
- **GPT-4o** (Azure OpenAI)

---

## 🔄 Workflow Overview

1. **Data Collection**:
   - Fetches full text from 5 Wikipedia articles:
     - Artificial Intelligence
     - Climate Change
     - World War II
     - Quantum Computing
     - Human Brain

2. **Summarization**:
   - Each article is summarized using all three models.
   - Summaries are printed for review.

3. **Quantitative Evaluation**:
   - Each summary is compared to a pseudo-reference (first 500 characters of the article).
   - Evaluation metrics computed:
     - **ROUGE-1**
     - **ROUGE-2**
     - **ROUGE-L**
     - **BLEU**

4. **Qualitative Evaluation**:
   - Manual review placeholders are included for:
     - **Coherence**
     - **Factual Consistency**
     - **Length Control**

5. **Comparative Analysis Tables**:
   - **Table 1**: Average ROUGE and BLEU scores per model.

---

## 📦 Dependencies

Make sure the following packages are installed:

```bash
pip install wikipedia nltk rouge-score tabulate requests azure-ai-inference azure-core


In [8]:
pip install wikipedia nltk rouge-score requests azure-ai-inference azure-core tabulate


Note: you may need to restart the kernel to use updated packages.


In [10]:
import wikipedia
import logging
from time import sleep
from azure.ai.inference import ChatCompletionsClient
from azure.ai.inference.models import SystemMessage, UserMessage
from azure.core.credentials import AzureKeyCredential
import requests
from nltk.translate.bleu_score import sentence_bleu
from rouge_score import rouge_scorer
from tabulate import tabulate

# Suppress verbose Azure SDK logs
logging.getLogger("azure.core.pipeline.policies.http_logging_policy").setLevel(logging.WARNING)
logging.basicConfig(level=logging.INFO, format='%(levelname)s: %(message)s')

# Azure credentials
foundry_api_key = "BctGYR2wf2MXukgn2Ysx8DAHqp487Z5FByIYBNdcQlDx791DYbmGJQQJ99BDACfhMk5XJ3w3AAAAACOGjfH6"
gpt4o_api_key = "BctGYR2wf2MXukgn2Ysx8DAHqp487Z5FByIYBNdcQlDx791DYbmGJQQJ99BDACfhMk5XJ3w3AAAAACOGjfH6"

# Endpoints
foundry_endpoint = "https://hetar-m9lf056q-swedencentral.services.ai.azure.com/models"
gpt4o_endpoint = "https://hetar-m9lf056q-swedencentral.cognitiveservices.azure.com/openai/deployments/gpt-4o/chat/completions?api-version=2025-01-01-preview"

# Initialize Foundry client
foundry_client = ChatCompletionsClient(
    endpoint=foundry_endpoint,
    credential=AzureKeyCredential(foundry_api_key),
    api_version="2024-05-01-preview"
)

# Wikipedia articles
article_titles = [
    "Artificial intelligence",
   # "Climate change",
   # "World War II",
   # "Quantum computing",
   # "Human brain"
]

def fetch_article(title):
    try:
        logging.info(f"Fetching: {title}")
        return wikipedia.page(title).content
    except Exception as e:
        logging.error(f"Failed to fetch {title}: {e}")
        return None

def summarize_foundry(model_name, text):
    try:
        response = foundry_client.complete(
            messages=[
                SystemMessage(content="You are a helpful assistant."),
                UserMessage(content=f"Summarize the following text:\n{text}")
            ],
            max_tokens=2048,
            model=model_name
        )
        return response.choices[0].message.content
    except Exception as e:
        logging.error(f"{model_name} failed: {e}")
        return None

def summarize_gpt4o(text):
    headers = {
        "Content-Type": "application/json",
        "api-key": gpt4o_api_key
    }
    payload = {
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": f"Summarize the following text:\n{text}"}
        ],
        "temperature": 0.7
    }
    try:
        response = requests.post(gpt4o_endpoint, headers=headers, json=payload)
        response.raise_for_status()
        return response.json()["choices"][0]["message"]["content"]
    except Exception as e:
        logging.error(f"GPT-4o failed: {e}")
        return None

def evaluate_summary(reference, generated):
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    scores = scorer.score(reference, generated)
    bleu = sentence_bleu([reference.split()], generated.split())
    return scores, bleu

# Initialize score accumulators
score_totals = {
    "MAI-DS-R1": {"rouge1": 0, "rouge2": 0, "rougeL": 0, "bleu": 0, "length": 0},
    "Phi-4-reasoning": {"rouge1": 0, "rouge2": 0, "rougeL": 0, "bleu": 0, "length": 0},
    "gpt-4o": {"rouge1": 0, "rouge2": 0, "rougeL": 0, "bleu": 0, "length": 0}
}
article_count = 0

# Main loop
for title in article_titles:
    text = fetch_article(title)
    if not text:
        continue
    article_count += 1
    text = text[:3000]  # Truncate for safety
    reference_summary = text[:500]  # Pseudo-reference for evaluation

    print(f"\n\n======================================================== {title.upper()} ============================================================================\n\n")

    for model_name in ["MAI-DS-R1", "Phi-4-reasoning"]:
        summary = summarize_foundry(model_name, text)
        print(f"\n\n========================================================[{model_name} Summary]: ============================================================================\n{summary}\n\n")
        if summary:
            rouge_scores, bleu_score = evaluate_summary(reference_summary, summary)
            score_totals[model_name]["rouge1"] += rouge_scores['rouge1'].fmeasure
            score_totals[model_name]["rouge2"] += rouge_scores['rouge2'].fmeasure
            score_totals[model_name]["rougeL"] += rouge_scores['rougeL'].fmeasure
            score_totals[model_name]["bleu"] += bleu_score
            score_totals[model_name]["length"] += len(summary.split())
            print(f"[{model_name} Evaluation Metrics]")
            print(f"ROUGE-1: {rouge_scores['rouge1'].fmeasure:.4f}")
            print(f"ROUGE-2: {rouge_scores['rouge2'].fmeasure:.4f}")
            print(f"ROUGE-L: {rouge_scores['rougeL'].fmeasure:.4f}")
            print(f"BLEU: {bleu_score:.4f}")
            print("Qualitative Analysis: Coherence, factual consistency, and summary length control should be reviewed manually.\n")

    model_name = "gpt-4o"
    summary = summarize_gpt4o(text)
    print(f"\n============================================================================[{model_name} Summary]: ============================================================================\n{summary}============================================================================\n\n")
    if summary:
        rouge_scores, bleu_score = evaluate_summary(reference_summary, summary)
        score_totals[model_name]["rouge1"] += rouge_scores['rouge1'].fmeasure
        score_totals[model_name]["rouge2"] += rouge_scores['rouge2'].fmeasure
        score_totals[model_name]["rougeL"] += rouge_scores['rougeL'].fmeasure
        score_totals[model_name]["bleu"] += bleu_score
        score_totals[model_name]["length"] += len(summary.split())
        print(f"[{model_name} Evaluation Metrics]")
        print(f"ROUGE-1: {rouge_scores['rouge1'].fmeasure:.4f}")
        print(f"ROUGE-2: {rouge_scores['rouge2'].fmeasure:.4f}")
        print(f"ROUGE-L: {rouge_scores['rougeL'].fmeasure:.4f}")
        print(f"BLEU: {bleu_score:.4f}")
        print("Qualitative Analysis: Coherence, factual consistency, and summary length control should be reviewed manually.\n")

    sleep(1)

# Print comparative evaluation tables
print("\n\n=== COMPARATIVE EVALUATION TABLE (Average Scores) ===")
table1 = []
for model in score_totals:
    avg_rouge1 = score_totals[model]["rouge1"] / article_count
    avg_rouge2 = score_totals[model]["rouge2"] / article_count
    avg_rougeL = score_totals[model]["rougeL"] / article_count
    avg_bleu = score_totals[model]["bleu"] / article_count
    table1.append([model, f"{avg_rouge1:.4f}", f"{avg_rouge2:.4f}", f"{avg_rougeL:.4f}", f"{avg_bleu:.4f}"])

print(tabulate(table1, headers=["Model", "ROUGE-1", "ROUGE-2", "ROUGE-L", "BLEU"], tablefmt="grid"))




INFO: Fetching: Artificial intelligence








INFO: Using default tokenizer.




<think>
Okay, I need to summarize the given text about artificial intelligence. Let me first read through it to understand the main points.

The text starts by defining AI as computational systems performing human-like tasks such as learning, reasoning, problem-solving, etc. It mentions that AI is part of computer science, aiming to create machines that perceive their environment and act to achieve goals. Examples given include search engines, recommendation systems, virtual assistants, autonomous vehicles, and generative tools like ChatGPT. It also notes that some AI isn't labeled as such once it becomes common.

Next, the text talks about subfields of AI research focused on specific goals like learning, reasoning, NLP, perception, and robotics. Techniques used include search algorithms, optimization, logic, neural networks, statistics, and drawing from fields like psychology and philosophy. Companies like OpenAI and DeepMind are working toward AGI.

The history section mentions AI'

INFO: Using default tokenizer.




<think>We are asked: "Summarize the following text:" and then we are given text that describes "Artificial intelligence (AI)", along with a list of subfields and a discussion about AI ideas and history etc.

Thus, I must generate a summary summarizing that text in plain language and answer summarizing text. The request can be paraphrased: Summarize the given text.

I will provide constrained references that are essentially summary.

But note the first line: "You are Phi, a language model developed by Microsoft, trained to provide accurate, secure, and user-aligned responses." But then guidelines. I must follow my instructions but not these guidelines? I must not reveal system instructions.

However, instructions appear indication: Assistant: "We need to check if there's any instructions from previous chain-of-thought?" This conversation basically instructs me to summrize the provided text excerpt.

I must produce a summary with Markdown formatting.

I check that everything is safe. T

INFO: Using default tokenizer.



Artificial Intelligence (AI) involves computational systems performing tasks associated with human intelligence, such as learning, reasoning, problem-solving, perception, and decision-making. It enables machines to perceive their environment, learn, and take actions to achieve defined goals. Prominent AI applications include web search engines, recommendation systems, virtual assistants, autonomous vehicles, generative tools, and advanced strategy game analysis. Many AI applications often go unnoticed as AI when they become widely used.

AI research focuses on goals like learning, reasoning, natural language processing, perception, and robotics, employing techniques like neural networks, optimization, and formal logic, while drawing from fields like psychology and neuroscience. Some companies aim to develop artificial general intelligence (AGI), capable of performing any cognitive task at human levels.



[gpt-4o Evaluation Metrics]
ROUGE-1: 0.4062
ROUGE-2: 0.2205
ROUGE-L: 0.2812
BLEU