
# 📊 Evaluation of PathSlide2Report (Demo Mode)

This is a **demo version** of the evaluation notebook.  
It comes pre-filled with **synthetic data** so recruiters can immediately see results without downloading TCGA data.

---
**Modes Available:**
- ✅ **Demo Mode (default)** → Uses mock summaries + diagnoses.
- 🔬 **Real Mode** → Runs full evaluation with TCGA data (if available).

---
Evaluation includes:
- BLEU / ROUGE (classic NLP metrics)
- Embedding-based similarity to ground truth diagnoses
- GPT-as-a-Judge scoring (1–5 scale)


## 1. Demo Data Setup

In [None]:

import pandas as pd

# Create synthetic logs (as if from Streamlit app)
logs = pd.DataFrame([
    {
        "metadata": "Slide001",
        "baseline_summary": "H&E stained liver tissue with normal architecture.",
        "rag_summary": "Liver tissue H&E at 40x magnification, showing intact histological structure."
    },
    {
        "metadata": "Slide002",
        "baseline_summary": "H&E stained colon tissue, features unclear.",
        "rag_summary": "Colon tissue H&E slide showing glandular structures consistent with adenocarcinoma."
    }
])

# Create synthetic ground truth metadata
df_meta = pd.DataFrame([
    {"slide_id": "Slide001", "diagnosis": "Normal liver tissue"},
    {"slide_id": "Slide002", "diagnosis": "Colon adenocarcinoma"}
])

print("Demo logs:")
display(logs)
print("Demo metadata:")
display(df_meta)


## 2. BLEU / ROUGE Evaluation

In [None]:

from nltk.translate.bleu_score import sentence_bleu
from rouge_score import rouge_scorer

reference = df_meta["diagnosis"].tolist()
generated_baseline = logs["baseline_summary"].tolist()
generated_rag = logs["rag_summary"].tolist()

def evaluate_summary(reference, candidate):
    bleu = sentence_bleu([reference.split()], candidate.split())
    scorer = rouge_scorer.RougeScorer(['rouge1','rougeL'], use_stemmer=True)
    scores = scorer.score(reference, candidate)
    return bleu, scores

results = []
for ref, base, rag in zip(reference, generated_baseline, generated_rag):
    bleu_base, rouge_base = evaluate_summary(ref, base)
    bleu_rag, rouge_rag = evaluate_summary(ref, rag)
    results.append({
        "Reference": ref,
        "Baseline": base,
        "RAG": rag,
        "BLEU (Baseline)": bleu_base,
        "BLEU (RAG)": bleu_rag,
        "ROUGE-L (Baseline)": rouge_base["rougeL"].fmeasure,
        "ROUGE-L (RAG)": rouge_rag["rougeL"].fmeasure
    })

pd.DataFrame(results)


## 3. Embedding-Based Similarity (Demo Mode)

In [None]:

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import matplotlib.pyplot as plt

model = SentenceTransformer("all-MiniLM-L6-v2")

baseline_emb = model.encode(logs["baseline_summary"].tolist())
rag_emb = model.encode(logs["rag_summary"].tolist())
gt_emb = model.encode(df_meta["diagnosis"].tolist())

baseline_scores = [cosine_similarity([b],[g])[0][0] for b,g in zip(baseline_emb, gt_emb)]
rag_scores = [cosine_similarity([r],[g])[0][0] for r,g in zip(rag_emb, gt_emb)]

logs["baseline_score"] = baseline_scores
logs["rag_score"] = rag_scores

plt.figure(figsize=(6,4))
plt.bar(["Baseline", "RAG+Metadata"], [np.mean(baseline_scores), np.mean(rag_scores)], color=["red","green"])
plt.title("Average Similarity to Ground Truth (Demo)")
plt.ylabel("Cosine Similarity")
plt.show()

logs[["metadata","baseline_score","rag_score"]]


## 4. GPT-as-a-Judge (Demo Mode Stub)

In [None]:

# Instead of calling GPT (requires API key), we simulate scores
# to illustrate what recruiters would see.

gpt_eval_df = pd.DataFrame([
    {"slide_id": "Slide001", "baseline_gpt_score": 4, "rag_gpt_score": 5},
    {"slide_id": "Slide002", "baseline_gpt_score": 2, "rag_gpt_score": 5}
])

avg_base = gpt_eval_df["baseline_gpt_score"].mean()
avg_rag = gpt_eval_df["rag_gpt_score"].mean()

print("Average GPT Score (Baseline):", avg_base)
print("Average GPT Score (RAG+Metadata):", avg_rag)

plt.figure(figsize=(6,4))
plt.bar(["Baseline", "RAG+Metadata"], [avg_base, avg_rag], color=["red","green"])
plt.title("Average GPT Evaluation Scores (Demo)")
plt.ylabel("Score (1–5)")
plt.ylim(0,5)
plt.show()

gpt_eval_df


## 5. Final Results Summary

In [None]:

print("### 📌 Results Summary (Demo Mode)")
print(f"- Embedding-based similarity: Baseline={np.mean(baseline_scores):.3f}, RAG={np.mean(rag_scores):.3f}")
print(f"- GPT-as-a-Judge: Baseline={avg_base:.2f}, RAG={avg_rag:.2f}")
print("\n✅ Overall, RAG+Metadata shows higher alignment with diagnoses compared to Baseline in this demo.")
