# 03 · Automated Evaluators for the Email Summarizer

This notebook sits in the **Measure** phase of the workshop. Earlier notebooks surfaced failure modes and produced labeled traces. Here we explain why automated evaluators matter, compare reference-free and reference-based approaches for the email summarizer, and show how programmatic checks and LLM-as-judge workflows accelerate iteration.


## Why Automated Evaluators?
- Manual re-labeling after every prompt tweak is slow and inconsistent. Automated evaluators let us re-measure a new run of summaries in minutes rather than hours.
- The Analyze → Measure → Improve loop depends on trustworthy metrics. Code or LLM judges give us repeatable estimates of how often the summarizer still fails.
- Automation is especially valuable for the Enron email summarizer because many prompts need small wording changes; we do not want to hand-label the same trace set after every tweak.


## Reference-Free vs Reference-Based Metrics
Reference-free metrics inspect the model output directly, while reference-based metrics compare it to a trusted target (for example, human-written summary bullets). In practice we stack both to cover different failure surfaces.

| Metric Type        | Summarizer Example                                        | What It Checks                                | Strengths                                   | Considerations |
|--------------------|-----------------------------------------------------------|------------------------------------------------|---------------------------------------------|----------------|
| Reference-Free     | Flag summaries that sound too casual for executive readers | Tone, structure, presence of a CTA             | Cheap, deterministic, easy to debug         | Misses nuanced factual errors |
| Reference-Based    | Compare generated bullets to analyst-written gold bullets  | Coverage of key decisions / action items       | Precise fidelity check                       | Needs curated references per email |
| Hybrid             | Require both tone check and coverage check                | Tone + correctness in one report               | Broader failure coverage                     | Higher maintenance across metrics |

The rest of this notebook uses synthetic data shaped like our email summarizer traces to illustrate each style of evaluator.


## Programmatic Evaluators for Informal Tone
When the failure definition is objective (e.g., "+/-" on informal phrases), a programmatic evaluator is fast and reliable. Below we flag summaries that include slang we have agreed is unacceptable for client-facing communications.


In [None]:
import pandas as pd

INFORMAL_KEYWORDS = {
    "hey team",
    "super pumped",
    "lol",
    "cheers",
    "you guys",
    "gonna",
}

sample_summaries = pd.DataFrame([
    {
        "summary_id": "S-001",
        "summary": "Team — Here's the recap: Decisions were documented, and Finance will review numbers tomorrow. No informal tone here.",
        "label": "Human Pass",
    },
    {
        "summary_id": "S-002",
        "summary": "Hey team! Super pumped about the vendor shortlist. You guys should ping me if anything feels off.",
        "label": "Human Fail (too casual)",
    },
    {
        "summary_id": "S-003",
        "summary": "The group confirmed the migration timeline. Cheers, and let's lock in the training invites.",
        "label": "Human Fail (casual sign-off)",
    },
    {
        "summary_id": "S-004",
        "summary": "Team — All action items remain with Ops; no tone violations were spotted.",
        "label": "Human Pass",
    },
])


def detect_informal_tone(text: str):
    lowered = text.lower()
    hits = [kw for kw in INFORMAL_KEYWORDS if kw in lowered]
    return hits


sample_summaries["informal_hits"] = sample_summaries["summary"].apply(detect_informal_tone)
sample_summaries["flagged_informal"] = sample_summaries["informal_hits"].apply(bool)
sample_summaries[["summary_id", "flagged_informal", "informal_hits", "label"]]


The heuristic catches rows S-002 and S-003 because they contain agreed-upon informal phrases. Programmatic evaluators like this are ideal for specification failures we can encode as deterministic rules.

For nuanced behaviors such as “Did the summary capture every decision from the thread without hallucinating?” we lean on an LLM-as-judge.


### Dataset Setup and Stratified Splits
We'll start from the merged human-labeled set `data/llm-judge-sample-full.json` and carve out 15%/40%/45% train/validation/test splits while preserving the pass/fail balance.


In [None]:
from pathlib import Path
import pandas as pd

RANDOM_SEED = 42
DATA_PATH = Path("../data/llm-judge-sample-full.json")

judge_df = pd.read_json(DATA_PATH)
judge_df["human_judgement"] = judge_df["human_judgement"].str.upper()

label_counts = (
    judge_df["human_judgement"]
    .value_counts()
    .rename_axis("label")
    .reset_index(name="count")
)
label_counts


In [None]:
SPLIT_FRACTIONS = (0.15, 0.40, 0.45)

splits = stratified_split_sklearn(
    judge_df,
    label_col="human_judgement",
    fractions=SPLIT_FRACTIONS,
    seed=RANDOM_SEED,
)

split_summary = (
    pd.concat(
        {
            split_name: df["human_judgement"].value_counts()
            for split_name, df in splits.items()
        },
        axis=1,
    )
    .fillna(0)
    .astype(int)
    .rename_axis("label")
    .sort_index()
)

split_summary

In [None]:
SPLIT_FRACTIONS = (0.15, 0.40, 0.45)

# --- Step 1: Split using sklearn-based stratified split ---
splits = stratified_split_sklearn(
    judge_df,
    label_col="human_judgement",
    fractions=SPLIT_FRACTIONS,
    seed=RANDOM_SEED,
)

# --- Step 2: Check split summary (unchanged) ---
split_summary = (
    pd.concat(
        {
            split_name: df["human_judgement"].value_counts()
            for split_name, df in splits.items()
        },
        axis=1,
    )
    .fillna(0)
    .astype(int)
    .rename_axis("label")
    .sort_index()
)
print(split_summary)

# --- Step 3: Write splits to JSON files (unchanged) ---
from pathlib import Path

for split_name, df in splits.items():
    output_path = Path(f"../data/llm-judge-split-{split_name}.json")
    output_path.write_text(df.to_json(orient="records", indent=2))
    print(f"Wrote {len(df)} rows to {output_path}")

### Few-shot Prompt Builder
Select a handful of pass/fail examples from the train split so the judge can learn what logically coherent summaries look like.


In [None]:
from typing import List
from textwrap import dedent
import random


def sample_few_shot_examples(
    df: pd.DataFrame,
    label_col: str,
    per_label: dict,
    seed: int = RANDOM_SEED,
) -> pd.DataFrame:
    rng = random.Random(seed)
    selections = []
    for label, quota in per_label.items():
        pool = df[df[label_col] == label]
        if pool.empty:
            continue
        sample_size = min(len(pool), quota)
        if sample_size == 0:
            continue
        selections.append(
            pool.sample(n=sample_size, random_state=rng.randint(0, 10**6))
        )
    if not selections:
        return pd.DataFrame()
    return pd.concat(selections, ignore_index=True)


FEW_SHOT_SPEC = {"PASS": 2, "FAIL": 3}

few_shot_examples = sample_few_shot_examples(
    splits["train"], "human_judgement", FEW_SHOT_SPEC, seed=RANDOM_SEED
)
few_shot_examples[["email_id", "human_judgement", "summary"]]


In [None]:
def render_example_block(row: pd.Series) -> str:
    return dedent(
        f"""
        ### Example ({row['human_judgement']})
        Email:
        {row['email']}

        Generated Summary:
        {row['summary']}

        Human Rationale:
        {row['human_reasoning']}
        """
    ).strip()


BASE_PROMPT_HEADER = dedent(
    """
    You are an expert executive-communication editor judging whether a model-produced email summary preserves logical coherence with the source email.

    Definitions:
    - PASS: The summary follows the email's chronology, keeps the cause-and-effect relationships intact, and does not contradict or omit key decisions or next steps.
    - FAIL: The summary scrambles the story, breaks causal links, introduces contradictions, or drops essential commitments.

    Use the examples below to anchor your decisions. Each example includes the original email, the model's summary, and why a human labeled it PASS or FAIL.
    """
).strip()

example_blocks = "\n\n".join(
    render_example_block(row)
    for _, row in few_shot_examples.iterrows()
)

judge_prompt_template = dedent(
    f"""
    {BASE_PROMPT_HEADER}

    {example_blocks}

    Now evaluate the candidate summary below for logical coherence.

    Email:
    __EMAIL__

    Model Summary:
    __SUMMARY__

    Respond in JSON with keys "reasoning" and "label" (either "PASS" or "FAIL").
    """
).strip()

print(judge_prompt_template[:1000])


### Judge Agent (Pydantic AI)
Instantiate a structured-output agent that produces `reasoning` and `label` fields when we feed the prompt above.


In [None]:
import os
from typing import Literal
from pydantic import BaseModel

try:
    from pydantic_ai import Agent
    from pydantic_ai.exceptions import UnexpectedModelBehavior
except ModuleNotFoundError:
    Agent = None
    UnexpectedModelBehavior = Exception
    print("Install pydantic-ai (pip install pydantic-ai) to enable judge calls.")


class JudgeOutput(BaseModel):
    reasoning: str
    label: Literal["PASS", "FAIL"]


if Agent is not None:
    JUDGE_MODEL = os.getenv("JUDGE_MODEL", "gpt-5-mini")
    judge_agent = Agent(
        JUDGE_MODEL,
        system_prompt="You are an email summarization evaluator focused on logical coherence.",
    )
    print(f"Judge agent ready with model: {JUDGE_MODEL}")
else:
    JUDGE_MODEL = None


### Batch Scoring Helpers
We parallelize judging with a `ThreadPoolExecutor` so the notebook can score 20-row batches quickly while still capturing structured outputs for accuracy checks.


In [None]:
import asyncio
from concurrent.futures import ThreadPoolExecutor
from typing import Any, Dict, List

import pandas as pd


def render_prompt(prompt_template: str, email_text: str, summary_text: str) -> str:
    return (
        prompt_template
        .replace("__EMAIL__", email_text.strip())
        .replace("__SUMMARY__", summary_text.strip())
    )


def _judge_single(agent: Any, prompt_template: str, row: dict) -> dict:
    prompt = render_prompt(prompt_template, row["email"], row["summary"])
    try:
        run = agent.run_sync(prompt, output_type=JudgeOutput)
        predicted_label = run.output.label
        reasoning = run.output.reasoning
    except UnexpectedModelBehavior as exc:
        predicted_label = "ERROR"
        reasoning = f"Schema mismatch: {exc}"
    except Exception as exc:
        predicted_label = "ERROR"
        reasoning = str(exc)

    return {
        "email_id": row.get("email_id"),
        "human_label": row.get("human_judgement"),
        "predicted_label": predicted_label,
        "reasoning": reasoning,
    }


async def score_rows(
    agent: Any,
    prompt_template: str,
    rows: List[dict],
    *,
    max_workers: int = 8,
) -> pd.DataFrame:
    if Agent is None:
        raise RuntimeError("Install pydantic-ai to score summaries.")

    loop = asyncio.get_running_loop()
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        tasks = [
            loop.run_in_executor(executor, _judge_single, agent, prompt_template, row)
            for row in rows
        ]
        results = await asyncio.gather(*tasks)

    return pd.DataFrame(results)


def compute_accuracy(results: pd.DataFrame) -> Dict[str, float]:
    mask = results["predicted_label"].isin({"PASS", "FAIL"})
    evaluated = results.loc[mask]

    total = len(evaluated)
    if total == 0:
        return {
            "coverage": 0.0,
            "records_evaluated": 0,
            "accuracy": float("nan"),
            "tpr": float("nan"),
            "tnr": float("nan"),
            "balanced_accuracy": float("nan"),
        }

    tp = ((evaluated["human_label"] == "PASS") & (evaluated["predicted_label"] == "PASS")).sum()
    tn = ((evaluated["human_label"] == "FAIL") & (evaluated["predicted_label"] == "FAIL")).sum()
    fp = ((evaluated["human_label"] == "FAIL") & (evaluated["predicted_label"] == "PASS")).sum()
    fn = ((evaluated["human_label"] == "PASS") & (evaluated["predicted_label"] == "FAIL")).sum()

    tpr = tp / (tp + fn) if (tp + fn) else float("nan")
    tnr = tn / (tn + fp) if (tn + fp) else float("nan")
    accuracy = (tp + tn) / total
    balanced = (tpr + tnr) / 2 if not (pd.isna(tpr) or pd.isna(tnr)) else float("nan")

    return {
        "coverage": len(evaluated) / len(results) if len(results) else 0.0,
        "records_evaluated": total,
        "accuracy": accuracy,
        "tpr": tpr,
        "tnr": tnr,
        "balanced_accuracy": balanced,
    }


In [None]:
if Agent is not None:
    sample_rows = (
        splits["val"]
        .sample(n=min(20, len(splits["val"])), random_state=RANDOM_SEED)
        .to_dict("records")
    )
    # Run this cell after configuring API credentials for the selected model.
    preview = await score_rows(judge_agent, judge_prompt_template, sample_rows, max_workers=8)
    display(preview)
else:
    print("Install pydantic-ai and instantiate `judge_agent` before scoring.")


In [None]:
metrics = compute_accuracy(preview)
metrics


## HW
 - Experimenet with different few-shot examples to see how they impact judge accuracy on the validation set.
 - Experiment with prompt wording to see how it impacts judge accuracy on the validation set.
 - Optional: Use DSPY optimizers to find the best combination of few-shot examples and prompt wording to maximize judge accuracy.