# 03 · Automated Evaluators for the Email Summarizer

This notebook sits in the **Measure** phase of the workshop. Earlier notebooks surfaced failure modes and produced labeled traces. Here we explain why automated evaluators matter, compare reference-free and reference-based approaches for the email summarizer, and show how programmatic checks and LLM-as-judge workflows accelerate iteration.


## Why Automated Evaluators?
- Manual re-labeling after every prompt tweak is slow and inconsistent. Automated evaluators let us re-measure a new run of summaries in minutes rather than hours.
- The Analyze → Measure → Improve loop depends on trustworthy metrics. Code or LLM judges give us repeatable estimates of how often the summarizer still fails.
- Automation is especially valuable for the Enron email summarizer because many prompts need small wording changes; we do not want to hand-label the same trace set after every tweak.


## Reference-Free vs Reference-Based Metrics
Reference-free metrics inspect the model output directly, while reference-based metrics compare it to a trusted target (for example, human-written summary bullets). In practice we stack both to cover different failure surfaces.

| Metric Type        | Summarizer Example                                        | What It Checks                                | Strengths                                   | Considerations |
|--------------------|-----------------------------------------------------------|------------------------------------------------|---------------------------------------------|----------------|
| Reference-Free     | Flag summaries that sound too casual for executive readers | Tone, structure, presence of a CTA             | Cheap, deterministic, easy to debug         | Misses nuanced factual errors |
| Reference-Based    | Compare generated bullets to analyst-written gold bullets  | Coverage of key decisions / action items       | Precise fidelity check                       | Needs curated references per email |
| Hybrid             | Require both tone check and coverage check                | Tone + correctness in one report               | Broader failure coverage                     | Higher maintenance across metrics |

The rest of this notebook uses synthetic data shaped like our email summarizer traces to illustrate each style of evaluator.


## Programmatic Evaluators for Informal Tone
When the failure definition is objective (e.g., "+/-" on informal phrases), a programmatic evaluator is fast and reliable. Below we flag summaries that include slang we have agreed is unacceptable for client-facing communications.


In [4]:
import pandas as pd

INFORMAL_KEYWORDS = {
    "hey team",
    "super pumped",
    "lol",
    "cheers",
    "you guys",
    "gonna",
}

sample_summaries = pd.DataFrame([
    {
        "summary_id": "S-001",
        "summary": "Team — Here's the recap: Decisions were documented, and Finance will review numbers tomorrow. No informal tone here.",
        "label": "Human Pass",
    },
    {
        "summary_id": "S-002",
        "summary": "Hey team! Super pumped about the vendor shortlist. You guys should ping me if anything feels off.",
        "label": "Human Fail (too casual)",
    },
    {
        "summary_id": "S-003",
        "summary": "The group confirmed the migration timeline. Cheers, and let's lock in the training invites.",
        "label": "Human Fail (casual sign-off)",
    },
    {
        "summary_id": "S-004",
        "summary": "Team — All action items remain with Ops; no tone violations were spotted.",
        "label": "Human Pass",
    },
])


def detect_informal_tone(text: str):
    lowered = text.lower()
    hits = [kw for kw in INFORMAL_KEYWORDS if kw in lowered]
    return hits


sample_summaries["informal_hits"] = sample_summaries["summary"].apply(detect_informal_tone)
sample_summaries["flagged_informal"] = sample_summaries["informal_hits"].apply(bool)
sample_summaries[["summary_id", "flagged_informal", "informal_hits", "label"]]


Unnamed: 0,summary_id,flagged_informal,informal_hits,label
0,S-001,False,[],Human Pass
1,S-002,True,"[hey team, you guys, super pumped]",Human Fail (too casual)
2,S-003,True,[cheers],Human Fail (casual sign-off)
3,S-004,False,[],Human Pass


The heuristic catches rows S-002 and S-003 because they contain agreed-upon informal phrases. Programmatic evaluators like this are ideal for specification failures we can encode as deterministic rules.

For nuanced behaviors such as “Did the summary capture every decision from the thread without hallucinating?” we lean on an LLM-as-judge.


## LLM-as-Judge Evaluators
### Why we need them
- Capturing every decision/action item requires understanding context, synonyms, and implicature—hard to encode with pure string rules.
- We want interpretable reasoning about misses so we can inspect disagreements.
- LLM judges align with Chapter 5 guidance: give each failure mode a narrowly scoped Pass/Fail task, backed by few-shot examples and structured output.

### Prompt anatomy for the summarizer (Decision Coverage criterion)
1. **Task framing** – “Decide if the summary captures all committed decisions without casual tone.”
2. **Definitions** – Spell out what counts as a Pass vs Fail, including partial coverage and tone violations.
3. **Few-shot examples** – Pull from the labeled email summaries (train split only).
4. **Structured output** – JSON with `reasoning` and `answer` so downstream code can parse and log.


### Prompt Skeleton
```
You are an evaluator for internal executive email summaries. Decide whether the summary captures every committed decision from the source email while keeping a professional tone.

Definitions:
- Pass: All decisions/action items from the source are present (paraphrases allowed) and the tone stays professional.
- Fail: Any decision/action item is missing, hallucinated, or the tone slips into casual language.

Few-shot examples (from train split):
<example>
Source decisions: ["Share revised headcount plan", "Schedule budget review"]
Summary: "Team — We confirmed the revised headcount plan and scheduled the budget review for Thursday."
Label: Pass
</example>
<example>
Source decisions: ["Send updated financial model"]
Summary: "Hey team! Super pumped to send the model tomorrow."
Label: Fail (casual tone)
</example>

Return JSON: {"reasoning": "...", "answer": "Pass" or "Fail"}

```


To keep this notebook self-contained we simulate both the labeled dataset and the judge behavior. In practice you would replace the stub with a real LLM call (e.g., `openai.responses`) and store the prompt template under `prompts/`.


In [None]:
import numpy as np

rng = np.random.default_rng(seed=7)

ACTION_LIBRARY = [
    ("Schedule finance review call", "Set up a finance check-in"),
    ("Share updated revenue snapshot", "Send the refreshed revenue numbers"),
    ("Confirm vendor shortlist", "Lock in the shortlist of vendors"),
    ("Publish migration timeline", "Circulate the migration timeline"),
    ("Finalize legal sign-off", "Secure legal approval"),
    ("Send onboarding packet", "Deliver the onboarding materials"),
]

INFORMAL_SNIPPETS = [
    "Hey team — quick blast!",
    "Super pumped about this!",
    "You guys crushed it",
    "lol let's keep momentum",
    "Cheers,",
]

records = []
for trace_id in range(1, 101):
    action_count = int(rng.integers(2, 4))
    action_indices = rng.choice(len(ACTION_LIBRARY), size=action_count, replace=False)
    canonical_actions = [ACTION_LIBRARY[i][0] for i in action_indices]
    paraphrase_actions = [ACTION_LIBRARY[i][1] for i in action_indices]

    paraphrase_flag = bool(rng.random() < 0.4)
    missed_action_flag = bool(rng.random() < 0.3)
    informal_flag = bool(rng.random() < 0.2)
    informal_intensity = float(rng.choice([0.6, 0.8, 1.0])) if informal_flag else 0.0

    intro = "Team — here is the recap from the thread."
    informal_hits = []
    if informal_flag:
        snippet = rng.choice(INFORMAL_SNIPPETS)
        intro = snippet
        informal_hits.append(snippet.lower())

    summary_actions = []
    literal_matches = 0
    paraphrased_count = 0
    for canonical, paraphrase in zip(canonical_actions, paraphrase_actions):
        text_choice = canonical
        if paraphrase_flag and rng.random() < 0.6:
            text_choice = paraphrase
            paraphrased_count += 1
        summary_actions.append(text_choice)
        if text_choice == canonical:
            literal_matches += 1

    if missed_action_flag and len(summary_actions) > 1:
        summary_actions.pop()

    summary_body = "Decisions:
" + "
".join(f"- {text}" for text in summary_actions)
    closing = "We will revisit next sync for status checks."

    summary_text = f"{intro}

{summary_body}

{closing}"

    action_coverage = len(summary_actions) / len(canonical_actions)
    literal_coverage = literal_matches / len(canonical_actions)
    semantic_coverage = action_coverage
    paraphrase_ratio = paraphrased_count / len(summary_actions) if summary_actions else 0.0

    human_pass = not missed_action_flag and not informal_flag

    records.append(
        {
            "trace_id": f"T{trace_id:03d}",
            "summary": summary_text,
            "reference_actions": canonical_actions,
            "summary_actions": summary_actions,
            "action_coverage": action_coverage,
            "literal_coverage": literal_coverage,
            "semantic_coverage": semantic_coverage,
            "paraphrase_ratio": paraphrase_ratio,
            "paraphrase_flag": paraphrase_flag,
            "missed_action_flag": missed_action_flag,
            "informal_flag": informal_flag,
            "informal_intensity": informal_intensity,
            "informal_hits": informal_hits,
            "human_pass": human_pass,
        }
    )

labeled_summaries = pd.DataFrame.from_records(records)

# Introduce five borderline flips to mimic human ambiguity
flip_indices = rng.choice(labeled_summaries.index, size=5, replace=False)
labeled_summaries.loc[flip_indices, "human_pass"] = ~labeled_summaries.loc[flip_indices, "human_pass"]

labeled_summaries.head()


We now mimic the Chapter 5 discipline: carve out disjoint splits before writing any prompts.
- **Train (15%)** – pool of clear Pass/Fail examples for few-shot snippets.
- **Dev (40%)** – iterate here and inspect disagreements.
- **Test (45%)** – untouched until we freeze the judge.


In [None]:
split_rng = np.random.default_rng(seed=2024)
indices = split_rng.permutation(len(labeled_summaries))

train_size = 15
dev_size = 40

train_idx = indices[:train_size]
dev_idx = indices[train_size:train_size + dev_size]
test_idx = indices[train_size + dev_size:]

splits = {
    "train": labeled_summaries.iloc[train_idx].reset_index(drop=True),
    "dev": labeled_summaries.iloc[dev_idx].reset_index(drop=True),
    "test": labeled_summaries.iloc[test_idx].reset_index(drop=True),
}

{k: len(v) for k, v in splits.items()}


The train split provides the few-shot snippets shown in the prompt skeleton. Below are three of them (redacted summaries shortened for readability).


In [None]:
train_examples = splits["train"][['trace_id', 'summary', 'human_pass']].head(3)
train_examples


### Simulating Judge Behaviour
We use a heuristic stub to stand in for the LLM. It exposes the two core levers affected when we edit few-shot examples:
- **Coverage sensitivity** – whether the judge recognises paraphrased decisions.
- **Tone strictness** – how severely the judge penalises casual language.

The baseline configuration mirrors a weak prompt that over-indexes on literal string matches and overlooks mild slang. The improved configuration reflects refined examples that highlight paraphrases and tone violations.


In [None]:
from typing import Dict


def run_judge(df: pd.DataFrame, *, coverage_column: str, coverage_threshold: float, informal_threshold: float, name: str) -> pd.DataFrame:
    rows = []
    for row in df.itertuples(index=False):
        coverage_value = getattr(row, coverage_column)
        coverage_ok = coverage_value >= coverage_threshold
        coverage_reason = (
            f"Coverage {coverage_value:.0%} ≥ {coverage_threshold:.0%}" if coverage_ok
            else f"Coverage {coverage_value:.0%} < {coverage_threshold:.0%}"
        )

        tone_ok = True
        tone_reason = "Tone acceptable"
        if row.informal_intensity >= informal_threshold:
            tone_ok = False
            tone_reason = "Informal tone flagged"
        elif row.informal_intensity > 0:
            tone_reason = "Mild casual tone allowed"

        passes = coverage_ok and tone_ok
        rows.append(
            {
                "trace_id": row.trace_id,
                "judge_answer": "Pass" if passes else "Fail",
                "judge_reasoning": f"{coverage_reason}; {tone_reason}",
            }
        )

    judged = df.merge(pd.DataFrame(rows), on="trace_id")
    judged["judge_name"] = name
    return judged


BASELINE_CONFIG = {
    "coverage_column": "literal_coverage",
    "coverage_threshold": 0.70,
    "informal_threshold": 0.95,
    "name": "Baseline few-shot set",
}

IMPROVED_CONFIG = {
    "coverage_column": "semantic_coverage",
    "coverage_threshold": 0.95,
    "informal_threshold": 0.30,
    "name": "Improved few-shot set",
}


In [None]:
def evaluate_split(df: pd.DataFrame, config: Dict) -> Dict[str, float]:
    judged = run_judge(df, **config)
    labels = df["human_pass"].astype(int)
    preds = (judged["judge_answer"] == "Pass").astype(int)

    tp = int(((labels == 1) & (preds == 1)).sum())
    tn = int(((labels == 0) & (preds == 0)).sum())
    fp = int(((labels == 0) & (preds == 1)).sum())
    fn = int(((labels == 1) & (preds == 0)).sum())

    total = len(df)
    accuracy = (tp + tn) / total if total else 0.0
    tpr = tp / (tp + fn) if (tp + fn) else 0.0
    tnr = tn / (tn + fp) if (tn + fp) else 0.0

    return {
        "judge_name": config["name"],
        "TP": tp,
        "FP": fp,
        "TN": tn,
        "FN": fn,
        "Accuracy": accuracy,
        "TPR": tpr,
        "TNR": tnr,
    }


def confusion_table(df: pd.DataFrame, config: Dict) -> pd.DataFrame:
    judged = run_judge(df, **config)
    ctab = pd.crosstab(df["human_pass"], judged["judge_answer"], rownames=["Human"], colnames=["Judge"], dropna=False)
    return ctab


#### Baseline judge on dev split
The baseline prompt under-penalises slang and misses paraphrased decisions. We see this in the confusion matrix and metrics.


In [None]:
baseline_confusion = confusion_table(splits["dev"], BASELINE_CONFIG)
baseline_metrics = evaluate_split(splits["dev"], BASELINE_CONFIG)
baseline_confusion


In [None]:
baseline_metrics


Inspect a few disagreements to understand failure patterns (exact summaries truncated here for readability).


In [None]:
baseline_judged_dev = run_judge(splits["dev"], **BASELINE_CONFIG)
dev_disagreements = baseline_judged_dev.assign(human_pass=splits["dev"]["human_pass"])
dev_disagreements = dev_disagreements[dev_disagreements["human_pass"] != (dev_disagreements["judge_answer"] == "Pass")]
dev_disagreements[['trace_id', 'human_pass', 'judge_answer', 'judge_reasoning']].head(5)


After reviewing the disagreements we add sharper few-shot examples: one that demonstrates professional tone while paraphrasing a decision, and another that calls out slang like “super pumped.” That guides the judge toward semantic coverage and stricter tone enforcement.


In [None]:
improved_confusion = confusion_table(splits["dev"], IMPROVED_CONFIG)
improved_metrics = evaluate_split(splits["dev"], IMPROVED_CONFIG)
improved_confusion


In [None]:
improved_metrics


In [None]:
import pandas as pd

comparison = pd.DataFrame([baseline_metrics, improved_metrics]).set_index("judge_name")[["Accuracy", "TPR", "TNR"]]
comparison


Accuracy jumps once the improved examples teach the judge to accept paraphrases (higher TPR) and reject informal tone (higher TNR).

With the prompt frozen we move to the test split for an unbiased estimate.


In [None]:
test_confusion = confusion_table(splits["test"], IMPROVED_CONFIG)
test_metrics = evaluate_split(splits["test"], IMPROVED_CONFIG)
test_confusion


In [None]:
test_metrics


## Correcting Success Rates
An imperfect judge biases raw success rates. We use the test-set TPR/TNR to correct the observed pass rate on a mock production batch and quantify uncertainty with bootstrap sampling.


In [None]:
def rogan_gladen(p_obs: float, tpr: float, tnr: float) -> float:
    denominator = tpr + tnr - 1
    if denominator == 0:
        return float("nan")
    corrected = (p_obs + tnr - 1) / denominator
    return float(min(max(corrected, 0.0), 1.0))


production_sample = labeled_summaries.sample(n=30, random_state=321).copy()
production_results = run_judge(production_sample, **IMPROVED_CONFIG)

observed_pass_rate = (production_results["judge_answer"] == "Pass").mean()
corrected_pass_rate = rogan_gladen(observed_pass_rate, test_metrics["TPR"], test_metrics["TNR"])

{"observed_pass_rate": observed_pass_rate, "corrected_pass_rate": corrected_pass_rate}


In [None]:
def bootstrap_corrected_rate(test_df: pd.DataFrame, observed_rate: float, *, draws: int = 5000, seed: int = 99) -> Dict[str, float]:
    rng = np.random.default_rng(seed)
    labels = test_df["human_pass"].astype(int).to_numpy()
    preds = (run_judge(test_df, **IMPROVED_CONFIG)["judge_answer"] == "Pass").astype(int).to_numpy()
    n = len(test_df)

    samples = []
    for _ in range(draws):
        idx = rng.integers(0, n, size=n)
        sampled_labels = labels[idx]
        sampled_preds = preds[idx]

        tp = ((sampled_labels == 1) & (sampled_preds == 1)).sum()
        fn = ((sampled_labels == 1) & (sampled_preds == 0)).sum()
        tn = ((sampled_labels == 0) & (sampled_preds == 0)).sum()
        fp = ((sampled_labels == 0) & (sampled_preds == 1)).sum()

        tpr = tp / (tp + fn) if (tp + fn) else 0.0
        tnr = tn / (tn + fp) if (tn + fp) else 0.0
        denominator = tpr + tnr - 1
        if denominator == 0:
            continue
        corrected = (observed_rate + tnr - 1) / denominator
        samples.append(min(max(corrected, 0.0), 1.0))

    if not samples:
        return {"lower": float("nan"), "upper": float("nan")}

    lower, upper = np.percentile(samples, [2.5, 97.5])
    return {"lower": float(lower), "upper": float(upper)}

ci = bootstrap_corrected_rate(splits["test"], observed_pass_rate, draws=3000)
ci


## Takeaways & Next Steps
- Start with programmatic checks for crisp, rule-based spec failures (tone, required boilerplate).
- Use LLM-as-judge evaluators for nuanced behaviors like decision coverage; align them with few-shot iteration on a dev split, then freeze and score test.
- Correct raw success rates with judge accuracy stats and attach confidence intervals before comparing prompt variants.
- To plug this into the real workshop data, swap the synthetic dataset for traces from `data/email_annotations.duckdb`, move the judge stub into `tools/`, and call your LLM provider instead of the heuristic.
