# 01 · Email Eval Brief & Prompt Foundations

This notebook introduces GenAI evaluation fundamentals using the curated Enron email slice from `data/curated_emails.csv`. The goals are to clarify why GenAI evals differ from classic ML metrics, align on the Analyze → Measure → Improve loop, and prepare rubric + prompt assets for the remainder of the workshop.

## What Are GenAI Evals?

- **Definition**: GenAI evaluations systematically measure the quality of high-dimensional outputs (long-form text, multi-field JSON, images, etc.) produced by generative models such as large language models (LLMs). A good evaluation can be interpreted without ambiguity, is reproducible, and provides actionable insights to improve model performance.
- **Contrast with classic ML**: Traditional ML metrics (accuracy, precision, recall) evaluate low-dimensional outputs (e.g., class labels, scalar predictions). GenAI outputs are complex and multifaceted, requiring more nuanced evaluation approaches.

In GenAI applications, the evaluation and testing process is non-trivial and often requires human judgment. This is because generative models can produce a wide range of outputs, and the quality of these outputs can be subjective and context-dependent.

### Why Are GenAI Evals Hard?

1. **Subjective, multifaceted outputs**: Multiple correct summaries/styles exist; we need rubrics to normalize judgments. There is no one correct answer to compare against.
2. **Non-determinism**: Identical prompts can produce different answers, so we care about failure rates and distributions, not single runs.
3. **Specification drift**: Requirements emerge as we inspect outputs, meaning evaluation criteria evolve alongside prompts and product goals.
4. **Label scarcity**: Golden answers are expensive. Unlike traditional ML datasets, we often lack large-scale labeled data for training and evaluation. Reference-free rules and Subject Matter Expert (SME) participation become crucial.

## The Three Gulfs<sup>1</sup>

The 3 major challenges in building effective GenAI systems can be conceptualized as three "gulfs" that separate the developer intentions from the LLM pipeline and data (inputs and outputs).

![Three Gulfs](3_gulfs.jpeg)

- **Gulf of Comprehension**: Hard to understand pipeline behaviour at scale → we sample data, inspect outputs, and identify failure modes.
- **Gulf of Specification**: What we mean ≠ what we specify → prompts/instructions must be explicit, data-informed.
- **Gulf of Generalization**: Clear prompts still fail on new inputs → measurable evals help us detect and mitigate drift.

We will repeatedly traverse **Analyze → Measure → Improve** to bridge these gulfs throughout the workshop.

## The Analyze - Measure - Improve Cycle

In order to bridge the three gulfs, we will repeatedly traverse the Analyze → Measure → Improve cycle.

![The Analyze - Measure - Improve Cycle](AMI-cycle.jpeg)

## Load the Curated Dataset

Notebook `00-Obtain-Candidate-Set.ipynb` produced `data/curated_emails.csv` via:
- `TIME_WINDOW = (2001-03-01, 2001-06-30)`
- `ACTION_KEYWORDS`, `MAX_RECIPIENTS = 6`
- `BROADCAST_SUBJECT_KEYWORDS`, `MAX_SUBJECT_FREQUENCY = 50`
- Quote-depth heuristics (`MIN_QUOTE_MARKERS = 2`) and `MAX_BODY_CHARS = 5000`
- Hand curated 100 emails. 

Run the cell below to load the slice (or supply your own CSV via `CANDIDATE_PATH`).

In [None]:
import pandas as pd
from pathlib import Path
import ipywidgets as widgets
from IPython.display import HTML, display
import html

CANDIDATE_PATH = Path("../data/curated_emails.csv")

emails_df = pd.read_csv(CANDIDATE_PATH)
print(f"Loaded {len(emails_df):,} emails from {CANDIDATE_PATH}")
display(emails_df.head())

Loaded 100 emails from ../data/curated_emails.csv


Unnamed: 0,file,raw_char_len,subject,normalized_subject,from_raw,from_email,to_raw,to_emails,to_count,cc_raw,...,body,body_char_len,body_line_count,quote_separator_count,quote_line_count,action_hit,sent_at,quote_markers,email_hash,selection_rank
0,kean-s/archiving/untitled/4815.,2123,Re: Memo on policy paper conclusions and staff...,memo on policy paper conclusions and staff call,richard.shapiro@enron.com,richard.shapiro@enron.com,"james.steffes@enron.com, steven.kean@enron.com",james.steffes@enron.com;steven.kean@enron.com,2,,...,"FYI\n\n\n\n\n""Seabron Adamson"" <seabron.adamso...",1600,52,0,0,True,2001-05-21 15:42:00,0,adefd40358cea160939f9384747401cb087ca097cb15fb...,1
1,beck-s/inbox/313.,2124,FW: Power Trading Audit Notification,power trading audit notification,mechelle.atwood@enron.com,mechelle.atwood@enron.com,sally.beck@enron.com,sally.beck@enron.com,1,,...,I will forward each of these per our discussio...,1600,25,1,0,True,2001-05-09 18:57:58,1,e4019a268ed896bc819979a07b9feefb2c1ac608b6d610...,2
2,zipper-a/deleted_items/374.,2267,FW: Investinme Course Offering,investinme course offering,jake.staffel@enron.com,jake.staffel@enron.com,andy.zipper@enron.com,andy.zipper@enron.com,1,lydia.cannon@enron.com,...,"Andy,\n\nFYI I have signed up for this course....",1616,33,1,0,True,2001-05-30 13:33:57,1,f2a2f1d72e667d0c54bad9af97ec3ab9516d98e07fec41...,3
3,mann-k/all_documents/4288.,3731,Re: Fountain Valley Transformer Contract - FT ...,fountain valley transformer contract - ft pier...,kay.mann@enron.com,kay.mann@enron.com,sheila.tweed@enron.com,sheila.tweed@enron.com,1,john.schwartzenburg@enron.com,...,My recollection is that the original ABB contr...,3161,73,0,0,True,2001-03-27 10:12:00,0,448f097af989426c758c65ddd969494679a7e3492b0035...,4
4,stokley-c/chris_stokley/sent/233.,2174,Meeting with Houston Settlements,meeting with houston settlements,chris.stokley@enron.com,chris.stokley@enron.com,caroline.emmert@enron.com,caroline.emmert@enron.com,1,,...,"Caroline, \n\t\n\tWhile I agree with you that ...",1645,25,0,0,True,2001-04-05 06:12:00,0,52daff89cdeed399fd6e9360d1d13601a8a5798ef68eea...,5


### Interactive Email Explorer

Use the widget below to browse individual emails. This makes it easier to inspect quoted history, commitments, and tone without scrolling through raw tables.

In [13]:
email_count = len(emails_df)
if email_count == 0:
    raise ValueError("emails_df is empty; nothing to explore.")

_email_card_style_injected = False


def _ensure_email_card_style() -> None:
    """Inject a lightweight CSS card style once per session."""
    global _email_card_style_injected
    if _email_card_style_injected:
        return
    card_style = """
    <style>
        .email-card {
            font-family: var(--jp-content-font-family, 'Segoe UI', system-ui, sans-serif);
            border: 1px solid rgba(15, 23, 42, 0.12);
            border-radius: 12px;
            padding: 0.9rem 1.1rem;
            background: rgba(244, 246, 252, 0.9);
            box-shadow: 0 4px 12px rgba(15, 23, 42, 0.06);
            color: #1f2933;
            max-width: 100%;
        }
        .email-card__header {
            display: flex;
            flex-wrap: wrap;
            justify-content: space-between;
            align-items: baseline;
            gap: 0.75rem;
            margin-bottom: 0.75rem;
            font-weight: 600;
        }
        .email-card__header span:last-child {
            font-size: 0.85rem;
            font-weight: 500;
            color: #52606d;
        }
        .email-card__meta {
            display: grid;
            grid-template-columns: repeat(auto-fit, minmax(220px, 1fr));
            gap: 0.45rem 1.2rem;
            margin-bottom: 0.9rem;
        }
        .email-card__meta-label {
            font-size: 0.75rem;
            letter-spacing: 0.05em;
            text-transform: uppercase;
            color: #617180;
            margin-bottom: 0.1rem;
        }
        .email-card__meta-value {
            font-size: 0.95rem;
            line-height: 1.35;
            color: #1f2933;
            word-break: break-word;
        }
        .email-card__body {
            font-family: 'SFMono-Regular', Consolas, 'Liberation Mono', Menlo, monospace;
            background: rgba(255, 255, 255, 0.85);
            border-radius: 10px;
            padding: 0.9rem 1rem;
            border: 1px solid rgba(15, 23, 42, 0.12);
            white-space: pre-wrap;
            overflow-y: auto;
            max-height: 440px;
        }
        .email-card__body--truncated::after {
            content: '\u2026';
            color: #8996a7;
        }
        .email-card__hint {
            font-size: 0.8rem;
            color: #55606f;
            margin-top: 0.5rem;
        }
    </style>
    """
    display(HTML(card_style))
    _email_card_style_injected = True


def _clean_subject(value: str) -> str:
    if not value:
        return "(no subject)"
    text = str(value).strip().replace("\n", " ")
    return text if len(text) <= 80 else f"{text[:77]}..."


def _format_email(idx: int, show_full_body: bool = False) -> widgets.HTML:
    row = emails_df.iloc[idx]
    meta_fields = [
        ("Subject", row.get("subject") or "(no subject)"),
        ("From", row.get("from_email") or row.get("from_raw") or "unknown"),
        ("To", row.get("to_emails") or row.get("to_raw") or "—"),
        ("CC", row.get("cc_emails") or row.get("cc_raw") or "—"),
        ("Characters", f"{int(row.get('body_char_len', 0)):,}"),
        (
            "Quote markers",
            f"{int(row.get('quote_separator_count', 0) + row.get('quote_line_count', 0)):,}",
        ),
    ]
    meta_html = "".join(
        f"<div><div class='email-card__meta-label'>{html.escape(label)}</div>"
        f"<div class='email-card__meta-value'>{html.escape(str(value))}</div></div>"
        for label, value in meta_fields
    )

    body_raw = (row.get("body") or "").strip()
    truncated = not show_full_body and len(body_raw) > 1500
    body_slice = body_raw if not truncated else body_raw[:1500].rstrip()
    body_class = "email-card__body"
    if truncated:
        body_class += " email-card__body--truncated"
    if body_slice:
        body_html = html.escape(body_slice)
        body_block = f"<pre class='{body_class}'>{body_html}</pre>"
    else:
        body_block = f"<pre class='{body_class}'><em>empty body</em></pre>"

    hint = (
        '<div class="email-card__hint">Body truncated. Enable "Show full body" to view the remainder.</div>'
        if truncated
        else ""
    )

    card_html = f"""
    <div class='email-card'>
        <div class='email-card__header'>
            <span>Email #{idx}</span>
            <span>Row {idx + 1} of {email_count}</span>
        </div>
        <div class='email-card__meta'>{meta_html}</div>
        {body_block}
        {hint}
    </div>
    """
    return widgets.HTML(card_html)


slider = widgets.IntSlider(
    min=0,
    max=email_count - 1,
    step=1,
    value=0,
    description="Email index",
    continuous_update=False,
    readout=False,
    layout=widgets.Layout(width="340px"),
)
show_full_body = widgets.Checkbox(
    value=False, description="Show full body", indent=False
)

subject_options = [
    (f"{idx:03} — {_clean_subject(emails_df.iloc[idx].get('subject'))}", idx)
    for idx in range(email_count)
]
jump_to = widgets.Dropdown(
    options=subject_options,
    value=0,
    description="Jump to",
    layout=widgets.Layout(width="60%"),
)

header = widgets.HTML(
    f"<h3 style='margin-bottom:0.2rem'>Explore filtered emails ({email_count} total)</h3>"
)
controls = widgets.VBox(
    [widgets.HBox([slider, jump_to]), widgets.HBox([show_full_body])]
)
output = widgets.Output()

display(widgets.VBox([header, controls, output]))


def _render(idx: int, reveal_full: bool) -> None:
    _ensure_email_card_style()
    with output:
        output.clear_output(wait=True)
        display(_format_email(idx, reveal_full))


def _sync_dropdown(idx: int) -> None:
    if jump_to.value == idx:
        return
    jump_to.unobserve(_on_jump, names="value")
    jump_to.value = idx
    jump_to.observe(_on_jump, names="value")


def _on_slider(change: dict) -> None:
    if change["name"] != "value" or change["new"] is None:
        return
    _sync_dropdown(change["new"])
    _render(change["new"], show_full_body.value)


def _on_jump(change: dict) -> None:
    if change["name"] != "value" or change["new"] is None:
        return
    if slider.value != change["new"]:
        slider.value = change["new"]
    else:
        _render(change["new"], show_full_body.value)


def _on_show_full_body(change: dict) -> None:
    if change["name"] != "value":
        return
    _render(slider.value, change["new"])


slider.observe(_on_slider, names="value")
jump_to.observe(_on_jump, names="value")
show_full_body.observe(_on_show_full_body, names="value")

_render(slider.value, show_full_body.value)

VBox(children=(HTML(value="<h3 style='margin-bottom:0.2rem'>Explore filtered emails (100 total)</h3>"), VBox(c…

## Our GenAI Problem: Email Summaries for Commitments

We want concise summaries that capture:
- Commitments made (tasks/action items)
- Owners and due dates (if present)
- Relevant summaries that get a new person up to speed or remind existing participants, while avoiding hallucinations.

These long emails—with quoted history intact—provide enough context to practice failure discovery and eval design.

## SME Alignment Models

There are two common SME models for rubric definition and adjudication:
- **Benevolent Dictator**: one lead defines rubric/labels, others review. Fast, consistent—ideal for tight timelines (our workshop default).
- **Committee/Consensus**: multiple SMEs negotiate rubric and adjudicate disagreements. Useful for fuzzy domains, but slower.

For this workshop we assume a *benevolent dictator* model to keep exercises crisp.

## Rubric for Email Summaries

### 1. Commitments Captured
**Pass:** All explicit tasks/requests identified with enough detail to act on  
**Fail:** Missing commitment OR hallucinated task not in original

**Examples:**
- *Pass*: "Send Q3 report by Friday." → captured correctly
- *Fail*: Commitment captured: "Can someone review?" → output assigns to John (hallucinated)

FILL IN MORE BELOW

## Prompt Engineering Best Practices

When drafting your first summary prompt:
1. **Role & Objective** – e.g., “You are an operations analyst summarizing corporate email threads.”
2. **Instructions** – enumerate required/forbidden behaviours (mention rubric criteria).
3. **Context** – delimit the email body clearly.
4. **Examples** – (optional)include one labeled summary if possible (even synthetic).
5. **Reasoning steps** – (optional) ask the model to extract commitments before writing prose.
6. **Output format** – enforce bullet lists or structured JSON if you need programmatic checks.
7. **Safety clauses** – remind the model not to invent owners/dates.

We will iterate on this prompt in Notebook 03 with automated evaluators.

### Example Prompt Template

```plaintext
# Startup Idea Generator

## Role
You are a startup advisor identifying viable business opportunities from customer problems.

## Instructions
✅ **Do:**
- Ground ideas in real problems from the chat
- Specify target customers and revenue model
- Assess with existing technology only

❌ **Don't:**
- Suggest ideas needing non-existent tech
- Ignore legal/ethical issues
- Propose vague concepts like "AI for everything"

## Context
{chat_input}

## Example
**Input:** "I waste 10 minutes daily finding matching Tupperware lids"

**Output:**
- **Idea**: RFID kitchen containers that show lid matches via app
- **Customer**: Busy families (12M households)  
- **Revenue**: $89 starter kit + $5/mo subscription
- **Differentiation**: Existing brands lack smart features

## Process
1. Extract core pain point
2. Identify target customer
3. Design feasible solution
4. Define revenue model
5. Check competition

## Output Format
- **Idea**: [One-liner]
- **Problem**: [Pain point + scale]
- **Solution**: [How it works]
- **Customer**: [Who + market size]
- **Revenue**: [Monetization]
- **Competition**: [Alternatives + differentiation]

## Safety
- If vague → ask clarifying questions
- If unclear → state "Need more context" 
- Don't invent numbers or assumptions
```

### Prompt Exercise

FILL IN PROMPT ENGINEERING EXERCISE HERE AS NEEDED

In [14]:
from textwrap import dedent

PROMPT_TEMPLATE = dedent("""
FILL IN ACCORDING TO PROMPT ENGINEERING BEST PRACTICES
You are a ...
                         

Email metadata:
Subject: {subject}
From: {from_line}
To: {to_line}
Cc: {cc_line}

Email body (delimited by triple backticks):
```
{body}
```

Return your answer as JSON with keys `summary` (string) and `commitments` (array of strings).
""")

print(PROMPT_TEMPLATE)


FILL IN ACCORDING TO PROMPT ENGINEERING BEST PRACTICES
You are a ...


Email metadata:
Subject: {subject}
From: {from_line}
To: {to_line}
Cc: {cc_line}

Email body (delimited by triple backticks):
```
{body}
```

Return your answer as JSON with keys `summary` (string) and `commitments` (array of strings).



**Reminder:** If you iterate on the template, save it under a new filename (e.g., `prompts/email_summary_prompt_v2.txt`) and pass that path via `--prompt` when you run `tools/generate_email_traces.py`. Each run records the prompt path and checksum.


In [None]:
from pathlib import Path

version = "v1"
PROMPT_PATH = Path(f"../prompts/email_summary_prompt_{version}.txt")
PROMPT_PATH.parent.mkdir(parents=True, exist_ok=True)
PROMPT_PATH.write_text(PROMPT_TEMPLATE.strip() + "\n")
print(f"Saved prompt to {PROMPT_PATH.resolve()}")

### Prompt Output Example
The trace generator (Notebook 02) expects a structured response matching this Pydantic model:

In [3]:
from pydantic import BaseModel, Field
from typing import List


class SummaryPayload(BaseModel):
    summary: str = Field(..., description="Concise email summary")
    commitments: List[str] = Field(
        default_factory=list, description="Explicit commitments or action items"
    )

### Try the Prompt with Pydantic AI
Run the cell below after setting an API key (e.g., `OPENAI_API_KEY`). It loads the saved system/user prompts, picks the first filtered email, and validates the response against `SummaryPayload`.

In [None]:
# FILL EMAIL INDEX:
emails_index = 0  # @param {type:"integer"}
import os
from pathlib import Path

import pandas as pd
from pydantic import Field
from pydantic_ai import Agent
from pydantic_ai.exceptions import UnexpectedModelBehavior
from typing import List

DATA_PATH = Path("../data/curated_emails.csv")
PROMPT_TEMPLATE_PATH = Path("../prompts/email_summary_prompt.txt")

prompt_template = PROMPT_TEMPLATE_PATH.read_text(encoding="utf-8")
emails_df = pd.read_csv(DATA_PATH)
row = emails_df.iloc[0]


def _fmt(value, default="Unknown"):
    if pd.isna(value):
        return default
    text = str(value).strip()
    return text or default


prompt = prompt_template.format(
    subject=row.get("subject") or "No subject",
    from_line=_fmt(row.get("from_email") or row.get("from_raw")),
    to_line=_fmt(
        row.get("to_emails") or row.get("to_raw"),
        default="(no direct recipients recorded)",
    ),
    cc_line=_fmt(
        row.get("cc_emails") or row.get("cc_raw"), default="(no cc recipients recorded)"
    ),
    body=row.get("body", ""),
)

model_name = os.environ.get("PYDANTIC_AI_MODEL", "openai:gpt-4o-mini")
agent = Agent(model_name, system_prompt="")

try:
    # Use await instead of run_sync() to work with Jupyter's existing event loop
    result = await agent.run(prompt, output_type=SummaryPayload)
except UnexpectedModelBehavior as exc:
    raise RuntimeError(
        "Model response did not match SummaryPayload. Review your prompt or model settings."
    ) from exc

result.output

SummaryPayload(summary="Forwarded memo from Seabron Adamson to Rick Shapiro (and copied to James Steffes and Steven Kean). The memo pertains to last week's RTO conference call and includes suggested topics for a 2:00 p.m. CDT discussion with Steve Kean. Seabron asks Rick to respond and requests a later quick chat about next steps for the RTOs. Attachment: '01-05-21 SA Memo to Rick Shapiro on RTO conference call.doc'.", commitments=['Please respond to seabron.adamson@frontier-economics.com. Owner: Richard Shapiro. Due: Not stated.', "Discuss/cover the memo's suggested topics with Steve K. at 2:00 p.m. CDT on 05/21/2001. Owner: Unknown. Due: 05/21/2001 2:00 p.m. CDT.", 'Have a quick chat about how to move to the next step with the RTOs (schedule at a later time). Owner: Richard Shapiro. Due: Not stated.'])

In [8]:
from pprint import pprint
pprint(result.output.summary)

('Forwarded memo from Seabron Adamson to Rick Shapiro (and copied to James '
 "Steffes and Steven Kean). The memo pertains to last week's RTO conference "
 'call and includes suggested topics for a 2:00 p.m. CDT discussion with Steve '
 'Kean. Seabron asks Rick to respond and requests a later quick chat about '
 "next steps for the RTOs. Attachment: '01-05-21 SA Memo to Rick Shapiro on "
 "RTO conference call.doc'.")


In [9]:
from pprint import pprint
pprint(result.output.commitments)

['Please respond to seabron.adamson@frontier-economics.com. Owner: Richard '
 'Shapiro. Due: Not stated.',
 "Discuss/cover the memo's suggested topics with Steve K. at 2:00 p.m. CDT on "
 '05/21/2001. Owner: Unknown. Due: 05/21/2001 2:00 p.m. CDT.',
 'Have a quick chat about how to move to the next step with the RTOs (schedule '
 'at a later time). Owner: Richard Shapiro. Due: Not stated.']


## References
1. Shankar, Shreya, et al. "Steering semantic data processing with docwrangler." Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology. 2025.

## What’s Next

- Notebook 02: convert manual notes into open/axial codes (failure taxonomy).
- Notebook 03: turn rubric items into automated evaluators (LLM-as-a-judge + checks).
- Optional Homework 01a: generate synthetic emails to stress-test new failure modes.