# 01 · Email Eval Brief & Prompt Foundations

This notebook introduces GenAI evaluation fundamentals using the curated Enron email slice from `data/filtered_emails.csv`. The goals are to clarify why GenAI evals differ from classic ML metrics, align on the Analyze → Measure → Improve loop, and prepare rubric + prompt assets for the remainder of the workshop.

## What Are GenAI Evals?

- **Definition**: GenAI evaluations systematically measure the quality of high-dimensional outputs (long-form text, multi-field JSON, images, etc.) produced by generative models such as large language models (LLMs). A good evaluation can be interpreted without ambiguity, is reproducible, and provides actionable insights to improve model performance.
- **Contrast with classic ML**: Traditional ML metrics (accuracy, precision, recall) focus on low-dimensional outputs (e.g., class labels, scalar predictions). GenAI outputs are complex and multifaceted, requiring more nuanced evaluation approaches.

In GenAI applications, the evaluation and testing process is non-trivial and often requires human judgment. This is because generative models can produce a wide range of outputs, and the quality of these outputs can be subjective and context-dependent.

### Why Are GenAI Evals Hard?

1. **Subjective, multifaceted outputs**: Multiple correct summaries/styles exist; we need rubrics to normalize judgments.
2. **Non-determinism**: Identical prompts can produce different answers, so we care about failure rates and distributions, not single runs.
3. **Specification drift**: Requirements emerge as we inspect outputs, meaning evaluation criteria evolve alongside prompts and product goals.
4. **Label scarcity**: Golden answers are expensive; reference-free rules and SME participation become crucial.

## The Three Gulfs<sup>1</sup>

The 3 major challenges in building effective GenAI systems can be conceptualized as three "gulfs" that separate the developer intentions from the LLM pipeline and data (inputs and outputs).

![Three Gulfs](3_gulfs.jpeg)

- **Gulf of Comprehension**: Hard to understand pipeline behaviour at scale → we sample data, inspect outputs, and identify failure modes.
- **Gulf of Specification**: What we mean ≠ what we specify → prompts/instructions must be explicit, data-informed.
- **Gulf of Generalization**: Clear prompts still fail on new inputs → measurable evals help us detect and mitigate drift.

We will repeatedly traverse **Analyze → Measure → Improve** to bridge these gulfs throughout the workshop.

## The Analyze - Measure - Improve Cycle

In order to bridge the three gulfs, we will repeatedly traverse the Analyze → Measure → Improve cycle.

![The Analyze - Measure - Improve Cycle](AMI-cycle.jpeg)

## Load the Curated Dataset

Notebook `00-Obtain-Candidate-Set.ipynb` produced `data/filtered_emails.csv` via:
- `TIME_WINDOW = (2001-03-01, 2001-06-30)`
- `ACTION_KEYWORDS`, `MAX_RECIPIENTS = 6`
- `BROADCAST_SUBJECT_KEYWORDS`, `MAX_SUBJECT_FREQUENCY = 50`
- Quote-depth heuristics (`MIN_QUOTE_MARKERS = 2`) and `MAX_BODY_CHARS = 5000`

Run the cell below to load the slice (or supply your own CSV via `CANDIDATE_PATH`).

In [1]:
import pandas as pd
from pathlib import Path
import ipywidgets as widgets
from IPython.display import Markdown, display

CANDIDATE_PATH = Path('../data/filtered_emails.csv')
if not CANDIDATE_PATH.exists():
    raise FileNotFoundError(f'Missing {CANDIDATE_PATH}. Run notebook 00 or adjust the path.')

emails_df = pd.read_csv(CANDIDATE_PATH)
print(f'Loaded {len(emails_df):,} emails from {CANDIDATE_PATH}')
display(emails_df.head())


Loaded 12,020 emails from ../data/filtered_emails.csv


Unnamed: 0,file,raw_char_len,subject,normalized_subject,from_raw,from_email,to_raw,to_emails,to_count,cc_raw,...,cc_count,date_raw,body,body_char_len,body_line_count,quote_separator_count,quote_line_count,action_hit,sent_at,quote_markers
0,sanchez-m/deleted_items/21.,3578,Fw: Fw: FW: Dr. Seuss (Don't delete,dr. seuss (don't delete,dpriese@worldnet.att.net,dpriese@worldnet.att.net,"tim.deanna@enron.com, cornella.deb@enron.com, ...",tim.deanna@enron.com;cornella.deb@enron.com;ti...,6,,...,0,"Thu, 31 May 2001 08:39:20 -0700 (PDT)",----- Original Message -----\nFrom: patbelle <...,2540,237,1,226,True,2001-05-31 15:39:20,227
1,farmer-d/logistics/112.,5492,"Re: FW: FW: 05/01 ENA Gas Sales on HPL, to D &...","05/01 ena gas sales on hpl, to d & h gas company",tnray@aep.com,tnray@aep.com,j..farmer@enron.com,j..farmer@enron.com,1,,...,0,"Wed, 20 Jun 2001 13:46:49 -0700 (PDT)","Elizabeth,\nHas contract expired? Unable to ge...",4920,189,7,157,True,2001-06-20 20:46:49,164
2,mann-k/_sent_mail/1401.,5216,Re: Lake Austin Spa Resort,lake austin spa resort,kay.mann@enron.com,kay.mann@enron.com,cindyv@lakeaustin.com,cindyv@lakeaustin.com,1,,...,0,"Wed, 7 Mar 2001 02:16:00 -0800 (PST)",I think what I will do is send out an email to...,4759,205,2,157,True,2001-03-07 10:16:00,159
3,mann-k/all_documents/3934.,5220,Re: Lake Austin Spa Resort,lake austin spa resort,kay.mann@enron.com,kay.mann@enron.com,cindyv@lakeaustin.com,cindyv@lakeaustin.com,1,,...,0,"Wed, 7 Mar 2001 02:16:00 -0800 (PST)",I think what I will do is send out an email to...,4759,205,2,157,True,2001-03-07 10:16:00,159
4,mann-k/sent/2999.,5210,Re: Lake Austin Spa Resort,lake austin spa resort,kay.mann@enron.com,kay.mann@enron.com,cindyv@lakeaustin.com,cindyv@lakeaustin.com,1,,...,0,"Wed, 7 Mar 2001 02:16:00 -0800 (PST)",I think what I will do is send out an email to...,4759,205,2,157,True,2001-03-07 10:16:00,159


In [2]:
summary_cols = ['body_char_len', 'quote_separator_count', 'quote_line_count', 'to_count', 'cc_count']
summary = emails_df[summary_cols].describe().T
summary['nonzero_pct'] = (emails_df[summary_cols] > 0).mean() * 100
summary


Unnamed: 0,count,mean,std,min,25%,50%,75%,max,nonzero_pct
body_char_len,12020.0,2420.222962,997.85616,296.0,1601.0,2194.0,3049.25,4999.0,100.0
quote_separator_count,12020.0,0.473877,0.888716,0.0,0.0,0.0,1.0,10.0,29.517471
quote_line_count,12020.0,2.075208,11.282374,0.0,0.0,0.0,0.0,226.0,6.281198
to_count,12020.0,1.301913,0.920474,0.0,1.0,1.0,1.0,6.0,96.663894
cc_count,12020.0,0.512646,1.074676,0.0,0.0,0.0,1.0,6.0,25.274542


### Interactive Email Explorer

Use the widget below to browse individual emails. This makes it easier to inspect quoted history, commitments, and tone without scrolling through raw tables.

In [None]:

try:
    import html
    import ipywidgets as widgets
    from IPython.display import HTML, display
except ImportError as exc:
    raise ImportError('ipywidgets is required for the email explorer. Install with `pip install ipywidgets`.') from exc

email_count = len(emails_df)
if email_count == 0:
    raise ValueError('emails_df is empty; nothing to explore.')

_email_card_style_injected = False


def _ensure_email_card_style() -> None:
    """Inject a lightweight CSS card style once per session."""
    global _email_card_style_injected
    if _email_card_style_injected:
        return
    card_style = """
    <style>
        .email-card {
            font-family: var(--jp-content-font-family, 'Segoe UI', system-ui, sans-serif);
            border: 1px solid rgba(15, 23, 42, 0.12);
            border-radius: 12px;
            padding: 0.9rem 1.1rem;
            background: rgba(244, 246, 252, 0.9);
            box-shadow: 0 4px 12px rgba(15, 23, 42, 0.06);
            color: #1f2933;
            max-width: 100%;
        }
        .email-card__header {
            display: flex;
            flex-wrap: wrap;
            justify-content: space-between;
            align-items: baseline;
            gap: 0.75rem;
            margin-bottom: 0.75rem;
            font-weight: 600;
        }
        .email-card__header span:last-child {
            font-size: 0.85rem;
            font-weight: 500;
            color: #52606d;
        }
        .email-card__meta {
            display: grid;
            grid-template-columns: repeat(auto-fit, minmax(220px, 1fr));
            gap: 0.45rem 1.2rem;
            margin-bottom: 0.9rem;
        }
        .email-card__meta-label {
            font-size: 0.75rem;
            letter-spacing: 0.05em;
            text-transform: uppercase;
            color: #617180;
            margin-bottom: 0.1rem;
        }
        .email-card__meta-value {
            font-size: 0.95rem;
            line-height: 1.35;
            color: #1f2933;
            word-break: break-word;
        }
        .email-card__body {
            font-family: 'SFMono-Regular', Consolas, 'Liberation Mono', Menlo, monospace;
            background: rgba(255, 255, 255, 0.85);
            border-radius: 10px;
            padding: 0.9rem 1rem;
            border: 1px solid rgba(15, 23, 42, 0.12);
            white-space: pre-wrap;
            overflow-y: auto;
            max-height: 440px;
        }
        .email-card__body--truncated::after {
            content: '\u2026';
            color: #8996a7;
        }
        .email-card__hint {
            font-size: 0.8rem;
            color: #55606f;
            margin-top: 0.5rem;
        }
    </style>
    """
    display(HTML(card_style))
    _email_card_style_injected = True


def _clean_subject(value: str) -> str:
    if not value:
        return '(no subject)'
    text = str(value).strip().replace('\n', ' ')
    return text if len(text) <= 80 else f"{text[:77]}..."


def _format_email(idx: int, show_full_body: bool = False) -> widgets.HTML:
    row = emails_df.iloc[idx]
    meta_fields = [
        ('Subject', row.get('subject') or '(no subject)'),
        ('From', row.get('from_email') or row.get('from_raw') or 'unknown'),
        ('To', row.get('to_emails') or row.get('to_raw') or '—'),
        ('CC', row.get('cc_emails') or row.get('cc_raw') or '—'),
        ('Characters', f"{int(row.get('body_char_len', 0)):,}"),
        ('Quote markers', f"{int(row.get('quote_separator_count', 0) + row.get('quote_line_count', 0)):,}"),
    ]
    meta_html = ''.join(
        f"<div><div class='email-card__meta-label'>{html.escape(label)}</div>"
        f"<div class='email-card__meta-value'>{html.escape(str(value))}</div></div>"
        for label, value in meta_fields
    )

    body_raw = (row.get('body') or '').strip()
    truncated = not show_full_body and len(body_raw) > 1500
    body_slice = body_raw if not truncated else body_raw[:1500].rstrip()
    body_class = 'email-card__body'
    if truncated:
        body_class += ' email-card__body--truncated'
    if body_slice:
        body_html = html.escape(body_slice)
        body_block = f"<pre class='{body_class}'>{body_html}</pre>"
    else:
        body_block = f"<pre class='{body_class}'><em>empty body</em></pre>"

    hint = '<div class="email-card__hint">Body truncated. Enable "Show full body" to view the remainder.</div>' if truncated else ''

    card_html = f"""
    <div class='email-card'>
        <div class='email-card__header'>
            <span>Email #{idx}</span>
            <span>Row {idx + 1} of {email_count}</span>
        </div>
        <div class='email-card__meta'>{meta_html}</div>
        {body_block}
        {hint}
    </div>
    """
    return widgets.HTML(card_html)


slider = widgets.IntSlider(
    min=0,
    max=email_count - 1,
    step=1,
    value=0,
    description='Email index',
    continuous_update=False,
    readout=False,
    layout=widgets.Layout(width='340px')
)
show_full_body = widgets.Checkbox(value=False, description='Show full body', indent=False)

subject_options = [
    (f"{idx:03} — {_clean_subject(emails_df.iloc[idx].get('subject'))}", idx)
    for idx in range(email_count)
]
jump_to = widgets.Dropdown(
    options=subject_options,
    value=0,
    description='Jump to',
    layout=widgets.Layout(width='60%')
)

header = widgets.HTML(f"<h3 style='margin-bottom:0.2rem'>Explore filtered emails ({email_count} total)</h3>")
controls = widgets.VBox([
    widgets.HBox([slider, jump_to]),
    widgets.HBox([show_full_body])
])
output = widgets.Output()

display(widgets.VBox([header, controls, output]))


def _render(idx: int, reveal_full: bool) -> None:
    _ensure_email_card_style()
    with output:
        output.clear_output(wait=True)
        display(_format_email(idx, reveal_full))


def _sync_dropdown(idx: int) -> None:
    if jump_to.value == idx:
        return
    jump_to.unobserve(_on_jump, names='value')
    jump_to.value = idx
    jump_to.observe(_on_jump, names='value')


def _on_slider(change: dict) -> None:
    if change['name'] != 'value' or change['new'] is None:
        return
    _sync_dropdown(change['new'])
    _render(change['new'], show_full_body.value)


def _on_jump(change: dict) -> None:
    if change['name'] != 'value' or change['new'] is None:
        return
    if slider.value != change['new']:
        slider.value = change['new']
    else:
        _render(change['new'], show_full_body.value)


def _on_show_full_body(change: dict) -> None:
    if change['name'] != 'value':
        return
    _render(slider.value, change['new'])


slider.observe(_on_slider, names='value')
jump_to.observe(_on_jump, names='value')
show_full_body.observe(_on_show_full_body, names='value')

_render(slider.value, show_full_body.value)


VBox(children=(HTML(value="<h3 style='margin-bottom:0.2rem'>Explore filtered emails (12020 total)</h3>"), VBox…

In [None]:
emails_df.head()

## Our GenAI Problem: Email Summaries for Commitments

We want concise summaries that capture:
- Commitments made (tasks/action items)
- Owners and due dates (if present)
- Relevant context, while avoiding hallucinations or personal data leaks

These long emails—with quoted history intact—provide enough context to practice failure discovery and eval design.

## SME Alignment Models

Before labeling or writing rubrics, align subject-matter experts (SMEs) on decision power:

- **Benevolent Dictator**: one lead defines rubric/labels, others review. Fast, consistent—ideal for tight timelines (our workshop default).
- **Committee/Consensus**: multiple SMEs negotiate rubric and adjudicate disagreements. Useful for fuzzy domains, but slower.

For this workshop we assume a *benevolent dictator* model to keep exercises crisp, but highlight trade-offs so participants can adapt to their org.

## Rubric for Email Summaries

| Criterion | Pass Definition | Fail Examples |
| --- | --- | --- |
| **Commitments captured** | All explicit tasks/requests appear with enough detail to act | Missing follow-up request or deliverable |
| **Owner & due date accuracy** | Owners/dates match source text; absent info → clearly noted | Hallucinated owner/date; misattributed responsibility |
| **No hallucinations** | Summary sticks to source facts; qualifiers for uncertainty | Invented tasks, fabricated numbers |
| **Clarity & brevity** | ≤120 words, easy to scan, actionable tone | Rambling, unclear, or overlong |

Use this rubric when manually tagging outputs or crafting evaluators. Adjust wording for your domain as needed.

### Manual Inspection Exercise

Pick 2–3 emails from the dataframe above and answer:
1. What commitments/tasks are present? Who owns them and by when?
2. Which rubric criteria are likely to trip up a naive summarizer?
3. What extra context (if any) would you need before trusting an automated summary?

Document your observations—they seed the failure taxonomy in Notebook 02.

## Prompt Engineering Best Practices (Chapter 2 Recap)

When drafting your first summary prompt:
1. **Role & Objective** – e.g., “You are an operations analyst summarizing corporate email threads.”
2. **Instructions** – enumerate required/forbidden behaviours (mention rubric criteria).
3. **Context** – delimit the email body clearly.
4. **Examples** – include one labeled summary if possible (even synthetic).
5. **Reasoning steps** – ask the model to extract commitments before writing prose.
6. **Output format** – enforce bullet lists or structured JSON if you need programmatic checks.
7. **Safety clauses** – remind the model not to invent owners/dates.

We will iterate on this prompt in Notebook 03 with automated evaluators.

### Starter Prompt Scaffold

Fill in the template below during the live session or as homework. Adapt to your organizational style.

In [None]:
from textwrap import dedent

PROMPT_TEMPLATE = dedent("""
You are an operations analyst summarizing internal Enron email conversations.

Instructions:
1. List every explicit commitment or request as a bullet.
2. Note the responsible owner and due date if stated; otherwise say `Owner: Unknown` or `Due: Not stated`.
3. Do not invent facts.

Email body (delimited by triple backticks):
```
{{email_body}}
```

Think step by step to extract commitments before writing the summary.
""")

print(PROMPT_TEMPLATE)


## References
1. Shankar, Shreya, et al. "Steering semantic data processing with docwrangler." Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology. 2025.

## What’s Next

- Notebook 02: convert manual notes into open/axial codes (failure taxonomy).
- Notebook 03: turn rubric items into automated evaluators (LLM-as-a-judge + checks).
- Optional Homework 01a: generate synthetic emails to stress-test new failure modes.