# Explore Email Dataset

We start with the full Enron email dump (`emails.csv`, ~517k raw messages) and build a
small, workshop-ready subset of emails that already contain rich conversational
context in their quoted history. The goal is to hand facilitators a handful of
messages that:

- revolve around action-oriented language (deadlines, follow ups, approvals)
- keep recipient lists manageable (true conversations, not company-wide blasts)
- include enough quoted history to surface commitments and potential misses
- stay short enough (≤5k characters) for a 60-minute tutorial

Rather than reconstructing threads, we treat each long email as the unit of
analysis. By the end, we will have a scored table (`candidate_df`) packed with the
most useful conversation-heavy emails, ready for manual evaluation and LLM judge
demos.


We cap subject frequency at `MAX_SUBJECT_FREQUENCY = 50` so recurring broadcast topics (e.g., newsletters) don’t dominate the slice.

We apply a deterministic filter stack: `TIME_WINDOW`, `ACTION_KEYWORDS`, `MAX_RECIPIENTS`, `BROADCAST_SUBJECT_KEYWORDS`, `MAX_SUBJECT_FREQUENCY`, `MIN_QUOTE_MARKERS`, and `MAX_BODY_CHARS` to end up with email-sized conversations.

### What We'll Do
- Load the raw Enron CSV and take a quick peek at the MIME payloads.
- Parse headers/body into lightweight features (participants, body length, quote markers).
- Apply a focused filter stack: time window, action keywords, recipient cap, broadcast stoplist, long/quoted requirement, a subject frequency cap, and a 5k character ceiling.
- Inspect distributions and rank the remaining emails by conversational depth.
- Preview the top candidates so facilitators can export them for labeling and evals.


In [None]:
import re
from email.utils import getaddresses, parseaddr
from pathlib import Path
import matplotlib.pyplot as plt
import pandas as pd
from IPython.display import Markdown, display

pd.set_option("display.max_columns", None)
pd.set_option("display.max_colwidth", 160)

DATA_PATH = Path("../data/emails.csv")
TIME_WINDOW = (pd.Timestamp("2001-03-01"), pd.Timestamp("2001-06-30"))
ACTION_KEYWORDS = [
    "deadline",
    "deliver",
    "deliverable",
    "please review",
    "fyi",
    "action item",
    "follow up",
    "schedule",
    "update",
    "approve",
]
ACTION_PATTERN = re.compile(
    "|".join(re.escape(k) for k in ACTION_KEYWORDS), flags=re.IGNORECASE
)
BROADCAST_SUBJECT_KEYWORDS = [
    "newsletter",
    "announcement",
    "enron mentions",
    "organizational announcement",
]
MAX_RECIPIENTS = 6
MIN_BODY_CHARS = 800
MIN_QUOTE_MARKERS = 2
MAX_BODY_CHARS = 5000
MAX_SUBJECT_FREQUENCY = 50
plt.style.use("seaborn-v0_8")

## Load raw emails

We ingest the full `emails.csv`, compute quick size stats, and immediately apply a coarse action-keyword + length filter so we only carry interesting messages into later steps.

In [None]:
raw_df = pd.read_csv(DATA_PATH, usecols=["file", "message"]).rename(
    columns={"message": "raw_message"}
)
raw_df["raw_char_len"] = raw_df["raw_message"].str.len()
print(
    f"Loaded {len(raw_df):,} rows (~{raw_df.raw_char_len.sum() / 1e6:,.1f}M characters)"
)

keyword_regex = "|".join(re.escape(k) for k in ACTION_KEYWORDS)
raw_df = raw_df[
    raw_df["raw_message"].str.contains(keyword_regex, case=False, regex=True, na=False)
]
raw_df = raw_df[raw_df["raw_char_len"] >= MIN_BODY_CHARS]
print(f"After quick keyword/length filter: {len(raw_df):,} rows remain")

display(raw_df.head(3))

### Peek at a raw message

Before parsing, glance at one raw MIME payload so you remember what the quotes and headers look like in the original corpus.

In [None]:
raw_df.sample(1)["raw_message"].values[0]

In [None]:
# Render a sampled raw message as a layperson-friendly preview
sample_row = raw_df.sample(1).iloc[0]
raw_message = sample_row["raw_message"]

header_lines: list[str] = []
body_lines: list[str] = []
found_break = False
for line in raw_message.splitlines():
    if not found_break and line.strip() == "":
        found_break = True
        continue
    if found_break:
        body_lines.append(line.rstrip())
    else:
        header_lines.append(line.rstrip())

header_preview = "\n".join(header_lines)
body_preview = "\n".join(body_lines[:40]).strip()
if len(body_lines) > 40:
    body_preview += "\n..."
body_preview = body_preview or "[empty body]"

email_title = sample_row["file"]

markdown = (
    f"### Sample email: {email_title}\n\n"
    "**Header preview**\n\n"
    "```\n"
    f"{header_preview}\n"
    "```\n\n"
    "**Body preview**\n\n"
    "```\n"
    f"{body_preview}\n"
    "```"
)

display(Markdown(markdown))

## Parse headers and derive features

Split headers/body, normalize the subject, and extract lightweight features (participants, body length, quote markers) that we'll use for filtering and scoring.

In [None]:
HEADER_BREAK = re.compile(r"\r?\n\r?\n")
SUBJECT_PREFIX_RE = re.compile(r"^(re|fw|fwd):\s*", flags=re.IGNORECASE)
SEPARATOR_RE = re.compile(r"-{3,}\s*Original Message", flags=re.IGNORECASE)
QUOTE_LINE_RE = re.compile(r"^(>+)", flags=re.MULTILINE)
WHITESPACE_RE = re.compile(r"\s+")


def split_headers_body(raw: str) -> tuple[str, str]:
    if not isinstance(raw, str):
        return "", ""
    parts = HEADER_BREAK.split(raw, maxsplit=1)
    header_block = parts[0] if parts else ""
    body = parts[1] if len(parts) > 1 else ""
    return header_block, body


def parse_header_block(block: str) -> dict[str, str]:
    headers: dict[str, str] = {}
    current_key: str | None = None
    for line in block.splitlines():
        if not line:
            current_key = None
            continue
        if line.startswith((" ", "\t")) and current_key:
            headers[current_key] += f" {line.strip()}"
            continue
        if ":" not in line:
            current_key = None
            continue
        key, value = line.split(":", 1)
        current_key = key.strip().lower()
        headers[current_key] = value.strip()
    return headers


def normalize_subject(subject: str) -> str:
    if not subject:
        return ""
    cleaned = subject
    for _ in range(5):
        updated = SUBJECT_PREFIX_RE.sub("", cleaned)
        if updated == cleaned:
            break
        cleaned = updated
    cleaned = WHITESPACE_RE.sub(" ", cleaned)
    return cleaned.strip().lower()


def parse_email_fields(raw: str) -> dict[str, object]:
    header_block, body = split_headers_body(raw)
    headers = parse_header_block(header_block)

    subject = headers.get("subject", "").strip()
    normalized_subject = normalize_subject(subject)

    date_raw = headers.get("date", "").strip() or None
    from_raw = headers.get("from", "").strip() or None
    from_email = parseaddr(from_raw)[1].lower() if from_raw else None

    to_raw = headers.get("to", "").strip() or None
    to_emails = [addr.lower() for _, addr in getaddresses([to_raw])] if to_raw else []
    cc_raw = headers.get("cc", "").strip() or None
    cc_emails = [addr.lower() for _, addr in getaddresses([cc_raw])] if cc_raw else []

    body_clean = body.strip()
    quote_separator_count = len(SEPARATOR_RE.findall(body))
    quote_line_count = len(QUOTE_LINE_RE.findall(body))

    return {
        "subject": subject or None,
        "normalized_subject": normalized_subject or None,
        "from_raw": from_raw,
        "from_email": from_email,
        "to_raw": to_raw,
        "to_emails": ";".join(to_emails) or None,
        "to_count": len(to_emails),
        "cc_raw": cc_raw,
        "cc_emails": ";".join(cc_emails) or None,
        "cc_count": len(cc_emails),
        "date_raw": date_raw,
        "body": body_clean,
        "body_char_len": len(body_clean),
        "body_line_count": body_clean.count("\n") + 1 if body_clean else 0,
        "quote_separator_count": quote_separator_count,
        "quote_line_count": quote_line_count,
        "action_hit": bool(ACTION_PATTERN.search(f"{subject}\n{body}"))
        if subject or body
        else False,
    }

In [None]:
parsed_records = raw_df["raw_message"].apply(parse_email_fields).apply(pd.Series)
parsed_df = pd.concat([raw_df.drop(columns=["raw_message"]), parsed_records], axis=1)
parsed_df["sent_at"] = pd.to_datetime(
    parsed_df["date_raw"], errors="coerce", utc=True
).dt.tz_localize(None)
parsed_df = parsed_df.dropna(subset=["body"])
parsed_df = parsed_df[parsed_df["body_char_len"] > 0]

print(f"Parsed rows: {len(parsed_df):,}")
display(
    parsed_df[
        [
            "file",
            "sent_at",
            "from_email",
            "subject",
            "body_char_len",
            "quote_separator_count",
        ]
    ].head(5)
)

In [None]:
parsed_df["quote_separator_count"].value_counts()

In [None]:
# Caching the parsed DataFrame for future use
parsed_df.to_csv("../data/parsed_emails.csv", index=False)

In [None]:
parsed_df = pd.read_csv("../data/parsed_emails.csv")

In [None]:
parsed_df.head()

## Filter for workshop-ready emails

Limit the dataset to action-heavy, conversational emails using a series of transparent filters: time window, keyword confirmation, recipient cap, broadcast removal, subject frequency cut, long/quoted requirement, and a 5k ceiling.

Filters applied in order:
1. Time window (`TIME_WINDOW`)
2. Keyword confirmation (`ACTION_KEYWORDS`)
3. Recipient cap (`MAX_RECIPIENTS`)
4. Broadcast subject removal (`BROADCAST_SUBJECT_KEYWORDS`)
5. Subject frequency cap (`MAX_SUBJECT_FREQUENCY`)
6. Conversation depth (`MIN_QUOTE_MARKERS` or extended length)
7. Body length cap (`MAX_BODY_CHARS`)

In [None]:
def snapshot(df: pd.DataFrame, stage: str) -> dict[str, object]:
    return {
        "stage": stage,
        "emails": len(df),
        "median_body_chars": int(df["body_char_len"].median()) if not df.empty else 0,
        "median_quote_separators": int(df["quote_separator_count"].median())
        if not df.empty
        else 0,
    }


filtered_df = parsed_df.copy()
filtered_df["sent_at"] = pd.to_datetime(filtered_df["sent_at"], errors="coerce")
progress = [snapshot(filtered_df, "parsed")]

if TIME_WINDOW:
    start, end = TIME_WINDOW
    time_mask = filtered_df["sent_at"].between(start, end, inclusive="both")
    filtered_df = filtered_df[time_mask]
    progress.append(snapshot(filtered_df, f"time window {start.date()}→{end.date()}"))

filtered_df = filtered_df[filtered_df["action_hit"]]
progress.append(snapshot(filtered_df, "keyword confirmed post-parse"))

recipient_mask = (
    filtered_df["to_count"].fillna(0) + filtered_df["cc_count"].fillna(0)
) <= MAX_RECIPIENTS
filtered_df = filtered_df[recipient_mask]
progress.append(snapshot(filtered_df, f"recipient count ≤ {MAX_RECIPIENTS}"))


broadcast_mask = ~filtered_df["normalized_subject"].fillna("").str.contains(
    "|".join(re.escape(k) for k in BROADCAST_SUBJECT_KEYWORDS), case=False, regex=True
)
filtered_df = filtered_df[broadcast_mask]
progress.append(snapshot(filtered_df, "broadcast subjects dropped"))

subject_counts = filtered_df["normalized_subject"].fillna("(no subject)").value_counts()
freq_mask = (
    filtered_df["normalized_subject"].fillna("(no subject)").map(subject_counts)
    <= MAX_SUBJECT_FREQUENCY
)
filtered_df = filtered_df[freq_mask]
progress.append(snapshot(filtered_df, f"subject frequency ≤ {MAX_SUBJECT_FREQUENCY}"))

conversation_mask = (
    filtered_df["quote_separator_count"] + filtered_df["quote_line_count"]
    >= MIN_QUOTE_MARKERS
) | (filtered_df["body_char_len"] >= MIN_BODY_CHARS * 1.5)
filtered_df = filtered_df[conversation_mask]
progress.append(snapshot(filtered_df, "long/quoted conversations"))

filtered_df = filtered_df[filtered_df["body_char_len"] <= MAX_BODY_CHARS]
progress.append(snapshot(filtered_df, f"body length ≤ {MAX_BODY_CHARS:,}"))

progress_df = pd.DataFrame(progress)
display(progress_df)

## Explore the filtered distribution

Check how body lengths and quote markers behave after the filters.

Look for a right-skewed distribution (a few long emails) but avoid extreme tails that would overwhelm participants.

Check how body lengths and quote markers behave after the filters so we know the remaining sample is both rich in context and manageable for participants.

In [None]:
scored_df = filtered_df.copy()
scored_df["quote_markers"] = (
    scored_df["quote_separator_count"] + scored_df["quote_line_count"]
)
scored_df = scored_df.sort_values(
    ["quote_markers", "body_char_len"], ascending=[False, False]
)
preview_cols = [
    "file",
    "subject",
    "from_email",
    "to_emails",
    "body_char_len",
    "quote_markers",
]
display(scored_df[preview_cols].head(10))
print(f"Final candidate emails: {len(scored_df):,}")

candidate_df = scored_df.copy()

In [None]:
if not filtered_df.empty:
    fig, axes = plt.subplots(1, 2, figsize=(10, 3))
    filtered_df["body_char_len"].plot(kind="hist", bins=40, ax=axes[0], color="#1f77b4")
    axes[0].set_title("Body length distribution")
    axes[0].set_xlabel("Characters")

    (filtered_df["quote_separator_count"] + filtered_df["quote_line_count"]).plot(
        kind="hist",
        bins=range(0, filtered_df["quote_separator_count"].max() + 3),
        ax=axes[1],
        color="#ff7f0e",
    )
    axes[1].set_title("Quote marker distribution")
    axes[1].set_xlabel("Markers per email")

    plt.tight_layout()
    plt.show()
else:
    print("No emails matched the current filter settings.")

## Review top candidate emails

Inspect a few high-scoring messages to confirm the slice feels right for the workshop.

Use these previews to pick golden examples; export them straight into the labeling notebook.

In [None]:
candidate_df.head()

Rank the filtered emails by conversational depth, then preview a few to confirm they are suitable for the workshop exercises.

## Next steps

- Export `candidate_df` for annotation or golden-set authoring.
- Drop selected emails into the evaluation notebooks to practice the Analyze→Measure→Improve loop.
- Adjust the thresholds (`ACTION_KEYWORDS`, `MAX_RECIPIENTS`, `MIN_QUOTE_MARKERS`, `MAX_BODY_CHARS`) if the balance between context and readability needs tuning.

In [None]:
def render_email(row: pd.Series, body_lines: int = 40) -> None:
    subject = row["subject"] or "(no subject)"
    meta_lines = [
        f"### {subject}",
        f"- File: `{row['file']}`",
        f"- From: {row.get('from_email') or row.get('from_raw') or 'unknown'}",
        f"- To: {row.get('to_emails') or row.get('to_raw') or '—'}",
        f"- CC: {row.get('cc_emails') or row.get('cc_raw') or '—'}",
        f"- Sent: {row['sent_at']:%Y-%m-%d %H:%M}"
        if pd.notna(row["sent_at"])
        else "- Sent: unknown",
        f"- Characters: {row['body_char_len']:,}",
        f"- Quote markers: {row['quote_separator_count'] + row['quote_line_count']:,}",
    ]
    body_lines_list = row["body"].splitlines()
    body_preview = "\n".join(body_lines_list[:body_lines]).strip()
    if len(body_lines_list) > body_lines:
        body_preview += "\n..."
    markdown = "\n\n".join(meta_lines) + f"\n\n```\n{body_preview}\n```"
    display(Markdown(markdown))


if not filtered_df.empty:
    top_rows = (
        candidate_df.assign(
            quote_markers=lambda df: df["quote_separator_count"]
            + df["quote_line_count"]
        )
        .sort_values(["quote_markers", "body_char_len"], ascending=[False, False])
        .head(100)[97:98]  # Show the 101st email for variety
    )
    for _, row in top_rows.iterrows():
        render_email(row, 1000)
else:
    print("No emails available to preview.")

In [None]:
candidate_df["body_char_len"].describe()

In [None]:
candidate_df.shape

In [None]:
candidate_df.to_csv("../data/filtered_emails.csv", index=False)
print("Filtered emails saved to filtered_emails.csv")

In [None]:
candidate_df = pd.read_csv("../data/filtered_emails.csv")

In [None]:
candidate_df.shape
candidate_df.sample(30).to_csv("../data/sample_filtered_emails_sample.csv", index=False)

In [None]:
import hashlib
import pandas as pd
from pathlib import Path

filtered_path = Path("../data/filtered_emails.csv")
if filtered_path.exists():
    df = pd.read_csv(filtered_path)

    def compute_email_hash(row):
        joined = "||".join(
            str(row.get(col, "")) for col in df.columns if col != "email_hash"
        )
        return hashlib.sha256(joined.encode("utf-8")).hexdigest()

    df["email_hash"] = df.apply(compute_email_hash, axis=1)
    before = len(df)
    df = df.drop_duplicates(subset="email_hash")
    removed = before - len(df)
    if removed:
        print(f"Removed {removed} duplicate emails detected via email_hash")
    df.to_csv(filtered_path, index=False)
    display(df.head())
else:
    print("filtered_emails.csv not found; run earlier cells first.")

In [None]:
df.shape
df.sample(30).to_csv("../data/filtered_emails_sample.csv", index=False)

In [None]:
df.shape