# Homework 01a: Synthetic Email Seed Set

We start with following process—**define dimensions → generate tuples → materialize scenarios**—but adapt it to email summarization so downstream notebooks can practice open coding without waiting for production traces.


## Setup
- Python 3.10+
- `pandas`, `pydantic`, `litellm` (only required if you flip on live LLM calls)
- `.env` with `OPENAI_API_KEY` or compatible provider key (optional)

> **Tip:** If you’re running in an offline environment, leave `USE_LLM = False` and the deterministic fallback will fabricate emails using the same template library we’d feed to a model.


In [6]:
from __future__ import annotations

import json
from itertools import product
from pathlib import Path
from datetime import datetime, timedelta
import random

import pandas as pd
from pydantic import BaseModel, Field

try:
    from dotenv import load_dotenv

    load_dotenv()
except ImportError:
    pass

try:
    from litellm import completion
except ImportError:
    completion = None

USE_LLM = True  # Flip to True to call an LLM for tuple and email generation
MODEL_NAME = "gpt-5-mini"
SEED = 20251015
rng = random.Random(SEED)

## Part 1: Define Dimensions & Generate Initial Tuples


### Identify Key Dimensions
The four axes below come directly from the workshop facilitation brief. Keeping spelling (e.g. `dimplomatic`) intact helps surface real-world quirks during evaluator design.

1. **Designation:** Associate, Executive, Manager, Sr Manager, SVP, VP, P, C suite, CEO
2. **Tone:** Friendly, formal, crass, dimplomatic, stern, abusive, sympathetic, pleading, commanding, persuasive
3. **Email Context:** Initiated, single reply, reply to thread, broadcast email
4. **Intent:** Informative, Warning, Ordering, Follow Up, Circle Back, Defer, Postpone, Organize, Motivate, Encourage, Reward, Praise, Congratulate, Scold, Explain, Dissuade


### Compose LLM Prompt for Dimension Tuples
Homework 2 used an LLM to enumerate diverse tuples before rolling into grounded theory. We do the same here, swapping the domain.

```text
You are helping an email summarization team seed synthetic coverage before real logs arrive.
Return 20 unique combinations covering designation, tone, email_context, and intent.
Balance the set—don’t over-represent any single tone or context. Keep values to the provided list.
```

We parse responses into structured objects so they’re easy to audit or regenerate.


In [None]:
class DimensionTuple(BaseModel):
    designation: str = Field(..., alias="Designation")
    tone: str = Field(..., alias="Tone")
    email_context: str = Field(..., alias="EmailContext")
    intent: str = Field(..., alias="Intent")


class DimensionTupleBatch(BaseModel):
    tuples: list[DimensionTuple]


def request_dimension_tuples(prompt: str, total: int = 40) -> list[DimensionTuple]:
    if not USE_LLM:
        raise RuntimeError("LLM usage disabled; set USE_LLM = True to call the API.")
    if completion is None:
        raise ImportError(
            "litellm is required for live LLM calls. Install with `pip install litellm`."
        )

    system = "You format structured JSON for synthetic email coverage."
    user = prompt + f'Return exactly {total} tuples as JSON: {{"tuples": [{{...}}]}}'
    response = completion(
        model=MODEL_NAME,
        messages=[
            {"role": "system", "content": system},
            {"role": "user", "content": user},
        ],
    )
    parsed = json.loads(response.choices[0].message.content)
    return DimensionTupleBatch(**parsed).tuples


def fallback_dimension_tuples(total: int = 160) -> list[DimensionTuple]:
    # We start with ~100 traces; we over-sample to prune later.
    dimensions = {
        "Designation": [
            "Associate",
            "Executive",
            "Manager",
            "Sr Manager",
            "SVP",
            "VP",
            "P",
            "C suite",
            "CEO",
        ],
        "Tone": [
            "Friendly",
            "formal",
            "crass",
            "dimplomatic",
            "stern",
            "abusive",
            "sympathetic",
            "pleading",
            "commanding",
            "persuasive",
        ],
        "EmailContext": [
            "Initiated",
            "single reply",
            "reply to thread",
            "broadcast email",
        ],
        "Intent": [
            "Informative",
            "Warning",
            "Ordering",
            "Follow Up",
            "Circle Back",
            "Defer",
            "Postpone",
            "Organize",
            "Motivate",
            "Encourage",
            "Reward",
            "Praise",
            "Congratulate",
            "Scold",
            "Explain",
            "Dissuade",
        ],
    }

    all_combos = list(product(*dimensions.values()))
    if total > len(all_combos):
        total = len(all_combos)
    sampled = rng.sample(all_combos, total)
    return [
        DimensionTuple(
            Designation=combo[0],
            Tone=combo[1],
            EmailContext=combo[2],
            Intent=combo[3],
        )
        for combo in sampled
    ]


DIMENSION_PROMPT = (
    "Enumerate structured tuples covering designation, tone, email_context, intent; "
    "use only the provided categorical values and keep the set balanced."
)

if USE_LLM:
    raw_tuples = request_dimension_tuples(DIMENSION_PROMPT, total=60)
else:
    raw_tuples = fallback_dimension_tuples(total=160)

raw_tuples[:3]

### Normalize Tuples
Just like Homework 2, we move the tuples into a dataframe so we can pivot, de-duplicate, or hand them to other teammates for review.


In [None]:
tuple_df = pd.DataFrame([t.dict(by_alias=True) for t in raw_tuples])
tuple_df.insert(0, "tuple_id", range(1, len(tuple_df) + 1))
tuple_df.head()

In [None]:
tuple_df.groupby(["Intent"]).size().rename("count").sort_values(ascending=False)

## Part 2: Generate Synthetic Emails
We can ask an LLM to draft each email, or reuse deterministic scaffolding similar to `generate_synthetic_queries.py`—here both options share the same output shape.


### Compose LLM Prompt for Emails

```text
You write realistic internal corporate emails.
Given designation, tone, email_context, intent, craft an email capturing commitments, owners, and due dates when appropriate.
Return JSON with subject, recipients, body, and a list of commitments (task, owner, due_date or null).
Use Enron-style energy industry flavor and keep commitment bullets explicit.
```


In [None]:
class Commitment(BaseModel):
    task: str
    owner: str
    due_date: str | None


class SyntheticEmail(BaseModel):
    email_id: str
    tuple_id: int
    designation: str
    tone: str
    email_context: str
    intent: str
    subject: str
    recipients: list[str]
    body: str
    commitments: list[Commitment]


def request_email(tuple_row: dict) -> SyntheticEmail:
    if not USE_LLM:
        raise RuntimeError("LLM usage disabled; set USE_LLM = True to call the API.")
    if completion is None:
        raise ImportError(
            "litellm is required for live LLM calls. Install with `pip install litellm`."
        )

    system = (
        "You draft corporate emails with explicit commitments for evaluation drills."
    )
    user = (
        "Draft one email as JSON with fields subject, recipients, body, commitments. "
        "Recipients should be 1-3 plausible employees. "
        "Commitments must be a list of {task, owner, due_date}. "
        "Use the following metadata: "
        f"Designation={tuple_row['Designation']}, Tone={tuple_row['Tone']}, "
        f"EmailContext={tuple_row['EmailContext']}, Intent={tuple_row['Intent']}."
    )
    response = completion(
        model=MODEL_NAME,
        messages=[
            {"role": "system", "content": system},
            {"role": "user", "content": user},
        ],
    )
    payload = json.loads(response.choices[0].message.content)
    return SyntheticEmail(
        email_id="SYNTH-XXXX",  # placeholder, overwritten below
        tuple_id=int(tuple_row["tuple_id"]),
        designation=tuple_row["Designation"],
        tone=tuple_row["Tone"],
        email_context=tuple_row["EmailContext"],
        intent=tuple_row["Intent"],
        subject=payload["subject"],
        recipients=payload["recipients"],
        body=payload["body"],
        commitments=[Commitment(**c) for c in payload["commitments"]],
    )

In [None]:
# Deterministic fallback mirroring Homework 2's scripted generator
PROJECTS = [
    "Q2 Liquidity Review",
    "West Power Expansion",
    "Reg Compliance Sprint",
    "Asset Divestiture Plan",
    "Houston Data Center Move",
    "Pipeline Integrity Audit",
    "Retail Pricing Refresh",
    "Risk Playbook Draft",
    "Trading Floor Renovation",
    "Client Onboarding Revamp",
    "Quarterly Earnings Prep",
    "Grid Reliability Study",
]

LOCATIONS = [
    "Houston",
    "Calgary",
    "London",
    "Singapore",
    "Portland",
    "Austin",
    "Los Angeles",
    "Denver",
]

DEPARTMENTS = [
    "Trading Ops",
    "Risk & Compliance",
    "Finance",
    "Human Capital",
    "Corporate Comms",
    "Market Strategy",
    "IT Services",
    "Legal",
]

PEOPLE = [
    "Jordan Blake",
    "Priya Patel",
    "Morgan Shah",
    "Luis Gomez",
    "Taylor Singh",
    "Avery Chen",
    "Samira Ali",
    "Noah Figueroa",
    "Chris Morales",
    "Harper Diaz",
    "Emerson Pike",
    "Micah Sloan",
    "Riley Tan",
    "Jamie Brooks",
    "Dana Carpenter",
    "Quinn Harper",
    "Rowan Ibarra",
    "Casey Wu",
    "Alex Rivera",
    "Peyton Ellis",
    "Devon Clarke",
    "Sydney Romero",
    "Morgan Caldwell",
    "Alicia Benton",
]

SENDER_DIRECTORY = {
    "Associate": ["Alex Rivera", "Peyton Ellis", "Morgan Shah"],
    "Executive": ["Priya Patel", "Luis Gomez", "Avery Chen"],
    "Manager": ["Jordan Blake", "Harper Diaz", "Chris Morales"],
    "Sr Manager": ["Taylor Singh", "Samira Ali", "Noah Figueroa"],
    "SVP": ["Emerson Pike", "Micah Sloan", "Rowan Ibarra"],
    "VP": ["Jamie Brooks", "Dana Carpenter", "Quinn Harper"],
    "P": ["Casey Wu", "Devon Clarke"],
    "C suite": ["Riley Tan", "Sydney Romero"],
    "CEO": ["Morgan Caldwell", "Alicia Benton"],
}

DESIGNATION_LABELS = {
    "Associate": "Associate",
    "Executive": "Executive Director",
    "Manager": "Manager",
    "Sr Manager": "Senior Manager",
    "SVP": "Senior Vice President",
    "VP": "Vice President",
    "P": "President",
    "C suite": "C-Suite Leader",
    "CEO": "Chief Executive Officer",
}

TONE_PROFILES = {
    "friendly": {
        "greeting": "Hey {salutation},",
        "closing": "Thanks!\n{sender}\n{signature}",
        "bridge": [
            "Hope you're having a solid week.",
            "Quick note to keep everyone aligned.",
        ],
    },
    "formal": {
        "greeting": "Hello {salutation},",
        "closing": "Regards,\n{sender}\n{signature}",
        "bridge": [
            "Providing the latest visibility on our work.",
            "Documenting the decisions below for your awareness.",
        ],
    },
    "crass": {
        "greeting": "Listen {salutation},",
        "closing": "Do it.\n{sender}\n{signature}",
        "bridge": [
            "We're not sugarcoating this.",
            "Cut through the noise and get this handled.",
        ],
    },
    "dimplomatic": {
        "greeting": "Team {salutation},",
        "closing": "Truly,\n{sender}\n{signature}",
        "bridge": [
            "Balancing a few competing priorities here.",
            "Let's land on a path that works for everyone.",
        ],
    },
    "stern": {
        "greeting": "Attention {salutation},",
        "closing": "Non-negotiable.\n{sender}\n{signature}",
        "bridge": [
            "This situation needs immediate discipline.",
            "Zero tolerance for further slippage.",
        ],
    },
    "abusive": {
        "greeting": "This is beyond acceptable, {salutation}.",
        "closing": "Fix it.\n{sender}\n{signature}",
        "bridge": [
            "We've flagged this twice already.",
            "I shouldn't have to repeat myself.",
        ],
    },
    "sympathetic": {
        "greeting": "Hi {salutation},",
        "closing": "Appreciate the lift here.\n{sender}\n{signature}",
        "bridge": [
            "I know the landscape is heavy right now.",
            "Thanks for leaning in despite the crunch.",
        ],
    },
    "pleading": {
        "greeting": "Please, {salutation},",
        "closing": "Really counting on you.\n{sender}\n{signature}",
        "bridge": [
            "I'm asking for a quick assist.",
            "We need this to stay afloat.",
        ],
    },
    "commanding": {
        "greeting": "Team {salutation},",
        "closing": "Execute.\n{sender}\n{signature}",
        "bridge": [
            "Treat the directives below as locked.",
            "Follow the sequence exactly as written.",
        ],
    },
    "persuasive": {
        "greeting": "Folks {salutation},",
        "closing": "Let's make this win.\n{sender}\n{signature}",
        "bridge": [
            "Hear me out—this unlocks serious upside.",
            "The rationale below should get you on board.",
        ],
    },
}

INTENT_SUBJECTS = {
    "Informative": ["Update: {project}", "FYI | {project} status"],
    "Warning": [
        "Escalation: {project} risk at {location}",
        "Heads-Up: {project} blocker",
    ],
    "Ordering": ["Directive: {project}", "Action Order | {project}"],
    "Follow Up": ["Follow-Up: {project} next steps", "Checking back on {project}"],
    "Circle Back": ["Circling Back: {project}", "Looping back on {project}"],
    "Defer": ["Deferral: {project}", "Holding on {project}"],
    "Postpone": ["Postponed: {project}", "Schedule Shift | {project}"],
    "Organize": ["Organize: {project} logistics", "Planning {project}"],
    "Motivate": ["Push on {project}", "Momentum for {project}"],
    "Encourage": ["Boost for {project} team", "Keep going on {project}"],
    "Reward": ["Reward Actions | {project}", "Incentives for {project}"],
    "Praise": ["Applause for {project}", "Shout-out | {project}"],
    "Congratulate": ["Congrats: {project}", "Celebrating {project}"],
    "Scold": ["Course Correct: {project}", "Not acceptable: {project}"],
    "Explain": ["Explainer: {project}", "Background on {project}"],
    "Dissuade": ["Reconsider {project}", "Against fast-tracking {project}"],
}

THREAD_SNIPPETS = [
    "Can you confirm the staffing plan before the steering call?",
    "We promised Finance a cleaner forecast this time.",
    "Compliance pinged us about the overdue attestations.",
    "The Houston desk is asking where the briefing stands.",
    "Let's not repeat last quarter's surprises.",
]

PREVIOUS_SNIPPETS = [
    "Reminder that Finance needs numbers before noon tomorrow.",
    "Do we have a story for leadership yet?",
    "Legal still has open questions we owe answers on.",
    "Ops is concerned about the warehouse impact.",
]

INTENT_TASK_LIBRARY = {
    "Informative": [
        "Summarize the latest metrics for {project} covering {location} operations",
        "Publish the weekly recap deck for {project}",
    ],
    "Warning": [
        "Investigate the risk flagged on {project} in {location}",
        "Draft the mitigation memo for the blocker hitting {project}",
    ],
    "Ordering": [
        "Complete the compliance checklist for {project}",
        "Deliver the handover pack for {project}",
    ],
    "Follow Up": [
        "Collect the missing approvals tied to {project}",
        "Confirm vendor responses for {project}",
    ],
    "Circle Back": [
        "Respond to outstanding questions on {project}",
        "Close the feedback loop with {location} stakeholders on {project}",
    ],
    "Defer": [
        "Log deferred action items affecting {project}",
        "Share the revised timeline after deferring {project} work",
    ],
    "Postpone": [
        "Notify partners about the postponed {project} rollout",
        "Update the launch calendar to reflect the postponed {project}",
    ],
    "Organize": [
        "Coordinate agenda and materials for the {project} workshop",
        "Schedule prep sessions for the {project} review",
    ],
    "Motivate": [
        "Collect success stories to fuel morale on {project}",
        "Set milestone shout-outs for the {project} pod",
    ],
    "Encourage": [
        "Send encouragement notes to analysts covering {project}",
        "Pair junior leads with mentors for {project}",
    ],
    "Reward": [
        "Draft recognition packets for the {project} contributors",
        "Submit bonus nominations for {project} performers",
    ],
    "Praise": [
        "Highlight the accomplishments from {location} supporting {project}",
        "Write the praise snippet for the leadership newsletter on {project}",
    ],
    "Congratulate": [
        "Send congratulations to the {project} task force",
        "Compose the celebration note for the {location} crew on {project}",
    ],
    "Scold": [
        "Address recurring misses tied to {project}",
        "Document corrective steps for {location} issues dragging {project}",
    ],
    "Explain": [
        "Prepare the explainer on why {project} shifted",
        "Outline the background for stakeholders new to {project}",
    ],
    "Dissuade": [
        "Build the case against fast-tracking {project}",
        "Summarize the downsides of the vendor plan impacting {project}",
    ],
}

INTENT_MAIN_LINES = {
    "Informative": "Sharing the latest on {project} so you have the same view I do.",
    "Warning": "We're facing a material risk on {project} and need containment now.",
    "Ordering": "Treat the directives below as non-optional for {project}.",
    "Follow Up": "Looping back to close the loop on {project} touches.",
    "Circle Back": "Time to circle back and settle the loose ends around {project}.",
    "Defer": "We're deferring portions of {project}; details below.",
    "Postpone": "We'll postpone active execution on {project} until we reset dependencies.",
    "Organize": "Let's orchestrate {project} cleanly with the plan below.",
    "Motivate": "We need a fresh burst of energy around {project}.",
    "Encourage": "Keep the momentum on {project}; a few nudges to help.",
    "Reward": "Recognizing the extra lift on {project}; log the rewards below.",
    "Praise": "Calling out standout work on {project} so it's visible.",
    "Congratulate": "Congratulations are in order for {project} milestones.",
    "Scold": "Performance on {project} is below the line—course correction required.",
    "Explain": "Here's the context you asked for regarding {project}.",
    "Dissuade": "Let me lay out why we should hold back on {project}.",
}

INTENT_DUE_PROBABILITY = {
    "Informative": 0.5,
    "Warning": 0.9,
    "Ordering": 0.95,
    "Follow Up": 0.85,
    "Circle Back": 0.8,
    "Defer": 0.7,
    "Postpone": 0.8,
    "Organize": 0.75,
    "Motivate": 0.6,
    "Encourage": 0.5,
    "Reward": 0.6,
    "Praise": 0.4,
    "Congratulate": 0.5,
    "Scold": 0.9,
    "Explain": 0.65,
    "Dissuade": 0.7,
}

BASE_DATE = datetime(2001, 3, 1)


def synthesize_email_offline(tuple_row: dict) -> SyntheticEmail:
    designation = tuple_row["Designation"]
    tone = tuple_row["Tone"]
    context = tuple_row["EmailContext"]
    intent = tuple_row["Intent"]

    sender = rng.choice(SENDER_DIRECTORY[designation])
    project = rng.choice(PROJECTS)
    location = rng.choice(LOCATIONS)
    department = rng.choice(DEPARTMENTS)
    recipients = rng.sample([p for p in PEOPLE if p != sender], rng.choice([1, 2, 3]))
    salutation = (
        "team" if context.lower() == "broadcast email" else recipients[0].split()[0]
    )
    profile = TONE_PROFILES[tone.lower()]

    def build_commitments():
        templates = INTENT_TASK_LIBRARY[intent]
        count = (
            1
            if intent in {"Reward", "Praise", "Congratulate", "Motivate", "Encourage"}
            else rng.choice([1, 2, 3])
        )
        out = []
        for _ in range(count):
            template = rng.choice(templates)
            owner = rng.choice([p for p in PEOPLE if p not in {sender}])
            due_date = None
            if rng.random() < INTENT_DUE_PROBABILITY[intent]:
                offset = rng.choice([1, 2, 3, 5, 7, 10, 14, 21])
                due_date = (BASE_DATE + timedelta(days=offset)).date().isoformat()
            out.append(
                Commitment(
                    task=template.format(project=project, location=location),
                    owner=owner,
                    due_date=due_date,
                )
            )
        return out

    commitments = build_commitments()

    subject_template = rng.choice(INTENT_SUBJECTS[intent])
    subject = subject_template.format(project=project, location=location)
    if context.lower() in {"single reply", "reply to thread"}:
        subject = "Re: " + subject
    elif context.lower() == "broadcast email":
        subject = "[Broadcast] " + subject

    greeting = profile["greeting"].format(salutation=salutation.title())
    bridge = rng.choice(profile["bridge"])
    context_line = {
        "initiated": "Starting a new thread so it doesn't get buried elsewhere.",
        "single reply": "Thanks for the note earlier—responding inline here.",
        "reply to thread": "Jumping back into this thread with the updates below.",
        "broadcast email": "Sharing broadly so everyone has the same guidance.",
    }[context.lower()]
    main_line = INTENT_MAIN_LINES[intent].format(project=project, location=location)

    commitment_block = "\n".join(
        f"- {c.task} (Owner: {c.owner}{'; Due: ' + c.due_date if c.due_date else ''})"
        for c in commitments
    )

    closing = profile["closing"].format(
        sender=sender,
        signature=f"{DESIGNATION_LABELS[designation]}, {department}",
    )

    body_parts = [
        greeting,
        bridge,
        context_line,
        main_line,
        "Commitments:",
        commitment_block,
    ]

    if context.lower() in {"single reply", "reply to thread"}:
        body_parts.append("")
        body_parts.append("> " + rng.choice(THREAD_SNIPPETS))
        body_parts.append("> " + rng.choice(PREVIOUS_SNIPPETS))

    body_parts.append("")
    body_parts.append(closing)

    body = "\n\n".join(body_parts)

    return SyntheticEmail(
        email_id="SYNTH-XXXX",
        tuple_id=int(tuple_row["tuple_id"]),
        designation=designation,
        tone=tone,
        email_context=context,
        intent=intent,
        subject=subject,
        recipients=recipients,
        body=body,
        commitments=commitments,
    )

### Generate Emails
Set `USE_LLM = True` to call your preferred model; otherwise we run the deterministic fallback so downstream notebooks always have material.


In [None]:
synthetic_batch: list[SyntheticEmail] = []

for _, row in tuple_df.iterrows():
    row_dict = row.to_dict()
    if USE_LLM:
        email = request_email(row_dict)
    else:
        email = synthesize_email_offline(row_dict)
    email.email_id = f"SYNTH-{int(row['tuple_id']):04d}"
    synthetic_batch.append(email)

len(synthetic_batch)

### Preview Sample Email


In [None]:
sample = synthetic_batch[rng.randrange(len(synthetic_batch))]
{
    "email_id": sample.email_id,
    "subject": sample.subject,
    "tone": sample.tone,
    "context": sample.email_context,
    "intent": sample.intent,
    "body": sample.body,
    "commitments": [c.dict() for c in sample.commitments],
}

### Serialize for Downstream Analysis
Export both CSV and JSONL just like Homework 2 so Notebook 02 can do open/axial coding and Notebook 03 can replay traces programmatically.


In [None]:
synthetic_df = pd.DataFrame(
    [
        {
            "email_id": email.email_id,
            "tuple_id": email.tuple_id,
            "designation": email.designation,
            "tone": email.tone,
            "email_context": email.email_context,
            "intent": email.intent,
            "subject": email.subject,
            "recipients": "; ".join(email.recipients),
            "body": email.body,
            "commitments_json": json.dumps(
                [c.dict() for c in email.commitments], ensure_ascii=False
            ),
        }
        for email in synthetic_batch
    ]
)

output_dir = Path("../data")
output_dir.mkdir(exist_ok=True)

csv_path = output_dir / "synthetic_emails.csv"
jsonl_path = output_dir / "synthetic_emails.jsonl"

synthetic_df.to_csv(csv_path, index=False)
with jsonl_path.open("w", encoding="utf-8") as fp:
    for email in synthetic_batch:
        record = email.dict()
        record["recipients"] = email.recipients
        record["commitments"] = [c.dict() for c in email.commitments]
        fp.write(json.dumps(record, ensure_ascii=False) + "\n")

csv_path, jsonl_path