## 🎨 Small Business Email Conversation Generator

This notebook demonstrates how to build a synthetic data generation pipeline for creating realistic email conversations between small business owners and their customers or professional contacts. The generated conversations can be used for training language models to better understand and handle small business communications.

In [1]:
from gretel_client.navigator import DataDesigner

### ✍️ Setting System Instruction
The system instructions guide our LLMs to generate authentic email conversations that match real-world business communications. These instructions ensure consistency in tone, style, and content while maintaining the natural variation you'd expect in small business emails.

In [2]:
system_instruction="\
    You are an expert at writing realistic emails. When given information about a person's job, \
    personality, and goals, write emails that sound exactly like they would write them. Match \
    their natural communication style - whether formal or casual, direct or diplomatic. \
    Include their common phrases and speech patterns. Consider their role, industry,\
    and relationship with the recipient. Make the tone and length appropriate for their objective. Avoid generic business language unless it matches how they actually write. Your emails should be indistinguishable from ones the person would write themselves."

### ⚙️ Data Designer Configuration
Instead of using a single configuration file, we build our pipeline step by step. This gives us fine-grained control over how we guide LLMs to produce authentic small business email conversations. The modular approach allows quick iteration and refinement of our data generation process.

In [3]:
## Create our DD Instance
data_designer = DataDesigner(
    model_suite='apache-2.0',
    special_system_instructions=system_instruction,
    endpoint="https://api.gretel.cloud",
    cache="yes"
)

[19:58:02] [INFO] 🦜 Using apache-2.0 model suite
Logged in as kirit.thadaka@gretel.ai ✅


### Use Structured Outputs to Definine the Email Structure

We use Pydantic to ensure our generated emails follow a consistent structure while maintaining flexibility in content. This helps maintain quality and reliability in our synthetic data.

In [4]:
from typing import Dict, List, Literal, Optional, Union, Any
from pydantic import BaseModel, Field, model_validator

class Email(BaseModel):
    """A single email in the email chain."""
    role: Literal["business_owner", "customer"] = Field(..., description="Which role is writing the message.")
    subject: str = Field(..., description="Email subject.")
    content: str = Field(..., description="Email contents.")


class EmailConversation(BaseModel):
    """An email conversation between a two people."""
    conversation: list[Email] = Field(..., description="List of all messages in the email chain.")

### 🌱 Categorical Seed Columns
Our seed columns define the context and characteristics of each email conversation. They include:

#### Business Context

* Business type and specific business category
* Business name generation
* Owner's communication style

#### Customer Context

* Customer personality and communication style
* Relationship with the business
* Previous interaction history

#### Conversation Parameters

* Primary goal of the email exchange
* Urgency level
* Overall tone
* Number of exchanges

Each seed category is carefully designed to create authentic variations in email conversations while maintaining realistic business scenarios.

In [5]:
data_designer.add_categorical_seed_column(
    name="business_type",
    description="The type of small business the owner operates",
    values=[
        "Food & Beverage",
        "Local Retail",
        "Skilled Trade",
        "Creative Services",
        "Health & Wellness"
    ],
    subcategories=[
        {
            "name": "specific_business",
            "values": {
                "Food & Beverage": [
                    "Family-run Bakery",
                    "Local Coffee Shop",
                    "Food Truck",
                    "Small Catering Business",
                    "Specialty Food Store"
                ],
                "Local Retail": [
                    "Independent Bookstore",
                    "Local Plant Nursery",
                    "Vintage Clothing Shop",
                    "Craft Supply Store",
                    "Local Pet Supply Store"
                ],
                "Skilled Trade": [
                    "Local Plumber",
                    "Independent Electrician",
                    "Small Carpentry Shop",
                    "Auto Repair Shop",
                    "Local Landscaping Service"
                ],
                "Creative Services": [
                    "Freelance Photographer",
                    "Independent Graphic Designer",
                    "Local Wedding Planner",
                    "Custom Jewelry Maker",
                    "Independent Interior Designer"
                ],
                "Health & Wellness": [
                    "Independent Massage Therapist",
                    "Local Yoga Studio",
                    "Family Chiropractor",
                    "Personal Training Business",
                    "Small Wellness Center"
                ]
            },
            "num_new_values_to_generate": 2
        }
    ]
)

In [6]:
data_designer.add_categorical_seed_column(
    name="owner_personality",
    description="The communication style and personality of the business owner",
    values=[
        "Professional and Formal",
        "Warm and Personal",
        "Direct and Efficient",
        "Creative and Casual",
        "Patient and Educational"
    ]
)

In [7]:
data_designer.add_categorical_seed_column(
    name="customer_personality",
    description="The personality type and communication style of the customer",
    values=[
        "Polite and Clear",
        "Frustrated but Respectful",
        "Demanding and Direct",
        "Detail-oriented",
        "Casual and Friendly",
        "Confused and Seeking Help"
    ]
)

In [8]:
data_designer.add_categorical_seed_column(
    name="conversation_goal",
    description="The primary objective of the email exchange in a small business context",
    values=[
        # Scheduling & Availability
        "Appointment Scheduling",     # For services like massage, training, repairs
        "Custom Order Discussion",    # For bakeries, jewelers, craftspeople
        "Consultation Request",       # For designers, wellness providers, trades
        
        # Service-Related
        "Quote Request",             # For trades, creative services, catering
        "Service Modification",      # Change to existing booking/order
        "Availability Check",        # Stock check, service times, special items
        
        # Customer Support
        "Progress Update Request",   # For ongoing projects, custom orders
        "Last-Minute Changes",       # Emergency scheduling, order modifications
        "Service Issue Resolution",  # Quality concerns, timing issues
        
        # Business Development
        "Collaboration Inquiry",     # Local business partnerships, events
        "Special Event Planning",    # Private bookings, workshops, classes
        "Custom Project Proposal",   # For creative services, skilled trades
        
        # Customer Experience
        "Follow-up Care",           # Post-service check-ins, maintenance tips
        "Detailed Instructions",     # Care instructions, product use guidance
        "Personal Recommendations",  # Product/service suggestions based on history
        
        # Administrative
        "Payment Arrangement",       # Deposits, payment plans, invoicing
        "Business Policy Question",  # Hours, booking policies, COVID protocols
        "Local Delivery Options"     # Delivery zones, timing, special handling
    ]
)

In [9]:
data_designer.add_categorical_seed_column(
    name="conversation_tone",
    description="The overall tone of the email exchange",
    values=[
        "Professional",
        "Friendly",
        "Formal",
        "Apologetic",
        "Appreciative",
        "Resolution-focused"
    ]
)

In [10]:
data_designer.add_categorical_seed_column(
    name="urgency_level",
    description="The urgency level and timing context of the small business email exchange",
    values=[
        # Same Day/Immediate
        "Emergency Service Needed",      # Urgent repairs, last-minute catering cancelation
        "Same-Day Modification",         # Changes to today's appointments/orders
        "Time-Critical Question",        # Questions about today's service/product
        
        # Near-Term
        "This Week Required",           # Scheduling for this week, stock inquiries
        "48-Hour Response Needed",      # Quote needed within 2 days, upcoming appointment
        "Weekend Preparation",          # Planning for weekend events/services
        
        # Standard Timing
        "Regular Booking",              # Normal appointment scheduling
        "Standard Order Process",       # Typical product orders/inquiries
        "Routine Inquiry",             # General questions about services/products
        
        # Future Planning
        "Advance Booking",             # Events, large projects, seasonal orders
        "Project Planning",            # Custom work, renovations, design projects
        "Seasonal Planning",           # Holiday orders, seasonal services
        
        # Follow-up Based
        "Post-Service Check",          # After service completion
        "Maintenance Schedule",        # Regular service planning
        "Project Update"               # Progress updates on ongoing work
    ]
)

In [11]:
data_designer.add_categorical_seed_column(
    name="business_relationship",
    description="The nature of the customer-business relationship in a small business context",
    values=[
        # New Relationships
        "First-Time Local Customer",             # Local resident trying service for first time
        "Word-of-Mouth Referral",               # Referred by existing customer
        "Social Media Discovery",               # Found business through local social media
        "Local Event Introduction",             # Met at farmer's market/community event
        
        # Established Relationships
        "Weekly Regular",                       # E.g., Standing yoga class, weekly bread order
        "Monthly Service Client",               # Regular maintenance, monthly appointments
        "Seasonal Customer",                    # Holiday orders, seasonal services
        "Project-Based Client",                 # Ongoing renovation, wedding planning
        
        # Community Connections
        "Fellow Local Business Owner",          # Other business owner in community
        "Local Family Customer",                # Family with multiple service needs
        "Neighborhood Regular",                 # Lives/works nearby, frequent casual visits
        "Community Event Partner",              # Collaborated on local events
        
        # Special Circumstances
        "Multi-Generation Customer",            # Family has used business for years
        "Former Regular Returning",             # Coming back after moving/break
        "Special Needs Client",                 # Requires specific accommodations
        "Custom Order Regular"                  # Regular custom/specialized orders
    ]
)

In [12]:

data_designer.add_categorical_seed_column(
    name="email_length",
    description="Number of email exchanges in the conversation",
    values=[2, 3, 4, 5, 6]
)

### ✨ Generated Data Columns
We use three main generated columns to create our email conversations:

#### Email Objective

* Creates specific, detailed scenarios for email communication
* Ensures the objective matches the business type and conversation goal
* Maintains realism and relevance

#### Email Subject

* Generates appropriate subject lines based on context
* Reflects urgency and tone
* Matches natural email writing patterns

#### Email Contents

* Generates the full email exchange
* Maintains consistent personality traits
* Includes appropriate business details
* Reaches natural conclusions

Each generated column builds upon the previous ones to create coherent, realistic email conversations that feel authentic to small business communications.

In [13]:
data_designer.add_generated_data_column(
    name="business_name",
    generation_prompt=(
        "Generate a single, creative business name for a {specific_business}. The name should:\n"
        "- Be 1-3 words maximum\n"
        "- Not include quotes, explanations, or alternatives\n"
        "- Not include 'LLC', 'Inc.', or other business designations\n"
        "- Be appropriate for the business type: {business_type}\n"
        "\n"
        "Examples:\n"
        "For a bakery: Sweet Crumb\n"
        "For a yoga studio: Zenspace Yoga\n"
        "For a bookstore: Page & Porter\n"
        "\n"
        "Provide ONLY the business name, with no additional text, quotes, or explanation."
    )
)

In [14]:
data_designer.add_generated_data_column(
    name="email_objective",
    generation_prompt=(
        "Based on this small business context:\n"
        "- Business Type: {business_type} (Specifically: {specific_business})\n"
        "- Conversation Goal: {conversation_goal}\n"
        "- Customer Type: {business_relationship}\n"
        "\n"
        "Generate ONE specific reason for customer contact that:\n"
        "- Includes real-world constraints or complications\n"
        "- References specific services/products (not generic ones)\n"
        "- Mentions timing or scheduling elements\n"
        "- Includes customer-specific context\n"
        "\n"
        "Examples:\n"
        "BAD: 'Need to order a custom cake for next week'\n"
        "GOOD: 'Need gluten-free birthday cake for 15 people by next Saturday - daughter has celiac disease'\n"
        "\n"
        "BAD: 'Kitchen sink needs repair'\n"
        "GOOD: 'Kitchen sink backing up after dishwasher runs - getting worse over past week'\n"
        "\n"
        "Only provide the objective, no additional context or explanation."
    )
)

In [15]:
data_designer.add_generated_data_column(
    name="email_subject",
    generation_prompt=(
        "Create an email subject line based on:\n"
        "- Email Objective: {email_objective}\n"
        "- Urgency Level: {urgency_level}\n"
        "- Conversation Tone: {conversation_tone}\n"
        "\n"
        "The subject line should:\n"
        "- Be concise (4-8 words)\n"
        "- Only include essential details\n"
        "- Avoid redundant information\n"
        "- Match how real people write subject lines (often informal or incomplete sentences)\n"
        "- Include urgency indicators only for genuinely urgent matters\n"
        "- Sound natural, not corporate\n"
        "- Reflect how the specific customer personality would write it\n"
        "\n"
        "Examples of good vs bad subject lines:\n"
        "BAD: 'Update Needed: Custom Wedding Cake for June 10th - Design Changes & Delivery Date'\n"
        "GOOD: 'Wedding Cake Changes - June 10th Delivery'\n"
        "\n"
        "BAD: 'Urgent: Custom Diagnostic Check & Brake Noise Issue - 2015 Toyota Camry'\n"
        "GOOD: 'Brake noise in Camry - need appointment'\n"
        "\n"
        "Only provide the subject line, no quotes or additional context."
    )
)

In [16]:
data_designer.add_generated_data_column(
    name="email_contents",
    generation_prompt=(
        "Generate an email conversation between a small business owner and customer with these parameters:\n"
        "\n"
        "CONTEXT:\n"
        "- Business Type: {business_type} (Specifically: {specific_business})\n"
        "- Owner's Style: {owner_personality}\n"
        "- Customer's Style: {customer_personality}\n"
        "- Relationship: {business_relationship}\n"
        "- Customer's Tone: {conversation_tone}\n"
        "- Business Name: {business_name}\n"
        "\n"
        "OBJECTIVE:\n"
        "{email_objective}\n"
        "\n"
        "REQUIREMENTS:\n"
        "1. AUTHENTICITY:\n"
        "- Include realistic friction points (scheduling conflicts, clarifying questions)\n"
        "- Add natural pauses between replies (don't resolve everything immediately)\n"
        "- Include occasional typos or informal language when it fits the personality\n"
        "- Reference practical details like forms, paperwork, or requirements\n"
        "\n"
        "2. BUSINESS CONTEXT:\n"
        "- Include industry-specific details (insurance for health services, deposits for custom work)\n"
        "- Reference real-world constraints (availability, supply issues, weather impacts)\n"
        "- Mention practical next steps (parking info, what to bring, preparation needed)\n"
        "\n"
        "3. CONVERSATION FLOW:\n"
        "- Don't resolve everything in the first exchange\n"
        "- Include natural back-and-forth about details\n"
        "- Let some questions lead to follow-up emails\n"
        "- Sometimes leave minor points unaddressed (like in real emails)\n"
        "\n"
        "4. FORMATTING:\n"
        "- Vary signature styles (full signature first, then casual sign-off)\n"
        "- Include realistic spacing/formatting issues\n"
        "- Sometimes reference missing attachments or follow-up documents\n"
        "\n"
        "BAD EXAMPLES TO AVOID:\n"
        "- Perfect, formal language in every email\n"
        "- Resolving everything in one exchange\n"
        "- Including every possible detail upfront\n"
        "- Overly corporate or template-like language\n"
        "\n"
        "\n"
        "REALISTIC ELEMENTS TO INCLUDE:\n"
        "1. Common Email Scenarios:\n"
        "- Missing information that requires follow-up\n"
        "- Schedule conflicts that need resolution\n"
        "- Clarification of details (sizes, colors, timing)\n"
        "- Alternative suggestions when first choice isn't available\n"
        "\n"
        "2. Natural Business Constraints:\n"
        "- Limited availability for popular times\n"
        "- Minimum notice periods for services\n"
        "- Stock limitations or seasonal availability\n"
        "- Standard policies (deposits, cancellations)\n"
        "\n"
        "3. Communication Patterns:\n"
        "- Shorter initial emails, longer responses\n"
        "- Questions that lead to follow-up emails\n"
        "- Gradual gathering of requirements\n"
        "- References to phone calls or in-person visits when needed\n"
        "\n"
        "4. Real-world Details:\n"
        "- Forms that need to be filled out\n"
        "- Payment timing and methods\n"
        "- Preparation instructions\n"
        "- Parking or location details\n"
        "\n"
        "EXAMPLES OF NATURAL PROGRESSION:\n"
        "BAD:\n"
        "Customer: [Gives every detail perfectly]\n"
        "Owner: [Accepts everything, perfect solution]\n"
        "\n"
        "GOOD:\n"
        "Customer: [Asks about service]\n"
        "Owner: [Asks clarifying questions, mentions constraints]\n"
        "Customer: [Provides more details, adjusts request]\n"
        "Owner: [Offers solution with specific next steps]\n"
        "The conversation must have {email_length} number of emails and reach a natural conclusion or next step."
    ),
    data_config={"type": "structured", "params": {"model": EmailConversation}},
    llm_type="judge"
)

### 👀 Generating and Reviewing Email Previews
Preview mode lets you quickly validate and refine your email generation setup. Each preview run creates 10 sample email conversations.

In [17]:
preview = data_designer.generate_dataset_preview(verbose_logging=True)

[19:58:03] [INFO] 🚀 Generating dataset preview
[19:58:03] [INFO] 🦜 Step 1: Generate seed category values
[19:58:04] [INFO]   |   |-- ✨ Generating values for seed subcategory `specific_business` when `business_type` is Food & Beverage
[19:58:04] [INFO]   |   |-- ✨ Generating values for seed subcategory `specific_business` when `business_type` is Local Retail
[19:58:04] [INFO]   |   |-- ✨ Generating values for seed subcategory `specific_business` when `business_type` is Skilled Trade
[19:58:05] [INFO]   |   |-- ✨ Generating values for seed subcategory `specific_business` when `business_type` is Creative Services
[19:58:05] [INFO]   |   |-- ✨ Generating values for seed subcategory `specific_business` when `business_type` is Health & Wellness
[19:58:06] [INFO]   |-- Model usage: [{"model": "Qwen/Qwen2.5-7B-Instruct", "prompt_tokens": 2261, "completion_tokens": 95, "request_count": 5, "total_tokens": 2356}]
[19:58:06] [INFO] 🎲 Step 2: Sample data seeds
[19:58:06] [INFO]   |-- 🎲 Randomly sam

### 🔎 Inspecting Individual Conversations

Run the display cell below to examine individual email conversations. Each run shows you a different preview record.

In [22]:
preview.display_sample_record()

### 🤔 Like what you see?

Submit a batch workflow!

In [20]:
# Submit batch job
batch_job = data_designer.submit_batch_workflow(num_records=100)
df = batch_job.fetch_dataset(wait_for_completion=True)
print("\nGenerated dataset shape:", df.shape)

By following these steps and leveraging the interactivity of the SDK, you can refine prompts, generate realistic dialogues, and ensure the resulting dataset is high-quality, non-toxic, and aligned with your domain-specific requirements.

In [21]:
# Inspect first 10 records of the generated dataset
df.head(10)