# üìä Structured Outputs: Turning Text into Data

**Scenario:** Imagine you're building a CRM (Customer Relationship Management) system. Every day, you receive hundreds of emails from customers with their contact information, requests, and feedback. Manually copying this data into your database would take hours. What if AI could automatically extract names, emails, phone numbers, and requests into a structured format you can directly save to your database?

This is the power of **structured outputs** ‚Äî getting AI to return data in a predictable, machine-readable format like JSON.

In this notebook, you'll learn how to:
- Extract structured data from unstructured text
- Use JSON mode for reliable data extraction
- Work with Pydantic for type-safe outputs
- Build practical data extraction applications

---

## üéØ What You'll Build

By the end of this notebook, you'll understand how to:
1. **JSON Mode** - Force AI to return valid JSON
2. **Data Extraction** - Pull structured info from text
3. **Pydantic Models** - Add type safety and validation
4. **Multiple Entities** - Extract lists of items
5. **Real-world Applications** - Resume parser, invoice processor, email analyzer

---

## üì¶ Setup: Install Required Packages

We'll use:
- **`litellm`** - AI model interface
- **`python-dotenv`** - Load API keys
- **`pydantic`** - Type-safe data validation

In [1]:
# Install required packages
!pip install -q litellm python-dotenv pydantic

In [2]:
import os
from dotenv import load_dotenv

# Load API key from .env file (if it exists)
load_dotenv()

# Configuration
DEFAULT_MODEL = os.getenv("DEFAULT_MODEL")
DEFAULT_TEMPERATURE = 0.7
DEFAULT_MAX_TOKENS = 500

print(f"‚úÖ Using model: {DEFAULT_MODEL}")

‚úÖ Using model: openrouter/google/gemini-2.0-flash-001


## üõ†Ô∏è Helper Functions

Let's create functions to work with structured outputs.

In [3]:
from litellm import completion
import json
from pydantic import BaseModel
from typing import Optional, List

import litellm
import logging

litellm.suppress_debug_info = True
logging.getLogger("litellm").setLevel(logging.CRITICAL)

def extract_json(
    prompt: str,
    system_message: Optional[str] = None,
    temperature: float = DEFAULT_TEMPERATURE
) -> dict:
    """
    Extract structured data as JSON.
    
    Args:
        prompt: Extraction instructions
        system_message: Optional system instruction
        temperature: Lower = more consistent
    
    Returns:
        Parsed JSON as a dictionary
    """
    messages = []
    
    if system_message:
        messages.append({"role": "system", "content": system_message})
    
    messages.append({"role": "user", "content": prompt})
    
    response = completion(
        model=DEFAULT_MODEL,
        messages=messages,
        temperature=temperature,
        response_format={"type": "json_object"}  # Force JSON output
    )
    
    return json.loads(response.choices[0].message.content)

print("‚úÖ Helper functions loaded!")

‚úÖ Helper functions loaded!


---

## üìß Part 1: Basic JSON Extraction (Contact Information)

**Scenario:** You receive an email from a potential customer. You need to extract their name, email, phone, and company into your CRM database.

**Without AI:** You'd manually copy-paste each field.

**With AI:** Extract everything automatically in one step.

In [4]:
email_text = """
Hi there,

I'm Sarah Johnson from TechCorp Inc. I'm interested in your services. 
You can reach me at sarah.j@techcorp.com or call me at (555) 123-4567.
We're based in San Francisco and looking to start a project next month.

Best regards,
Sarah
"""

prompt = f"""Extract contact information from this email and return as JSON:

{email_text}

Return a single JSON object (not an array) with these fields: name, email, phone, company, city"""

print(f"[Email:]")
print(email_text)

contact_data = extract_json(prompt, temperature=0.3)

print(f"‚úì Extracted Data:")
print(json.dumps(contact_data, indent=2))

[Email:]

Hi there,

I'm Sarah Johnson from TechCorp Inc. I'm interested in your services. 
You can reach me at sarah.j@techcorp.com or call me at (555) 123-4567.
We're based in San Francisco and looking to start a project next month.

Best regards,
Sarah

‚úì Extracted Data:
{
  "name": "Sarah Johnson",
  "email": "sarah.j@techcorp.com",
  "phone": "(555) 123-4567",
  "company": "TechCorp Inc.",
  "city": "San Francisco"
}


### üí° Key Insight

**JSON mode** (`response_format={"type": "json_object"}`) guarantees the AI will return valid JSON. This is much more reliable than asking the AI to "format as JSON" in the prompt.

**Why this matters:** You can directly save this data to a database without worrying about parsing errors.

---

## üîí Part 2: Type-Safe Extraction with Pydantic

**The Problem:** JSON is flexible, but that means fields might be missing or have wrong types.

**The Solution:** **Pydantic** ‚Äî a Python library that validates data types and provides better error messages.

**What this means:** Instead of just getting a dictionary, you get a Python object with guaranteed fields and types. If the AI returns invalid data, Pydantic will catch it immediately.

In [5]:
# Define a Pydantic model for contact information

class Contact(BaseModel):
    name: str
    email: str
    phone: Optional[str] = None  # Optional field
    company: Optional[str] = None
    city: Optional[str] = None

# Extract and validate
contact_data = extract_json(prompt, temperature=0.3)
contact = Contact(**contact_data)  # Validate with Pydantic

print(f"‚úì Validated Contact:")
print(f"Name: {contact.name}")
print(f"Email: {contact.email}")
print(f"Phone: {contact.phone}")
print(f"Company: {contact.company}")
print(f"City: {contact.city}")

‚úì Validated Contact:
Name: Sarah Johnson
Email: sarah.j@techcorp.com
Phone: (555) 123-4567
Company: TechCorp Inc.
City: San Francisco


### ‚ùì Discussion Question #1

What happens if the AI fails to extract a required field (like `email`)? Try removing the email from the text and see what error Pydantic gives you. How would you handle this in a production application?

---

## üõí Part 3: Extracting Multiple Entities (Shopping List)

**Scenario:** A customer sends you a message listing multiple products they want to buy. You need to extract all products with their quantities and prices.

**What this means:** Instead of extracting one item, you're extracting a **list** of items.

In [13]:
# Define Pydantic models for products
class Product(BaseModel):
    product_name: str
    quantity: int
    price: float

class Order(BaseModel):
    products: List[Product]
    total: float

order_text = """
I'd like to order:
- 2 Buddha Bowls at $12.99 each
- 1 Green Smoothie for $6.50
- 3 Avocado Toasts at $9.99 each

Please confirm the total.
"""

prompt = f"""Extract all products from this order and calculate the total.

{order_text}

Return a single JSON object (not an array) with:
- products: array of {{product_name, quantity, price}}
- total: sum of all items"""

print(f"[Order Text:]")
print(order_text)

order_data = extract_json(prompt)
order = Order(**order_data)

print(f"‚úì Extracted Order:")
for product in order.products:
    print(f"  ‚Ä¢ {product.quantity}x {product.product_name} @ ${product.price} = ${product.quantity * product.price:.2f}")
print(f"\nTotal: ${order.total:.2f}")

[Order Text:]

I'd like to order:
- 2 Buddha Bowls at $12.99 each
- 1 Green Smoothie for $6.50
- 3 Avocado Toasts at $9.99 each

Please confirm the total.

‚úì Extracted Order:
  ‚Ä¢ 2x Buddha Bowl @ $12.99 = $25.98
  ‚Ä¢ 1x Green Smoothie @ $6.5 = $6.50
  ‚Ä¢ 3x Avocado Toast @ $9.99 = $29.97

Total: $62.45


### üéØ Challenge Task #1: Validate the Math

The AI calculated a total. But is it correct? Write code to:
1. Calculate the actual total from the products list
2. Compare it to the AI's total
3. Print a warning if they don't match

This is important for production systems ‚Äî always validate AI outputs!

In [7]:
# Your experimentation space for Challenge #1

# calculated_total = sum(p.quantity * p.price for p in order.products)
# if abs(calculated_total - order.total) > 0.01:  # Allow small floating point differences
#     print(f"‚ö†Ô∏è  Warning: Total mismatch! AI said ${order.total}, actual is ${calculated_total}")
# else:
#     print(f"‚úÖ Total verified: ${calculated_total}")

---

## üìÑ Part 4: Complex Nested Structures (Resume Parsing)

**Scenario:** You're building a job application system. Candidates submit resumes as text. You need to extract:
- Personal info (name, email, phone)
- Work experience (list of jobs with company, title, dates)
- Education (list of degrees)
- Skills (list of skills)

This is a **nested structure** ‚Äî objects containing lists of other objects.

In [15]:
# Define nested Pydantic models
class WorkExperience(BaseModel):
    company: str
    title: str
    start_date: str
    end_date: str
    description: Optional[str] = None

class Education(BaseModel):
    school: str
    degree: str
    year: str

class Resume(BaseModel):
    name: str
    email: str
    phone: Optional[str] = None
    work_experience: List[WorkExperience]
    education: List[Education]
    skills: List[str]

resume_text = """
JOHN DOE
john.doe@email.com | (555) 987-6543

Senior Software Engineer at TechCorp (2020-2024)
- Led development of cloud infrastructure

Software Engineer at StartupXYZ (2018-2020)
- Built mobile applications

BS Computer Science, Stanford University, 2018

Python, JavaScript, React, AWS, Docker
"""

prompt = f"""Extract all information from this resume and return as structured JSON:

{resume_text}

Return a single JSON object with these exact fields:
- name (string)
- email (string)  
- phone (string)
- work_experience (array of objects with: company, title, start_date, end_date, description)
- education (array of objects with: school, degree, year)
- skills (array of strings)

Use these exact field names."""

print(f"[Resume:]")
print(resume_text)

resume_data = extract_json(prompt)
resume = Resume(**resume_data)

print(f"‚úì Parsed Resume:")
print(f"\n‚Üí Candidate: {resume.name}")
print(f"‚Üí Contact: {resume.email} | {resume.phone}")

print(f"\n‚Üí Work Experience:")
for job in resume.work_experience:
    print(f"  ‚Ä¢ {job.title} at {job.company} ({job.start_date} - {job.end_date})")

print(f"\n‚Üí Education:")
for edu in resume.education:
    print(f"  ‚Ä¢ {edu.degree}, {edu.school} ({edu.year})")

print(f"\n‚Üí Skills: {', '.join(resume.skills)}")

[Resume:]

JOHN DOE
john.doe@email.com | (555) 987-6543

Senior Software Engineer at TechCorp (2020-2024)
- Led development of cloud infrastructure

Software Engineer at StartupXYZ (2018-2020)
- Built mobile applications

BS Computer Science, Stanford University, 2018

Python, JavaScript, React, AWS, Docker

‚úì Parsed Resume:

‚Üí Candidate: JOHN DOE
‚Üí Contact: john.doe@email.com | (555) 987-6543

‚Üí Work Experience:
  ‚Ä¢ Senior Software Engineer at TechCorp (2020 - 2024)
  ‚Ä¢ Software Engineer at StartupXYZ (2018 - 2020)

‚Üí Education:
  ‚Ä¢ BS Computer Science, Stanford University (2018)

‚Üí Skills: Python, JavaScript, React, AWS, Docker


### ‚ùì Discussion Question #2

Resume formats vary widely. Some use bullet points, some use paragraphs, some have sections in different orders. How robust is this extraction? What would happen with a very different resume format? How could you improve reliability?

---

## üíº Real-World Applications

### 1. Invoice Processor
**What it does:** Extract line items, quantities, prices, and totals from invoices

**How to build it:**
- Define schema for invoice items
- Extract all products with prices
- Validate totals match
- Save to accounting database

In [16]:
# Example: Invoice Processor
class InvoiceItem(BaseModel):
    description: str
    quantity: int
    unit_price: float
    total: float

class Invoice(BaseModel):
    invoice_number: str
    date: str
    vendor: str
    items: List[InvoiceItem]
    subtotal: float
    tax: float
    total: float

invoice_text = """
INVOICE #INV-2024-001
Date: January 15, 2024
From: Office Supplies Co.

Items:
1. Printer Paper (5 reams) @ $8.99 each = $44.95
2. Blue Pens (Box of 12) @ $3.50 each = $3.50
3. Notebooks (10 units) @ $2.99 each = $29.90

Subtotal: $78.35
Tax (8%): $6.27
Total: $84.62
"""

prompt = f"""Extract all information from this invoice:

{invoice_text}

Return a single JSON object with these exact fields:
- invoice_number (string)
- date (string)
- vendor (string)
- items (array of objects with: description, quantity, unit_price, total)
- subtotal (number)
- tax (number)
- total (number)

IMPORTANT: Use 'unit_price' exactly, do not use 'price'.
Use these exact field names."""

invoice_data = extract_json(prompt)
invoice = Invoice(**invoice_data)

print(f"‚úì Invoice #{invoice.invoice_number}")
print(f"Vendor: {invoice.vendor}")
print(f"Date: {invoice.date}\n")

print("Items:")
for item in invoice.items:
    print(f"  ‚Ä¢ {item.description}: {item.quantity} √ó ${item.unit_price} = ${item.total}")

print(f"\nSubtotal: ${invoice.subtotal}")
print(f"Tax: ${invoice.tax}")
print(f"‚Üí Total: ${invoice.total}")

‚úì Invoice #INV-2024-001
Vendor: Office Supplies Co.
Date: January 15, 2024

Items:
  ‚Ä¢ Printer Paper (5 reams): 5 √ó $8.99 = $44.95
  ‚Ä¢ Blue Pens (Box of 12): 1 √ó $3.5 = $3.5
  ‚Ä¢ Notebooks (10 units): 10 √ó $2.99 = $29.9

Subtotal: $78.35
Tax: $6.27
‚Üí Total: $84.62


### 2. Email Action Item Extractor
**What it does:** Read emails and extract action items with deadlines

**What this means:** Instead of reading a long email thread and manually making a to-do list, AI does it for you. For example, from "Please send the report by Friday" it extracts: Action: "Send report", Deadline: "Friday"

In [17]:
# Example: Email Action Item Extractor
class ActionItem(BaseModel):
    task: str
    deadline: str
    priority: str
    assigned_to: str

class EmailAnalysis(BaseModel):
    subject: str
    sender: str
    action_items: List[ActionItem]

email = """
From: boss@company.com
Subject: Q1 Planning Meeting Follow-up

Hi Team,

Great meeting today. Here are the next steps:

1. Sarah needs to finalize the budget by Friday. This is high priority.
2. Mike, please update the slide deck by Wednesday. Medium priority.
3. Everyone should review the new policy doc by next Monday. Low priority.

Thanks!
"""

prompt = f"""Extract action items from this email:

{email}

Return a single JSON object with these exact fields:
- subject (string)
- sender (string)
- action_items (array of objects with: task, deadline, priority, assigned_to)

Priority should be "high", "medium", or "low".
Use these exact field names."""

email_data = extract_json(prompt)
analysis = EmailAnalysis(**email_data)

print(f"[Subject:] {analysis.subject}")
print(f"[Sender:] {analysis.sender}\n")

print("Action Items:")
for item in analysis.action_items:
    print(f"  ‚Ä¢ [Task:] {item.task}")
    print(f"    Priority: {item.priority.upper()} | Deadline: {item.deadline} | Assigned: {item.assigned_to}")
    print("="*60)

[Subject:] Q1 Planning Meeting Follow-up
[Sender:] boss@company.com

Action Items:
  ‚Ä¢ [Task:] finalize the budget
    Priority: HIGH | Deadline: Friday | Assigned: Sarah
  ‚Ä¢ [Task:] update the slide deck
    Priority: MEDIUM | Deadline: Wednesday | Assigned: Mike
  ‚Ä¢ [Task:] review the new policy doc
    Priority: LOW | Deadline: next Monday | Assigned: Everyone


### 3. Product Review Analyzer
**What it does:** Extract sentiment, rating, pros, and cons from product reviews

In [18]:
# Example: Product Review Analyzer
class Review(BaseModel):
    product_name: str
    rating: int  # 1-5 stars
    sentiment: str  # "positive", "neutral", "negative"
    pros: List[str]
    cons: List[str]
    summary: str

review_text = """
I bought the UltraBook Pro laptop last month and I'm mostly happy with it. 
The battery life is amazing - easily lasts 12 hours. The screen is beautiful and bright.
Performance is great for coding and video editing.

However, it's quite expensive at $1,500. Also, the keyboard feels a bit mushy compared to my old laptop.
The trackpad is sometimes unresponsive.

Overall, I'd give it 4 out of 5 stars. Good laptop but not perfect.
"""

prompt = f"""Analyze this product review:

{review_text}

Extract: product_name, rating (1-5), sentiment, pros (array), cons (array), summary (1 sentence)"""

review_data = extract_json(prompt)
review = Review(**review_data)

print(f"‚úì Review Analysis")
print(f"Product: {review.product_name}")
print(f"Rating: {'‚≠ê' * review.rating} ({review.rating}/5)")
print(f"Sentiment: {review.sentiment.upper()}\n")

print(f"‚úì Pros:")
for pro in review.pros:
    print(f"  ‚úì {pro}")

print(f"\n‚úó Cons:")
for con in review.cons:
    print(f"  ‚úó {con}")

print(f"\n‚Üí Summary: {review.summary}")

‚úì Review Analysis
Product: UltraBook Pro
Rating: ‚≠ê‚≠ê‚≠ê‚≠ê (4/5)
Sentiment: POSITIVE

‚úì Pros:
  ‚úì amazing battery life
  ‚úì beautiful and bright screen
  ‚úì great performance for coding and video editing

‚úó Cons:
  ‚úó expensive
  ‚úó mushy keyboard
  ‚úó unresponsive trackpad

‚Üí Summary: Good laptop but not perfect.


---

## üöÄ Final Challenge: Build a Meeting Notes Extractor

**Your Task:** Create a system that processes meeting transcripts and extracts structured information.

**Requirements:**
1. Extract:
   - Meeting title
   - Date
   - Attendees (list)
   - Key decisions made (list)
   - Action items with owners and deadlines
   - Next meeting date
2. Use Pydantic for validation
3. Format the output nicely

**Test with this meeting transcript:**

In [12]:
meeting_transcript = """
Product Planning Meeting - January 20, 2024

Attendees: Sarah (PM), John (Engineering), Lisa (Design), Mike (Marketing)

Sarah: Let's discuss the Q1 roadmap. We've decided to prioritize the mobile app redesign.

Lisa: I'll have the mockups ready by next Friday, January 26th.

John: Engineering will need 2 weeks after that for implementation. We also decided to postpone 
the analytics dashboard to Q2.

Mike: I'll prepare the launch campaign. Can someone send me the feature list by Wednesday?

Sarah: I'll send that. Let's meet again on February 5th to review progress.
"""

# Your experimentation space for the Final Challenge

# Define your Pydantic models here:
# class MeetingNotes(BaseModel):
#     ...

# Extract and display the data

---

## üéì Key Takeaways

Congratulations! You now understand how to extract structured data from text:

1. **JSON Mode** - Guarantees valid JSON output
2. **Pydantic** - Adds type safety and validation
3. **Multiple Entities** - Extract lists of items
4. **Nested Structures** - Handle complex data hierarchies
5. **Real-world Applications** - Invoice processor, email analyzer, review parser

### üîë Best Practices:

| Scenario | Temperature | Validation | Notes |
|----------|-------------|------------|-------|
| Contact Extraction | Low (0.2) | Pydantic | Accuracy is critical |
| Invoice Processing | Low (0.2) | Pydantic + Math | Verify calculations |
| Review Analysis | Medium (0.5) | Pydantic | Some interpretation needed |
| Resume Parsing | Low (0.3) | Pydantic | Consistent format important |

### ‚ö†Ô∏è Important Considerations:

- **Always validate**: Don't trust AI outputs blindly
- **Handle missing data**: Use `Optional` fields in Pydantic
- **Test edge cases**: What if the text is ambiguous?
- **Low temperature**: Use 0.2-0.3 for consistent extraction
- **Clear prompts**: Specify exact field names and formats

### üöÄ Next Steps:

- **Combine** with vision AI to extract data from images
- **Add** error handling for failed extractions
- **Build** a complete data pipeline (extract ‚Üí validate ‚Üí save to database)
- **Experiment** with different models and compare accuracy

---

## üìö Additional Resources

- [Structured Outputs: Everything You Should Know](https://learnwithparam.com/blog/structured-output-making-llms-application-ready)
- [Pydantic Documentation](https://docs.pydantic.dev/)
- [OpenAI Structured Outputs](https://platform.openai.com/docs/guides/structured-outputs)

Happy building! üéâ