# Nested Models and Alternative Schema Types

While basic structured output allows us to extract simple data with flat schemas, real-world applications often require handling complex, hierarchical data structures. In this notebook, we will learn how to work with nested Pydantic models, explore alternative schema types like Python dataclasses and TypedDict, understand when to use each approach, and leverage provider-native structured output support for better performance.

In [1]:
from pydantic import BaseModel, Field
from typing import List, Optional, TypedDict, Union
from dataclasses import dataclass
from langchain.agents import create_agent
from langchain.agents.structured_output import ToolStrategy
from langchain_openai import ChatOpenAI
import os

## Nested Pydantic models for hierarchical data

Many real-world scenarios require extracting hierarchical data. For example, when extracting information about a company, we might need to capture not just the company name, but also nested information about employees, departments, and locations. Nested Pydantic models allow us to define these complex structures in a type-safe, validated way.

Let's start with a practical example: extracting information about a research paper, including its authors and their affiliations.

In [2]:
# Define nested models for hierarchical data
class Affiliation(BaseModel):
    """Represents an author's institutional affiliation"""
    institution: str = Field(description="Name of the institution")
    department: str = Field(description="Department or division")
    country: str = Field(description="Country where institution is located")

class Author(BaseModel):
    """Represents a research paper author with their affiliation"""
    name: str = Field(description="Full name of the author")
    email: str = Field(description="Contact email address")
    affiliation: Affiliation = Field(description="Author's institutional affiliation")

class ResearchPaper(BaseModel):
    """Complete research paper information with nested author data"""
    title: str = Field(description="Title of the research paper")
    abstract: str = Field(description="Paper abstract or summary")
    authors: List[Author] = Field(description="List of paper authors with their details")
    publication_year: int = Field(description="Year of publication")
    keywords: List[str] = Field(description="Research keywords or topics")


# Initialize the model
llm = ChatOpenAI(
    model="gpt-4o-mini-2024-07-18",
    api_key=os.getenv("OPENAI_API_KEY", "").strip(),
    temperature=0
)

# Create an agent with nested Pydantic model schema
agent = create_agent(
    model=llm,
    tools=[],
    response_format=ToolStrategy(
        schema=ResearchPaper,
        tool_message_content="Research paper information extracted successfully"
    )
)

# Example: Extract structured information from unstructured text
text = """
The paper 'Deep Learning for Natural Language Processing' was published in 2023.
It was authored by Dr. Sarah Chen (sarah.chen@stanford.edu) from the Computer Science 
Department at Stanford University in the United States, and Prof. Michael Zhang 
(m.zhang@mit.edu) from the Artificial Intelligence Lab at MIT in the United States.
The abstract discusses novel approaches to transformer architectures for improved 
language understanding. Key topics include attention mechanisms, transfer learning, 
and model optimization.
"""

result = agent.invoke({
    "messages": [{"role": "user", "content": f"Extract research paper information from: {text}"}]
})

print("Extracted Information:")
print(result["structured_response"])

Extracted Information:
title='Deep Learning for Natural Language Processing' abstract='The abstract discusses novel approaches to transformer architectures for improved language understanding.' authors=[Author(name='Dr. Sarah Chen', email='sarah.chen@stanford.edu', affiliation=Affiliation(institution='Stanford University', department='Computer Science', country='United States')), Author(name='Prof. Michael Zhang', email='m.zhang@mit.edu', affiliation=Affiliation(institution='MIT', department='Artificial Intelligence Lab', country='United States'))] publication_year=2023 keywords=['attention mechanisms', 'transfer learning', 'model optimization']


In this example, we have created a three-level nested structure:
1. `ResearchPaper` (top level) contains a list of `Author` objects
2. Each `Author` contains an `Affiliation` object
3. `Affiliation` contains simple string fields

The `Field` descriptor provides metadata that helps the LLM understand what each field represents. When using nested models, LangChain automatically handles the serialization and validation of the entire hierarchy. The LLM receives the complete schema structure and returns data that matches this nested format.


## Alternative schema types

### Dataclasses
While Pydantic models are powerful, Python dataclasses offer a lighter-weight alternative. Dataclasses are ideal when we need simpler validation requirements or want to minimize dependencies.

In [3]:
@dataclass
class ProductReview:
    """Customer review for a product"""
    product_name: str
    rating: int
    reviewer_name: str
    review_text: str
    pros: List[str]
    cons: List[str]
    would_recommend: bool
    verified_purchase: Optional[bool] = None

# Create agent with dataclass schema
agent = create_agent(
    model=llm,
    tools=[],
    response_format=ToolStrategy(
        schema=ProductReview,
        tool_message_content="Product review extracted successfully"
    )
)

review_text = """
I recently bought the Sony WH-1000XM5 wireless headphones, and I’m seriously impressed.
The sound quality is fantastic, and the noise cancellation is the best I’ve experienced.
Battery life lasts easily over 30 hours, and they’re super comfortable for long flights.

The only downside is the high price and the touch controls can be finicky at times.
Still, I’d definitely recommend them to anyone looking for top-tier noise-canceling headphones.

Reviewer: Alex Johnson
Rating: 5 stars
Verified Purchase: Yes
"""

result = agent.invoke({
    "messages": [
        {"role": "user", "content": f"Extract a structured product review from the following text:\n{review_text}"}
    ]
})

print("Extracted Information:")
print(result["structured_response"])

Extracted Information:
ProductReview(product_name='Sony WH-1000XM5', rating=5, reviewer_name='Alex Johnson', review_text='I recently bought the Sony WH-1000XM5 wireless headphones, and I’m seriously impressed. The sound quality is fantastic, and the noise cancellation is the best I’ve experienced. Battery life lasts easily over 30 hours, and they’re super comfortable for long flights. The only downside is the high price and the touch controls can be finicky at times. Still, I’d definitely recommend them to anyone looking for top-tier noise-canceling headphones.', pros=['Fantastic sound quality', 'Best noise cancellation', 'Over 30 hours battery life', 'Super comfortable for long flights'], cons=['High price', 'Touch controls can be finicky'], would_recommend=True, verified_purchase=True)


### TypedDict
TypedDict is useful when working with dictionary-based data structures and we need type hints without creating class instances. It means:
- We can describe what keys and value types a dictionary should have (like saying “this dict must have `name: str` and `age: int`”)
- Without turning it into a full class or object (like a Pydantic model or dataclass).

Namely, `TypedDict` is a way to add structure and type hints to normal Python dictionaries.

In [4]:
# Define the schemas
class EventDetails(TypedDict):
    """Details about an event"""
    event_name: str
    date: str
    time: str
    location: str

class EventExtraction(TypedDict):
    """Extracted event information from text"""
    events: List[EventDetails]
    total_count: int
    source_type: str


# Initialize the model
llm = ChatOpenAI(
    model="gpt-4o-mini-2024-07-18",
    api_key=os.getenv("OPENAI_API_KEY", "").strip(),
    temperature=0
)

# Create agent with TypedDict schema
agent = create_agent(
    model=llm,
    tools=[],
    response_format=ToolStrategy(
        schema=EventExtraction,
        tool_message_content="Event information extracted successfully"
    )
)

event_text = """
Upcoming Tech Events:
1. AI Innovations Summit on December 5, 2025 at 9:00 AM in San Francisco.
2. Python Developers Conference on January 20, 2026 at 10:30 AM in New York.
3. Data Science Expo on March 14, 2026 at 11:00 AM in London.
Source: techmeetups.com
"""

result = agent.invoke({
    "messages": [
        {"role": "user", "content": f"Extract structured event information from the following text:\n{event_text}"}
    ]
})

print("Extracted Information:")
print(result["structured_response"])

Extracted Information:
{'events': [{'event_name': 'AI Innovations Summit', 'date': 'December 5, 2025', 'time': '9:00 AM', 'location': 'San Francisco'}, {'event_name': 'Python Developers Conference', 'date': 'January 20, 2026', 'time': '10:30 AM', 'location': 'New York'}, {'event_name': 'Data Science Expo', 'date': 'March 14, 2026', 'time': '11:00 AM', 'location': 'London'}], 'total_count': 3, 'source_type': 'techmeetups.com'}


## Union types for multiple output formats - Handling multiple possible structures
Union types allow us to define multiple possible output schemas, and the LLM will choose the appropriate one based on the input.

In [5]:
# Define schema for a successful booking
class SuccessfulBooking(BaseModel):
    booking_type: str = Field(default="success")
    confirmation_number: str
    customer_name: str
    service: str
    date: str
    total_cost: float

# Define schema for a booking error
class BookingError(BaseModel):
    booking_type: str = Field(default="error")
    error_message: str
    requested_service: str
    alternative_dates: List[str]
    reason: str

# Define the Union type (multiple possible outputs)
BookingResponse = Union[SuccessfulBooking, BookingError]

# Initialize the language model
llm = ChatOpenAI(
    model="gpt-4o-mini-2024-07-18",
    api_key=os.getenv("OPENAI_API_KEY", "").strip(),
    temperature=0
)

# Create agent with the Union schema
agent = create_agent(
    model=llm,
    tools=[],
    response_format=ToolStrategy(
        schema=BookingResponse,
        tool_message_content="Booking response extracted successfully"
    )
)


### Test case 1 – Successful booking
success_text = """
Your hotel booking was confirmed! 
Confirmation number: CNF12345. 
Customer: Alice Johnson. 
Service: Ocean View Room. 
Date: December 10, 2025. 
Total cost: $350.
"""

result_success = agent.invoke({
    "messages": [
        {"role": "user", "content": f"Extract booking response details from this text:\n{success_text}"}
    ]
})

print("✅ Successful Booking Response:")
print(result_success["structured_response"])



### Test case 2 – Booking error
error_text = """
We're sorry, your flight booking could not be completed. 
The requested service: Business Class flight to Tokyo on March 2, 2026.
Reason: Seats are fully booked. 
Alternative available dates: March 3, March 5, and March 7.
"""

result_error = agent.invoke({
    "messages": [
        {"role": "user", "content": f"Extract booking response details from this text:\n{error_text}"}
    ]
})

print("\n❌ Booking Error Response:")
print(result_error["structured_response"])

✅ Successful Booking Response:
booking_type='success' confirmation_number='CNF12345' customer_name='Alice Johnson' service='Ocean View Room' date='December 10, 2025' total_cost=350.0

❌ Booking Error Response:
booking_type='error' error_message='Your flight booking could not be completed.' requested_service='Business Class flight to Tokyo on March 2, 2026.' alternative_dates=['March 3', 'March 5', 'March 7'] reason='Seats are fully booked.'


### Performance considerations
When working with structured output in production:
1. Use Provider strategy for best performance
2. Minimize schema complexity (avoid deeply nested structures when possible)
3. Set `max_items` on lists to prevent excessive processing
4. Use appropriate schema types based on validation needs
5. Implement batch processing for multiple items
6. Enable error handling with `handle_errors=True`