## <b><font color='darkblue'>Preface</font></b>
<font size='3ptx'> ([source](https://machinelearningmastery.com/the-complete-guide-to-using-pydantic-for-validating-llm-outputs/)) <b>Large language models generate text, not structured data. Even when you prompt them to return structured data, they’re still generating text that looks like valid JSON.</b> The output may have incorrect field names, missing required fields, wrong data types, or extra text wrapped around the actual data. Without validation, these inconsistencies cause runtime errors that are difficult to debug.</font>

<b>[Pydantic](https://docs.pydantic.dev/latest/) helps you validate data at runtime using Python type hints</b>. It checks that LLM outputs match your expected schema, converts types automatically where possible, and provides clear error messages when validation fails. This gives you a reliable contract between the LLM’s output and your application’s requirements.

### <b><font color='darkgreen'>Agenda</font></b>
Topics we will cover include:
* Designing robust Pydantic models (including custom validators and nested schemas).
* Parsing “messy” LLM outputs safely and surfacing precise validation errors.
* Integrating validation with OpenAI, LangChain, and LlamaIndex plus retry strategies.

![ui](https://machinelearningmastery.com/wp-content/uploads/2025/12/mlm-complete-guide-pydantic-featured-image.jpeg)

## <b><font color='darkblue'>Getting Started</font></b>
<font size='3ptx'><b>Let’s start with a simple example by building a tool that extracts contact information from text.</b> The LLM reads unstructured text and returns structured data that we validate with Pydantic:</font>

In [2]:
from pydantic import BaseModel, EmailStr, field_validator
from typing import Optional

class ContactInfo(BaseModel):
    name: str
    email: EmailStr
    phone: Optional[str] = None
    company: Optional[str] = None
    
    @field_validator('phone')
    @classmethod
    def validate_phone(cls, v):
        if v is None:
            return v
        cleaned = ''.join(filter(str.isdigit, v))
        if len(cleaned) < 10:
            raise ValueError('Phone number must have at least 10 digits')
        return cleaned

<b>All Pydantic models inherit from [**BaseModel**](https://docs.pydantic.dev/latest/api/base_model/#BaseModel), which provides automatic validation</b>. Type hints like `name: str` help Pydantic validate types at runtime. The `EmailStr` type validates email format without needing a custom regex. Fields marked with `Optional[str] = None` can be missing or null. The `@field_validator` decorator lets you add custom validation logic, like cleaning phone numbers and checking their length.

Here’s how to use the model to validate sample LLM output:

In [3]:
import json

llm_response = '''
{
    "name": "Sarah Johnson",
    "email": "sarah.johnson@techcorp.com",
    "phone": "(555) 123-4567",
    "company": "TechCorp Industries"
}
'''

data = json.loads(llm_response)
contact = ContactInfo(**data)

print(contact.name) 
print(contact.email)  
print(contact.model_dump())

Sarah Johnson
sarah.johnson@techcorp.com
{'name': 'Sarah Johnson', 'email': 'sarah.johnson@techcorp.com', 'phone': '5551234567', 'company': 'TechCorp Industries'}


When you create a <b><font color='blue'>ContactInfo</font></b> instance, <b>Pydantic validates everything automatically. If validation fails, you get a clear error message telling you exactly what went wrong</b>.

## <b><font color='darkblue'>Parsing and Validating LLM Outputs</font></b>
<font size='3ptx'><b>LLMs don’t always return perfect JSON</b>. Sometimes they add markdown formatting, explanatory text, or mess up the structure. Here’s how to handle these cases:</font>

In [4]:
from pydantic import BaseModel, ValidationError, field_validator
import json
import re

class ProductReview(BaseModel):
    product_name: str
    rating: int
    review_text: str
    would_recommend: bool
    
    @field_validator('rating')
    @classmethod
    def validate_rating(cls, v):
        if not 1 <= v <= 5:
            raise ValueError('Rating must be an integer between 1 and 5')
        return v


def extract_json_from_llm_response(response: str) -> dict:
    """Extract JSON from LLM response that might contain extra text."""
    json_match = re.search(r'\{.*\}', response, re.DOTALL)
    if json_match:
        return json.loads(json_match.group())
    raise ValueError("No JSON found in response")


def parse_review(llm_output: str) -> ProductReview:
    """Safely parse and validate LLM output."""
    try:
        data = extract_json_from_llm_response(llm_output)
        review = ProductReview(**data)
        return review
    except json.JSONDecodeError as e:
        print(f"JSON parsing error: {e}")
        raise
    except ValidationError as e:
        print(f"Validation error: {e}")
        raise
    except Exception as e:
        print(f"Unexpected error: {e}")
        raise

This approach uses regex to find JSON within response text, handling cases where the LLM adds explanatory text before or after the data. We catch different exception types separately:
- <b><font color='blue'>JSONDecodeError</font></b> for malformed JSON,
- <b><font color='blue'>ValidationError</font></b> for data that doesn’t match the schema, and
- General exceptions for unexpected issues.

The `extract_json_from_llm_response` function handles text cleanup while `parse_review` handles validation, keeping concerns separated. In production, you’d want to log these errors or retry the LLM call with an improved prompt.

This example shows an LLM response with extra text that our parser handles correctly:

In [5]:
messy_response = '''
Here's the review in JSON format:

{
    "product_name": "Wireless Headphones X100",
    "rating": 4,
    "review_text": "Great sound quality, comfortable for long use.",
    "would_recommend": true
}

Hope this helps!
'''

review = parse_review(messy_response)
print(f"Product: {review.product_name}")
print(f"Rating: {review.rating}/5")

Product: Wireless Headphones X100
Rating: 4/5


The parser extracts the JSON block from the surrounding text and validates it against the <b><font color='blue'>ProductReview</font></b> schema.

### <b><font color='darkgreen'>Advanced: Using Pydantic validator (v1/v2)</font></b>
If you want the model itself to accept messy text, you can add a `root_validator` (v1) or `model_validator` (v2):

In [1]:
import json
from pydantic import BaseModel, root_validator

class UserModel(BaseModel):
    name: str
    age: int

    @root_validator(pre=True)
    def extract_json(cls, values):
        # values could be a dict or raw string
        if isinstance(values, str):
            import re, json
            match = re.search(r'```json(.*?)```', values, re.DOTALL)
            if match:
                values = json.loads(match.group(1).strip())
        print(f'Values: {values}')
        return values

/tmp/ipykernel_258636/952699920.py:8: PydanticDeprecatedSince20: Pydantic V1 style `@root_validator` validators are deprecated. You should migrate to Pydantic V2 style `@model_validator` validators, see the migration guide for more details. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
  @root_validator(pre=True)


In [2]:
user = UserModel(**{'name': 'John', 'age': 30})
print(user)

Values: {'name': 'John', 'age': 30}
name='John' age=30


## <b><font color='darkblue'>Working with Nested Models</font></b>
<font size='3ptx'>Real-world data is rarely flat. Here’s how to handle nested structures like a product with multiple reviews and specifications:</font>

In [3]:
from pydantic import BaseModel, Field, field_validator
from typing import List


class Specification(BaseModel):
    key: str
    value: str


class Review(BaseModel):
    reviewer_name: str
    rating: int = Field(..., ge=1, le=5)
    comment: str
    verified_purchase: bool = False


class Product(BaseModel):
    id: str
    name: str
    price: float = Field(..., gt=0)
    category: str
    specifications: List[Specification]
    reviews: List[Review]
    average_rating: float = Field(..., ge=1, le=5)
    
    @field_validator('average_rating')
    @classmethod
    def check_average_matches_reviews(cls, v, info):
        reviews = info.data.get('reviews', [])
        if reviews:
            calculated_avg = sum(r.rating for r in reviews) / len(reviews)
            if abs(calculated_avg - v) > 0.1:
                raise ValueError(
                    f'Average rating {v} does not match calculated average {calculated_avg:.2f}'
                )
        return v

The <b><font color='blue'>Product</font></b> model contains lists of <b><font color='blue'>Specification</font></b> and <b><font color='blue'>Review</font></b> objects, and each nested model is validated independently. Using `Field(..., ge=1, le=5)` adds constraints directly in the type hint, where ge means “greater than or equal” and gt means “greater than”.


The `check_average_matches_reviews validator` accesses other fields using `info.data`, allowing you to validate relationships between fields. When you pass nested dictionaries to `Product(**data)`, Pydantic automatically creates the nested <b><font color='blue'>Specification</font></b> and <b><font color='blue'>Review</font></b> objects.

This structure ensures data integrity at every level. If a single review is malformed, you’ll know exactly which one and why.

This example shows how nested validation works with a complete product structure:

In [4]:
llm_response = {
    "id": "PROD-2024-001",
    "name": "Smart Coffee Maker",
    "price": 129.99,
    "category": "Kitchen Appliances",
    "specifications": [
        {"key": "Capacity", "value": "12 cups"},
        {"key": "Power", "value": "1000W"},
        {"key": "Color", "value": "Stainless Steel"}
    ],
    "reviews": [
        {
            "reviewer_name": "Alex M.",
            "rating": 5,
            "comment": "Makes excellent coffee every time!",
            "verified_purchase": True
        },
        {
            "reviewer_name": "Jordan P.",
            "rating": 4,
            "comment": "Good but a bit noisy",
            "verified_purchase": True
        }
    ],
    "average_rating": 4.5
}

product = Product(**llm_response)
print(f"{product.name}: ${product.price}")
print(f"Average Rating: {product.average_rating}")
print(f"Number of reviews: {len(product.reviews)}")

Smart Coffee Maker: $129.99
Average Rating: 4.5
Number of reviews: 2


Pydantic validates the entire nested structure in one call, checking that specifications and reviews are properly formed and that the average rating matches the individual review ratings.

## <b><font color='darkblue'>Using Pydantic with LLM APIs and Frameworks</font></b>
<font size='3ptx'>So far, we’ve learned that <b>we need a reliable way to convert free-form text into structured, validated data.</b> Now let’s see how to use Pydantic validation with OpenAI’s API, as well as frameworks like [**LangChain**](https://www.langchain.com/) and [**LlamaIndex**](https://www.llamaindex.ai/). Be sure to install the required SDKs.</font>

### <b><font color='darkgreen'>Using LangChain with Pydantic</font></b>
<font size='3ptx'><b>[LangChain](https://www.langchain.com/) provides built-in support for structured output extraction with Pydantic models</b>. There are two main approaches that handle the complexity of prompt engineering and parsing for you.</font>

In [11]:
!pip freeze | grep -P "(langchain|langchain-core|langchain-community|langchain-google-genai)"

langchain==1.2.0
langchain-classic==1.0.1
langchain-community==0.4.1
langchain-core==1.2.5
langchain-google-genai==4.1.2
langchain-text-splitters==1.1.0


In [34]:
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_core.output_parsers import PydanticOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import RunnableSerializable, RunnableConfig
from pydantic import BaseModel, Field
from typing import Any, List, Optional

In [58]:
import getpass
import os

if "GOOGLE_API_KEY" not in os.environ:
    os.environ["GOOGLE_API_KEY"] = getpass.getpass("Enter your Google AI API key: ")


restaurant_text = """
Mama's Italian Kitchen is a cozy family-owned restaurant serving authentic 
Italian cuisine. Rated 4.5 stars, it's known for its homemade pasta and 
wood-fired pizzas. Prices are moderate ($$), and their signature dishes 
include lasagna bolognese and tiramisu.
"""

restaurant_text2 = """
Sakura Garden Sushi is a serene, contemporary bistro offering traditional Japanese flavors with a modern twist.
Rated 4.7 stars, it’s celebrated for its pristine sashimi and creative specialty rolls.
Prices are upscale ($$$), and their standout offerings include the bluefin tuna flight and matcha lava cake.
"""

In [53]:
import re

JSON_CONTENT_EXTRACT_PTN = r'''```json(?P<json_raw_text>.*)```'''


class DebugRunnable(RunnableSerializable):
    """Print out all data generated by LLM."""
    
    def invoke(self, input: Any, config: Optional[RunnableConfig] = None) -> Any:
        print("\n--- [DEBUG: LLM Raw Output] ---")
        # LLM outputs are usually AIMessage objects; 
        # .content gets you the actual string for cleaner viewing.
        if hasattr(input, "content"):
            print(input.content)
            if '```json' in input.content:
                print('Unexpected "```json" detected! Clean it out...')
                mth = re.search(JSON_CONTENT_EXTRACT_PTN, input.content, flags=re.DOTALL)
                json_raw_text = mth.group('json_raw_text')
                input.content = json_raw_text
                print(f'Extracted Json Raw Text:\n{json_raw_text}\n')
        else:
            print(input)
        print("--- [END DEBUG] ---\n")
        
        # Crucial: Return the input so the next step in the chain receives it
        return input

In [54]:
class Restaurant(BaseModel):
    """Information about a restaurant."""
    name: str = Field(description="The name of the restaurant")
    cuisine: str = Field(description="Type of cuisine served")
    price_range: str = Field(description="Price range with unit as NT dollar.")
    rating: Optional[float] = Field(default=None, description="Rating out of 5.0")
    specialties: List[str] = Field(description="Signature dishes or specialties")

In [55]:
def extract_restaurant_with_parser(text: str) -> Restaurant:
    """Extract restaurant info using LangChain's PydanticOutputParser."""
    
    parser = PydanticOutputParser(pydantic_object=Restaurant)
    
    prompt = PromptTemplate(
        template="Extract restaurant information from the following text.\n{format_instructions}\n{text}\n",
        input_variables=["text"],
        partial_variables={"format_instructions": parser.get_format_instructions()}
    )

    llm = ChatGoogleGenerativeAI(
        model="gemini-2.5-flash",
        temperature=1.0,  # Gemini 3.0+ defaults to 1.0
        max_tokens=None,
        timeout=None,
        max_retries=2,
        # other params...
    )
    debug = DebugRunnable()
    chain = prompt | llm | debug | parser

    result = chain.invoke({"text": text})
    return result

The <b><font color='blue'>PydanticOutputParser</font></b> automatically generates format instructions from your Pydantic model, including field descriptions and type information. It works with any LLM that can follow instructions and doesn’t require function calling support. The chain syntax makes it easy to compose complex workflows.

In [59]:
try:
    restaurant_info = extract_restaurant_with_parser(restaurant_text2)    
except Exception as e:
    print(f"Error: {e}")


--- [DEBUG: LLM Raw Output] ---
```json
{
  "name": "Sakura Garden Sushi",
  "cuisine": "Japanese",
  "price_range": "upscale ($$$)",
  "rating": 4.7,
  "specialties": [
    "pristine sashimi",
    "creative specialty rolls",
    "bluefin tuna flight",
    "matcha lava cake"
  ]
}
```
Unexpected "```json" detected! Clean it out...
Extracted Json Raw Text:

{
  "name": "Sakura Garden Sushi",
  "cuisine": "Japanese",
  "price_range": "upscale ($$$)",
  "rating": 4.7,
  "specialties": [
    "pristine sashimi",
    "creative specialty rolls",
    "bluefin tuna flight",
    "matcha lava cake"
  ]
}


--- [END DEBUG] ---



In [57]:
print(f"Restaurant: {restaurant_info.name}")
print(f"Cuisine: {restaurant_info.cuisine}")
print(f"Specialties: {', '.join(restaurant_info.specialties)}")
print(f"Rating: {restaurant_info.rating}")

Restaurant: Mama's Italian Kitchen
Cuisine: Italian
Specialties: homemade pasta, wood-fired pizzas, lasagna bolognese, tiramisu
Rating: 4.5


The second method is to use the native function calling capabilities of modern LLMs through the [with_structured_output() function](https://api.python.langchain.com/en/latest/chains/langchain.chains.structured_output.base.create_structured_output_runnable.html):

In [27]:
def extract_restaurant_structured(text: str) -> Restaurant:
    """Extract restaurant info using with_structured_output."""
    llm = ChatGoogleGenerativeAI(
        model="gemini-2.5-flash",
        temperature=1.0,  # Gemini 3.0+ defaults to 1.0
        max_tokens=None,
        timeout=None,
        max_retries=2,
        # other params...
    )
    
    structured_llm = llm.with_structured_output(Restaurant)
    
    prompt = PromptTemplate.from_template(
        "Extract restaurant information from the following text:\n\n{text}"
    )
    
    chain = prompt | structured_llm
    result = chain.invoke({"text": text})
    return result

In [28]:
try:
    restaurant_info = extract_restaurant_structured(restaurant_text)    
except Exception as e:
    print(f"Error: {e}")

In [29]:
print(f"Restaurant: {restaurant_info.name}")
print(f"Cuisine: {restaurant_info.cuisine}")
print(f"Specialties: {', '.join(restaurant_info.specialties)}")
print(f"Rating: {restaurant_info.rating}")

Restaurant: Mama's Italian Kitchen
Cuisine: Italian
Specialties: homemade pasta, wood-fired pizzas, lasagna bolognese, tiramisu
Rating: 4.5


This method produces cleaner, more concise code and makes use of the model’s native function calling capabilities for more reliable extraction. You don’t need to manually create parsers or format instructions, and it’s generally more accurate than prompt-based approaches.

## <b><font color='darkblue'>Retrying LLM Calls with Better Prompts</font></b>
<font size='3ptx'>When the LLM returns invalid data, you can <b>retry with an improved prompt that includes the error message from the failed validation attempt</b></font>:

In [30]:
from pydantic import BaseModel, ValidationError
from typing import Optional
import json


class EventExtraction(BaseModel):
    event_name: str
    date: str
    location: str
    attendees: int
    event_type: str


def extract_with_retry(llm_call_function, max_retries: int = 3) -> Optional[EventExtraction]:
    """Try to extract valid data, retrying with error feedback if validation fails."""
    last_error = None
    
    for attempt in range(max_retries):
        try:
            response = llm_call_function(last_error)
            data = json.loads(response)
            return EventExtraction(**data)
            
        except ValidationError as e:
            last_error = str(e)
            print(f"Attempt {attempt + 1} failed: {last_error}")
            
            if attempt == max_retries - 1:
                print("Max retries reached, giving up")
                return None
                
        except json.JSONDecodeError:
            print(f"Attempt {attempt + 1}: Invalid JSON")
            last_error = "The response was not valid JSON. Please return only valid JSON."
            
            if attempt == max_retries - 1:
                return None
    
    return None

Each retry includes the previous error message, helping the LLM understand what went wrong. After `max_retries`, the function returns `None` instead of crashing, allowing the calling code to handle the failure gracefully. Printing each attempt’s error makes it easy to debug why extraction is failing.

In a real application, your `llm_call_function` would construct a new prompt including the Pydantic error message, like "`Previous attempt failed with error: {error}. Please fix and try again.`"

This example shows the retry pattern with a mock LLM function that progressively improves:

In [31]:
def mock_llm_call(previous_error: Optional[str] = None) -> str:
    """Simulate an LLM that improves based on error feedback."""
    
    if previous_error is None:
        return '{"event_name": "Tech Conference 2024", "date": "2024-06-15", "location": "San Francisco"}'
    elif "attendees" in previous_error.lower():
        return '{"event_name": "Tech Conference 2024", "date": "2024-06-15", "location": "San Francisco", "attendees": "about 500", "event_type": "Conference"}'
    else:
        return '{"event_name": "Tech Conference 2024", "date": "2024-06-15", "location": "San Francisco", "attendees": 500, "event_type": "Conference"}'


result = extract_with_retry(mock_llm_call)


if result:
    print(f"\nSuccess! Extracted event: {result.event_name}")
    print(f"Expected attendees: {result.attendees}")
else:
    print("Failed to extract valid data")

Attempt 1 failed: 2 validation errors for EventExtraction
attendees
  Field required [type=missing, input_value={'event_name': 'Tech Conf...ation': 'San Francisco'}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.11/v/missing
event_type
  Field required [type=missing, input_value={'event_name': 'Tech Conf...ation': 'San Francisco'}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.11/v/missing
Attempt 2 failed: 1 validation error for EventExtraction
attendees
  Input should be a valid integer, unable to parse string as an integer [type=int_parsing, input_value='about 500', input_type=str]
    For further information visit https://errors.pydantic.dev/2.11/v/int_parsing
Attempt 3 failed: 1 validation error for EventExtraction
attendees
  Input should be a valid integer, unable to parse string as an integer [type=int_parsing, input_value='about 500', input_type=str]
    For further information visit https://errors.pydantic.

The first attempt misses the required attendees field, the second attempt includes it but with the wrong type, and the third attempt gets everything correct. The retry mechanism handles these progressive improvements.

## <b><font color='darkblue'>Conclusion</font></b>
<font size='3ptx'><b>[Pydantic](https://docs.pydantic.dev/latest/) helps you go from unreliable LLM outputs into validated, type-safe data structures.</b> By combining clear schemas with robust error handling, you can build AI-powered applications that are both powerful and reliable.

Here are the key takeaways:
* Define clear schemas that match your needs
* Validate everything and handle errors gracefully with retries and fallbacks
* Use type hints and validators to enforce data integrity
* Include schemas in your prompts to guide the LLM

Start with simple models and add validation as you find edge cases in your LLM outputs. Happy exploring!