## <b><font color='darkblue'>Preface</font></b>
<font size='3ptx'><b>Large language models generate text, not structured data. Even when you prompt them to return structured data, they’re still generating text that looks like valid JSON.</b> The output may have incorrect field names, missing required fields, wrong data types, or extra text wrapped around the actual data. Without validation, these inconsistencies cause runtime errors that are difficult to debug.</font>

<b>[Pydantic](https://docs.pydantic.dev/latest/) helps you validate data at runtime using Python type hints</b>. It checks that LLM outputs match your expected schema, converts types automatically where possible, and provides clear error messages when validation fails. This gives you a reliable contract between the LLM’s output and your application’s requirements.

### <b><font color='darkgreen'>Agenda</font></b>
Topics we will cover include:
* Designing robust Pydantic models (including custom validators and nested schemas).
* Parsing “messy” LLM outputs safely and surfacing precise validation errors.
* Integrating validation with OpenAI, LangChain, and LlamaIndex plus retry strategies.

![ui](https://machinelearningmastery.com/wp-content/uploads/2025/12/mlm-complete-guide-pydantic-featured-image.jpeg)

## <b><font color='darkblue'>Getting Started</font></b>
<font size='3ptx'><b>Let’s start with a simple example by building a tool that extracts contact information from text.</b> The LLM reads unstructured text and returns structured data that we validate with Pydantic:</font>

In [2]:
from pydantic import BaseModel, EmailStr, field_validator
from typing import Optional

class ContactInfo(BaseModel):
    name: str
    email: EmailStr
    phone: Optional[str] = None
    company: Optional[str] = None
    
    @field_validator('phone')
    @classmethod
    def validate_phone(cls, v):
        if v is None:
            return v
        cleaned = ''.join(filter(str.isdigit, v))
        if len(cleaned) < 10:
            raise ValueError('Phone number must have at least 10 digits')
        return cleaned

<b>All Pydantic models inherit from [**BaseModel**](https://docs.pydantic.dev/latest/api/base_model/#BaseModel), which provides automatic validation</b>. Type hints like `name: str` help Pydantic validate types at runtime. The `EmailStr` type validates email format without needing a custom regex. Fields marked with `Optional[str] = None` can be missing or null. The `@field_validator` decorator lets you add custom validation logic, like cleaning phone numbers and checking their length.

Here’s how to use the model to validate sample LLM output:

In [3]:
import json

llm_response = '''
{
    "name": "Sarah Johnson",
    "email": "sarah.johnson@techcorp.com",
    "phone": "(555) 123-4567",
    "company": "TechCorp Industries"
}
'''

data = json.loads(llm_response)
contact = ContactInfo(**data)

print(contact.name) 
print(contact.email)  
print(contact.model_dump())

Sarah Johnson
sarah.johnson@techcorp.com
{'name': 'Sarah Johnson', 'email': 'sarah.johnson@techcorp.com', 'phone': '5551234567', 'company': 'TechCorp Industries'}


When you create a <b><font color='blue'>ContactInfo</font></b> instance, <b>Pydantic validates everything automatically. If validation fails, you get a clear error message telling you exactly what went wrong</b>.

## <b><font color='darkblue'>Parsing and Validating LLM Outputs</font></b>
<font size='3ptx'><b>LLMs don’t always return perfect JSON</b>. Sometimes they add markdown formatting, explanatory text, or mess up the structure. Here’s how to handle these cases:</font>

In [4]:
from pydantic import BaseModel, ValidationError, field_validator
import json
import re

class ProductReview(BaseModel):
    product_name: str
    rating: int
    review_text: str
    would_recommend: bool
    
    @field_validator('rating')
    @classmethod
    def validate_rating(cls, v):
        if not 1 <= v <= 5:
            raise ValueError('Rating must be an integer between 1 and 5')
        return v


def extract_json_from_llm_response(response: str) -> dict:
    """Extract JSON from LLM response that might contain extra text."""
    json_match = re.search(r'\{.*\}', response, re.DOTALL)
    if json_match:
        return json.loads(json_match.group())
    raise ValueError("No JSON found in response")


def parse_review(llm_output: str) -> ProductReview:
    """Safely parse and validate LLM output."""
    try:
        data = extract_json_from_llm_response(llm_output)
        review = ProductReview(**data)
        return review
    except json.JSONDecodeError as e:
        print(f"JSON parsing error: {e}")
        raise
    except ValidationError as e:
        print(f"Validation error: {e}")
        raise
    except Exception as e:
        print(f"Unexpected error: {e}")
        raise

This approach uses regex to find JSON within response text, handling cases where the LLM adds explanatory text before or after the data. We catch different exception types separately:
- <b><font color='blue'>JSONDecodeError</font></b> for malformed JSON,
- <b><font color='blue'>ValidationError</font></b> for data that doesn’t match the schema, and
- General exceptions for unexpected issues.

The `extract_json_from_llm_response` function handles text cleanup while `parse_review` handles validation, keeping concerns separated. In production, you’d want to log these errors or retry the LLM call with an improved prompt.

This example shows an LLM response with extra text that our parser handles correctly:

In [5]:
messy_response = '''
Here's the review in JSON format:

{
    "product_name": "Wireless Headphones X100",
    "rating": 4,
    "review_text": "Great sound quality, comfortable for long use.",
    "would_recommend": true
}

Hope this helps!
'''

review = parse_review(messy_response)
print(f"Product: {review.product_name}")
print(f"Rating: {review.rating}/5")

Product: Wireless Headphones X100
Rating: 4/5


The parser extracts the JSON block from the surrounding text and validates it against the <b><font color='blue'>ProductReview</font></b> schema.

### <b><font color='darkgreen'>Advanced: Using Pydantic validator (v1/v2)</font></b>
If you want the model itself to accept messy text, you can add a `root_validator` (v1) or `model_validator` (v2):

In [16]:
import json
from pydantic import BaseModel, root_validator

class UserModel(BaseModel):
    name: str
    age: int

    @root_validator(pre=True)
    def extract_json(cls, values):
        # values could be a dict or raw string
        if isinstance(values, str):
            import re, json
            match = re.search(r'```json(.*?)```', values, re.DOTALL)
            if match:
                values = json.loads(match.group(1).strip())
        print(f'Values: {values}')
        return values

/tmp/ipykernel_4111/952699920.py:8: PydanticDeprecatedSince20: Pydantic V1 style `@root_validator` validators are deprecated. You should migrate to Pydantic V2 style `@model_validator` validators, see the migration guide for more details. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.12/migration/
  @root_validator(pre=True)


In [18]:
user = UserModel(**{'name': 'John', 'age': 30})
print(user)

Values: {'name': 'John', 'age': 30}
name='John' age=30


## <b><font color='darkblue'>Working with Nested Models</font></b>
<font size='3ptx'>Real-world data is rarely flat. Here’s how to handle nested structures like a product with multiple reviews and specifications:</font>

In [6]:
from pydantic import BaseModel, Field, field_validator
from typing import List

class Specification(BaseModel):
    key: str
    value: str

class Review(BaseModel):
    reviewer_name: str
    rating: int = Field(..., ge=1, le=5)
    comment: str
    verified_purchase: bool = False
    
class Product(BaseModel):
    id: str
    name: str
    price: float = Field(..., gt=0)
    category: str
    specifications: List[Specification]
    reviews: List[Review]
    average_rating: float = Field(..., ge=1, le=5)
    
    @field_validator('average_rating')
    @classmethod
    def check_average_matches_reviews(cls, v, info):
        reviews = info.data.get('reviews', [])
        if reviews:
            calculated_avg = sum(r.rating for r in reviews) / len(reviews)
            if abs(calculated_avg - v) > 0.1:
                raise ValueError(
                    f'Average rating {v} does not match calculated average {calculated_avg:.2f}'
                )
        return v