# JSON Schema Factory + Pydantic Validation Workshop

## The Challenge: Taming Wild LLM Outputs 

**Problem**: LLMs are creative but unpredictable. Ask for structured data, get creative chaos.

**Solution**: **Factory Pattern** + **JSON Schema** + **Pydantic** = Structured AI Magic

### Function Call vs Custom Schema

**Modern LLM APIs** (OpenAI, Gemini, Anthropic) offer **Function Calling** for structured outputs:
- ✅ **Built-in**: Native JSON schema validation at API level
- ✅ **Convenient**: Direct integration with provider-specific tools  
- ❌ **Vendor Lock-in**: Tied to specific providers and their implementations
- ❌ **Limited Control**: Validation happens on their servers, not yours
- ❌ **Cost Impact**: Every validation requires an API call

**Our Custom Approach** offers deeper control and flexibility:
-  **Provider Independence**: Works with ANY LLM (OpenAI, Gemini, Anthropic, local models)
-  **Data Security**: Local validation, no sensitive business data sent to third parties
-  **Business Logic**: Custom validation rules beyond basic JSON schema
-  **Cost Control**: Validation happens locally, saving API costs
-  **Analytics**: Track validation patterns and data quality metrics locally

### The Magic Trio:
-  **JSON Schema Factory**: Templates that guide LLM output (like Function Call schemas)
-  **LLM Provider**: Generates data following our templates (any provider)
-  **Pydantic**: Validates and cleans results with business logic

### Real-World Impact:
Think of it like this: **Function Calls** are like ordering from a fixed restaurant menu, while **our approach** is like having your own kitchen with any chef and your own quality standards!

**Note**: This tutorial shows the foundational approach. While Gemini has native function calling, understanding this pattern gives you provider independence and deeper control.


## Step 1: Schema Templates - The Recipe Cards

**Setup Phase**: Import essential libraries for our schema factory pattern. We'll use Pydantic for validation, 
standard typing for type hints, and our custom AI client to demonstrate real LLM integration. This foundational 
setup enables us to build provider-independent structured data generation systems.

These are the "instruction manuals" we give to LLMs:

In [4]:
from typing import Dict, List, Optional, Type
from pydantic import BaseModel, Field, ValidationError
from enum import Enum
import json
import sys
import os

# Import our multi-provider AI client
sys.path.append('..')
from utils.client import AIClient, MockAIClient

print("🚀 Ready to build structured AI systems!")
print(f"📋 Available AI providers: {AIClient.get_available_providers()}")

🚀 Ready to build structured AI systems!
📋 Available AI providers: ['gemini', 'gemini-pro', 'openai', 'openai-gpt3', 'anthropic']


**Schema Templates - The Recipe Cards**: These JSON schemas serve as instruction manuals for LLMs, similar to 
OpenAI's function calling schemas but provider-independent. Each schema defines the exact structure we want, 
making LLM outputs predictable and consistent across different use cases.

In [5]:
# 🏪 Product Schema - for e-commerce applications
PRODUCT_SCHEMA = {
    "name": "string - product title",
    "price": "number - price in USD", 
    "tags": ["string", "string", "..."],
    "specs": {
        "weight": "string - with units",
        "color": "string - primary color"
    },
    "description": "string - brief product description"
}

# 👤 User Schema - for social/professional applications  
USER_SCHEMA = {
    "username": "string - 3-20 characters",
    "age": "integer - between 13-120",
    "email": "string - valid email format",
    "skills": ["string", "string", "..."],
    "location": "string - city, country format"
}

# 📝 Task Schema - for productivity applications
TASK_SCHEMA = {
    "title": "string - task description",
    "priority": "string - high|medium|low",
    "estimated_hours": "number - time estimate",
    "categories": ["string", "..."],
    "dependencies": ["string", "..."]
}

print("📋 Schema templates ready!")
print(f"📊 Schema types: Product, User, Task")
print(f"🎯 Each schema guides LLMs like Function Call schemas")

📋 Schema templates ready!
📊 Schema types: Product, User, Task
🎯 Each schema guides LLMs like Function Call schemas


## Step 2: Factory Pattern - The Smart Chef Selector

**Factory Pattern Implementation**: The Factory Pattern dynamically selects appropriate schemas based on data type,
similar to how function calling chooses the right function. This abstraction layer makes our system extensible and 
maintainable, allowing easy addition of new schema types without modifying existing code.

Different tasks need different templates. Factory Pattern chooses the right one:

In [6]:
class SchemaType(Enum):
    PRODUCT = "product"
    USER = "user"

class JSONSchemaFactory:
    """🏭 Factory Pattern: Smart schema selection"""
    
    _schemas = {
        SchemaType.PRODUCT: PRODUCT_SCHEMA,
        SchemaType.USER: USER_SCHEMA
    }
    
    @classmethod
    def get_llm_prompt(cls, schema_type: SchemaType, task: str) -> str:
        """🎯 Create LLM prompt with schema guidance"""
        schema = cls._schemas[schema_type]
        return f"""
Generate JSON data for: {task}

REQUIRED FORMAT:
{json.dumps(schema, indent=2)}

Return ONLY valid JSON, no extra text.
"""
    
    @classmethod
    def list_types(cls) -> List[str]:
        return [t.value for t in cls._schemas.keys()]

# Demo: Create prompts for different scenarios
factory = JSONSchemaFactory()

print("🏭 Factory created!")
print(f"Available types: {factory.list_types()}")

# Example: E-commerce product generation
prompt = factory.get_llm_prompt(SchemaType.PRODUCT, "Gaming laptop")
print(f"\n📝 Sample prompt preview: {prompt[:100]}...")

🏭 Factory created!
Available types: ['product', 'user']

📝 Sample prompt preview: 
Generate JSON data for: Gaming laptop

REQUIRED FORMAT:
{
  "name": "string - product title",
  "pr...


## Step 3: Pydantic Models - The Quality Inspector

**Pydantic Validation Models**: These models act as quality inspectors, ensuring LLM outputs meet our business 
requirements. Unlike basic JSON validation in function calls, Pydantic provides deep validation with custom rules, 
type coercion, and detailed error reporting for robust data processing.

These ensure LLM outputs meet our standards:

In [7]:
class ProductSpecs(BaseModel):
    weight: str
    color: str

class Product(BaseModel):
    name: str = Field(min_length=1, description="Product name")
    price: float = Field(gt=0, description="Must be positive")
    tags: List[str] = Field(min_items=1, description="At least one tag")
    specs: ProductSpecs

class User(BaseModel):
    username: str = Field(min_length=3, max_length=20)
    age: int = Field(ge=13, le=120, description="Realistic age range")
    email: str = Field(pattern=r'^[\w\.-]+@[\w\.-]+\.\w+$')
    skills: List[str] = Field(default_factory=list)

print("✅ Validation models ready!")
print("🛡️  Built-in validation rules:")
print("   • Product price must be > 0")
print("   • User age: 13-120 years")
print("   • Email format checking")
print("   • Username length: 3-20 chars")

✅ Validation models ready!
🛡️  Built-in validation rules:
   • Product price must be > 0
   • User age: 13-120 years
   • Email format checking
   • Username length: 3-20 chars


## Step 4: Validation Factory - Automatic Quality Control

**Validation Factory - Quality Control Hub**: This factory maps schema types to their corresponding validators, 
creating a clean abstraction layer. It provides automatic validation routing and consistent error handling across 
all data types, making the system maintainable and extensible.

Connects schema types to their validators:

In [8]:
class ValidationFactory:
    """🎯 Maps schema types to validators"""
    
    _validators = {
        SchemaType.PRODUCT: Product,
        SchemaType.USER: User
    }
    
    @classmethod
    def validate(cls, schema_type: SchemaType, raw_data: Dict) -> BaseModel:
        """🔍 Validate LLM output"""
        validator = cls._validators[schema_type]
        return validator(**raw_data)  # Pydantic magic happens here!

print("🎯 Validation Factory ready!")
print("Now we can automatically validate any LLM output!")

🎯 Validation Factory ready!
Now we can automatically validate any LLM output!


## Step 5: The Complete Pipeline - Magic in Action ✨

Watch the full workflow: Schema → LLM Simulation → Validation

In [9]:
def demonstrate_pipeline(schema_type: SchemaType, task: str, mock_llm_output: Dict):
    """🎬 Full demonstration of the pipeline"""
    
    print(f"\n=== {task.upper()} PIPELINE ===")
    
    # Step 1: Generate LLM prompt
    prompt = JSONSchemaFactory.get_llm_prompt(schema_type, task)
    print(f"📝 Generated prompt ({len(prompt)} chars)")
    
    # Step 2: Simulate LLM response
    print(f"🤖 LLM Output: {json.dumps(mock_llm_output, indent=2)}")
    
    # Step 3: Validate with Pydantic
    try:
        validated = ValidationFactory.validate(schema_type, mock_llm_output)
        print(f"✅ VALIDATION SUCCESS!")
        print(f"📊 Result type: {type(validated).__name__}")
        print(f"🎯 Clean data ready for your app!")
        return validated
    except ValidationError as e:
        print(f"❌ VALIDATION FAILED:")
        for error in e.errors():
            print(f"   • {error['loc'][0]}: {error['msg']}")
        return None

# Demo 1: Perfect product data
good_product = {
    "name": "Gaming Laptop Pro",
    "price": 1299.99,
    "tags": ["gaming", "laptop", "high-performance"],
    "specs": {"weight": "2.5kg", "color": "black"}
}

demonstrate_pipeline(SchemaType.PRODUCT, "Gaming laptop", good_product)


=== GAMING LAPTOP PIPELINE ===
📝 Generated prompt (375 chars)
🤖 LLM Output: {
  "name": "Gaming Laptop Pro",
  "price": 1299.99,
  "tags": [
    "gaming",
    "laptop",
    "high-performance"
  ],
  "specs": {
    "weight": "2.5kg",
    "color": "black"
  }
}
✅ VALIDATION SUCCESS!
📊 Result type: Product
🎯 Clean data ready for your app!


Product(name='Gaming Laptop Pro', price=1299.99, tags=['gaming', 'laptop', 'high-performance'], specs=ProductSpecs(weight='2.5kg', color='black'))

**Real LLM Integration**: Now we integrate with actual LLM providers using our multi-provider client. This 
demonstrates how our schema factory works with real AI services like Gemini, while maintaining provider independence. 
The mock client provides a fallback for testing without API costs.

In [10]:
def demonstrate_real_llm_pipeline(schema_type: SchemaType, task: str, use_real_api: bool = False):
    """
    🎬 Complete pipeline with real LLM integration
    """
    
    print(f"\\n=== {task.upper()} - {'REAL API' if use_real_api else 'MOCK'} PIPELINE ===")
    
    # Step 1: Generate LLM prompt using our factory
    prompt = JSONSchemaFactory.get_llm_prompt(schema_type, task)
    print(f"📝 Generated structured prompt ({len(prompt)} chars)")
    
    # Step 2: Get LLM response (real or mock)
    try:
        if use_real_api:
            # Try real Gemini API (falls back to mock if no API key)
            try:
                client = AIClient("gemini")
                print(f"🤖 Using real Gemini API...")
                llm_response = client.simple_query(prompt)
            except Exception as e:
                print(f"⚠️  API unavailable ({str(e)[:50]}...), using mock")
                client = MockAIClient("gemini-mock")
                llm_response = client.simple_query(prompt)
        else:
            client = MockAIClient("mock")
            print(f"🧪 Using mock client for demo...")
            llm_response = client.simple_query(prompt)
        
        print(f"📤 LLM Response received ({len(llm_response)} chars)")
        
        # Step 3: Extract JSON from response (LLMs sometimes add extra text)
        try:
            # Try to find JSON in the response
            start_idx = llm_response.find('{')
            end_idx = llm_response.rfind('}') + 1
            if start_idx != -1 and end_idx != 0:
                json_str = llm_response[start_idx:end_idx]
                llm_data = json.loads(json_str)
            else:
                # Fallback: use mock data if JSON extraction fails
                print("⚠️  JSON extraction failed, using mock data")
                llm_data = get_mock_data(schema_type)
                
        except json.JSONDecodeError:
            print("⚠️  JSON parsing failed, using mock data")
            llm_data = get_mock_data(schema_type)
        
        print(f"🔍 Extracted data: {json.dumps(llm_data, indent=2)}")
        
        # Step 4: Validate with Pydantic
        try:
            validated = ValidationFactory.validate(schema_type, llm_data)
            print(f"✅ VALIDATION SUCCESS!")
            print(f"📊 Result type: {type(validated).__name__}")
            print(f"🎯 Clean, validated data ready for your application!")
            return validated
            
        except ValidationError:
            print(f"💡 Validation failed - this shows our quality control working!")
            return None
            
    except Exception as e:
        print(f"❌ Pipeline error: {str(e)}")
        return None

def get_mock_data(schema_type: SchemaType) -> Dict:
    """
    Generate realistic mock data for demos
    """
    mock_data = {
        SchemaType.PRODUCT: {
            "name": "Professional Gaming Laptop X1",
            "price": 1899.99,
            "tags": ["gaming", "laptop", "professional", "high-performance"],
            "specs": {"weight": "2.8kg", "color": "matte black"},
            "description": "High-performance gaming laptop designed for professional esports and content creation."
        },
        SchemaType.USER: {
            "username": "alex_dev",
            "age": 28,
            "email": "alex.developer@example.com",
            "skills": ["Python", "React", "Machine Learning", "Docker"],
            "location": "San Francisco, USA"
        },
        SchemaType.TASK: {
            "title": "Implement user authentication system",
            "priority": "high",
            "estimated_hours": 24.5,
            "categories": ["backend", "security"],
            "dependencies": ["database_setup", "user_model"]
        }
    }
    return mock_data[schema_type]

# Demo with different scenarios
print("🚀 Testing Real LLM Integration Pipeline:")

# Test 1: Product with mock
result1 = demonstrate_real_llm_pipeline(SchemaType.PRODUCT, "Gaming laptop for professionals", False)

# Test 2: User with real API (if available)
result2 = demonstrate_real_llm_pipeline(SchemaType.USER, "Software developer profile", True)

INFO:utils.client:🧪 MockAIClient initialized for testing (provider: mock)


🚀 Testing Real LLM Integration Pipeline:
\n=== GAMING LAPTOP FOR PROFESSIONALS - MOCK PIPELINE ===
📝 Generated structured prompt (393 chars)
🧪 Using mock client for demo...
📤 LLM Response received (105 chars)
⚠️  JSON extraction failed, using mock data
❌ Pipeline error: TASK
\n=== SOFTWARE DEVELOPER PROFILE - REAL API PIPELINE ===
📝 Generated structured prompt (338 chars)


INFO:utils.client:✅ gemini client initialized (model: gemini-2.5-flash)
INFO:utils.client:🗑️ Chat history cleared
INFO:utils.client:🗑️ Chat history cleared


🤖 Using real Gemini API...


INFO:httpx:HTTP Request: POST https://generativelanguage.googleapis.com/v1beta/openai/chat/completions "HTTP/1.1 200 OK"


📤 LLM Response received (258 chars)
🔍 Extracted data: {
  "username": "coder_xyz_123",
  "age": 32,
  "email": "jane.doe@example.com",
  "skills": [
    "Python",
    "JavaScript",
    "React",
    "Node.js",
    "Docker",
    "AWS",
    "MongoDB",
    "Git"
  ],
  "location": "San Francisco, USA"
}
✅ VALIDATION SUCCESS!
📊 Result type: User
🎯 Clean, validated data ready for your application!


## Step 6: Error Handling - When LLMs Misbehave

**Error Handling & Quality Control**: This section demonstrates how our validation system catches various data 
quality issues that LLMs might produce. Unlike basic function calling validation, our system provides detailed 
error analysis and business rule enforcement, showing the value of local validation control.

See how Pydantic catches problems:

In [12]:
# Demo 2: Problematic user data (multiple errors)
bad_user = {
    "username": "jo",           # ❌ Too short (min 3 chars)
    "age": 150,                # ❌ Too old (max 120)
    "email": "not-an-email",   # ❌ Invalid format
    "skills": ["Python", "AI"] # ✅ This part is fine
}

print("🧪 TESTING ERROR HANDLING:")
demonstrate_pipeline(SchemaType.USER, "Social media user", bad_user)

print("\n💡 Key Insight: Pydantic catches ALL validation issues!")
print("   This prevents bad data from entering your system.")

🧪 TESTING ERROR HANDLING:

=== SOCIAL MEDIA USER PIPELINE ===
📝 Generated prompt (329 chars)
🤖 LLM Output: {
  "username": "jo",
  "age": 150,
  "email": "not-an-email",
  "skills": [
    "Python",
    "AI"
  ]
}
❌ VALIDATION FAILED:
   • username: String should have at least 3 characters
   • age: Input should be less than or equal to 120
   • email: String should match pattern '^[\w\.-]+@[\w\.-]+\.\w+$'

💡 Key Insight: Pydantic catches ALL validation issues!
   This prevents bad data from entering your system.


## Step 7: Real-World Application

**Production-Ready Service**: This final implementation showcases a complete, production-ready data generation 
service that combines all our patterns. It demonstrates error recovery, logging, and how to build reliable AI 
systems that can handle real-world complexity and edge cases gracefully.

Complete example: Building a data generation service

In [None]:
class AIDataService:
    """🏢 Production-ready service using our patterns"""
    
    def generate_structured_data(self, data_type: str, description: str) -> Dict:
        """🎯 Main service method"""
        
        # Convert string to enum
        schema_type = SchemaType(data_type)
        
        # Generate prompt
        prompt = JSONSchemaFactory.get_llm_prompt(schema_type, description)
        
        # In real app: send prompt to LLM API here
        # For demo: use mock data
        mock_responses = {
            SchemaType.PRODUCT: {
                "name": "Smart Watch X1",
                "price": 299.0,
                "tags": ["wearable", "fitness", "smart"],
                "specs": {"weight": "45g", "color": "silver"}
            },
            SchemaType.USER: {
                "username": "alice_dev",
                "age": 28,
                "email": "alice@example.com",
                "skills": ["Python", "React", "Machine Learning"]
            }
        }
        
        llm_output = mock_responses[schema_type]
        
        # Validate output
        try:
            validated_data = ValidationFactory.validate(schema_type, llm_output)
            return {
                "success": True,
                "data": validated_data.dict(),
                "message": "Data generated and validated successfully!"
            }
        except ValidationError as e:
            return {
                "success": False,
                "errors": [str(err) for err in e.errors()],
                "message": "Validation failed"
            }

# Test the service
service = AIDataService()

print("🏢 AI Data Service Demo:")
print("\n1. Product Generation:")
result1 = service.generate_structured_data("product", "Fitness smartwatch")
print(f"   Success: {result1['success']}")
print(f"   Product: {result1['data']['name']} - ${result1['data']['price']}")

print("\n2. User Generation:")
result2 = service.generate_structured_data("user", "Software developer profile")
print(f"   Success: {result2['success']}")
print(f"   User: {result2['data']['username']} ({result2['data']['age']} years old)")
print(f"   Skills: {', '.join(result2['data']['skills'][:2])}...")

You’ve learned how to use the Factory Pattern to dynamically select the right schema based on your data type. JSON Schemas help guide language models to produce consistent and structured output. With Pydantic, you can automatically validate and clean the data, catching errors early. This creates a smooth, reliable pipeline from prompt to validated result.

This approach is especially useful for:

* Managing and generating structured content
* Converting unstructured data into clean formats
* Ensuring consistent API responses
* Automatically creating valid test data

Some tips to keep in mind:

* Start with simple schemas and add complexity gradually
* Always validate LLM output before using it
* Keep track of schema versions over time
* Have fallback plans for handling errors

By mastering this pattern, you’re building AI systems that are predictable, safe, scalable, and easy to maintain!
