<a href="https://colab.research.google.com/github/mohitseventeens/Pydantic-for-LLM-Workflow/blob/main/Pydantic_for_LLM_Application.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 📓 Pydantic for LLM Workflows: Course Notes

**Author:** Mohit Sonkamble <br>
**Date:** August 27, 2025

## 📝 Course Structure & Table of Contents

This course is divided into several key modules. Click the links below to navigate to the relevant section.

*   **Part 1: [Welcome and Introduction](#part1)**
    *   Welcome to Pydantic for LLM Workflows (Video - 3 mins)
    *   Introduction to Pydantic for LLM Workflows (Video - 10 mins)

*   **Part 2: [Pydantic Fundamentals](#part2)**
    *   Pydantic model basics (Video with Code Example - 13 mins)

*   **Part 3: [LLM Response Handling](#part3)**
    *   Validating LLM responses (Video with Code Example - 15 mins)
    *   Passing a Pydantic model in your API call (Video with Code Example - 9 mins)

*   **Part 4: [Advanced Applications](#part4)**
    *   Tool calling (Video with Code Example - 19 mins)

*   **Part 5: [Conclusion](#part5)**
    *   Conclusion (Video - 1 min)

In [3]:
# @title Install Required Libraries
# It's good practice to run this cell first to ensure all dependencies are installed.

%pip install -qU 'pydantic[email]'

print("✅ All required libraries have been installed successfully!")

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/313.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m313.6/313.6 kB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[?25h✅ All required libraries have been installed successfully!


---
## <a id="part1"></a>Part 1: Pydantic Fundamentals

### **Welcome to Pydantic for LLM Workflows**

#### Concept: LLMs in Software vs. Standalone Tasks

1.  **Standalone Use:** Tasks like summarizing a document, translating text, or generating an essay. The output is consumed directly by a human, and slight variations in format are generally acceptable.
2.  **Integrated Use (in a Software System):** Using an LLM as one component in a larger application. Here, the LLM's output must be predictable, structured, and reliable because it will be passed to another function, API, or database.

#### Example: The Customer Service Chatbot Workflow

The diagram below shows how different user queries are handled in an automated support system.

<img src="https://raw.githubusercontent.com/mohitseventeens/Pydantic-for-LLM-Workflow/main/lesson1-customer-support-example.png" width="600">


**Workflow Deconstruction:**

*   **Simple Queries:** Some user issues, like *"I forgot my password"*, can be handled by simple, deterministic tools. A bot can recognize this intent and use a function like `lookup_faq_answer()` to provide a direct response ("Here's a reset link!").
*   **Complex Queries:** More nuanced or emotional queries, like *"I'm not happy with this product!"*, are ambiguous and require deeper understanding. This is where the LLM is used.
*   **The LLM's Critical Role (The "Magic"):** The LLM doesn't just generate a conversational reply. Its primary job here is to **structure the unstructured user input**. It processes the complaint and transforms it into a clean, predictable JSON object with predefined fields:
    *   `name`, `email` (user info)
    *   `query` (the original message)
    *   `priority`, `category`, `is_complaint` (classifications)
    *   `tags` (for routing and analysis)
*   **System Integration:** This structured JSON data is then used to create a formal **Support Ticket**. This ticket can now be reliably processed by the rest of the software system—logged in a database, assigned to a human agent, and tracked.

---

### Key Takeaway & The Need for Pydantic

The core challenge is ensuring the LLM *always* produces the correctly formatted JSON. If the LLM returns a key with a typo, uses a string instead of a boolean, or misses a field entirely, the downstream system (`Support Ticket` creation) will break.

This is the problem Pydantic solves. It acts as a powerful validator and parser that guarantees the data flowing from the LLM into your system is **100% correct and conforms to the structure you define.**

Pydantic isn't just a new, trendy library; it has been a trusted and popular tool for data validation in the Python ecosystem for a long time, making it a reliable choice for building robust applications.

---
### **Introduction to Pydantic for LLM Workflows**

This lesson builds on the customer support example to introduce the common challenges of working with LLMs in software and how Pydantic provides elegant solutions.

#### The Core Challenge: Unreliable LLM Outputs

Even when you ask an LLM for a structured output like JSON, you can't guarantee its format. Common issues include:

*   **Extraneous Text:** The LLM often wraps the JSON in conversational text, like *"Here's the JSON you requested!"*
*   **Code Fences:** The JSON is frequently enclosed in Markdown triple backticks (```json ... ```).
*   **Inconsistent Formatting:** The LLM might make mistakes like adding a trailing comma, using single quotes instead of double quotes, or generating a string where a number is expected.

<img src="https://raw.githubusercontent.com/mohitseventeens/Pydantic-for-LLM-Workflow/main/lesson2-issues-LLM.png" width="600">

These inconsistencies will cause a standard JSON parser to fail, breaking your application.

#### Approach 1: Basic Prompting & Error Chaining

The simplest method is to engineer a detailed prompt that includes an example of the desired JSON format.

<img src="https://raw.githubusercontent.com/mohitseventeens/Pydantic-for-LLM-Workflow/main/lesson2-method1+pydantic.png" width="600">

This workflow often involves a loop:
1.  Send the detailed prompt to the LLM.
2.  Try to parse the output.
3.  If parsing fails, catch the error, include it in a new prompt, and ask the LLM to fix its own mistake.

While this can work, it's brittle and relies on hope. A much more robust approach is to formalize the validation process with Pydantic.

#### Approach 2: Robust Validation with Pydantic

Pydantic provides a definitive schema for your data. Instead of just showing the LLM an example in a prompt, you define a strict structure in code.

The Pydantic-powered workflow is as follows:
1.  **Define a Schema:** You create a class inheriting from `pydantic.BaseModel`. This class defines every expected field, its data type (`str`, `bool`, `EmailStr`), and even allowed values (e.g., `Literal['refund_request', 'information_request']`).
2.  **Prompt the LLM:** You can still use a similar prompt, but now your system has a ground truth for what the output *must* look like.
3.  **Validate the Output:** The raw JSON string from the LLM is passed to a Pydantic method like `CustomerQuery.model_validate_json()`.
4.  **Get Structured Data:**
    *   **On Success:** Pydantic automatically cleans and parses the string into a valid Python object (`valid_data`). This object is guaranteed to have the correct structure and types, ready for the "Next step in the system!"
    *   **On Failure:** Pydantic raises a detailed `ValidationError` that explains exactly what was wrong with the LLM's output.

#### Advanced Use Case: Tool Calling

Pydantic's utility extends beyond simple validation. It is the core technology that enables modern **Tool Calling** (also known as Function Calling).

In this paradigm, the LLM acts as an intelligent router. It decides which of your application's functions (tools) to call based on the user's query.

<img src="https://raw.githubusercontent.com/mohitseventeens/Pydantic-for-LLM-Workflow/main/lesson2-tool-calling.png" width="600">

Here's how Pydantic enables this:
1.  **Define Tool Arguments:** You use a Pydantic `BaseModel` (e.g., `FAQLookupArgs`) to define the exact arguments and their types that a function (e.g., `lookup_faq_answer`) requires.
2.  **Provide Tool Definitions:** You provide the LLM with a list of available tools, using the Pydantic model's schema to describe the parameters for each tool.
3.  **LLM Orchestration:** When a user asks a question like "I forgot my password," the LLM recognizes that the `lookup_faq_answer` tool is the best fit. It then uses the provided schema to construct a valid JSON object containing the required arguments and asks your system to execute that function.

This turns the LLM from a simple text generator into an orchestrator that can interact with your code in a structured and reliable way.

---
## <a id="part2"></a>Part 2: Pydantic Model Basics

In this lesson, we'll learn the fundamentals of Pydantic models for data validation. We'll see how to define data models, validate user input, and handle validation errors gracefully using the customer support example.

By the end of this lesson, we'll be able to:
- Create Pydantic models to validate user input data.
- Handle validation errors with proper `try...except` blocks.
- Use optional fields and field constraints in your models.
- Work with JSON data validation methods.

### 1. Setup

First, let's import the necessary libraries. `BaseModel` is the core component we'll inherit from, and `ValidationError` is the exception Pydantic raises.

In [12]:
# Import libraries needed for the lesson
from pydantic import BaseModel, ValidationError, EmailStr, Field
from typing import Optional
from datetime import date
import json

### 2. Defining a Basic Model

We start by defining a class that inherits from `BaseModel`. Each attribute is given a type hint (e.g., `name: str`). Pydantic uses these hints to enforce the data types. `EmailStr` is a special Pydantic type that validates if a string is a properly formatted email address.

In [13]:
# Create a Pydantic model for validating user input
class UserInput(BaseModel):
    name: str
    email: EmailStr
    query: str

# Create a model instance with valid data
user_input = UserInput(
    name="Joe User",
    email="joe.user@example.com",
    query="I forgot my password."
)
print("--- Successful Instantiation ---")
print(user_input)

--- Successful Instantiation ---
name='Joe User' email='joe.user@example.com' query='I forgot my password.'


If we provide data that doesn't match the required type, Pydantic will raise a `ValidationError`.

**Note:** The following cell is expected to produce a `ValidationError` because the email address is invalid.

In [14]:
# Attempt to create another model instance with an invalid email
try:
    user_input = UserInput(
        name="Joe User",
        email="not-an-email",
        query="I forgot my password."
    )
    print(user_input)
except ValidationError as e:
    print("--- Expected Validation Error ---")
    print(e)

--- Expected Validation Error ---
1 validation error for UserInput
email
  value is not a valid email address: An email address must have an @-sign. [type=value_error, input_value='not-an-email', input_type=str]


### 3. Graceful Error Handling

To prevent crashes, we should always wrap our validation logic in a `try...except` block. This function demonstrates how to catch a `ValidationError` and print a user-friendly summary of the errors.

In [16]:
# Define a function to handle user input validation safely
def validate_user_input(input_data):
    """Validates user input data against the UserInput model."""
    try:
        user_input = UserInput(**input_data)
        print("✅ Valid user input created:")
        print(f"{user_input.model_dump_json(indent=2)}")
        return user_input
    except ValidationError as e:
        print("❌ Validation error occurred:")
        for error in e.errors():
            print(f"  - Field '{error['loc'][0]}': {error['msg']}")
        return None

# --- Example 1: Valid Data ---
print("--- Testing with valid data ---")
valid_data = {
    "name": "Joe User",
    "email": "joe.user@example.com",
    "query": "I forgot my password."
}
validate_user_input(valid_data)

print("\n" + "="*50 + "\n")

# --- Example 2: Missing Required Field ---
print("--- Testing with a missing field ---")
missing_field_data = {
    "name": "Joe User",
    "email": "joe.user@example.com"
}
validate_user_input(missing_field_data)

--- Testing with valid data ---
✅ Valid user input created:
{
  "name": "Joe User",
  "email": "joe.user@example.com",
  "query": "I forgot my password."
}


--- Testing with a missing field ---
❌ Validation error occurred:
  - Field 'query': Field required


### 4. Enhancing the Model: Optional Fields & Constraints

Pydantic models can be made more powerful using `Optional` types and `Field` for adding constraints.

- **`Optional[type]`**: Marks a field as not required.
- **`Field(default, description, ...)`**: Allows setting default values and validation rules like `ge` (greater than or equal) and `le` (less than or equal).

Pydantic also performs **type coercion**: if data can be safely converted (e.g., a string `"2025-12-31"` to a `date` object, or `"12345"` to an `int`), Pydantic handles it automatically. By default, Pydantic ignores extra fields that are not defined in the model.

In [17]:
# Define a new, more advanced UserInput model
class UserInput(BaseModel):
    name: str
    email: EmailStr
    query: str
    order_id: Optional[int] = Field(
        None,
        description="5-digit order number (cannot start with 0)",
        ge=10000,
        le=99999
    )
    purchase_date: Optional[date] = None

# --- Test 1: Only required fields ---
print("--- Test 1: Only required fields ---")
input_data_1 = {
    "name": "Joe User", "email": "joe.user@example.com", "query": "I forgot my password."
}
validate_user_input(input_data_1)
print("\n")

# --- Test 2: All fields + extra fields (which are ignored) ---
print("--- Test 2: All fields + extra fields ---")
input_data_2 = {
    "name": "Joe User", "email": "joe.user@example.com",
    "query": "I need to return an item.", "order_id": 12345,
    "purchase_date": date(2025, 12, 31),
    "system_message": "This will be ignored", "iteration": 1
}
validate_user_input(input_data_2)
print("\n")

# --- Test 3: Type coercion from string to date and int ---
print("--- Test 3: Type Coercion ---")
input_data_3 = {
    "name": "Joe User", "email": "joe.user@example.com",
    "query": "Overcharged for my order.", "order_id": "12345",
    "purchase_date": "2025-12-31"
}
validate_user_input(input_data_3)
print("\n")

# --- Test 4: Invalid type that cannot be coerced ---
print("--- Test 4: Invalid Type ---")
input_data_4 = {
    "name": 99999, "email": "joe.user@example.com",
    "query": "Wrong name type.", "order_id": 12345, "purchase_date": "2025-12-31"
}
validate_user_input(input_data_4)

--- Test 1: Only required fields ---
✅ Valid user input created:
{
  "name": "Joe User",
  "email": "joe.user@example.com",
  "query": "I forgot my password.",
  "order_id": null,
  "purchase_date": null
}


--- Test 2: All fields + extra fields ---
✅ Valid user input created:
{
  "name": "Joe User",
  "email": "joe.user@example.com",
  "query": "I need to return an item.",
  "order_id": 12345,
  "purchase_date": "2025-12-31"
}


--- Test 3: Type Coercion ---
✅ Valid user input created:
{
  "name": "Joe User",
  "email": "joe.user@example.com",
  "query": "Overcharged for my order.",
  "order_id": 12345,
  "purchase_date": "2025-12-31"
}


--- Test 4: Invalid Type ---
❌ Validation error occurred:
  - Field 'name': Input should be a valid string


### 5. Working with JSON Data

A common use case is validating data that arrives as a JSON string. The standard way is to first parse the JSON into a Python dictionary using `json.loads()` and then pass the dictionary to our validation function.

However, Pydantic provides a more direct and efficient method: `model_validate_json()`. This single command parses and validates the raw JSON string in one step.

In [18]:
# Define user input as a valid JSON string
json_data_valid = '''
{
    "name": "Joe User",
    "email": "joe.user@example.com",
    "query": "I bought a keyboard and mouse and was overcharged.",
    "order_id": 12345,
    "purchase_date": "2025-12-31"
}
'''

# --- Method 1: json.loads + validate_user_input ---
print("--- Method 1: json.loads() then validate ---")
input_data = json.loads(json_data_valid)
user_input = validate_user_input(input_data)
print("\n")


# --- Method 2: model_validate_json ---
print("--- Method 2: Using model_validate_json() ---")
try:
    user_input = UserInput.model_validate_json(json_data_valid)
    print("✅ Valid user input created directly from JSON:")
    print(user_input.model_dump_json(indent=2))
except ValidationError as e:
    print(f"❌ Validation error: {e}")

--- Method 1: json.loads() then validate ---
✅ Valid user input created:
{
  "name": "Joe User",
  "email": "joe.user@example.com",
  "query": "I bought a keyboard and mouse and was overcharged.",
  "order_id": 12345,
  "purchase_date": "2025-12-31"
}


--- Method 2: Using model_validate_json() ---
✅ Valid user input created directly from JSON:
{
  "name": "Joe User",
  "email": "joe.user@example.com",
  "query": "I bought a keyboard and mouse and was overcharged.",
  "order_id": 12345,
  "purchase_date": "2025-12-31"
}


Now, let's try `model_validate_json` with invalid data. The `order_id` `01234` is not greater than or equal to `10000`.

**Note:** The following cell is expected to produce a `ValidationError`.

In [19]:
# Define an invalid JSON string (order_id is out of bounds)
json_data_invalid = '''
{
    "name": "Joe User",
    "email": "joe.user@example.com",
    "query": "My account has been locked for some reason.",
    "order_id": "01234",
    "purchase_date": "2025-12-31"
}
'''

# Parse and validate in one step, expecting an error
try:
    user_input = UserInput.model_validate_json(json_data_invalid)
    print(user_input.model_dump_json(indent=2))
except ValidationError as e:
    print("--- Expected Validation Error ---")
    print(e)

--- Expected Validation Error ---
1 validation error for UserInput
order_id
  Input should be greater than or equal to 10000 [type=greater_than_equal, input_value='01234', input_type=str]
    For further information visit https://errors.pydantic.dev/2.11/v/greater_than_equal


---
## Conclusion

In this lesson, we learned how to use Pydantic models to validate user input for a customer support scenario. By defining clear data models with type hints, optional fields, and constraints, you can ensure your application only works with well-formed data. Handling validation errors gracefully makes your code more robust and reliable, setting the stage for integrating Pydantic into more advanced LLM workflows.