<a href="https://colab.research.google.com/github/jeffheaton/app_generative_ai/blob/main/t81_559_class_05_3_pydantic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# T81-559: Applications of Generative Artificial Intelligence
**Module 5: LangChain: Data Extraction**
* Instructor: [Jeff Heaton](https://sites.wustl.edu/jeffheaton/), McKelvey School of Engineering, [Washington University in St. Louis](https://engineering.wustl.edu/Programs/Pages/default.aspx)
* For more information visit the [class website](https://sites.wustl.edu/jeffheaton/t81-558/).

# Module 5 Material

* Part 5.1: Structured Output Parser [[Video]](https://www.youtube.com/watch?v=62CSR141VRE) [[Notebook]](t81_559_class_05_1_langchain_data.ipynb)
* Part 5.2: Other Parsers (CSV, JSON, Pandas, Datetime) [[Video]](https://www.youtube.com/watch?v=VXm8gPzU3qc) [[Notebook]](t81_559_class_05_2_parsers.ipynb)
* **Part 5.3: Pydantic parser** [[Video]](https://www.youtube.com/watch?v=dc4fn-W60hg) [[Notebook]](t81_559_class_05_3_pydantic.ipynb)
* Part 5.4: Custom Output Parser [[Video]](https://www.youtube.com/watch?v=jBpkAblQC_U) [[Notebook]](t81_559_class_05_4_custom_parsers.ipynb)
* Part 5.5: Output-Fixing Parser [[Video]](https://www.youtube.com/watch?v=_txWiLjf4bo) [[Notebook]](t81_559_class_05_5_output_fixing_parsers.ipynb)

# Google CoLab Instructions

The following code ensures that Google CoLab is running and maps Google Drive if needed.

In [5]:
import os

try:
    from google.colab import drive, userdata
    COLAB = True
    print("Note: using Google CoLab")
except:
    print("Note: not using Google CoLab")
    COLAB = False

# OpenAI Secrets
if COLAB:
    os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')

# Install needed libraries in CoLab
if COLAB:
    !pip install langchain langchain_openai

Note: using Google CoLab


# 5.3: Pydantic Parser

Pydantic is a data validation and settings management library in Python that uses Python type annotations. It's designed to allow for quick and easy data parsing and validation using Python's standard typing system.

Some of the key features of Pydantic:

* Data validation: It validates data to ensure it conforms to the expected format, converting types where necessary.
* Editor support: Pydantic models are classes that leverage Python's type hints, making them easy to use with modern editors that provide features like type checking and autocompletion.
* Error handling: It provides detailed and human-readable error reports to help identify where and why data failed to validate.
* Settings management: Pydantic is often used for managing settings/configurations, making it easy to load parameters from environment variables, JSON files, or other sources.
* Extensible: You can extend models with methods and properties, and use Pydantic's validation decorators to perform custom validation.
* Integration with other libraries: It works well with many other libraries, such as FastAPI for building APIs, enhancing their usability and functionality.

Overall, Pydantic is highly appreciated for its robustness and ease of use in ensuring that data inputs conform to specified formats, making it a valuable tool in modern Python development, especially in web development and data processing applications.

LangChain, a library designed to facilitate the building of applications with language models, offers various tools to manage and enhance interactions with these models. One of these tools is the PydanticOutputParser, which integrates Pydantic's powerful validation capabilities with the output of large language models (LLMs) like GPT.

The primary goal of the PydanticOutputParser is to ensure that the outputs from a language model are structured and adhere to a predefined schema. This is especially important in applications where consistent and reliable data formats are crucial, such as in data extraction tasks, API responses, or any scenario requiring subsequent automated processing of the model's output.

In [6]:
from typing import List

from langchain.output_parsers import PydanticOutputParser
from langchain_core.prompts import PromptTemplate
from pydantic import BaseModel, Field, field_validator   # <-- v2 imports
from langchain_openai import ChatOpenAI

MODEL = "gpt-5-mini"
TEMPERATURE = 0

llm = ChatOpenAI(model=MODEL, temperature=TEMPERATURE)

The following code uses an LLM to tell a joke. The pydantic parser ensures that the joke ends in a question mark. Such checking ensures that the somewhat random LLM produces output that aligns with our expectations.

In [7]:
class Joke(BaseModel):
    setup: str = Field(description="question to set up a joke")
    punchline: str = Field(description="answer to resolve the joke")

    @field_validator("setup")
    @classmethod
    def question_ends_with_question_mark(cls, v: str) -> str:
        if not v.endswith("?"):
            raise ValueError("Badly formed question!")
        return v

joke_query = "Tell me a joke about cats."

parser = PydanticOutputParser(pydantic_object=Joke)

prompt = PromptTemplate(
    template="Answer the user query.\n{format_instructions}\n{query}\n",
    input_variables=["query"],
    partial_variables={"format_instructions": parser.get_format_instructions()},
)

chain = prompt | llm | parser
result = chain.invoke({"query": joke_query})
print(result)


setup='Why did the cat sit on the computer?' punchline='Because it wanted to keep an eye on the mouse and claim the desktop as its throne.'


This example shows how to use **LangChain’s `PydanticOutputParser` with a custom Pydantic v2 model** to enforce structured outputs from a language model. The `Actor` class defines two fields—an actor’s name and a list of films they have starred in—along with validation rules that require the name to contain at least two capitalized words and the film list to be non-empty. By combining a prompt template, the parser, and an OpenAI chat model, the chain ensures that the model’s response conforms to the specified schema, making it easy to integrate reliable, structured data directly into applications.


In [8]:
from typing import List
from pydantic import BaseModel, Field, field_validator
from langchain.output_parsers import PydanticOutputParser
from langchain_core.prompts import PromptTemplate

class Actor(BaseModel):
    name: str = Field(description="name of an actor")
    film_names: List[str] = Field(description="list of names of films they starred in")

    @field_validator("name")
    @classmethod
    def validate_name(cls, value: str) -> str:
        parts = value.strip().split()
        if len(parts) < 2:
            raise ValueError("Name must contain at least two words.")
        if not all(p and p[0].isupper() for p in parts):
            raise ValueError("Each word in the name must start with a capital letter.")
        return value

    @field_validator("film_names")
    @classmethod
    def validate_film_names(cls, v: List[str]) -> List[str]:
        if not v:
            raise ValueError("film_names must contain at least one title.")
        if not all(isinstance(x, str) and x.strip() for x in v):
            raise ValueError("All film names must be non-empty strings.")
        return v

actor_query = "Generate the filmography for a random actor."

parser = PydanticOutputParser(pydantic_object=Actor)

prompt = PromptTemplate(
    template="Answer the user query.\n{format_instructions}\n{query}\n",
    input_variables=["query"],
    partial_variables={"format_instructions": parser.get_format_instructions()},
)

# assumes `llm` is already defined (e.g., ChatOpenAI(...))
chain = prompt | llm | parser

result = chain.invoke({"query": actor_query})
print(result)


name='Tom Hanks' film_names=['Forrest Gump', 'Saving Private Ryan', 'Cast Away', 'Philadelphia', 'Big', 'Apollo 13', 'The Green Mile', 'Toy Story', 'Catch Me If You Can', 'A Beautiful Day in the Neighborhood']
