# Pydantic parser
This output parser allows users to specify an arbitrary JSON schema and query LLMs for JSON or YAML outputs that conform to that schema.

Keep in mind that large language models are leaky abstractions! You'll have to use an LLM with sufficient capacity to generate well-formed JSON. In the OpenAI family, DaVinci can do reliably but Curie's ability already drops off dramatically. 

Use Pydantic to declare your data model. Pydantic's BaseModel is like a Python dataclass, but with actual type checking + coercion.

In [3]:
from typing import List

from langchain.llms.openai import OpenAI
from langchain.output_parsers import PydanticOutputParser
from langchain.prompts import PromptTemplate
from langchain_core.pydantic_v1 import BaseModel, Field, validator

In [6]:
model_name = "gpt-3.5-turbo-instruct"
temperature = 0.0
model = OpenAI(model=model_name, temperature=temperature)

In [7]:
# Define your desired data structure.
class Joke(BaseModel):
    setup: str = Field(description="question to set up a joke")
    punchline: str = Field(description="answer to resolve the joke")

    # You can add custom validation logic easily with Pydantic.
    @validator("setup")
    def question_ends_with_question_mark(cls, field):
        if field[-1] != "?":
            raise ValueError("Badly formed question!")
        return field


# And a query intented to prompt a language model to populate the data structure.
joke_query = "Tell me a joke."

# Set up a parser + inject instructions into the prompt template.
parser = PydanticOutputParser(pydantic_object=Joke)

prompt = PromptTemplate(
    template="Answer the user query.\n{format_instructions}\n{query}\n",
    input_variables=["query"],
    partial_variables={"format_instructions": parser.get_format_instructions()},
)

_input = prompt.format_prompt(query=joke_query)

output = model(_input.to_string())

parser.parse(output)

Joke(setup='Why did the chicken cross the road?', punchline='To get to the other side!')

In [8]:
# Here's another example, but with a compound typed field.
class Actor(BaseModel):
    name: str = Field(description="name of an actor")
    film_names: List[str] = Field(description="list of names of films they starred in")


actor_query = "Generate the filmography for a random actor."

parser = PydanticOutputParser(pydantic_object=Actor)

prompt = PromptTemplate(
    template="Answer the user query.\n{format_instructions}\n{query}\n",
    input_variables=["query"],
    partial_variables={"format_instructions": parser.get_format_instructions()},
)

_input = prompt.format_prompt(query=actor_query)

output = model(_input.to_string())

parser.parse(output)

Actor(name='Tom Hanks', film_names=['Forrest Gump', 'Saving Private Ryan', 'The Green Mile', 'Cast Away', 'Toy Story'])

## Pydantic (YAML) parser
The Pydantic YAML parser, an extension of the JSON parser, provides a more token-efficient solution for parsing model output in YAML format, proving to be ~20-35% faster and using the same percentage fewer completion tokens. It follows the same principles as its JSON counterpart, allowing the definition of a specific JSON schema for querying large language models. 

The YAML parser is particularly useful for handling lists of objects, eliminating the need to specify a root key.

In [31]:
from langchain.output_parsers import YamlOutputParser


class Actors(BaseModel):
    __root__: List[Actor]


actor_query = "Generate a list of 3 actors and their filmographies."

parser = YamlOutputParser(pydantic_object=Actors)

prompt = PromptTemplate.from_template(
    template="Answer the user query.\n{format_instructions}\n{query}\n",
    partial_variables={"format_instructions": parser.get_format_instructions()},
)

chain = prompt | model | parser

chain.invoke({"query": actor_query})

Actors(__root__=[Actor(name='Tom Hanks', film_names=['Forrest Gump', 'Cast Away', 'Saving Private Ryan']), Actor(name='Meryl Streep', film_names=['The Devil Wears Prada', 'Mamma Mia!', 'The Iron Lady']), Actor(name='Leonardo DiCaprio', film_names=['Titanic', 'The Wolf of Wall Street', 'Inception'])])

Since there is no guarantee that the LLM will produce output in the desired format, it is advisable to re-run the chain if an `OutputParserException` was raised.

In [32]:
from langchain_core.exceptions import OutputParserException

# In case the model output can't be parsed and the `OutputParserException` is raised, retry the chain.
retry_chain = chain.with_retry(retry_if_exception_type=(OutputParserException,))
retry_chain.invoke({"query": actor_query})

Actors(__root__=[Actor(name='Tom Hanks', film_names=['Forrest Gump', 'Cast Away', 'Saving Private Ryan']), Actor(name='Meryl Streep', film_names=['The Devil Wears Prada', 'Mamma Mia!', 'The Iron Lady']), Actor(name='Leonardo DiCaprio', film_names=['Titanic', 'The Wolf of Wall Street', 'Inception'])])