# Output Parsers

Language models output text. But many times you may want to get more structured information than just text back. This is where output parsers come in.

Output parsers are classes that help structure language model responses. There are two main methods an output parser must implement:

- `get_format_instructions() -> str`: A method which returns a string containing instructions for how the output of a language model should be formatted.
- `parse(str) -> Any`: A method which takes in a string (assumed to be the response from a language model) and parses it into some structure.

Below we go over some examples of output parsers.

In [1]:
from langchain.prompts import PromptTemplate, ChatPromptTemplate, HumanMessagePromptTemplate
from langchain.llms import OpenAI
from langchain.chat_models import ChatOpenAI

## PydanticOutputParser
This output parser allows users to specify an arbitrary JSON schema and query LLMs for JSON outputs that conform to that schema.

Keep in mind that large language models are leaky abstractions! You'll have to use an LLM with sufficient capacity to generate well-formed JSON. In the OpenAI family, DaVinci can do reliably but Curie's ability already drops off dramatically. 

Use Pydantic to declare your data model. Pydantic's BaseModel like a Python dataclass, but with actual type checking + coercion.

In [2]:
from langchain.output_parsers import PydanticOutputParser
from pydantic import BaseModel, Field, validator
from typing import List

In [3]:
model_name = 'text-davinci-003'
temperature = 0.0
model = OpenAI(model_name=model_name, temperature=temperature)

In [4]:
# Define your desired data structure.
class Joke(BaseModel):
    setup: str = Field(description="question to set up a joke")
    punchline: str = Field(description="answer to resolve the joke")
    
    # You can add custom validation logic easily with Pydantic.
    @validator('setup')
    def question_ends_with_question_mark(cls, field):
        if field[-1] != '?':
            raise ValueError("Badly formed question!")
        return field

# And a query intented to prompt a language model to populate the data structure.
joke_query = "Tell me a joke."

# Set up a parser + inject instructions into the prompt template.
parser = PydanticOutputParser(pydantic_object=Joke)

prompt = PromptTemplate(
    template="Answer the user query.\n{format_instructions}\n{query}\n",
    input_variables=["query"],
    partial_variables={"format_instructions": parser.get_format_instructions()}
)

_input = prompt.format_prompt(query=joke_query)

output = model(_input.to_string())

parser.parse(output)

Joke(setup='Why did the chicken cross the road?', punchline='To get to the other side!')

In [5]:
# Here's another example, but with a compound typed field.
class Actor(BaseModel):
    name: str = Field(description="name of an actor")
    film_names: List[str] = Field(description="list of names of films they starred in")
        
actor_query = "Generate the filmography for a random actor."

parser = PydanticOutputParser(pydantic_object=Actor)

prompt = PromptTemplate(
    template="Answer the user query.\n{format_instructions}\n{query}\n",
    input_variables=["query"],
    partial_variables={"format_instructions": parser.get_format_instructions()}
)

_input = prompt.format_prompt(query=actor_query)

output = model(_input.to_string())

parser.parse(output)

Actor(name='Tom Hanks', film_names=['Forrest Gump', 'Saving Private Ryan', 'The Green Mile', 'Cast Away', 'Toy Story'])

### Aside: adding "guardrails" to your parsers.

"Guardrails" intuitively add validation logic + optionally retry logic to some black box output, like an LM generating structured output...

Below we'll showcase a "guarded" parser which can be dropped into an `LLMChain` as is. It will catch errors at parsing time and try resolve them, initially by re-invoking an LLM. There are many approaches for guardrailing non-deterministic LLMs, here's a simple case.

In [6]:
from langchain.guardrails.parsing import RetriableOutputParser
from langchain.output_parsers import OutputParserException

# Note: here we use an LLMChain which slightly abstracts calling an LLM with prompt templates + parsers.
from langchain.chains import LLMChain

#### 1st example: retry with a larger model.

In [7]:
# Pydantic data structure.
class FloatArray(BaseModel):
    values: List[float] = Field(description="list of floats")

# Query that will populate the data structure.
float_array_query = "Write out a few terms of fiboacci."

In [8]:
# Declare a parser and prompt.
parser = PydanticOutputParser(pydantic_object=FloatArray)

prompt = PromptTemplate(
    template="Answer the user query.\n{format_instructions}\n{query}\n",
    input_variables=["query"],
    partial_variables={"format_instructions": parser.get_format_instructions()}
)

# Currently, the parser is set on the prompt template for use in an LLMChain.
prompt.output_parser = parser

In [9]:
# For demonstration's sake, we'll use a "small" model that probably won't generate json properly.
llm_chain = LLMChain(
    prompt=prompt,
    llm=OpenAI(model_name="text-curie-001"),
    verbose=True)

try:
    llm_chain.predict(query=float_array_query)
except OutputParserException as e:
    print("Dang!")
    print(e)



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mAnswer the user query.
The output should be formatted as a JSON instance that conforms to the JSON schema below.

As an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}}
the object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.

Here is the output schema:
```
{"properties": {"values": {"title": "Values", "description": "list of floats", "type": "array", "items": {"type": "number"}}}, "required": ["values"]}
```
Write out a few terms of fiboacci.
[0m
Dang!
Failed to parse FloatArray from completion 
A fibonacci sequence is a sequence of numbers where each number in the sequence is the sum of the two previous numbers in the sequence. The first number in the sequence is 0 and the second number in 

In [10]:
# We can replace the parser with a guarded parser that tries to fix errors with a bigger model.
guarded_parser = RetriableOutputParser(
    parser=parser, retry_llm=OpenAI(model_name="text-davinci-003"))
prompt.output_parser = guarded_parser

llm_chain.predict(query=float_array_query)



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mAnswer the user query.
The output should be formatted as a JSON instance that conforms to the JSON schema below.

As an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}}
the object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.

Here is the output schema:
```
{"properties": {"values": {"title": "Values", "description": "list of floats", "type": "array", "items": {"type": "number"}}}, "required": ["values"]}
```
Write out a few terms of fiboacci.
[0m

[1m> Finished chain.[0m


FloatArray(values=[0.0, 1.0, 1.0, 2.0, 3.0, 5.0, 8.0, 13.0, 21.0, 34.0])

This example is demonstrative though. If your goal is to generate data structures, probably you'll want to start a large enough model.

#### 2nd example: a more realistic example that a large model sometimes struggles with.

In [11]:
from enum import Enum

# These data structure will induce a classification & summarization task. Neat!
class Outcome(str, Enum):
    Purchase = "Purchase"
    Objection = "Objection"
    class Config:  
        use_enum_values = True

class CustomerOutcome(BaseModel):
    outcome: Outcome = Field(description="did the customer purchase or object to the offer")
    reason_for_outcome: str = Field(description="why?")

In [12]:
parser = PydanticOutputParser(pydantic_object=CustomerOutcome)

prompt_template_str = """Answer the query below.
Customer Message: {customer_msg}
{format_instructions}
Say whether the Customer accepted or rejected the purchase and summarize why:"""

prompt = PromptTemplate(
    template=prompt_template_str,
    input_variables=["customer_msg"],
    partial_variables={"format_instructions": parser.get_format_instructions()}
)
prompt.output_parser = parser

In [13]:
customer_msg = """Nope thats way over budget, can't do it won't do it."""

In [14]:
llm_chain = LLMChain(
    prompt=prompt,
    llm=OpenAI(model_name="text-davinci-001"),
    verbose=True)

try:
    completion = llm_chain.predict(customer_msg=customer_msg)
    print(completion)
except OutputParserException as e:
    print("Dang!")
    print(e)



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mAnswer the query below.
Customer Message: Nope thats way over budget, can't do it won't do it.
The output should be formatted as a JSON instance that conforms to the JSON schema below.

As an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}}
the object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.

Here is the output schema:
```
{"properties": {"outcome": {"description": "did the customer purchase or object to the offer", "allOf": [{"$ref": "#/definitions/Outcome"}]}, "reason_for_outcome": {"title": "Reason For Outcome", "description": "why?", "type": "string"}}, "required": ["outcome", "reason_for_outcome"], "definitions": {"Outcome": {"title": "Outcome", "description": "An enumeration.", "enu

Nice! It worked. We can similarly wrap with a retry to be safe:

In [15]:
guarded_parser = RetriableOutputParser(
    parser=parser, retry_llm=OpenAI(model_name="text-davinci-003"))
prompt.output_parser = guarded_parser

llm_chain.predict(customer_msg=customer_msg)



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mAnswer the query below.
Customer Message: Nope thats way over budget, can't do it won't do it.
The output should be formatted as a JSON instance that conforms to the JSON schema below.

As an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}}
the object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.

Here is the output schema:
```
{"properties": {"outcome": {"description": "did the customer purchase or object to the offer", "allOf": [{"$ref": "#/definitions/Outcome"}]}, "reason_for_outcome": {"title": "Reason For Outcome", "description": "why?", "type": "string"}}, "required": ["outcome", "reason_for_outcome"], "definitions": {"Outcome": {"title": "Outcome", "description": "An enumeration.", "enu

CustomerOutcome(outcome=<Outcome.Objection: 'Objection'>, reason_for_outcome='The customer said that the product was over budget.')

<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>

---

# Older, less powerful parsers

## Structured Output Parser

While the Pydantic/JSON parser is more powerful, we initially experimented data structures having text fields only.

In [16]:
from langchain.output_parsers import StructuredOutputParser, ResponseSchema

Here we define the response schema we want to receive.

In [17]:
response_schemas = [
    ResponseSchema(name="answer", description="answer to the user's question"),
    ResponseSchema(name="source", description="source used to answer the user's question, should be a website.")
]
output_parser = StructuredOutputParser.from_response_schemas(response_schemas)

We now get a string that contains instructions for how the response should be formatted, and we then insert that into our prompt.

In [18]:
format_instructions = output_parser.get_format_instructions()
prompt = PromptTemplate(
    template="answer the users question as best as possible.\n{format_instructions}\n{question}",
    input_variables=["question"],
    partial_variables={"format_instructions": format_instructions}
)

We can now use this to format a prompt to send to the language model, and then parse the returned result.

In [19]:
model = OpenAI(temperature=0)

In [20]:
_input = prompt.format_prompt(question="what's the capital of france")
output = model(_input.to_string())

In [21]:
output_parser.parse(output)

{'answer': 'Paris', 'source': 'https://en.wikipedia.org/wiki/Paris'}

And here's an example of using this in a chat model

In [22]:
chat_model = ChatOpenAI(temperature=0)

In [23]:
prompt = ChatPromptTemplate(
    messages=[
        HumanMessagePromptTemplate.from_template("answer the users question as best as possible.\n{format_instructions}\n{question}")  
    ],
    input_variables=["question"],
    partial_variables={"format_instructions": format_instructions}
)

In [24]:
_input = prompt.format_prompt(question="what's the capital of france")
output = chat_model(_input.to_messages())

In [25]:
output_parser.parse(output.content)

{'answer': 'Paris', 'source': 'https://en.wikipedia.org/wiki/Paris'}

## CommaSeparatedListOutputParser

Here's another parser strictly less powerful than Pydantic/JSON parsing.

In [26]:
from langchain.output_parsers import CommaSeparatedListOutputParser

In [27]:
output_parser = CommaSeparatedListOutputParser()

In [28]:
format_instructions = output_parser.get_format_instructions()
prompt = PromptTemplate(
    template="List five {subject}.\n{format_instructions}",
    input_variables=["subject"],
    partial_variables={"format_instructions": format_instructions}
)

In [29]:
model = OpenAI(temperature=0)

In [30]:
_input = prompt.format(subject="ice cream flavors")
output = model(_input)

In [31]:
output_parser.parse(output)

['Vanilla',
 'Chocolate',
 'Strawberry',
 'Mint Chocolate Chip',
 'Cookies and Cream']