## Output Parsing

Language models output text. But there are times where you want to get more structured information than just text back

Output parsers are classes that help structure language model responses. There are two main methods an output parser must implement:

- **Get format instructions**: A method which returns a string containing instructions for how the output of a language model should be formatted.
- **Parse**: A method which takes in a string (assumed to be the response from a language model) and parses it into some structure.

- Output Parsing
    - StrOutputParser
    - JsonOutputParser
    - CSV Output Parser
    - Datatime Output Parser
    - Structured Output Parser (Pydanitc or Json)


In [1]:
from dotenv import load_dotenv, find_dotenv

load_dotenv(find_dotenv())


True

In [27]:
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo")

# temperature 0 means the model is factual and we don't want creativity

In [3]:
from langchain_core.prompts import (
                                        SystemMessagePromptTemplate,
                                        HumanMessagePromptTemplate,
                                        ChatPromptTemplate,
                                        PromptTemplate
                                        )

# Pydantic in LangChain: Notes & Example

---

## 1. What is Pydantic?

- **Pydantic** is a Python library for **data validation** and **settings management** using Python type annotations.
- It provides a `BaseModel` class for declaring data models with types, constraints, and descriptions.
- It **automatically validates** and parses data, raising errors for missing or invalid fields.

**In LangChain, Pydantic models are often used for:**
- Defining structured outputs expected from an LLM (e.g., a Joke object, a product recommendation, etc.).
- Validating and parsing LLM outputs into safe, strongly-typed Python objects.

---

## 2. Declaring a Pydantic Model

```python
from typing import Optional
from pydantic import BaseModel, Field

class Joke(BaseModel):
    """Joke to tell user"""

    setup: str = Field(description="The setup of the joke")
    punchline: str = Field(description="The punchline of the joke")
    rating: Optional[int] = Field(description="The rating of the joke is from 1 to 10", default=None)


In [4]:
from typing import  Optional
from pydantic import BaseModel, Field
from langchain_core.output_parsers import PydanticOutputParser


In [5]:
class Joke(BaseModel):
    """Joke to tell user"""

    setup: str = Field(description="The setup of the joke")
    punchline: str = Field(description="The punchline of the joke")
    rating: Optional[int] = Field(description="The rating of the joke is from 1 to 10", default=None)

In [6]:
parser = PydanticOutputParser(pydantic_object=Joke)

In [7]:
instruction = parser.get_format_instructions()

In [8]:
print(instruction)

The output should be formatted as a JSON instance that conforms to the JSON schema below.

As an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}
the object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.

Here is the output schema:
```
{"description": "Joke to tell user", "properties": {"setup": {"description": "The setup of the joke", "title": "Setup", "type": "string"}, "punchline": {"description": "The punchline of the joke", "title": "Punchline", "type": "string"}, "rating": {"anyOf": [{"type": "integer"}, {"type": "null"}], "default": null, "description": "The rating of the joke is from 1 to 10", "title": "Rating"}}, "required": ["setup", "punchline"]}
```


In [9]:


prompt = PromptTemplate(
    # The 'template' argument defines the format of the prompt to be sent to the LLM.
    template='''
    Answer the user query with a joke. Here is your formatting instruction.
    {format_instruction}

    Query: {query}
    Answer:''',
    
    # 'input_variables' specifies which keys the template expects at runtime (here: 'query')
    input_variables=['query'],
    
    # 'partial_variables' are values you want to "hard-code" or fill in at creation time.
    # Here, format_instruction is filled automatically with instructions from the parser
    partial_variables={'format_instruction': parser.get_format_instructions()}
)

# The result is a prompt template that, when given a 'query', will fill in both
# 'query' (user's actual question) and 'format_instruction' (instructions for output formatting).

# Step 2: Create a chain by connecting the prompt template to the LLM
# This means: the input dict will first fill the prompt, then be sent to the LLM.
chain = prompt | llm


In [10]:
output = chain.invoke({'query': 'Tell me a joke about the cat'})

In [11]:
print(output.content)

{
    "setup": "Why was the cat sitting on the computer?",
    "punchline": "To keep an eye on the mouse!",
    "rating": 8
}


In [12]:
chain = prompt | llm | parser
output = chain.invoke({'query': 'Tell me a joke about the dogs'})
print(output)

setup='Why do dogs run in circles before lying down?' punchline="Because it's too hard to run in squares!" rating=8


### Parsing with `.with_structured_output()` method
- This method takes a schema as input which specifies the names, types, and descriptions of the desired output attributes.
-  The schema can be specified as a TypedDict class, JSON Schema or a Pydantic class.


In [13]:
output = llm.invoke('Tell me a joke about the cat')
print(output.content)

Why was the cat sitting on the computer?

Because it wanted to keep an eye on the mouse!


In [14]:
structured_llm = llm.with_structured_output(Joke)



In [15]:
output = structured_llm.invoke('Tell me a joke about the cat')
print(output)

setup='Why was the cat sitting on the computer?' punchline='To keep an eye on the mouse!' rating=8


In [16]:
from langchain_openai import ChatOpenAI
llm2 = ChatOpenAI(temperature=0, model_name="gpt-4o")
structured_llm = llm2.with_structured_output(Joke)
output = structured_llm.invoke('Tell me a joke about the cat')
print(output)

setup='Why was the cat sitting on the computer?' punchline='Because it wanted to keep an eye on the mouse!' rating=7


### `JSON` Output Parser

- Output parsers accept a string or BaseMessage as input and can return an arbitrary type.


In [17]:
from langchain_core.output_parsers import JsonOutputParser

parser = JsonOutputParser(pydantic_object=Joke)
print(parser.get_format_instructions())



The output should be formatted as a JSON instance that conforms to the JSON schema below.

As an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}
the object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.

Here is the output schema:
```
{"description": "Joke to tell user", "properties": {"setup": {"description": "The setup of the joke", "title": "Setup", "type": "string"}, "punchline": {"description": "The punchline of the joke", "title": "Punchline", "type": "string"}, "rating": {"anyOf": [{"type": "integer"}, {"type": "null"}], "default": null, "description": "The rating of the joke is from 1 to 10", "title": "Rating"}}, "required": ["setup", "punchline"]}
```


In [18]:
prompt = PromptTemplate(
    template='''
    Answer the user query with a joke. Here is your formatting instruction.
    {format_instruction}

    Query: {query}
    Answer:''',
    input_variables=['query'],
    partial_variables={'format_instruction': parser.get_format_instructions()}
)

chain = prompt | llm
output = chain.invoke({'query': 'Tell me a joke about the cat'})
print(output.content)

{
    "setup": "Why was the cat sitting on the computer?",
    "punchline": "To keep an eye on the mouse!",
    "rating": 8
}


In [19]:
chain = prompt | llm | parser
output = chain.invoke({'query': 'Tell me a joke about the cat'})
print(output)

{'setup': 'Why was the cat sitting on the computer?', 'punchline': 'To keep an eye on the mouse!', 'rating': 8}


### CSV Output Parser

- This output parser can be used when you want to return a list of comma-separated items.



In [20]:
# value1, values2, values3, so on

from langchain_core.output_parsers import CommaSeparatedListOutputParser

parser = CommaSeparatedListOutputParser()

print(parser.get_format_instructions())

Your response should be a list of comma separated values, eg: `foo, bar, baz` or `foo,bar,baz`


In [21]:
format_instruction = parser.get_format_instructions()

prompt = PromptTemplate(
    template='''
    Answer the user query with a list of values. Here is your formatting instruction.
    {format_instruction}

    Query: {query}
    Answer:''',
    input_variables=['query'],
    partial_variables={'format_instruction': format_instruction}
)   

In [22]:
chain = prompt | llm | parser

output = chain.invoke({'query': 'generate my website seo keywords. I have content about the NLP and LLM.'})
print(output)

['NLP', 'LLM', 'website', 'SEO', 'keywords', 'content']


### Datatime Output Parser

- Gives output in datetime format. Sometimes throws error if the LLM output is not in datetime format.

In [23]:
from langchain.output_parsers import DatetimeOutputParser

parser = DatetimeOutputParser()

format_instruction = parser.get_format_instructions()
print(format_instruction)

Write a datetime string that matches the following pattern: '%Y-%m-%dT%H:%M:%S.%fZ'.

Examples: 0768-03-18T06:37:13.749146Z, 0740-10-31T18:24:53.376274Z, 0524-08-22T15:37:50.952225Z

Return ONLY this string, no other words!


In [24]:
prompt = PromptTemplate(
    template='''
    Answer the user query with a datetime. Here is your formatting instruction.
    {format_instruction}

    Query: {query}
    Answer:''',
    input_variables=['query'],
    partial_variables={'format_instruction': format_instruction}
)

In [25]:
chain = prompt | llm | parser

In [26]:
output = chain.invoke({'query': 'when the America got discovered?'})

print(output)

1492-10-12 00:00:00
