# Structured Output with Pydantic

In this notebook you will drastically upgrade your ability to generate structured output through a combination of Pydantic classes and LangChain's `JsonOutputParser`.

---

## Objectives

By the time you complete this notebook you will:

- Understand the limitations of our current approach to generating structured data.
- Learn to create class-driven schemas for structured data generation using Pydantic.

---

## Imports

In [1]:
!pip install groq langchain-groq

Collecting groq
  Downloading groq-0.31.1-py3-none-any.whl.metadata (16 kB)
Collecting langchain-groq
  Downloading langchain_groq-0.3.8-py3-none-any.whl.metadata (2.6 kB)
Downloading groq-0.31.1-py3-none-any.whl (134 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.9/134.9 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading langchain_groq-0.3.8-py3-none-any.whl (16 kB)
Installing collected packages: groq, langchain-groq
Successfully installed groq-0.31.1 langchain-groq-0.3.8


In [2]:
import os
import getpass

os.environ["GROQ_API_KEY"] = getpass.getpass("GROQ API Key:\n")

GROQ API Key:
··········


In [3]:
from langchain_groq import ChatGroq
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser, JsonOutputParser
from langchain_core.runnables import RunnableLambda
from langchain_core.pydantic_v1 import BaseModel, Field


For example, replace imports like: `from langchain_core.pydantic_v1 import BaseModel`
with: `from pydantic import BaseModel`
or the v1 compatibility namespace if you are working in a code base that has not been fully upgraded to pydantic 2 yet. 	from pydantic.v1 import BaseModel

  exec(code_obj, self.user_global_ns, self.user_ns)


---

## Create a Model Instance

In [4]:
llm = ChatGroq(model_name="llama-3.3-70b-versatile", temperature=0)

## Limitations of Our Current Structured Data Approach

Your implementation might have been slightly different, but in the solution to the previous notebook's exercise we had some success with the following prompt template to generate a JSON object containing book details.

In [5]:
book_template = ChatPromptTemplate.from_template('''\
Make a JSON object representing the details of the following book: {book_title}. \
It should have fields for:
- The title of the book.
- The author of the book.
- The year the book was originally published.

Only return the JSON. Never return non-JSON text including backtack wrappers around the JSON.''')

Using this template, our solution implementation generated the following list of book details:

```python
[{'title': 'Dune', 'author': 'Frank Herbert', 'year_of_publication': 1965},
 {'title': 'Neuromancer', 'author': 'William Gibson', 'year': 1984},
 {'title': 'Snow Crash', 'author': 'Neal Stephenson', 'yearPublished': '1992'},
 {'title': 'The Left Hand of Darkness',
  'author': None,
  'publication_year': None},
 {'title': 'Foundation', 'author': 'Isaac Asimov', 'year': '1951'}]
 ```

The result was well-formatted, but looking more carefully at it, we can see it has some issues:

- The key names are not consistent for all values, for example `'year_of_publication'`, `'year'`, and `'yearPublished'`.
- The year has been generated at times as a string (`'1992'`), at times as an int (`1984`), and at times as a NoneType.

At this point in the workshop, knowing what you already do, you're probably already full of ideas about how to address each of these. Perhaps the following ideas come to mind:

- Be more specific in our prompt about the names of the keys, the types of the values, and what to do when the LLM can't generate data for a field.
- Try including a system message to more strongly reinforce how we want the LLM to generate responses.
- Provide few-shot examples to help the model understand all the specifics of what it should and shouldn't do.

If you're thinking along these lines, that's really fantastic, and you're correct about approaching the problem this way.

But let's consider some of the ways that our task might get even more complicated:

- What if we wanted to templatize more of the prompt, for example, which fields to include?
- What if our data structure gets far more complicated?
- Since we are generating data, what if we wanted to capture a definition of our data type for use elsewhere?

Again, knowing what you already know, you can likely think of viable ways to accomplish each of these, though it's easy to imagine it getting rather complicated quick. Luckily for us, LangChain ships with a variety of tools to help us accomplish generating structured data, and using them will greatly simplify our application code and allow us to perform more complicated structured data generation tasks more easily.

---

## Structured Data as a Class

Even before we get to LangChain-specific tools to help us generate structured data, let's take a step back and think about how we might articulate a data structure in Python if we weren't working in the context of LLMs. One very sensible approach would be to create a Python class.

Here we define a `Book` class that captures what we hoped to describe in our prompt template above.

In [None]:
class Book:
    """Information about a book."""

    def __init__(self, title, author, year_of_publication):
        self.title = title
        self.author = author
        self.year_of_publication = year_of_publication

However, there are some details we discussed above about our structured data that this class does not yet capture, like the type of the value for each field. Also, unlike our actual prompt template, there is no description, aside from its name, about what each field ought to contain.

Let's improve on this slightly but rewriting the class to include Python type hints, and some comments articulating the intended value of each field.

In [None]:
class Book:
    """Information about a book."""

    def __init__(self, title: str, author: str, year_of_publication: int):
        self.title: str = title  # The title of the book
        self.author: str = author  # The author of the book
        self.year_of_publication: int = year_of_publication  # The year the book was published

It's still missing aspects like default values and data validation, but for the most part, if we had a way to convey the infomation contained in the above class, including its comments, in a prompt, we might be in pretty good shape.

---

## Pydantic

In fact, LangChain provides us with exactly what we need to convey the information contained in classes to prompts. In doing so we have a powerful tool that enables us to articulate the structure of the data we want generated in a class, and then let LangChain do some of the more tedious work of conveying the information we capture in the class to a prompt.

In order to do this however, we need to use Pydantic classes instead of vanilla Python classes.

If you're unfamiliar, [Pydantic](https://docs.pydantic.dev/latest/) is "the most widely used data validation library for Python." If you're not using Pydantic in your object-oriented Python code, there's a good chance you'll enjoy learning how to use it.

For our purposes, we are only going to be using Pydantic to construct straightforward classes so that LangChain can then work with our class definitions to create prompts that will assist us in generating structured data.

The relevant Pydantic functionality has been integrated into LangChain, so to begin working with Pydantic classes, we need to import the following.

In [6]:
from langchain_core.pydantic_v1 import BaseModel, Field

Having imported `BaseModel` and `Field` we are now able to rewrite our `Book` class using Pydantic as follows.

In [7]:
class Book(BaseModel):
    """Information about a book."""

    title: str = Field(description="The title of the book")
    author: str = Field(description="The author of the book")
    year_of_publication: str = Field(description="The year the book was published")

As you can see, when we want to construct a Pydantic class, we create a class that inherits from `BaseModel` as we're doing above.

Rather than creating an `__init__` function, we can supply the class's fields at the top level of the class definition by defining them with `Field`, which, as a convenience, allows us provide a `desciption` argument about the intended use of the field.

---

## From Class to Formatting Instructions

In order to take the structure defined in our Pydantic `Book` class and generate a JSON object, we need a prompt to provide the model. LangChain's `JsonOutputParser` will provide us with just that.

First we'll import the `JsonOutputParser` class.

In [8]:
from langchain_core.output_parsers import JsonOutputParser

Just like with the `StrOutputParser` and `SimpleJsonOutputParser` parsers that we've used previously, we need to create an instance of the parser to use in our chain.

Different from the parsers we've worked with earlier, however, we can provide `JsonOutputParser` with an argument `pydantic_object` and provide a Pydantic object expressing how we want the JSON to be parsed. Here we'll pass in our Pydantic `Book`.

In [9]:
parser = JsonOutputParser(pydantic_object=Book)

Instances of `JsonOutputParser` contain a `get_format_instructions` method which create explicit instructions for formatting the JSON based on the provided Pydantic object.

In [10]:
format_instructions = parser.get_format_instructions()

In [11]:
print(format_instructions)

The output should be formatted as a JSON instance that conforms to the JSON schema below.

As an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}
the object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.

Here is the output schema:
```
{"description": "Information about a book.", "properties": {"title": {"title": "Title", "description": "The title of the book", "type": "string"}, "author": {"title": "Author", "description": "The author of the book", "type": "string"}, "year_of_publication": {"title": "Year Of Publication", "description": "The year the book was published", "type": "string"}}, "required": ["title", "author", "year_of_publication"]}
```


This is a really fantastic convenience to have the parser generate these detailed formatting instructions for us.

---

## The Importance of Docstrings and Field Descriptions

In the `format_instructions` above you'll notice several `"description"` fields. The top level `"description"` field states `""Information about a book.""`, the `"title"` `"description"` field states `"The title of the book"`. If we look again at our Pydantic class definition...

In [None]:
class Book(BaseModel):
    """Information about a book."""

    title: str = Field(description="The title of the book")
    author: str = Field(description="The author of the book")
    year_of_publication: str = Field(description="The year the book was published")

...you'll see that these descriptions were created from the class's docstring (for the top level description) and for each of the passed in `description` values (for each of the fields).

These texts are critical for conveying our intent to the LLM. When creating Pydantic classes to be used as formatting tools with LLMs, always take care to provide a meaningful docstring for the entire class, as well as good descriptions for each of its fields.

---

## Using Formatting Instructions in Prompts

Let's leverage the formatting instructions created by `JsonOutputParser` based on the Pydantic `Book` class in a prompt. While we are at it, we might as well also supply a system message to support our intended goal.

In [12]:
template = ChatPromptTemplate.from_messages([
    ("system", "You are an AI that generates JSON and ONLY JSON according to the instructions provided to you."),
    ("human", (
        "Generate JSON about the user input according to the provided format instructions.\n" +
        "Input: {input}\n" +
        "Format instructions {format_instructions}")
    )
])

Next we'll create our chain.

In [13]:
chain = template | llm | parser # Created above with `parser = JsonOutputParser(pydantic_object=Book)`

When we invoke this template, we'll need to provide an `input`, which in this case should be a book title, as well as `format_instructions`, which we have already obtained from `parser.format_instructions()`.

In [14]:
chain.invoke({
    "input": "East of Eden",
    "format_instructions": format_instructions
})

{'title': 'East of Eden',
 'author': 'John Steinbeck',
 'year_of_publication': '1952'}

Since we are going to want to provide different `input` values, but retain the same `format_instructions`, we can partially apply our existing `format_instructions` to the prompt template using the template's `.partial` method.

In [15]:
chain = template.partial(format_instructions=format_instructions) | llm | parser # Created above with `parser = JsonOutputParser(pydantic_object=Book)`

Let's try our new chain with a batch of books.

In [16]:
book_titles = ["Dune", "Neuromancer", "Snow Crash", "The Left Hand of Darkness", "Foundation"]

In [17]:
chain.batch(book_titles)

[{'title': 'Dune', 'author': 'Frank Herbert', 'year_of_publication': '1965'},
 {'title': 'Neuromancer',
  'author': 'William Gibson',
  'year_of_publication': '1984'},
 {'title': 'Snow Crash',
  'author': 'Neal Stephenson',
  'year_of_publication': '1992'},
 {'title': 'The Left Hand of Darkness',
  'author': 'Ursula K. Le Guin',
  'year_of_publication': '1969'},
 {'title': 'Foundation',
  'author': 'Isaac Asimov',
  'year_of_publication': '1951'}]

Comparing this to the output from the previous notebook (see immediately below), you can see our results are more consistent and better.

```python
[{'title': 'Dune', 'author': 'Frank Herbert', 'year_of_publication': 1965},
 {'title': 'Neuromancer', 'author': 'William Gibson', 'year': 1984},
 {'title': 'Snow Crash', 'author': 'Neal Stephenson', 'yearPublished': '1992'},
 {'title': 'The Left Hand of Darkness',
  'author': None,
  'publication_year': None},
 {'title': 'Foundation', 'author': 'Isaac Asimov', 'year': '1951'}]
 ```

---

## Using with_structured_output

As an alternative, and improved way to generate structured output, many LLMs now support the `with_structured_output` method, which allows us to replace the following...

```python
template = ChatPromptTemplate.from_messages([
    ("system", "You are an AI that generates JSON and only JSON according to the instructions provided to you."),
    ("human", (
        "Generate JSON about the user input according to the provided format instructions.\n" +
        "Input: {input}\n" +
        "Format instructions {format_instructions}")
    )
])

chain = template.partial(format_instructions=format_instructions) | llm | JsonOutputParser(pydantic_object=Book)
```

... with:

```python
llm_structured = llm.with_structured_output(Book)
```

In the example just shown, `llm_structured` can be invoked, batched, or streamed just like `chain`, but the syntax is much more concise.



---

## Exercise: Leverage Pydantic for Structured Data Generation

For this exercise you are going to generate a batch of structured data for the following cites.

In [18]:
city_names = ['Tokyo', 'Busan', 'Cairo', 'Perth']

For each of these cities you should create a JSON blob that contains information about the city, including:
- The name of the city.
- The country that the city is located within.
- Whether or not the city is the capital city of the country it is located in.
- The population of the city.

Feel free to check out the Solution below if you get stuck.

### Your Work Here

### Solution

In [21]:
class City(BaseModel):
    """Information about a city."""

    name: str = Field(description="The name of the city")
    country: str = Field(description="The the country the city is located in")
    capital: bool = Field(description="Is the city the capital of the country it is located in")
    population: int = Field(description="The population of the city")

In [19]:
template = ChatPromptTemplate.from_messages([
    ("system", "You are an AI that generates JSON and only JSON according to the instructions provided to you."),
    ("human", (
        "Generate JSON about the user input according to the provided format instructions.\n" +
        "Input: {input}\n" +
        "Format instructions {format_instructions}")
    )
])

In [20]:
parser = JsonOutputParser(pydantic_object=City)

NameError: name 'City' is not defined

In [None]:
template_with_format_instructions = template.partial(format_instructions=parser.get_format_instructions())

In [None]:
chain = template_with_format_instructions | llm | parser

In [None]:
chain.batch(city_names)

[{'name': 'Tokyo',
  'country': 'Japan',
  'capital': True,
  'population': 13969100},
 {'name': 'Busan',
  'country': 'South Korea',
  'capital': False,
  'population': 3420000},
 {'name': 'Cairo',
  'country': 'Egypt',
  'capital': True,
  'population': 10230000},
 {'name': 'Perth',
  'country': 'Australia',
  'capital': False,
  'population': 2045553}]

---