# Constrained generation to guarantee syntactic correctness

::: {.callout-note title="Motivation"}
If we want to generate output that is structured in a specific way, we can use various techniques to 

- make the extraction more efficient (but automatically adding the "obvious" tokens)
- make the generation guaranteed to be syntactically correct
- make the generation sometimes more semantically correct, too 
:::

To enable constrained decoding, we will use one of the most popular packages for this task [`instructor`](https://jxnl.github.io/instructor/).
It is built on [`pydantic`]() and can leverage function calling and JSON-mode of the OpenAI API as well as other constrained sampling approaches.

In [1]:
from pydantic import BaseModel, Field
from typing import List, Optional, Literal

import instructor
from openai import OpenAI

from dotenv import load_dotenv
load_dotenv('../.env', override=True)

True

## Defining a data schema

For most constrained generation tasks, we need to define a data schema in a programmatic way.
The most common way to do so is to use `pydantic` data classes.
Here is an example of a simple data schema for a recipe:

```python
from pydantic import BaseModel

class Recipe(BaseModel):
    title: str
    ingredients: List[str]
    instructions: List[str]
```

This schema can also be extended to include descriptions of different fields or to only allow certain values for specific fields. For example, we could add a field for the number of servings and only allow positive integers.

```python
from pydantic import BaseModel, Field
from typing import Literal, List

class Recipe(BaseModel):
    title: str
    ingredients: List[str]
    instructions: List[str]
    servings: int = Field(..., gt=0, description="The number of servings for this recipe")
    rating: Literal["easy", "medium", "hard"] = Field("easy", description="The difficulty level of this recipe")
```

If we want to extract copolymerization reactions a data schema could look like this

In [2]:
class CopolymerizationReaction(BaseModel):
    temperature: float = Field(..., title="Temperature", description="Temperature at which the reaction is carried out")
    temperature_unit: Literal["C", "K"] = Field(..., title="Temperature unit", description="Unit of temperature")
    solvent: Optional[str] = Field(None, title="Solvent", description="Solvent used in the reaction. If bulk polymerization was performed, this field should be left empty")
    initiator: Optional[str] = Field(None, title="Initiator", description="Initiator used in the reaction")
    monomers: List[str] = Field(..., title="Monomers", description="Names of the monomers used in the reaction", min_items=2, max_items=2)
    reactivity_ratios: List[float] = Field(..., title="Reactivity ratios", description="Reactivity ratios of the monomers", min_items=2, max_items=2)
    reactivity_ratios_confidence_intervals: List[float] = Field(..., title="Reactivity ratios confidence intervals", description="Estimated error for the reactivity ratios of the monomers", min_items=2, max_items=2)
    polymerization_type: str = Field(..., title="Polymerization type", description="Type of polymerization (e.g. free radical, anionic, cationic)")
    determination_method: str = Field(..., title="Determination method", description="Method used to determine the reactivity ratios")

With this schema, we can now use `instructor` to "patch" the OpenAI API client to ensure that our output fulfils the schema.

In [3]:
client = instructor.patch(
    OpenAI(), mode=instructor.Mode.MD_JSON
)

In this case, we will use PDF files in the form as images as input for the model. To perform this conversion, we import some utilities.

In [4]:
from pdf2image import convert_from_path
from utils import process_image, get_prompt_vision_model

In [6]:
filepath = 'example.pdf'
pdf_images = convert_from_path(filepath)

images_base64 = [process_image(image, 2048, 'images', filepath, j)[0] for j, image in enumerate(pdf_images)]
images = get_prompt_vision_model(images_base64=images_base64)


In [9]:
completion = client.chat.completions.create(
    model="gpt-4-vision-preview",
    response_model=List[CopolymerizationReaction],
    messages=[
        {
            "role": "system",
            "content": """You are a scientific assistant, extracting important information about polymerization conditions.
Extract only data which you are 100% confident about. If you are unsure about any information (e.g. because it is not legible), please leave it blank.
We must ensure that the data is accurate and reliable.""",
        },
        {"role": "user", "content": [*images]},
    ],
    temperature=0,
)