<img src="https://raw.githubusercontent.com/instill-ai/cookbook/main/images/Logo.png" alt="Instill Logo" width="300"/>

# Generating Structured Output from LLMs

OpenAI recently announced that they now support [**Structured Outputs in the API**](https://openai.com/index/introducing-structured-outputs-in-the-api/) with general availability. The ability to distill and transform the creative and diverse unstructured outputs of Large Language Models (LLMs) into an actionable and reliable structured data represents a huge milestone in the world of Unstructured Data ETL. However, there's more to the story than meets the eye!

Coincidently, and perhaps ironically, a paper was published the day before this OpenAI announcement, titled [**Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models**](https://arxiv.org/abs/2408.02442v1). It shines a revealing light on the limitations of forcing structured outputs from LLMs. In particular, they found that LLMs tend to **struggle with reasoning tasks when they're placed under format restrictions**. Additionally, the _stricter_ these format restrictions are, the _more_ their reasoning performance drops.

### Contents

In this notebook, we will explore this very problem by implementing and benchmarking the current state-of-the-art tools for generating structured outputs from LLMs. The tools and libraries that we will consider are:

1. [**Instructor**](https://python.useinstructor.com) - a Python library, built on top of Pydantic, that lets you generate structured output from LLMs
2. [**Marvin**](https://www.askmarvin.ai) - a Python library for building reliable natural language interfaces
3. [**BAML**](https://www.boundaryml.com) - a domain specific language to write and test LLM functions
4. [**TypeChat**](https://microsoft.github.io/TypeChat/) - a tool from Microsoft for getting well-typed responses from language models

[Outlines](https://outlines-dev.github.io/outlines/), [JSONformer](https://github.com/1rgs/jsonformer) and [Guidance](https://github.com/guidance-ai/guidance/tree/main) were also considered, however they were left out of this experiment as they had limited support for remote API calls and failed when integrating with the OpenAI API.

We will then evaluate this problem with the latest OpenAI [Structured Outputs](https://openai.com/index/introducing-structured-outputs-in-the-api/) feature using Versatile Data Pipelines - **💧 Instill VDP** - hosted on **☁️ Instill Cloud**.

### Benchmark Task

The task we will use to test, compare and evaluate the performance of these methods is directly inspired from Figure 1 of the aforementioned [paper](https://arxiv.org/abs/2408.02442v1):

<img src="https://raw.githubusercontent.com/instill-ai/cookbook/main/images/Speak_freely.png" alt="Figure 1" width="390"/>

In our task, we ramp up the complexity by combining an analogous reasoning problem to the one shown above with text summarization. More precisely, we ask the LLM/output structuring tool to summarize the contents of a resume into the following structure:
```Python
name: str
email: str
cost: float
reasoning: str
experience: list[str]
skills: list[str]
```
where `cost` represents the answer to the question:
> John Doe is a freelance software engineer. He charges a 
        base rate of $50 per hour for the first 29 hours of work 
        each week. For any additional hours, he charges 1.7 
        times his base hourly rate. This week, John worked on a 
        project for 38 hours. How much will John Doe charge his 
        client for the project this week?

and `reasoning` contains the rationale and steps behind the calculated cost. See below for the example resume we will use, as well as the correct `cost` response.

In [1]:
resume = """
    John Doe
    1234 Elm Street 
    Springfield, IL 62701
    (123) 456-7890
    Email: john.doe@gmail.com

    Objective: To obtain a position as a software engineer.

    Education:
    Bachelor of Science in Computer Science
    University of Illinois at Urbana-Champaign
    May 2020 - May 2024

    Experience:
    Software Engineer Intern
    Google
    May 2022 - August 2022
    - Worked on the Google Search team
    - Developed new features for the search engine
    - Wrote code in Python and C++

    Software Engineer Intern
    Facebook
    May 2021 - August 2021
    - Worked on the Facebook Messenger team
    - Developed new features for the messenger app
    - Wrote code in Python and Java
    """

question = """
    Question:
    John Doe is a freelance software engineer. He charges a 
    base rate of $50 per hour for the first 29 hours of work 
    each week. For any additional hours, he charges 1.7 
    times his base hourly rate. This week, John worked on a 
    project for 38 hours. How much will John Doe charge his 
    client for the project this week?
    """

context = resume + question

In [2]:
true_answer = (50*29) + (1.7*50*9)
print(f'Correct Answer: ${true_answer}')

Correct Answer: $2215.0


### Setup

To execute all of the code in this notebook, you’ll need to create a free **☁️ Instill Cloud** account and setup an API Token. To create your account, please refer to our [quickstart guide](https://www.instill.tech/docs/quickstart). For generating your API Token, consult the [API Token Management](https://www.instill.tech/docs/core/token) page.

**This will give you access to 10,000 free credits per month that you can use to make API calls with third-party AI vendors. Please see our [documentation](https://www.instill.tech/docs/cloud/credit) for further details.**

Whilst you can run all **💧 Instill VDP** pipelines using your 10,000 free monthly credits, please note that you will need a valid OpenAI API key to run the structured LLM output evaluations (e.g. for Instructor, Marvin, BAML, TypeChat). Once you have created one via the OpenAI website, please set it as an environment variable by running the following line, but replacing `*********` for your OpenAI API key.

In [None]:
!export OPENAI_API_KEY='**********'

We will now install the latest Instill Python SDK, import the required libraries, and configure the SDK with a valid API token.

In [3]:
!pip install instill-sdk --quiet

In [4]:
from instill.clients import InstillClient
from instill.configuration import global_config
from google.protobuf.struct_pb2 import Struct
from google.protobuf.json_format import MessageToDict
from IPython.display import IFrame
import os

global_config.set_default(
    url="api.instill.tech",
    token="YOUR_INSTILL_API_TOKEN", # <-- Insert your Instill API token here
    secure=True,
)

### 0. OpenAI Baseline Performance

In [5]:
# TODO: Add OpenAI Baseline test with VDP on Instill Cloud

### 1. Instructor

[Instructor](https://python.useinstructor.com) is a Python library, built on top of Pydantic, that lets you generate structured output from LLMs. Here is how you can easily get started with it, and test its performance on the benchmark task.

In [6]:
!pip install -U instructor --quiet

In [7]:
import instructor
from pydantic import BaseModel
from openai import OpenAI


class DataModel(BaseModel):
    name: str
    email: str
    cost: float
    reasoning: str
    experience: list[str]
    skills: list[str]


client = instructor.from_openai(OpenAI())


template = """
    Extract from this content:
    {resume}
    Answer the question, storing the result in cost, and the step-by-step reasoning in reasoning.
    """

prompt = template.format(resume=context)

In [8]:
instructor_response = client.chat.completions.create(
    model="gpt-4",
    response_model=DataModel,
    messages=[{"role": "user", "content": prompt}],
)

instructor_dict = instructor_response.model_dump()
instructor_dict

2024-08-13 16:29:13,269.269 INFO     HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


{'name': 'John Doe',
 'email': 'john.doe@gmail.com',
 'cost': 1985.0,
 'reasoning': 'John charges $50 per hour for the first 29 hours. This amounts to $50 * 29 = $1450. For the remaining 9 hours (38-29), he charges 1.7 times his base rate, which is $50 * 1.7 = $85 per hour. This amounts to $85 * 9 = $765. Therefore, the total cost for this week would be $1450 + $765 = $1985.',
 'experience': ['Software Engineer Intern at Google from May 2022 to August 2022. Worked on the Google Search team, developed new features for the search engine, wrote code in Python and C++.',
  'Software Engineer Intern at Facebook from May 2021 to August 2021. Worked on the Facebook Messenger team, developed new features for the messenger app, wrote code in Python and Java.'],
 'skills': ['Python', 'Java', 'C++']}

In [9]:
print(f'Instructor error: ${abs(instructor_dict["cost"]-true_answer)}')

Instructor error: $230.0


### 2. Marvin

[Marvin](https://www.askmarvin.ai) is a Python library for building reliable natural language interfaces. Here is how you can easily get started with it, and test its performance on the benchmark task.

In [10]:
!pip install marvin --quiet

In [11]:
import marvin


@marvin.fn
def process(
    resume:str = resume,
    question: str = question,
) -> DataModel:
    """
    Extract content from `resume`.
    Answer the `question`, storing the result in cost, and the step-by-step reasoning in reasoning.
    """

In [12]:
marvin_response = process(resume, question)

marvin_dict = marvin_response.model_dump()
marvin_dict

2024-08-13 16:29:20,717.717 INFO     HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


{'name': 'John Doe',
 'email': 'john.doe@gmail.com',
 'cost': 2115.0,
 'reasoning': 'John charges $50 for the first 29 hours of work and 1.7 times his base rate for any additional hours. This week, he worked 38 hours. \n\nStep-by-step reasoning: \n1. Calculate the cost for the first 29 hours: \n29 hours * $50/hour = $1450.\n2. Calculate the additional hours worked: \n38 hours - 29 hours = 9 hours.\n3. Calculate the cost for the additional hours at 1.7 times the base rate: \n9 hours * ($50/hour * 1.7) = 9 hours * $85/hour = $765.\n4. Add the two amounts to get the total charge: \n$1450 + $765 = $2215.',
 'experience': ['Software Engineer Intern at Google from May 2022 to August 2022',
  'Software Engineer Intern at Facebook from May 2021 to August 2021'],
 'skills': ['Python', 'C++', 'Java']}

In [13]:
print(f'Marvin error: ${abs(marvin_dict["cost"]-true_answer)}')

Marvin error: $100.0


### 3. BAML

[BAML](https://www.boundaryml.com) is a domain specific language to write and test LLM functions developed by Boundary. Here is how you can easily get started with it, and test its performance on the benchmark task.

In [14]:
!pip install baml-py
!baml-cli init

Traceback (most recent call last):
  File "/opt/miniconda3/envs/instill/bin/baml-cli", line 8, in <module>
    sys.exit(invoke_runtime_cli())
             ^^^^^^^^^^^^^^^^^^^^
baml_py.BamlError: Destination directory already contains a baml_src directory: ./baml_src


In [15]:
file_path = 'baml_src/resume.baml'

with open(file_path, 'r') as file:
    file_data = file.read()

updated_file_data = file_data.replace(
    '// Defining a data model.\nclass Resume {\n  name string\n  email string\n  experience string[]\n  skills string[]\n}',
    '// Defining a data model.\nclass Resume {\n  name string\n  email string\n  cost float\n  reasoning string\n  experience string[]\n  skills string[]\n}'
    ).replace(
    'Extract from this content:\n    {{ resume }}\n\n',
    'Extract from this content:\n    {{ resume }}\n    Answer the question, storing the result in cost, and the step-by-step reasoning in reasoning.\n\n'
    )

with open(file_path, 'w') as file:
    file.write(updated_file_data)

print(updated_file_data)

// Defining a data model.
class Resume {
  name string
  email string
  cost float
  reasoning string
  experience string[]
  skills string[]
}

// Creating a function to extract the resume from a string.
function ExtractResume(resume: string) -> Resume {
  client GPT4
  prompt #"
    Extract from this content:
    {{ resume }}
    Answer the question, storing the result in cost, and the step-by-step reasoning in reasoning.

    {{ ctx.output_format }}
  "#
}

// Testing the function with a sample resume.
test vaibhav_resume {
  functions [ExtractResume]
  args {
    resume #"
      Vaibhav Gupta
      vbv@boundaryml.com

      Experience:
      - Founder at BoundaryML
      - CV Engineer at Google
      - CV Engineer at Microsoft

      Skills:
      - Rust
      - C++
    "#
  }
}



In [16]:
!baml-cli generate
from baml_client.sync_client import b

Generated 1 baml_client


In [17]:
baml_response = b.ExtractResume(context)

baml_dict = baml_response.model_dump()
baml_dict

{'name': 'John Doe',
 'email': 'john.doe@gmail.com',
 'cost': 2100.0,
 'reasoning': 'For the first 29 hours, John charges his base rate, totaling $50 * 29 = $1450. For the additional 9 hours, as he worked 38 hours, he charges 1.7 times his base rate, totaling $85 * 9 = $765. Adding these two amounts together, John charges his client $1450 + $765 = $2215 for the 38 hours this week.',
 'experience': ['Software Engineer Intern at Google from May 2022 to August 2022 where he worked on the Google Search team, developed new features for the search engine, and wrote code in Python and C++.',
  'Software Engineer Intern at Facebook from May 2021 to August 2021 where he worked on the Facebook Messenger team, developed new features for the messenger app, and wrote code in Python and Java.'],
 'skills': ['Python', 'C++', 'Java']}

In [18]:
print(f'BAML error: ${abs(baml_dict["cost"]-true_answer)}')

BAML error: $115.0


### 4. TypeChat

[TypeChat](https://microsoft.github.io/TypeChat/) is a tool from Microsoft for getting well-typed responses from language models. Here is how you can easily get started with it, and test its performance on the benchmark task.

In [19]:
!pip install "typechat @ git+https://github.com/microsoft/TypeChat#subdirectory=python" --quiet

In [20]:
from dataclasses import dataclass
from typing_extensions import Annotated, Doc
from typing import Dict, Any


@dataclass
class TypeChatDataModel:
    name: Annotated[str, Doc("The name of the candidate")]
    email: Annotated[str, Doc("The email address of the candidate")]
    cost: Annotated[str, Doc("The cost of hiring the candidate for the project")]
    reasoning: Annotated[str, Doc("The reasoning provided by the cost calculation")]
    experience: Annotated[list[str], Doc("A list of experiences the candidate has")]
    skills: Annotated[list[str], Doc("A list of skills the candidate possesses")]

    def to_dict(self) -> Dict[str, Any]:
        return {
            "name": self.name,
            "email": self.email,
            "cost": self.cost,
            "reasoning": self.reasoning,
            "experience": self.experience,
            "skills": self.skills,
        }

In [21]:
from typechat import TypeChatJsonTranslator, TypeChatValidator, create_language_model

env_vars = {'OPENAI_API_KEY': os.getenv('OPENAI_API_KEY'), 'OPENAI_MODEL': 'gpt-3.5-turbo'}

model = create_language_model(env_vars)
validator = TypeChatValidator(TypeChatDataModel)
translator = TypeChatJsonTranslator(model, validator, TypeChatDataModel)

In [22]:
typechat_response = await translator.translate(prompt)

typechat_dict = typechat_response.value.to_dict()
typechat_dict

2024-08-13 16:29:59,143.143 INFO     HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


{'name': 'John Doe',
 'email': 'john.doe@gmail.com',
 'cost': '$935',
 'reasoning': 'Base rate for first 29 hours: $50/hour * 29 hours = $1450\nAdditional hours: 9 hours * $50/hour * 1.7 = $765\nTotal cost: $1450 + $765 = $2215\nClient worked 38 hours, so cost for 38 hours: $50/hour * 29 hours + $50/hour * 1.7 * 9 hours = $935',
 'experience': ['Software Engineer Intern at Google (May 2022 - August 2022)',
  'Software Engineer Intern at Facebook (May 2021 - August 2021)'],
 'skills': ['Python', 'C++', 'Java']}

In [23]:
print(f'TypeChat error: ${abs(float(typechat_dict["cost"][1:])-true_answer)}')

TypeChat error: $1280.0
