# Structured Output Performance with LangChain

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://e7zy.short.gy/G1LklQ) [![View on GitHub](https://img.shields.io/badge/Open%20in%20GitHub-black?logo=github)](https://github.com/mattflo/structured-output-performance)

TLDR: Check out the results below.

My original objective was to identify open model alternatives to OpenAI that offered fast and reliable structured output. We are a LangChain shop. Despite its drawbacks, the ability to compose chains and the observability powered by LangSmith are extremely useful. So, calls to models in this analysis are wrapped in LangChain abstractions. If LangChain isn't your thing, I recommend you take a look at the excellent [Instructor](https://github.com/jxnl/instructor) project.

Most of the models below are wrapped using LangChain's beta [with_structured_output](https://python.langchain.com/docs/guides/structured_output).

## Consistency and Latency

This analysis aims to measure two dimensions of performance - consistency and speed. Consistency is measured by the rate at which a model can reliably produce output that can be parsed into an instance of the desired class. Latency metrics should be interpreted more as a guide than a rigorous benchmark. The graphs show failure rate on the y axis and latency on the x axis. The lower left corner of the graphs shows models performing the best against these two metrics.


## Baseline Findings

I planned to baseline against the OpenAI models we've used prior to this analysis. When I saw @swyx's [tweet](https://twitter.com/swyx/status/1773154343845257602), it was time to see what was going on with the OpenAI models. It turns out all of the GPT 3.5 function calling in our applications is on the old 0613 version. This analysis shows why. The newer versions, for whatever reason, aren't nearly as consistent with function calling. As you'll see in the results below, you should probably use JSON mode on the newer versions.


In [1]:
import sys

if 'google.colab' in sys.modules or 'kaggle_secrets' in sys.modules:
    !pip install set-env-colab-kaggle-dotenv langchain langchain-openai langchain-fireworks langchain-groq langchain-mistralai -qq

In [2]:
import os
from set_env import set_env

# to run all the examples, you will need keys for all these inference providers
# if you don't have some of these keys, you should get them!
# if you don't want to get keys, you can optionally omit some of the examples
# if local, set them up in `.env`
# on colab set up the secrets in the left bar
os.environ["OPENAI_API_KEY"] = set_env("OPENAI_API_KEY")
os.environ["GROQ_API_KEY"] = set_env("GROQ_API_KEY")
os.environ["FIREWORKS_API_KEY"] = set_env("FIREWORKS_API_KEY")
os.environ["TOGETHER_API_KEY"] = set_env("TOGETHER_API_KEY")
os.environ["MISTRAL_API_KEY"] = set_env("MISTRAL_API_KEY")

In [3]:
import time
from typing import List

import numpy as np
import pandas as pd
import plotly.graph_objects as go
from langchain.output_parsers import PydanticOutputParser
from langchain.prompts import PromptTemplate
from langchain_core.language_models.chat_models import BaseChatModel
from langchain_core.pydantic_v1 import BaseModel, Field
from langchain_core.runnables import Runnable, RunnableSequence
from langchain_fireworks import ChatFireworks
from langchain_groq import ChatGroq
from langchain_mistralai import ChatMistralAI
from langchain_openai import ChatOpenAI
from pandas import DataFrame
from tqdm import tqdm

In [4]:
class FunctionCallingMixin:
    def to_chain(
        self,
        pydantic_object: BaseModel,
        prompt: str,
    ) -> Runnable:
        # The Pydantic object and a single string prompt are passed in here.
        # If you want to do more complex things with the prompt, you may need to change this
        prompt_template = PromptTemplate.from_template(prompt)
        return prompt_template | self.with_structured_output(pydantic_object)


class JsonModeMixin:
    def to_chain(
        self,
        pydantic_object: BaseModel,
        prompt: str,
    ) -> Runnable:
        # The Pydantic object and a single string prompt are passed in here.
        # If you want to do more complex things with the prompt, you may need to change this
        parser = PydanticOutputParser(pydantic_object=pydantic_object)
        prompt_template = PromptTemplate.from_template(
            "{format_instructions}\n" + prompt,
            partial_variables={"format_instructions": parser.get_format_instructions()},
        )
        return prompt_template | self.with_structured_output(
            pydantic_object, method="json_mode"
        )


class MistralLargeFunc(FunctionCallingMixin, ChatMistralAI):
    def __init__(self):
        super().__init__(model="mistral-large-latest")


class MistralLargeJson(ChatMistralAI):
    def __init__(self):
        super().__init__(model="mistral-large-latest")

    def to_chain(
        self,
        pydantic_object: BaseModel,
        prompt: str,
    ) -> Runnable:
        parser = PydanticOutputParser(pydantic_object=pydantic_object)
        prompt_template = PromptTemplate.from_template(
            "{format_instructions}\n" + prompt,
            partial_variables={"format_instructions": parser.get_format_instructions()},
        )
        return prompt_template | self | parser


class MistralSmall(FunctionCallingMixin, ChatMistralAI):
    def __init__(self):
        super().__init__(model="mistral-small-latest")


class Together(ChatOpenAI):
    def __init__(self, model: str):
        super().__init__(
            model=model,
            base_url="https://api.together.xyz/v1",
            api_key=os.environ["TOGETHER_API_KEY"],
        )


# Together offers Json mode support for these also. however,
# the latencies were all over the place so only the function calling models are included here
class TogetherMistralFunc(FunctionCallingMixin, Together):
    def __init__(self):
        super().__init__(model="mistralai/Mistral-7B-Instruct-v0.1")


class TogetherMixtralFunc(FunctionCallingMixin, Together):
    def __init__(self):
        super().__init__(model="mistralai/Mixtral-8x7B-Instruct-v0.1")


class TogetherCodeLlama34BIFunc(FunctionCallingMixin, Together):
    def __init__(self):
        super().__init__(model="togethercomputer/CodeLlama-34b-Instruct")


class GPT4Functions(JsonModeMixin, ChatOpenAI):
    def __init__(self):
        super().__init__(model="gpt-4-turbo-preview")


class GPT4Json(JsonModeMixin, ChatOpenAI):
    def __init__(self):
        super().__init__(model="gpt-4-turbo-preview")


class GPT350613Functions(FunctionCallingMixin, ChatOpenAI):
    def __init__(self):
        super().__init__(model="gpt-3.5-turbo-0613")


class GPT351106Functions(FunctionCallingMixin, ChatOpenAI):
    def __init__(self):
        super().__init__(model="gpt-3.5-turbo-1106")


class GPT351106Json(JsonModeMixin, ChatOpenAI):
    def __init__(self):
        super().__init__(model="gpt-3.5-turbo-1106")


class GPT350125Functions(FunctionCallingMixin, ChatOpenAI):
    def __init__(self):
        super().__init__(model="gpt-3.5-turbo-0125")


class GPT350125Json(JsonModeMixin, ChatOpenAI):
    def __init__(self):
        super().__init__(model="gpt-3.5-turbo-0125")


class FireworksFunctions(FunctionCallingMixin, ChatFireworks):
    def __init__(self):
        super().__init__(model="accounts/fireworks/models/firefunction-v1")


class FireworksMixtral(JsonModeMixin, ChatFireworks):
    def __init__(self, **kwargs):
        super().__init__(
            model="accounts/fireworks/models/mixtral-8x7b-instruct",
            **kwargs,
        )


# groq is the only one that needs a slightly different chain construction
class Groq(ChatGroq):
    def to_chain(
        self,
        pydantic_object: BaseModel,
        prompt: str,
    ) -> Runnable:
        # The Pydantic object and a single string prompt are passed in here.
        # If you want to do more complex things with the prompt, you may need to change this
        parser = PydanticOutputParser(pydantic_object=pydantic_object)
        prompt_template = PromptTemplate.from_template(
            "{format_instructions}\n" + prompt,
            partial_variables={"format_instructions": parser.get_format_instructions()},
        )
        return prompt_template | self | parser


class GroqMixtral(Groq):
    def __init__(self):
        super().__init__(model="mixtral-8x7b-32768")


class GroqGemma7b(Groq):
    def __init__(self):
        super().__init__(model="gemma-7b-it")


class GroqLLaMa70b(Groq):
    def __init__(self):
        super().__init__(model="llama2-70b-4096")


def model_groups(cls):
    cls.Groq = [cls.GroqMixtral, cls.GroqLLaMa70b, cls.GroqGemma7b]
    cls.Fireworks = [cls.FireworksMixtral, cls.FireworksFunctions]
    cls.TogetherFunc = [
        cls.TogetherMistralFunc,
        cls.TogetherMixtralFunc,
        cls.TogetherCodeLlama34BIFunc,
    ]
    cls.GPT4 = [cls.GPT4Functions, cls.GPT4Json]
    cls.GPT35 = [
        cls.GPT350613Functions,
        cls.GPT351106Functions,
        cls.GPT351106Json,
        cls.GPT350125Functions,
        cls.GPT350125Json,
    ]
    cls.Mistral = [cls.MistralLargeFunc]
    return cls


@model_groups
class Models:
    # https://docs.mistral.ai/platform/endpoints/#operation/listModels
    Mistral: List[BaseChatModel]
    MistralLargeFunc = MistralLargeFunc()

    # https://console.groq.com/docs/models
    Groq: List[BaseChatModel]
    GroqMixtral = GroqMixtral()
    GroqLLaMa70b = GroqLLaMa70b()
    GroqGemma7b = GroqGemma7b()

    Fireworks: List[BaseChatModel]
    # https://readme.fireworks.ai/docs/structured-response-formatting
    FireworksMixtral = FireworksMixtral(max_tokens=1000)
    # https://readme.fireworks.ai/docs/function-calling
    FireworksFunctions = FireworksFunctions()

    # https://docs.together.ai/docs/function-calling
    TogetherFunc: List[BaseChatModel]
    TogetherMistralFunc = TogetherMistralFunc()
    TogetherMixtralFunc = TogetherMixtralFunc()
    TogetherCodeLlama34BIFunc = TogetherCodeLlama34BIFunc()

    GPT4: List[BaseChatModel]
    GPT4Functions = GPT4Functions()
    GPT4Json = GPT4Json()

    GPT35: List[BaseChatModel]
    GPT350613Functions = GPT350613Functions()
    GPT351106Functions = GPT351106Functions()
    GPT351106Json = GPT351106Json()
    GPT350125Functions = GPT350125Functions()
    GPT350125Json = GPT350125Json()

### Choose which models to include in the comparison

In [5]:
model_filter: List[BaseChatModel] = [
    *Models.GPT35,
    *Models.GPT4,
    *Models.TogetherFunc,
    *Models.Fireworks,
    *Models.Groq,
    *Models.Mistral,
]

In [6]:
actors = [
    "Leonardo DiCaprio",
    "Denzel Washington",
    "Tom Hanks",
    "Brad Pitt",
    "Joaquin Phoenix",
    "Meryl Streep",
    "Jennifer Lawrence",
    "Viola Davis",
    "Cate Blanchett",
    "Scarlett Johansson",
]

### Try out other functions/prompts here

In [7]:
class Actor(BaseModel):
    name: str = Field(description="name of an actor")
    film_names: List[str] = Field(description="list of names of films they starred in")
    awards: List[str] = Field(description="list of awards they won")


prompt = "Generate the filmography for {actor}.\n"

In [8]:
# create all the model and actor combinations

models_actors = [(model, actor) for model in model_filter[:] for actor in actors[:]]

print(f"created {len(models_actors)} model actor pairs")

created 160 model actor pairs


In [9]:
# Takes about ~7 minutes to run 160 samples
results = []

for model, actor in tqdm(models_actors):
    chain: RunnableSequence = model.to_chain(Actor, prompt)

    start = time.time()
    try:
        result = chain.invoke({"actor": actor})
    except Exception as e:
        result = e

    end = time.time()
    results.append(
        {
            "model": model.__class__.__name__,
            "result": result,
            "latency": end - start,
        }
    )

results[:2]

  0%|          | 0/160 [00:00<?, ?it/s]

  warn_beta(
  warn_beta(
  warn_beta(
100%|██████████| 160/160 [06:33<00:00,  2.46s/it]


[{'model': 'GPT350613Functions',
  'result': Actor(name='Leonardo DiCaprio', film_names=['Titanic', 'The Revenant', 'Inception', 'The Departed', 'The Wolf of Wall Street', 'Django Unchained', 'Catch Me If You Can', 'The Great Gatsby', 'Shutter Island', 'Blood Diamond'], awards=['Academy Award for Best Actor', 'Golden Globe Award for Best Actor', 'BAFTA Award for Best Actor']),
  'latency': 1.800079107284546},
 {'model': 'GPT350613Functions',
  'result': Actor(name='Denzel Washington', film_names=['Training Day', 'Malcolm X', 'Glory', 'The Equalizer', 'Flight', 'Remember the Titans', 'American Gangster', 'Inside Man', 'The Book of Eli', 'Fences'], awards=['Academy Award for Best Actor', 'Golden Globe Award for Best Actor', 'Screen Actors Guild Award for Outstanding Performance by a Male Actor in a Leading Role']),
  'latency': 1.8030071258544922}]

In [10]:
# put the results in a dataframe
df = pd.DataFrame(results)

In [11]:
# compute failure rate
df["failure"] = df["result"].apply(lambda x: not isinstance(x, Actor))
df.head()

Unnamed: 0,model,result,latency,failure
0,GPT350613Functions,name='Leonardo DiCaprio' film_names=['Titanic'...,1.800079,False
1,GPT350613Functions,name='Denzel Washington' film_names=['Training...,1.803007,False
2,GPT350613Functions,"name='Tom Hanks' film_names=['Forrest Gump', '...",1.757051,False
3,GPT350613Functions,"name='Brad Pitt' film_names=['Fight Club', 'Se...",1.638846,False
4,GPT350613Functions,"name='Joaquin Phoenix' film_names=['Joker', 'H...",1.741278,False


In [12]:
def show_scatter_plot(
    title: str, df: DataFrame, model_filter: List[BaseChatModel] = None
):
    if model_filter is not None:
        df = df[df["model"].isin([model.__class__.__name__ for model in model_filter])]
    # Calculate the mean, median, and quartiles of latency for each model
    model_stats = (
        df.groupby("model")
        .agg(
            {
                "latency": [
                    "mean",
                    lambda x: np.percentile(x, 25),
                    lambda x: np.percentile(x, 50),
                    lambda x: np.percentile(x, 75),
                ],
                "failure": "mean",
            }
        )
        .reset_index()
    )

    # Convert failure mean to percentage
    model_stats["failure", "mean"] *= 100

    # Create a scatter plot
    fig = go.Figure()

    # Define a list of text positions to cycle through
    text_positions = [
        "top center",
        "bottom center",
        # "top center",
    ]
    position_index = 0

    for _, row in model_stats.iterrows():
        model_name = row[("model", "")]
        latency_mean = row[("latency", "mean")]
        quartile_25 = row[("latency", "<lambda_0>")]
        latency_median = row[("latency", "<lambda_1>")]
        quartile_75 = row[("latency", "<lambda_2>")]
        failure_rate = row[("failure", "mean")]

        # Custom hover text
        custom_hover_text = f"<b>{model_name}</b><br>Failure Rate: {failure_rate:.2f}%<br>1st Quartile: {quartile_25:.2f}s<br>Mean: {latency_mean:.2f}s<br>Median: {latency_median:.2f}s<br>3rd Quartile: {quartile_75:.2f}s"

        fig.add_trace(
            go.Scatter(
                x=[latency_median],
                y=[failure_rate],
                # optionally add quartile error bars
                # error_x=dict(
                #     type="data",
                #     array=[latency_median - quartile_25, quartile_75 - latency_median],
                #     visible=True,
                # ),
                name=model_name,
                text=[model_name],  # Model name as point label
                hovertemplate=custom_hover_text
                + "<extra></extra>",  # Custom hover text without the trace name
                mode="markers+text",
                textposition=text_positions[
                    position_index % len(text_positions)
                ],  # Cycle through positions
            )
        )
        position_index += 1  # Move to the next position for the next model

    fig.update_layout(
        title=title,
        xaxis_title="Median Latency (seconds)",
        yaxis_title="Failure Rate (%)",
        showlegend=False,  # Hide the legend to avoid redundancy
    )

    fig.show()

# Results

**Note**: To see the graphs, you will need to open the notebook in [Colab](https://e7zy.short.gy/G1LklQ) or run it yourself.

**Failure Rate** - the rate that the model produced output that _couldn't_ be parsed into an instance of the Actor class.

**Latency** - the median time in seconds it took for the model/provider to return.

The fastest, most consistent models will appear in the bottom left of the graphs.

## Best Open Model Options

All of these model, inference provider pairs produce consistent structured output. 

- **GroqGemma7b** recently moves into the top spot with the lowest latency.
- **GroqMixtral** is still blazing fast in second place.
- **FireworksFunctions** is trailed by **FireworksMixtral**, **TogetherMixtral**, and **TogetherMistral**.
- **TogetherCodeLlama34B** is quite a bit slower.

In [16]:
show_scatter_plot(
    "Open Models Structured Output Comparison",
    df,
    [*Models.Fireworks, *Models.TogetherFunc, Models.GroqMixtral, Models.GroqGemma7b],
)

## All Models

**GroqLLaMa70b** exhibits horrible consistency next to the other Groq models

In [14]:
show_scatter_plot("Structured Output LLMs Comparison", df)

## Proprietary Models

### GPT 4
As per usual, GPT4 offers great consistency and awful speed.

### GPT 3.5
The performance varies greatly across the GPT3.5 models for both consistency and latency. The newer versions are much less consistent with function calling than the older 0613 version which only offers support for function calling, not JSON mode. JSON mode works much better on the two newer versions, with a notable speed increase for 0125 compared to 1106.

### Mistral Large
Struggles to provide consistent structured output.

In [15]:
show_scatter_plot(
    "Proprietary Models Structured Output Comparison",
    df,
    [*Models.GPT35, *Models.GPT4, Models.MistralLargeFunc],
)