### Supporting JSON outputs in commercial LLMs
##### This notebook covers the ways Anthropic, Google, and OpenAI handle the structured JSON outputs:

[Anthropic Claude Sonnet 3.5](#anthropic-claude-sonnet-3.5)

[Google Gemini Pro 1.5](#google-gemini-pro-1.5)

[OpenAI GPT-4o](#openai)

```
TL;DR 
As of 08/2024, it is possible to get relatively stable JSON outputs in all top-three commercial commercial LLMs.

The clear winner is OpenAI's new "structured output mode" which gives clean JSON outputs based on Pydantic data models and can even follow some field constraints provides as Pydantic descriptions. However, contrary to the claims of the OpenAI documentation, the "structured output mode" is still somewhat prompt-dependent and can infrequently fail.

The second place is taken by Anthropic Claude Sonnet 3.5, which can stick to a structured output by jumping through the hoops of forcing the model into the "function calling". Claude cannot accept Pydantic classes directly, but it still reads field descriptions from JSON models and tries to stick to them.

Google Gemini 1.5 pro takes a third place as the only way to stick to a structured output is to feed the API with an instance of genai.protos.Schema class. There is no direct way to specify field descriptions.  
```

Our test methodology is very simple – we stick as close to vendor's documentation as possible and check the validity of the output JSONs as well as integrity of objects they were instructed to represent. We do not check or evaluate the actual object content. To prevent LLM from caching the call results, we use DataChain run the evaluation over a dataset of 50 unique entries.


##### Here are the cumulative error rates observed:

 "claude-3-5-sonnet-20240620":
 
| LLM  | mode  | JSON validation errors | Object validation errors |
|-----------|-----------|-----------|-----------------|
| Claude Sonnet 3.5 | prompt alone | 14% |  N/A |
| Claude Sonnet 3.5 | prompt & example | 2% |  N/A |
| Claude Sonnet 3.5 | "auto" tool config | 2% |  N/A |
| Claude Sonnet 3.5 | forced tool config | 0% in 1,000 calls| 0% in 1,000 calls |

 "gemini-1.5-pro-latest":
 
| LLM  | mode  | JSON validation errors  | Object validation errors |
|-----------|-----------|-----------|----------------|
| Gemini 1.5 Pro | prompt alone | 90% | N/A |
| Gemini 1.5 Pro | json mode / schema from a user class | 0% | 2% |
| Gemini 1.5 Pro | json mode / schema in the prompt  | 1% | N/A |
| Gemini 1.5 Pro | json mode / schema as `genai.protos.Schema` | 0% in 1,000 calls| 0% in 1,000 calls |

"gpt-4o-2024-08-06":

| LLM  | mode  | JSON validation errors  | Object validation errors |
|-----------|-----------|-----------|----------------|
| GPT 4o | structured output | 0.01% in 10,000 calls  | 0% in 10,000 calls  |
| GPT 4o | structured output + inconsistent prompt | 0.05% | N/A |
| GPT 4o | structured output + field description constraint  | 0% | 0.05% |


### Anthropic Claude Sonnet

First try - just ask for JSON output using the Anthropic [recommended prompt](https://docs.anthropic.com/en/docs/test-and-evaluate/strengthen-guardrails/increase-consistency).

In [14]:
#
# requesting JSON output from Anthropic with the prompt here:
# https://docs.anthropic.com/en/docs/test-and-evaluate/strengthen-guardrails/increase-consistency
#

import os
import anthropic

PROMPT = """
You’re a Customer Insights AI. 
Analyze this feedback and output in JSON format with keys: “sentiment” (positive/negative/neutral), 
“key_issues” (list), and “action_items” (list of dicts with “team” and “task”).
"""

client = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))

completion = (
   client.messages.create(                       
        model="claude-3-5-sonnet-20240620", 
        max_tokens = 1024,       
        system=PROMPT,                           
        messages=[{"role": "user", "content": "User: Book me a ticket. Bot: I do not know."}]
   )
)
print(completion.content[0].text)

Here's the analysis of that feedback in JSON format:

{
  "sentiment": "negative",
  "key_issues": [
    "Bot unable to perform requested task",
    "Lack of functionality",
    "Poor user experience"
  ],
  "action_items": [
    {
      "team": "Development",
      "task": "Implement ticket booking functionality"
    },
    {
      "team": "Knowledge Base",
      "task": "Create and integrate a database of ticket booking information and procedures"
    },
    {
      "team": "UX/UI",
      "task": "Design a user-friendly interface for ticket booking process"
    },
    {
      "team": "Training",
      "task": "Improve bot's response to provide alternatives or direct users to appropriate resources when unable to perform a task"
    }
  ]
}


We can see problem right away – the model produces JSON output but prepends it with unwanted preamble.
To understand the prevalence of this behavior, let's try the same prompt but evaluate against 50 chat dialogs from this [public dataset](https://radar.kit.edu/radar/en/dataset/FdJmclKpjHzLfExE.ExpBot%2B-%2BA%2Bdataset%2Bof%2B79%2Bdialogs%2Bwith%2Ban%2Bexperimental%2Bcustomer%2Bservice%2Bchatbot) using the [DataChain library](https://github.com/iterative/datachain):

In [2]:
import os
import json
import anthropic
from datachain import File, DataChain, Column

from pydantic import BaseModel, Field
from typing import List, Optional

class TextBlock(BaseModel):
    text: str
    type: str

class Usage(BaseModel):
    input_tokens: int
    output_tokens: int

class ClaudeMessage(BaseModel):
    id: str
    content: List[TextBlock]
    model: str
    role: str
    stop_reason: str
    stop_sequence: Optional[str] = None
    type: str
    usage: Usage

PROMPT = """
You’re a Customer Insights AI. 
Analyze this dialog and output in JSON format with keys: “sentiment” (positive/negative/neutral), 
“key_issues” (list), and “action_items” (list of dicts with “team” and “task”).
"""

source_files = "gs://datachain-demo/chatbot-KiT/"
client = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))

def eval_dialogue(file: File) -> str:    
     completion = (
         client.messages.create(                       
                model="claude-3-5-sonnet-20240620", 
                max_tokens = 1024,       
                system=PROMPT,                           
                messages=[{"role": "user", "content": file.read()},]
         )
     )
     json_string = completion.content[0].text
     try:
         # Attempt to convert the string to JSON
         json_data = json.loads(json_string)
         return completion
     except json.JSONDecodeError as e:
         # Catch JSON decoding errors
         print(f"JSONDecodeError: {e}")
         print(json_string)
         return completion

simple_chain = DataChain.from_storage(source_files, type="text")       \
              .settings(cache=True)                                    \
              .filter(Column("file.path").glob("*.txt"))               \
              .map(claude = eval_dialogue, output=ClaudeMessage)       \
              .save()

Preparing: 50 rows [00:00, 22584.02 rows/s]
Processed: 50 rows [00:00, 9450.04 rows/s]
Preparing: 50 rows [00:00, 16987.87 rows/s]
Processed: 13 rows [00:52,  4.48s/ rows]

JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Here's the analysis of the dialog in JSON format:

{
  "sentiment": "neutral",
  "key_issues": [
    "Chatbot's inability to understand context and recall previously provided information",
    "Rigid response structure requiring specific input formats",
    "Limited options presented to the user",
    "Lack of flexibility in understanding user's budget preferences"
  ],
  "action_items": [
    {
      "team": "AI Development",
      "task": "Improve context understanding and information recall capabilities"
    },
    {
      "team": "AI Development",
      "task": "Enhance natural language processing to handle various input formats"
    },
    {
      "team": "Product Management",
      "task": "Expand the database of mobile phone operators and plans"
    },
    {
      "team": "UX Design",
      "task": "Redesign the conversation flow to be more user-friendly and less repetitive"
    },
    {
      "team": "AI Development",
 

Processed: 20 rows [01:27,  5.84s/ rows]

JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Here's the analysis of the dialog in JSON format:

{
  "sentiment": "neutral",
  "key_issues": [
    "User difficulty in providing clear budget information",
    "Bot's inability to understand non-numeric responses",
    "Limited flexibility in adjusting search parameters mid-conversation",
    "Lack of follow-up questions about international travel needs"
  ],
  "action_items": [
    {
      "team": "UX Design",
      "task": "Improve flexibility of input options for budget questions"
    },
    {
      "team": "NLP",
      "task": "Enhance natural language understanding for non-numeric responses"
    },
    {
      "team": "Product Management",
      "task": "Implement feature to adjust search parameters during conversation"
    },
    {
      "team": "Content",
      "task": "Develop more detailed questions about international usage needs"
    },
    {
      "team": "QA",
      "task": "Test bot's ability to handle unexpecte

Processed: 29 rows [02:13,  5.28s/ rows]

JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Here's the analysis of the dialog in JSON format:

{
  "sentiment": "neutral",
  "key_issues": [
    "User's needs not fully met",
    "Chatbot's inability to handle flexible budget input",
    "Lack of continuity in conversation",
    "Limited response to specific requests"
  ],
  "action_items": [
    {
      "team": "Product",
      "task": "Improve plan recommendation algorithm to include free text messages when requested"
    },
    {
      "team": "Development",
      "task": "Enhance chatbot's ability to handle flexible budget inputs"
    },
    {
      "team": "UX",
      "task": "Improve conversation flow to maintain context between user requests"
    },
    {
      "team": "AI",
      "task": "Train the chatbot to better understand and respond to specific user requests"
    }
  ]
}


Processed: 30 rows [02:24,  7.26s/ rows]

JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Here's the analysis of the dialog in JSON format:

{
  "sentiment": "positive",
  "key_issues": [
    "Customer initially provided incorrect plan information",
    "Customer wants a cheaper phone plan",
    "Customer travels outside of Europe",
    "Customer wants to spend less than 25€ per month"
  ],
  "action_items": [
    {
      "team": "Customer Service",
      "task": "Improve the process for customers to accurately identify their current plan"
    },
    {
      "team": "Product Development",
      "task": "Develop more affordable plans with international roaming options"
    },
    {
      "team": "AI Development",
      "task": "Enhance chatbot's ability to handle ambiguous responses like 'sometimes'"
    },
    {
      "team": "Marketing",
      "task": "Promote the BigPhone Green Xtra plan to budget-conscious customers"
    },
    {
      "team": "UX/UI",
      "task": "Improve chatbot's ability to understand and re

Processed: 50 rows [03:44,  4.50s/ rows]
Saving: 50 rows [00:00, 19150.32 rows/s]
Cleanup: 4 tables [00:00, 3559.02 tables/s]


As we see, about 4 out of 50 responses come with an extra text line preceding the output.
To address that, [Anthropic recommends](https://docs.anthropic.com/en/docs/test-and-evaluate/strengthen-guardrails/increase-consistency#prefill-claudes-response) two techniques:

1. Providing structure examples in the prompt
2. Pre-filling the assistant's answer (coercing the output start).

Note that while giving examples are straightforward, pre-filling is a bit inconvenient because one needs to add the coercive part back to the answer to complete a structured object. But nonetheless, let's give both a try:

In [4]:
import os
import json
import anthropic
from datachain import File, DataChain, Column

source_files = "gs://datachain-demo/chatbot-KiT/"
client = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))

PROMPT = """
You’re a Customer Insights AI. 
Analyze this dialog and output in JSON format with keys: “sentiment” (positive/negative/neutral), 
“key_issues” (list), and “action_items” (list of dicts with “team” and “task”).

Example:
{
  "sentiment": "negative",
  "key_issues": [
    "Bot unable to perform requested task",
    "Poor user experience"
  ],
  "action_items": [
    {
      "team": "Development",
      "task": "Implement ticket booking functionality"
    },
    {
      "team": "UX/UI",
      "task": "Design a user-friendly interface for ticket booking process"
    }
  ]
}    
"""
prefill='{"sentiment":'

def eval_dialogue(file: File) -> str:    
     completion = (
         client.messages.create(                       
                model="claude-3-5-sonnet-20240620", 
                max_tokens = 1024,       
                system=PROMPT,                           
                messages=[{"role": "user", "content": file.read()},
                          {"role": "assistant", "content": f'{prefill}'},
                         ]
         )
     )
     json_string = prefill + completion.content[0].text
     try:
         # Attempt to convert the string to JSON
         json_data = json.loads(json_string)
         return json_string
     except json.JSONDecodeError as e:
         # Catch JSON decoding errors
         print(f"JSONDecodeError: {e}")
         print(json_string)
         return json_string

chain = DataChain.from_storage(source_files, type="text")       \
              .settings(cache=True)                             \
              .filter(Column("file.path").glob("*.txt"))        \
              .map(claude = eval_dialogue)                      \
              .exec()


Preparing: 50 rows [00:00, 19291.25 rows/s]
Processed: 50 rows [00:00, 6438.31 rows/s]
Preparing: 50 rows [00:00, 17773.98 rows/s]
Processed: 2 rows [00:06,  3.41s/ rows]

JSONDecodeError: Expecting ',' delimiter: line 1 column 18 (char 17)
{"sentiment":": "negative",
"key_issues": [
"Bot unable to handle unrealistic input values",
"Poor data validation and error handling",
"Inability to sign contracts or take direct action",
"Repetitive conversation flow",
"Lack of personalization in responses"
],
"action_items": [
{
"team": "Development",
"task": "Implement input validation for realistic values"
},
{
"team": "Development",
"task": "Improve error handling and user guidance"
},
{
"team": "Product Management",
"task": "Define clear user journey and add functionality for contract signing"
},
{
"team": "UX/UI",
"task": "Design more flexible conversation flows"
},
{
"team": "Data Science",
"task": "Implement more sophisticated recommendation engine"
},
{
"team": "Customer Support",
"task": "Create escalation path for complex queries"
}
]}


Processed: 9 rows [00:38,  4.43s/ rows]

JSONDecodeError: Expecting value: line 3 column 1 (char 15)
{"sentiment":

```json
{
  "sentiment": "negative",
  "key_issues": [
    "Bot unable to handle follow-up questions",
    "Bot not responding to user's specific queries",
    "Lack of flexibility in conversation",
    "Poor user experience"
  ],
  "action_items": [
    {
      "team": "Development",
      "task": "Improve bot's ability to handle follow-up questions and maintain context"
    },
    {
      "team": "Development",
      "task": "Enhance bot's natural language processing to understand varied user inputs"
    },
    {
      "team": "Content",
      "task": "Expand bot's knowledge base to include information about different providers"
    },
    {
      "team": "UX/UI",
      "task": "Design a more intuitive conversation flow that allows for user preferences and follow-up questions"
    },
    {
      "team": "Product Management",
      "task": "Review and revise the bot's response strategy when it doesn't understan

Processed: 50 rows [03:27,  4.15s/ rows]
Cleanup: 4 tables [00:00, 3421.83 tables/s]


As expected, coercion got rid of the starter text line because the output is forced to begin as a valid JSON. This is clearly much better than just "asking" Claude to be nice and produce valid JSON.
Nonetheless, the LLM is still able to deviate from the structured output, although less frequently:

```
JSONDecodeError: Expecting ',' delimiter: line 1 column 18 (char 17)
{"sentiment":": "negative",

```


#### Anthropic "tool" hack

From the above, it is evident that Anthropic LLM does not reliably stick to the requested schema right out of the box.
Nevertheless, the [LangChain](lanngchain.com) library still offers a method call [llm_with_structured_output](https://python.langchain.com/v0.1/docs/modules/model_io/chat/structured_output/) for their Anthropic wrapper anyway.

How does it work?

LangChain wraps the Claude completion request into a "tool call", suggesting the LLM to use a pseudo-tool that requires a structured call.
The line of thinking here is that Anthropic model is trained well enough to not send unstructured output to an external function.

So let's try this idea out in the same testbench:

In [5]:
import os
import json
import anthropic
from datachain import File, DataChain, Column

from pydantic import BaseModel, Field, ValidationError
from typing import List, Optional

class TextBlock(BaseModel):
    text: str
    type: str

class Usage(BaseModel):
    input_tokens: int
    output_tokens: int

class ClaudeMessage(BaseModel):
    id: str
    content: List[TextBlock]
    model: str
    role: str
    stop_reason: str
    stop_sequence: Optional[str] = None
    type: str
    usage: Usage

class ActionItem(BaseModel):
    team: str 
    task: str

class EvalResponse(BaseModel):
    sentiment: str = Field(description="dialog sentiment (positive/negative/neutral)")
    key_issues: list[str] = Field(description="list of 3 problems discovered in the dialog")
    action_items: list[ActionItem] = Field(description="list of dicts with 'team' and 'task'")


source_files = "gs://datachain-demo/chatbot-KiT/"
client = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))

PROMPT = """
You’re assigned to evaluate this chatbot dialog and sending the results to the manager via send_to_manager tool.    
"""

def eval_dialogue(file: File) -> str:    
     completion = (
         client.messages.create(                       
                model="claude-3-5-sonnet-20240620", 
                max_tokens = 1024,       
                system=PROMPT, 
                tools=[
                    {
                        "name": "send_to_manager",
                        "description": "Send bot evaluation results to a manager",
                        "input_schema": EvalResponse.model_json_schema(),
                    }
                ],
                messages=[{"role": "user", "content": file.read()},
                         ]
         )
     )
     try:
         json_dict = completion.content[1].input
     except IndexError as e:
         # Catch cases where Claude refuses to use tools
         print(f"IndexError: {e}")
         print(completion)
         return(completion)
     try:
         # Attempt to convert the tool dict to EvalResponse object
         EvalResponse(**json_dict)
         return completion
     except ValidationError as e:
         # Catch Pydantic validation errors
         print(f"Pydantic error: {e}")
         print(completion)
         return completion

tool_chain = DataChain.from_storage(source_files, type="text")          \
              .settings(cache=True)                                     \
              .filter(Column("file.path").glob("*.txt"))                \
              .map(claude = eval_dialogue, output=ClaudeMessage)        \
              .exec()


Preparing: 50 rows [00:00, 16821.63 rows/s]
Processed: 50 rows [00:00, 8862.20 rows/s]
Preparing: 50 rows [00:00, 14740.65 rows/s]
Processed: 30 rows [02:47,  4.73s/ rows]

IndexError: list index out of range
Message(id='msg_0128s4bv2bKe575FJK6gNPee', content=[TextBlock(text="I apologize, but I don't have the ability to directly print this conversation. However, I can evaluate the chatbot dialog and send the results to a manager using the available tool. Would you like me to do that?", type='text')], model='claude-3-5-sonnet-20240620', role='assistant', stop_reason='end_turn', stop_sequence=None, type='message', usage=Usage(input_tokens=1681, output_tokens=49))


Processed: 50 rows [05:04,  6.10s/ rows]
Cleanup: 4 tables [00:00, 2142.41 tables/s]



Interestingly, things with Anthropic tools are going well here until the point when the LLM refuses to make a tool call:

```
IndexError: list index out of range
Message(id='msg_018V97rq6HZLdxeNRZyNWDGT', content=[TextBlock(text="I apologize, but I don't have the ability to directly print anything. I'm a chatbot designed to help evaluate conversations and provide analysis. Based on the conversation you've shared, it seems you were interacting with a different chatbot that helps find mobile phone plans. That chatbot doesn't appear to have printing capabilities either.\n\nHowever, I can analyze this conversation and send an evaluation to the manager. Would you like me to do that?", type='text')], model='claude-3-5-sonnet-20240620', role='assistant', stop_reason='end_turn', stop_sequence=None, type='message', usage=Usage(input_tokens=1676, output_tokens=95))
```

If the call is not made, the JSON input is not provided because the response object does not have the tool block.
Luckily, Anthropic offers an option to always force the tool use and eliminate the text block from the response altogether.

This takes the form of the following argument: `tool_choice = {"type": "tool", "name": "send_to_manager"}`
To better understand the error rate, the total number of LLM calls is forced to be 1,000. We also wrap the exceptions the Claude API may throw at us:

In [6]:
import os
import json
import anthropic
from datachain import File, DataChain, Column

from pydantic import BaseModel, Field, ValidationError, field_validator
from typing import List, Optional


class ActionItem(BaseModel):
    team: str 
    task: str

class EvalResponse(BaseModel):
    sentiment: str = Field(description="dialog sentiment (positive/negative/neutral)")
    key_issues: list[str] = Field(description="list of 3 problems discovered in the dialog")
    action_items: list[ActionItem] = Field(description="list of dicts with 'team' and 'task'")

    @field_validator("key_issues")
    def count_issues(cls, value):
        if  len(value) != 3:
            raise ValueError(f"{len(value)} issues provided")
        return value


source_files = "gs://datachain-demo/chatbot-KiT/"
client = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))

PROMPT = """
You’re assigned to evaluate this chatbot dialog and sending the results to the manager via send_to_manager tool.    
"""

def eval_dialogue(file: File) -> str:
     try:
        completion = (
             client.messages.create(                       
                    model="claude-3-5-sonnet-20240620", 
                    max_tokens = 1024,       
                    system=PROMPT, 
                    tools=[
                        {
                            "name": "send_to_manager",
                            "description": "Send bot evaluation results to a manager",
                            "input_schema": EvalResponse.model_json_schema(),
                        }
                    ],
                    tool_choice = {"type": "tool", "name": "send_to_manager"},
                    messages=[{"role": "user", "content": file.read()},
                             ]
             )
         )
     except anthropic.APIError as e:
        print(f"APIError: {e}")
        return "error"
     except anthropic.RateLimitError as e:
        print(f"RateLimitError: {e}")
        return "error"
     except anthropic.AuthenticationError as e:
        print(f"AuthenticationError: {e}")
        return "error"
     except Exception as e:
        print(f"An unexpected error occurred: {e}")
        return "error"
         
     try:
         json_dict = completion.content[0].input
     except IndexError as e:
         # Catch cases where Claude refuses to use tools
         print(f"IndexError: {e}")
         print(completion)
         return str(completion)
     try:
         # Attempt to convert the tool dict to EvalResponse object
         EvalResponse(**json_dict)
         return str(completion)
     except ValidationError as e:
         # Catch Pydantic validation errors
         print(f"Pydantic error: {e}")
         print(completion)
         return str(completion)


for i in range(1,21): # 50x20 = 1,000 LLM calls
    forced_tool_chain = DataChain.from_storage(source_files, type="text")          \
                  .settings(cache=True)                                            \
                  .filter(Column("file.path").glob("*.txt"))                       \
                  .map(claude = eval_dialogue)                                     \
                  .exec()
    print(f"{i*50} calls")


Preparing: 50 rows [00:00, 23550.28 rows/s]
Processed: 50 rows [00:00, 8172.21 rows/s]
Preparing: 50 rows [00:00, 19492.07 rows/s]
Processed: 50 rows [03:41,  4.43s/ rows]
Cleanup: 4 tables [00:00, 2867.41 tables/s]


50 calls


Preparing: 50 rows [00:00, 40745.13 rows/s]
Processed: 50 rows [00:00, 10279.65 rows/s]
Preparing: 50 rows [00:00, 20990.41 rows/s]
Processed: 50 rows [03:26,  4.13s/ rows]
Cleanup: 4 tables [00:00, 2122.89 tables/s]


100 calls


Preparing: 50 rows [00:00, 29006.25 rows/s]
Processed: 50 rows [00:00, 12703.08 rows/s]
Preparing: 50 rows [00:00, 19157.32 rows/s]
Processed: 50 rows [03:20,  4.02s/ rows]
Cleanup: 4 tables [00:00, 1734.62 tables/s]


150 calls


Preparing: 50 rows [00:00, 21349.40 rows/s]
Processed: 50 rows [00:00, 9573.41 rows/s]
Preparing: 50 rows [00:00, 19208.21 rows/s]
Processed: 50 rows [03:16,  3.94s/ rows]
Cleanup: 4 tables [00:00, 2557.89 tables/s]


200 calls


Preparing: 50 rows [00:00, 23912.79 rows/s]
Processed: 50 rows [00:00, 8296.02 rows/s]
Preparing: 50 rows [00:00, 16745.07 rows/s]
Processed: 50 rows [03:42,  4.45s/ rows]
Cleanup: 4 tables [00:00, 2275.49 tables/s]


250 calls


Preparing: 50 rows [00:00, 24365.66 rows/s]
Processed: 50 rows [00:00, 9852.26 rows/s]
Preparing: 50 rows [00:00, 23923.71 rows/s]
Processed: 50 rows [03:41,  4.42s/ rows]
Cleanup: 4 tables [00:00, 2362.66 tables/s]


300 calls


Preparing: 50 rows [00:00, 20919.22 rows/s]
Processed: 50 rows [00:00, 10756.83 rows/s]
Preparing: 50 rows [00:00, 17029.25 rows/s]
Processed: 50 rows [03:41,  4.43s/ rows]
Cleanup: 4 tables [00:00, 2443.17 tables/s]


350 calls


Preparing: 50 rows [00:00, 20168.80 rows/s]
Processed: 50 rows [00:00, 10206.11 rows/s]
Preparing: 50 rows [00:00, 20858.88 rows/s]
Processed: 50 rows [03:21,  4.02s/ rows]
Cleanup: 4 tables [00:00, 2362.66 tables/s]


400 calls


Preparing: 50 rows [00:00, 28684.89 rows/s]
Processed: 50 rows [00:00, 10729.32 rows/s]
Preparing: 50 rows [00:00, 26952.22 rows/s]
Processed: 50 rows [03:45,  4.51s/ rows]
Cleanup: 4 tables [00:00, 2580.71 tables/s]


450 calls


Preparing: 50 rows [00:00, 18016.77 rows/s]
Processed: 50 rows [00:00, 7313.01 rows/s]
Preparing: 50 rows [00:00, 19827.47 rows/s]
Processed: 50 rows [03:31,  4.23s/ rows]
Cleanup: 4 tables [00:00, 2162.29 tables/s]


500 calls


Preparing: 50 rows [00:00, 23017.80 rows/s]
Processed: 50 rows [00:00, 9179.11 rows/s]
Preparing: 50 rows [00:00, 18647.98 rows/s]
Processed: 50 rows [03:53,  4.68s/ rows]
Cleanup: 4 tables [00:00, 2284.17 tables/s]


550 calls


Preparing: 50 rows [00:00, 23899.17 rows/s]
Processed: 50 rows [00:00, 9817.67 rows/s]
Preparing: 50 rows [00:00, 16313.90 rows/s]
Processed: 43 rows [02:52,  4.42s/ rows]

APIError: Error code: 500 - {'type': 'error', 'error': {'type': 'api_error', 'message': 'Internal server error'}}


Processed: 50 rows [03:21,  4.02s/ rows]
Cleanup: 4 tables [00:00, 2509.31 tables/s]


600 calls


Preparing: 50 rows [00:00, 22484.74 rows/s]
Processed: 50 rows [00:00, 10940.90 rows/s]
Preparing: 50 rows [00:00, 17985.87 rows/s]
Processed: 29 rows [02:07,  4.77s/ rows]

APIError: Error code: 500 - {'type': 'error', 'error': {'type': 'api_error', 'message': 'Internal server error'}}


Processed: 50 rows [03:42,  4.45s/ rows]
Cleanup: 4 tables [00:00, 2747.66 tables/s]


650 calls


Preparing: 50 rows [00:00, 26329.59 rows/s]
Processed: 50 rows [00:00, 9873.13 rows/s]
Preparing: 50 rows [00:00, 20162.98 rows/s]
Processed: 50 rows [03:20,  4.01s/ rows]
Cleanup: 4 tables [00:00, 3853.29 tables/s]


700 calls


Preparing: 50 rows [00:00, 23923.71 rows/s]
Processed: 50 rows [00:00, 9433.88 rows/s]
Preparing: 50 rows [00:00, 17611.29 rows/s]
Processed: 16 rows [00:54,  4.01s/ rows]

APIError: Error code: 500 - {'type': 'error', 'error': {'type': 'api_error', 'message': 'Internal server error'}}


Processed: 50 rows [03:12,  3.86s/ rows]
Cleanup: 4 tables [00:00, 3007.75 tables/s]


750 calls


Preparing: 50 rows [00:00, 26455.81 rows/s]
Processed: 50 rows [00:00, 10586.33 rows/s]
Preparing: 50 rows [00:00, 20400.31 rows/s]
Processed: 50 rows [03:12,  3.85s/ rows]
Cleanup: 4 tables [00:00, 3744.91 tables/s]


800 calls


Preparing: 50 rows [00:00, 29171.68 rows/s]
Processed: 50 rows [00:00, 12048.44 rows/s]
Preparing: 50 rows [00:00, 17421.10 rows/s]
Processed: 50 rows [03:08,  3.76s/ rows]
Cleanup: 4 tables [00:00, 2594.28 tables/s]


850 calls


Preparing: 50 rows [00:00, 25291.27 rows/s]
Processed: 50 rows [00:00, 9382.81 rows/s]
Preparing: 50 rows [00:00, 16117.06 rows/s]
Processed: 50 rows [03:08,  3.78s/ rows]
Cleanup: 4 tables [00:00, 2820.65 tables/s]


900 calls


Preparing: 50 rows [00:00, 24903.84 rows/s]
Processed: 50 rows [00:00, 9540.31 rows/s]
Preparing: 50 rows [00:00, 17869.39 rows/s]
Processed: 50 rows [03:11,  3.83s/ rows]
Cleanup: 4 tables [00:00, 4306.27 tables/s]


950 calls


Preparing: 50 rows [00:00, 22269.85 rows/s]
Processed: 50 rows [00:00, 9435.16 rows/s]
Preparing: 50 rows [00:00, 17568.50 rows/s]
Processed: 9 rows [00:32,  4.14s/ rows]

APIError: Error code: 500 - {'type': 'error', 'error': {'type': 'api_error', 'message': 'Internal server error'}}


Processed: 50 rows [03:21,  4.03s/ rows]
Cleanup: 4 tables [00:00, 2794.81 tables/s]

1000 calls





#### Great!

The Anthropic API throws up an "Internal server error" exception occasionally, but 
Claude 3.5 Sonnet forced into the tool mode seems to be capable of 1,000+ calls without a single JSON failure. It was also capable of reading the field limit from the Pydantic description.


And the next up is...

### Google Gemini Pro 1.5

[Google documentation on structured outputs](https://ai.google.dev/gemini-api/docs/json-mode?lang=python) says right away that prompt-based generation is unreliable, and "Google can't guarantee that it will produce JSON and nothing but JSON." This appears, indeed, true because a pure prompt-induced JSON output tends to emit many Markdown preambles:

In [3]:
import os
import json
from datachain.lib.dc import Column, DataChain
import google.generativeai as genai

source_files = "gs://datachain-demo/chatbot-KiT/"
google_api_key=os.getenv("GOOGLE_API_KEY")

PROMPT="""
You’re a Customer Insights AI. 
Analyze the following dialog and provide evaluation in JSON format with keys: “sentiment” (positive/negative/neutral), 
“key_issues” (list), and “action_items” (list of dicts with “team” and “task”).
"""

def gemini_setup():
    genai.configure(api_key=google_api_key)
    return genai.GenerativeModel(model_name='gemini-1.5-pro-latest', system_instruction=PROMPT)

def eval_dialogue (file, model):
    response = model.generate_content(file.read(), stream=False)
    response.resolve()
    try:
         json_dict = json.loads(response.text)
    except json.JSONDecodeError as e:
         # Catch cases where Gemini fails to produc JSON
         print(f"IndexError: {e}")
         print(response.text)
         return response.text
    return response.text

chain =  (
         DataChain.from_storage(source_files, type="text")
            .settings(cache=True)
            .limit(1)
            .setup(model = lambda: gemini_setup())
            .map(gemini = eval_dialogue)
            .exec()
)

Preparing: 50 rows [00:00, 28089.37 rows/s]
Processed: 50 rows [00:00, 8382.57 rows/s]
Preparing: 1 rows [00:00, 394.65 rows/s]
Download: 0.00B [00:00, ?B/s]

IndexError: Expecting value: line 1 column 1 (char 0)
```json
{
  "sentiment": "negative",
  "key_issues": [
    "Bot misunderstood user confirmation.",
    "Recommended plan doesn't meet user needs (more MB, less minutes, price limit)."
  ],
  "action_items": [
    {
      "team": "Engineering",
      "task": "Investigate why bot didn't understand 'correct' and 'yes it is' confirmations."
    },
    {
      "team": "Product",
      "task": "Review and improve plan matching logic to prioritize user needs and constraints."
    }
  ]
}
``` 




Download: 1.73kB [00:03, 503B/s]/s][A
Processed: 1 rows [00:00, 165.37 rows/s]
Cleanup: 4 tables [00:00, 2552.44 tables/s]


According to [Google documentation](https://ai.google.dev/gemini-api/docs/json-mode?lang=python), there are three possible ways to get around this, and they all require switching the model config to the JSON output mode. 

In the most convenient syntax form, one can point the output configuration to a user class directly, or to a data model subclassed from `typing.TypedDict` 

In both cases the result will be similar to the below:

In [16]:
import os
import json
from datachain.lib.dc import Column, DataChain
import google.generativeai as genai

source_files = "gs://datachain-demo/chatbot-KiT/"
google_api_key=os.getenv("GOOGLE_API_KEY")

PROMPT="""
You’re a Customer Insights AI. 
Analyze the following dialog and provide evaluation in JSON format.
"""
from pydantic import BaseModel, Field, ValidationError
from typing import List, Optional

class ActionItem(BaseModel):
    team: str 
    task: str

class EvalResponse(BaseModel):
    sentiment: str 
    key_issues: list[str] 
    action_items: list[ActionItemG]

def gemini_setup():
    genai.configure(api_key=google_api_key)
    return genai.GenerativeModel(model_name='gemini-1.5-pro-latest', 
                                 system_instruction=PROMPT,
                                 generation_config={"response_mime_type": "application/json",
                                                     "response_schema": EvalResponse
                                                   },
                                )

def eval_dialogue (file, model):
    response = model.generate_content(file.read(), stream=False)
    response.resolve()
    try:
         json_dict = json.loads(response.text)
    except json.JSONDecodeError as e:
         # Catch cases where Gemini fails to produce valid JSON
         print(f"JSONDecodeError: {e}")
         print(response.text)
         return response.text
    try:
         # Attempt to convert the response dict to EvalResponse object
         eval_object = EvalResponse(**json_dict)
         return str(response.text)
    except ValidationError as e:
         # Catch Pydantic validation errors
         print(f"Pydantic error: {e}")
         print(response.text)
         return str(response.text)
    
    return str(eval_object)

chain =  (
         DataChain.from_storage(source_files, type="text")
            .settings(cache=True)
            .setup(model = lambda: gemini_setup())
            .map(gemini = eval_dialogue)
            .exec()
)

Preparing: 50 rows [00:00, 30709.50 rows/s]
Processed: 50 rows [00:00, 9384.49 rows/s]
Preparing: 50 rows [00:00, 17737.90 rows/s]
Download: 0.00B [00:00, ?B/s]
Download: 9.73kB [00:01, 5.40kB/s]][A
Download: 15.0kB [00:07, 1.90kB/s]rows][A
Download: 18.8kB [00:10, 1.54kB/s]rows][A

Pydantic error: 1 validation error for EvalResponse
sentiment
  Field required [type=missing, input_value={'action_items': [{'task'...nd could be improved."]}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.8/v/missing
{"action_items": [{"task": "Investigate why the bot recommends a more expensive plan when the user specified a maximum budget.", "team": "Bot Development"}, {"task": "Improve the bot's understanding of user requests related to additional packages or modifications to existing plans.", "team": "NLU Training"}, {"task": "Train the bot on handling farewells like 'byebye' more gracefully.", "team": "Conversation Flow"}], "key_issues": ["The bot fails to respect the user's budget constraint.", "The bot struggles to understand requests about plan modifications.", "The bot's response to the user's farewell is abrupt and could be improved."]} 



Download: 22.4kB [00:14, 1.25kB/s]rows][A
Download: 24.1kB [00:16, 1.26kB/s]rows][A
Download: 25.9kB [00:19, 1.04kB/s]rows][A
Download: 27.7kB [00:20, 1.04kB/s]rows][A
Download: 29.8kB [00:22, 1.16kB/s]rows][A
Download: 31.5kB [00:26, 814B/s]  rows][A
Download: 38.1kB [00:27, 1.54kB/s] rows][A
Download: 40.6kB [00:31, 1.20kB/s] rows][A
Download: 43.1kB [00:36, 871B/s]   rows][A
Download: 46.6kB [00:43, 708B/s]s/ rows][A
Download: 51.0kB [00:46, 912B/s]s/ rows][A
Download: 53.0kB [00:50, 760B/s]s/ rows][A
Download: 58.0kB [00:54, 974B/s]s/ rows][A

Pydantic error: 1 validation error for EvalResponse
sentiment
  Field required [type=missing, input_value={'action_items': [{'task'...nfusion for the user.']}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.8/v/missing
{"action_items": [{"task": "Bot should clarify what \"inclusive international minutes\" means in the offered plan.", "team": "Bot team"}, {"task": "Bot should confirm if the user is looking for a plan with inclusive international minutes or just the option to add them.", "team": "Bot team"}], "key_issues": ["The bot recommended a plan without considering the user's need for international calls despite explicitly stating it in the summary.", "The bot didn't clarify what \"inclusive international minutes\" means, leading to potential confusion for the user."]} 



Download: 59.9kB [00:57, 894B/s]s/ rows][A
Download: 63.0kB [00:58, 1.11kB/s] rows][A
Download: 65.7kB [01:03, 829B/s]   rows][A
Download: 67.6kB [01:08, 696B/s]s/ rows][A

Pydantic error: 1 validation error for EvalResponse
sentiment
  Field required [type=missing, input_value={'action_items': [{'task'...r towards a solution."]}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.8/v/missing
{"action_items": [{"task": "Improve bot's ability to understand user intents and extract key information from free-form text.", "team": "Engineering"}, {"task": "Train the bot on a wider variety of expressions for common requests, such as 'cheapest offer', 'less than I pay now', etc.", "team": "Data Science"}, {"task": "Implement a more flexible confirmation flow, allowing users to correct or adjust their input more easily.", "team": "Design"}, {"task": "Explore the possibility of asking clarifying questions when user input is ambiguous or incomplete.", "team": "Engineering"}], "key_issues": ["The bot struggles to understand user intents expressed in natural language.", "The conversation flow is rigid, leading to frustration when the use


Download: 69.7kB [01:09, 820B/s]s/ rows][A
Download: 71.2kB [01:11, 862B/s]s/ rows][A
Download: 73.6kB [01:15, 743B/s]s/ rows][A
Download: 75.4kB [01:19, 629B/s]s/ rows][A
Download: 76.9kB [01:24, 523B/s]s/ rows][A
Download: 78.7kB [01:28, 481B/s]s/ rows][A
Download: 80.7kB [01:29, 615B/s]s/ rows][A
Download: 82.9kB [01:32, 699B/s]s/ rows][A
Download: 86.5kB [01:35, 827B/s]s/ rows][A
Download: 88.2kB [01:38, 737B/s]s/ rows][A
Download: 89.8kB [01:40, 816B/s]s/ rows][A
Download: 91.6kB [01:44, 696B/s]s/ rows][A
Download: 94.9kB [01:47, 785B/s]s/ rows][A
Download: 98.1kB [01:50, 842B/s]s/ rows][A
Download: 101kB [01:55, 770B/s] s/ rows][A
Download: 103kB [01:57, 757B/s]3s/ rows][A
Download: 105kB [01:59, 852B/s]5s/ rows][A
Download: 107kB [02:01, 940B/s]8s/ rows][A

Pydantic error: 1 validation error for EvalResponse
sentiment
  Field required [type=missing, input_value={'action_items': [{'task'...n inappropriate plan."]}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.8/v/missing
{"action_items": [{"task": "Bot didn't understand \"Never\" and \"Yes\" answers for the \"Do you often travel outside of Europe?\" question.  Need to investigate and fix.", "team": "Bot team"}], "key_issues": ["The bot failed to understand the user's answer to a yes/no question multiple times. This led to the bot misunderstanding the user's travel habits and potentially recommending an inappropriate plan."]} 



Download: 109kB [02:05, 752B/s]1s/ rows][A
Download: 113kB [02:08, 926B/s]5s/ rows][A
Download: 114kB [02:10, 900B/s]1s/ rows][A
Download: 116kB [02:15, 662B/s]1s/ rows][A
Download: 118kB [02:16, 722B/s]3s/ rows][A
Download: 119kB [02:18, 798B/s]5s/ rows][A
Download: 121kB [02:22, 616B/s]1s/ rows][A
Download: 123kB [02:24, 725B/s]6s/ rows][A
Download: 126kB [02:25, 1.05kB/s]/ rows][A
Download: 128kB [02:29, 779B/s]  / rows][A

Pydantic error: 1 validation error for EvalResponse
sentiment
  Field required [type=missing, input_value={'action_items': [{'task'...tion-gathering phase.']}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.8/v/missing
{"action_items": [{"task": "Bot should handle clarification better", "team": "Product"}, {"task": "Bot should not end dialog prematurely", "team": "Engineering"}, {"task": "Bot should handle interruptions or clarifications gracefully during information gathering", "team": "Engineering"}], "key_issues": ["The bot struggles with user clarifications and prematurely ends the dialogue. The user provides a clarification about the number of text messages, but instead of smoothly incorporating this information, the bot gets confused and offers to restart. This highlights a need for the bot to handle interruptions or clarifications gracefully during its information-gathering phase."]} 



Download: 130kB [02:33, 707B/s]5s/ rows][A
Download: 130kB [02:37, 843B/s]5s/ rows][A
Processed: 50 rows [02:35,  3.12s/ rows]
Cleanup: 4 tables [00:00, 2236.07 tables/s]


#### As we can see, Gemini 1.5 Pro produces valid JSONs but does not stick to the data model provided.

Various object validity issues are sporadically observed, but most frequently the missing fields. The average frequency of issues is about 2 in 50 runs - which is enough to disqualify this mode of operation.

So let us try a different "recommended" JSON mode syntax – insert the schema right into the prompt:


In [3]:
import os
import json
from datachain.lib.dc import Column, DataChain
import google.generativeai as genai

source_files = "gs://datachain-demo/chatbot-KiT/"
google_api_key=os.getenv("GOOGLE_API_KEY")

PROMPT="""
You’re a Customer Insights AI. 
Analyze the following dialog and provide evaluation in JSON format with schema EvalResponse:

EvalResponse = {"sentiment": str, "key_issues": list[str], "action_items" = list[ActionItem]}
ActionItem = {"team: str, "task":str}
"""
from pydantic import BaseModel, Field, ValidationError
from typing import List, Optional


class ActionItem(BaseModel):
    team: str 
    task: str

class EvalResponse(BaseModel):
    sentiment: str = Field(description="dialog sentiment (positive/negative/neutral)")
    key_issues: list[str] = Field(description="list of 3 problems discovered in the dialog")
    action_items: list[ActionItem] = Field(description="list of dicts with 'team' and 'task'")

def gemini_setup():
    genai.configure(api_key=google_api_key)
    return genai.GenerativeModel(model_name='gemini-1.5-pro-latest', 
                                 system_instruction=PROMPT,
                                 generation_config={"response_mime_type": "application/json"}
                                )

def eval_dialogue (file, model):
    response = model.generate_content(file.read(), stream=False)
    response.resolve()
    try:
         json_dict = json.loads(response.text)
    except json.JSONDecodeError as e:
         # Catch cases where Gemini fails to produce valid JSON
         print(f"JSONDecodeError: {e}")
         print(response.text)
         return response.text
    try:
         # Attempt to convert the response dict to EvalResponse object
         eval_object = EvalResponse(**json_dict)
         return str(response.text)
    except ValidationError as e:
         # Catch Pydantic validation errors
         print(f"Pydantic error: {e}")
         print(response.text)
         return str(response.text)
    
    return str(eval_object)

for i in range(1,21):
    chain =  (
             DataChain.from_storage(source_files, type="text")
                .settings(cache=True)
                .setup(model = lambda: gemini_setup())
                .map(gemini = eval_dialogue)
                .exec()
    )
    print(f"{i*50} calls")

Listing gs://datachain-demo: 50 objects [00:00, 399.36 objects/s]
Preparing: 50 rows [00:00, 17199.64 rows/s]
Processed: 50 rows [00:00, 13683.62 rows/s]
Preparing: 50 rows [00:00, 24640.49 rows/s]
Processed: 50 rows [02:10,  2.61s/ rows]
Cleanup: 4 tables [00:00, 2173.78 tables/s]


50 calls


Preparing: 50 rows [00:00, 20260.38 rows/s]
Processed: 50 rows [00:00, 9844.40 rows/s]
Preparing: 50 rows [00:00, 17356.22 rows/s]
Processed: 50 rows [02:15,  2.72s/ rows]
Cleanup: 4 tables [00:00, 4157.92 tables/s]


100 calls


Preparing: 50 rows [00:00, 31291.44 rows/s]
Processed: 50 rows [00:00, 12468.20 rows/s]
Preparing: 50 rows [00:00, 21159.84 rows/s]
Processed: 50 rows [02:10,  2.60s/ rows]
Cleanup: 4 tables [00:00, 937.64 tables/s]


150 calls


Preparing: 50 rows [00:00, 23139.71 rows/s]
Processed: 50 rows [00:00, 9848.10 rows/s]
Preparing: 50 rows [00:00, 19508.39 rows/s]
Processed: 50 rows [02:13,  2.67s/ rows]
Cleanup: 4 tables [00:00, 2711.69 tables/s]


200 calls


Preparing: 50 rows [00:00, 21843.06 rows/s]
Processed: 50 rows [00:00, 9384.49 rows/s]
Preparing: 50 rows [00:00, 19312.57 rows/s]
Processed: 50 rows [02:13,  2.68s/ rows]
Cleanup: 4 tables [00:00, 254.97 tables/s]


250 calls


Preparing: 50 rows [00:00, 5231.24 rows/s]
Processed: 50 rows [00:00, 1323.88 rows/s]
Preparing: 50 rows [00:00, 2667.96 rows/s]
Processed: 50 rows [02:11,  2.63s/ rows]
Cleanup: 4 tables [00:00, 2258.65 tables/s]


300 calls


Preparing: 50 rows [00:00, 22574.29 rows/s]
Processed: 50 rows [00:00, 9414.40 rows/s]
Preparing: 50 rows [00:00, 20965.23 rows/s]
Processed: 45 rows [02:01,  2.18s/ rows]

JSONDecodeError: Expecting ':' delimiter: line 1 column 116 (char 115)
{"sentiment": "negative", "key_issues": ["bot doesn't understand user's price correction request"], "action_items" = [{"team": "bot_improvement", "task": "Improve natural language understanding for price correction"}]}



Processed: 50 rows [02:13,  2.67s/ rows]
Cleanup: 4 tables [00:00, 3046.53 tables/s]


350 calls


Preparing: 50 rows [00:00, 20209.62 rows/s]
Processed: 50 rows [00:00, 10445.02 rows/s]
Preparing: 50 rows [00:00, 20998.82 rows/s]
Processed: 16 rows [00:44,  2.92s/ rows]

JSONDecodeError: Expecting ':' delimiter: line 1 column 168 (char 167)
{"sentiment": "negative", "key_issues": ["User's need for international calls not met", "Bot's suggested plan doesn't match user's budget"], "action_items": [{"team: "bot_improvement", "task": "Improve understanding of user requirements, especially regarding international calls"}, {"team": "bot_improvement", "task": "Enhance filtering of plans to strictly adhere to user's budget constraints"}]}




Processed: 50 rows [02:19,  2.78s/ rows]
Cleanup: 4 tables [00:00, 2413.99 tables/s]


400 calls


Preparing: 50 rows [00:00, 19608.71 rows/s]
Processed: 50 rows [00:00, 5226.29 rows/s]
Preparing: 50 rows [00:00, 20500.02 rows/s]
Processed: 50 rows [02:16,  2.73s/ rows]
Cleanup: 4 tables [00:00, 2612.46 tables/s]


450 calls


Preparing: 50 rows [00:00, 25575.02 rows/s]
Processed: 50 rows [00:00, 7965.18 rows/s]
Preparing: 50 rows [00:00, 7579.43 rows/s]
Processed: 50 rows [02:11,  2.63s/ rows]
Cleanup: 4 tables [00:00, 2746.76 tables/s]


500 calls


Preparing: 50 rows [00:00, 20490.00 rows/s]
Processed: 50 rows [00:00, 8841.65 rows/s]
Preparing: 50 rows [00:00, 17792.08 rows/s]
Processed: 50 rows [02:14,  2.68s/ rows]
Cleanup: 4 tables [00:00, 3620.46 tables/s]


550 calls


Preparing: 50 rows [00:00, 20858.88 rows/s]
Processed: 50 rows [00:00, 8441.62 rows/s]
Preparing: 50 rows [00:00, 17789.06 rows/s]
Processed: 50 rows [02:15,  2.70s/ rows]
Cleanup: 4 tables [00:00, 2989.53 tables/s]


600 calls


Preparing: 50 rows [00:00, 19018.34 rows/s]
Processed: 50 rows [00:00, 10936.34 rows/s]
Preparing: 50 rows [00:00, 18044.67 rows/s]
Processed: 50 rows [02:10,  2.62s/ rows]
Cleanup: 4 tables [00:00, 2440.32 tables/s]


650 calls


Preparing: 50 rows [00:00, 22748.15 rows/s]
Processed: 50 rows [00:00, 9790.63 rows/s]
Preparing: 50 rows [00:00, 15384.04 rows/s]
Processed: 50 rows [02:07,  2.55s/ rows]
Cleanup: 4 tables [00:00, 2740.03 tables/s]


700 calls


Preparing: 50 rows [00:00, 19501.13 rows/s]
Processed: 50 rows [00:00, 8926.71 rows/s]
Preparing: 50 rows [00:00, 17624.61 rows/s]
Processed: 50 rows [02:11,  2.64s/ rows]
Cleanup: 4 tables [00:00, 1758.99 tables/s]


750 calls


Preparing: 50 rows [00:00, 16890.72 rows/s]
Processed: 50 rows [00:00, 9535.11 rows/s]
Preparing: 50 rows [00:00, 19246.99 rows/s]
Processed: 2 rows [00:03,  1.79s/ rows]

JSONDecodeError: Expecting ':' delimiter: line 1 column 248 (char 247)
{"sentiment": "negative", "key_issues": ["Bot doesn't handle user input variations effectively", "Bot fails to accurately capture and present summarized information", "Bot doesn't guide the user to successfully complete the task"], "action_items" = [{"team": "NLU", "task": "Improve model's ability to handle typos, abbreviations, and different phrasings for the same intent."}, {"team": "Dialogue", "task": "Implement confirmation mechanisms to ensure accurate data capture and present a summary for user verification before proceeding."}, {"team": "Integration", "task": "Connect the chatbot with a backend system to access real-time mobile phone plan data and offer personalized recommendations."}]}



Processed: 50 rows [02:17,  2.74s/ rows]
Cleanup: 4 tables [00:00, 3328.15 tables/s]


800 calls


Preparing: 50 rows [00:00, 22206.18 rows/s]
Processed: 50 rows [00:00, 9296.71 rows/s]
Preparing: 50 rows [00:00, 17933.57 rows/s]
Processed: 50 rows [02:08,  2.57s/ rows]
Cleanup: 4 tables [00:00, 3248.88 tables/s]


850 calls


Preparing: 50 rows [00:00, 34222.45 rows/s]
Processed: 50 rows [00:00, 6660.16 rows/s]
Preparing: 50 rows [00:00, 21700.66 rows/s]
Processed: 50 rows [02:13,  2.66s/ rows]
Cleanup: 4 tables [00:00, 4592.72 tables/s]


900 calls


Preparing: 50 rows [00:00, 26342.82 rows/s]
Processed: 50 rows [00:00, 12642.59 rows/s]
Preparing: 50 rows [00:00, 13224.57 rows/s]
Processed: 50 rows [02:15,  2.71s/ rows]
Cleanup: 4 tables [00:00, 2792.94 tables/s]


950 calls


Preparing: 50 rows [00:00, 18342.97 rows/s]
Processed: 50 rows [00:00, 9125.99 rows/s]
Preparing: 50 rows [00:00, 17005.77 rows/s]
Processed: 42 rows [01:54,  2.18s/ rows]

JSONDecodeError: Expecting ':' delimiter: line 1 column 71 (char 70)
{"sentiment": "positive", "key_issues": [], "action_items": [{"team: "N/A", "task": "N/A"}]}



Processed: 50 rows [02:12,  2.65s/ rows]
Cleanup: 4 tables [00:00, 2841.19 tables/s]

1000 calls





The results of this run are somewhat unexpected: when output format is given as prompt, Gemini Pro stops failing model validations, but infrequently fails to produce valid JSONs:
```
JSONDecodeError: Expecting ':' delimiter: line 1 column 135 (char 134)
{"sentiment": "negative", "key_issues": ["bot failed to recommend a plan", "user's budget is unrealistic"], "action_items": [{"team: "bot", "task": "improve plan recommendation logic to handle edge cases like extremely low budgets"}, {"team": "bot", "task": "prompt user for a more realistic budget if their initial input is too low"}]}
```

This is obviously not acceptable, and mixing instructions with formats in one prompt is in any case too fragile.

Finally, let us try to give Gemini a last chance by feeding a json output schema directly to the model configuration. 

Unfortunately, unlike Anthropic Claude tool configuration, Gemini cannot read the output of Pydantic `model_json_schema()` method, so we would need to hardcode a `genai.protos.Schema` class duplicating our data model. The configuration looks clunky but appears to work well:


In [4]:
import os
import json
from datachain.lib.dc import Column, DataChain
import google.generativeai as genai

source_files = "gs://datachain-demo/chatbot-KiT/"
google_api_key=os.getenv("GOOGLE_API_KEY")

PROMPT="""
You’re a Customer Insights AI. 
Analyze the following dialog and provide evaluation in JSON format.
"""
from pydantic import BaseModel, Field, ValidationError
from typing import List, Optional


class ActionItem(BaseModel):
    team: str 
    task: str

class EvalResponse(BaseModel):
    sentiment: str = Field(description="dialog sentiment (positive/negative/neutral)")
    key_issues: list[str] = Field(description="list of 3 problems discovered in the dialog")
    action_items: list[ActionItem] = Field(description="list of dicts with 'team' and 'task'")

g_str = genai.protos.Schema(type=genai.protos.Type.STRING)

g_action_item = genai.protos.Schema(
            type=genai.protos.Type.OBJECT,
            properties={
                'team':genai.protos.Schema(type=genai.protos.Type.STRING),
                'task':genai.protos.Schema(type=genai.protos.Type.STRING)
            },
            required=['team','task']
        )

g_evaluation=genai.protos.Schema(
            type=genai.protos.Type.OBJECT,
            properties={
                'sentiment':genai.protos.Schema(type=genai.protos.Type.STRING),
                'key_issues':genai.protos.Schema(type=genai.protos.Type.ARRAY, items=g_str),
                'action_items':genai.protos.Schema(type=genai.protos.Type.ARRAY, items=g_action_item)
            },
            required=['sentiment','key_issues', 'action_items']
        )

def gemini_setup():
    genai.configure(api_key=google_api_key)
    return genai.GenerativeModel(model_name='gemini-1.5-pro-latest', 
                                 system_instruction=PROMPT,
                                 generation_config={"response_mime_type": "application/json",
                                                     "response_schema": g_evaluation,
                                                   }
                                )

def eval_dialogue (file, model):
    response = model.generate_content(file.read(), stream=False)
    response.resolve()
    try:
         json_dict = json.loads(response.text)
    except json.JSONDecodeError as e:
         # Catch cases where Gemini fails to produce valid JSON
         print(f"IndexError: {e}")
         print(response.text)
         return response.text
    try:
         # Attempt to convert the response dict to EvalResponse object
         eval_object = EvalResponse(**json_dict)
         return str(response.text)
    except ValidationError as e:
         # Catch Pydantic validation errors
         print(f"Pydantic error: {e}")
         print(response.text)
         return str(response.text)
    
    return str(eval_object)
    
for i in range(1,21):
    chain =  (
             DataChain.from_storage(source_files, type="text")
                .settings(cache=True)
                .setup(model = lambda: gemini_setup())
                .map(gemini = eval_dialogue)
                .exec()
    )
    print(f"{i*50} calls")

Preparing: 50 rows [00:00, 26329.59 rows/s]
Processed: 50 rows [00:00, 7691.17 rows/s]
Preparing: 50 rows [00:00, 23201.15 rows/s]
Processed: 50 rows [03:18,  3.97s/ rows]
Cleanup: 4 tables [00:00, 3445.01 tables/s]


50 calls


Preparing: 50 rows [00:00, 24425.25 rows/s]
Processed: 50 rows [00:00, 11249.00 rows/s]
Preparing: 50 rows [00:00, 8873.08 rows/s]
Processed: 50 rows [03:21,  4.03s/ rows]
Cleanup: 4 tables [00:00, 2865.45 tables/s]


100 calls


Preparing: 50 rows [00:00, 17482.09 rows/s]
Processed: 50 rows [00:00, 9407.22 rows/s]
Preparing: 50 rows [00:00, 19522.92 rows/s]
Processed: 50 rows [03:12,  3.85s/ rows]
Cleanup: 4 tables [00:00, 3134.17 tables/s]


150 calls


Preparing: 50 rows [00:00, 24594.25 rows/s]
Processed: 50 rows [00:00, 10752.97 rows/s]
Preparing: 50 rows [00:00, 19517.47 rows/s]
Processed: 50 rows [03:05,  3.71s/ rows]
Cleanup: 4 tables [00:00, 2816.86 tables/s]


200 calls


Preparing: 50 rows [00:00, 18946.17 rows/s]
Processed: 50 rows [00:00, 10264.56 rows/s]
Preparing: 50 rows [00:00, 19025.24 rows/s]
Processed: 50 rows [03:10,  3.80s/ rows]
Cleanup: 4 tables [00:00, 2913.72 tables/s]


250 calls


Preparing: 50 rows [00:00, 26135.99 rows/s]
Processed: 50 rows [00:00, 9529.48 rows/s]
Preparing: 50 rows [00:00, 16727.70 rows/s]
Processed: 50 rows [03:09,  3.80s/ rows]
Cleanup: 4 tables [00:00, 4202.71 tables/s]


300 calls


Preparing: 50 rows [00:00, 26938.37 rows/s]
Processed: 50 rows [00:00, 12970.20 rows/s]
Preparing: 50 rows [00:00, 23994.87 rows/s]
Processed: 50 rows [03:11,  3.83s/ rows]
Cleanup: 4 tables [00:00, 2391.96 tables/s]


350 calls


Preparing: 50 rows [00:00, 19887.64 rows/s]
Processed: 50 rows [00:00, 11372.22 rows/s]
Preparing: 50 rows [00:00, 17667.67 rows/s]
Processed: 50 rows [03:08,  3.77s/ rows]
Cleanup: 4 tables [00:00, 2142.41 tables/s]


400 calls


Preparing: 50 rows [00:00, 19127.62 rows/s]
Processed: 50 rows [00:00, 9160.67 rows/s]
Preparing: 50 rows [00:00, 18699.53 rows/s]
Processed: 50 rows [03:05,  3.71s/ rows]
Cleanup: 4 tables [00:00, 3363.52 tables/s]


450 calls


Preparing: 50 rows [00:00, 23157.60 rows/s]
Processed: 50 rows [00:00, 8122.51 rows/s]
Preparing: 50 rows [00:00, 21911.52 rows/s]
Processed: 50 rows [03:09,  3.80s/ rows]
Cleanup: 4 tables [00:00, 3780.36 tables/s]


500 calls


Preparing: 50 rows [00:00, 22161.60 rows/s]
Processed: 50 rows [00:00, 6292.46 rows/s]
Preparing: 50 rows [00:00, 21533.55 rows/s]
Processed: 50 rows [03:04,  3.69s/ rows]
Cleanup: 4 tables [00:00, 2480.00 tables/s]


550 calls


Preparing: 50 rows [00:00, 24989.90 rows/s]
Processed: 50 rows [00:00, 10692.66 rows/s]
Preparing: 50 rows [00:00, 10817.87 rows/s]
Processed: 50 rows [03:08,  3.76s/ rows]
Cleanup: 4 tables [00:00, 2637.10 tables/s]


600 calls


Preparing: 50 rows [00:00, 18586.83 rows/s]
Processed: 50 rows [00:00, 8890.38 rows/s]
Preparing: 50 rows [00:00, 19715.63 rows/s]
Processed: 50 rows [03:16,  3.92s/ rows]
Cleanup: 4 tables [00:00, 2424.10 tables/s]


650 calls


Preparing: 50 rows [00:00, 27489.21 rows/s]
Processed: 50 rows [00:00, 13112.94 rows/s]
Preparing: 50 rows [00:00, 22990.05 rows/s]
Processed: 50 rows [03:05,  3.71s/ rows]
Cleanup: 4 tables [00:00, 2904.64 tables/s]


700 calls


Preparing: 50 rows [00:00, 31036.73 rows/s]
Processed: 50 rows [00:00, 12465.24 rows/s]
Preparing: 50 rows [00:00, 21196.20 rows/s]
Processed: 50 rows [03:07,  3.75s/ rows]
Cleanup: 4 tables [00:00, 4907.05 tables/s]


750 calls


Preparing: 50 rows [00:00, 11541.84 rows/s]
Processed: 50 rows [00:00, 10672.53 rows/s]
Preparing: 50 rows [00:00, 22419.84 rows/s]
Processed: 50 rows [03:04,  3.69s/ rows]
Cleanup: 4 tables [00:00, 3773.55 tables/s]


800 calls


Preparing: 50 rows [00:00, 27060.03 rows/s]
Processed: 50 rows [00:00, 11795.01 rows/s]
Preparing: 50 rows [00:00, 22696.45 rows/s]
Processed: 50 rows [03:08,  3.77s/ rows]
Cleanup: 4 tables [00:00, 3847.10 tables/s]


850 calls


Preparing: 50 rows [00:00, 31078.13 rows/s]
Processed: 50 rows [00:00, 13527.39 rows/s]
Preparing: 50 rows [00:00, 26633.88 rows/s]
Processed: 50 rows [03:11,  3.83s/ rows]
Cleanup: 4 tables [00:00, 3619.68 tables/s]


900 calls


Preparing: 50 rows [00:00, 27732.77 rows/s]
Processed: 50 rows [00:00, 11850.99 rows/s]
Preparing: 50 rows [00:00, 23777.23 rows/s]
Processed: 50 rows [03:09,  3.79s/ rows]
Cleanup: 4 tables [00:00, 2851.33 tables/s]


950 calls


Preparing: 50 rows [00:00, 24166.31 rows/s]
Processed: 50 rows [00:00, 12509.85 rows/s]
Preparing: 50 rows [00:00, 24777.32 rows/s]
Processed: 50 rows [03:15,  3.91s/ rows]
Cleanup: 4 tables [00:00, 4048.56 tables/s]

1000 calls





### OpenAI

OpenAI went through several iterations of API methods to coerce their models to structured outputs.
There latest version is called ["Structured Outputs"](https://openai.com/index/introducing-structured-outputs-in-the-api/) and can take a Pydantic object as pattern. OpenAI explains that it works by constrained sampling, so in theory it should always get the right result given the system tries to resample for long enough. Constrained sampling requires some special machinery at the inference time and only partially depends on training, so this mechanism would be difficult to replicate in the open-source models.

And what is most impressive is... this just works out of the box, with a Pydantic model as guidance:

In [2]:
import os
import json
from datachain import File, DataChain, Column
from openai import OpenAI
from pydantic import BaseModel, Field, field_validator

openai_api_key = os.getenv("OPENAI_API_KEY")
prompt = "You are assigned to evaluate the success of a dialog between user and a chatbot. Reply with a JSON."

class ActionItem(BaseModel):
    team: str 
    task: str

class EvalResponse(BaseModel):
    sentiment: str 
    key_issues: list[str]
    action_items: list[ActionItem] 

def eval_dialogue(client, file: File) -> str:
     completion = client.beta.chat.completions.parse(
         model="gpt-4o-2024-08-06",
         messages=[
            {"role": "system", "content": prompt},
            {"role": "user", "content": file.read()},
         ],
         response_format=EvalResponse,
     )
     message = completion.choices[0].message

     try: 
         EvalResponse(**message.parsed.dict())
         return str(message.parsed)
     except ValidationError as e:
         print(e)
         return str(message.parsed)

# 1,000 entries to evaluate
for i in range(1,21):
    chain = (
       DataChain.from_storage("gs://datachain-demo/chatbot-KiT/", object_name="file", type="text")
       .setup(client=lambda : OpenAI(api_key=openai_api_key))
       .settings(cache=True)
       .map(evaluation=eval_dialogue)
       .exec()
    )
    print(f"{i*50} calls")


Preparing: 50 rows [00:00, 19816.23 rows/s]
Processed: 50 rows [00:00, 10354.26 rows/s]
Preparing: 50 rows [00:00, 17277.57 rows/s]
Processed: 50 rows [01:14,  1.50s/ rows]
Cleanup: 4 tables [00:00, 2799.00 tables/s]


50 calls


Preparing: 50 rows [00:00, 28934.22 rows/s]
Processed: 50 rows [00:00, 11660.56 rows/s]
Preparing: 50 rows [00:00, 20671.78 rows/s]
Processed: 50 rows [01:22,  1.64s/ rows]
Cleanup: 4 tables [00:00, 1837.39 tables/s]


100 calls


Preparing: 50 rows [00:00, 31726.96 rows/s]
Processed: 50 rows [00:00, 8047.71 rows/s]
Preparing: 50 rows [00:00, 18832.18 rows/s]
Processed: 50 rows [01:21,  1.63s/ rows]
Cleanup: 4 tables [00:00, 3625.94 tables/s]


150 calls


Preparing: 50 rows [00:00, 24462.29 rows/s]
Processed: 50 rows [00:00, 12821.13 rows/s]
Preparing: 50 rows [00:00, 22491.98 rows/s]
Processed: 50 rows [01:12,  1.45s/ rows]
Cleanup: 4 tables [00:00, 1809.06 tables/s]


200 calls


Preparing: 50 rows [00:00, 20592.62 rows/s]
Processed: 50 rows [00:00, 9336.86 rows/s]
Preparing: 50 rows [00:00, 22260.40 rows/s]
Processed: 50 rows [01:17,  1.54s/ rows]
Cleanup: 4 tables [00:00, 3150.06 tables/s]


250 calls


Preparing: 50 rows [00:00, 27623.18 rows/s]
Processed: 50 rows [00:00, 10676.88 rows/s]
Preparing: 50 rows [00:00, 21948.22 rows/s]
Processed: 50 rows [01:14,  1.50s/ rows]
Cleanup: 4 tables [00:00, 4841.91 tables/s]


300 calls


Preparing: 50 rows [00:00, 21496.02 rows/s]
Processed: 50 rows [00:00, 9720.29 rows/s]
Preparing: 50 rows [00:00, 20112.71 rows/s]
Processed: 50 rows [01:15,  1.50s/ rows]
Cleanup: 4 tables [00:00, 3231.98 tables/s]


350 calls


Preparing: 50 rows [00:00, 19215.25 rows/s]
Processed: 50 rows [00:00, 9007.61 rows/s]
Preparing: 50 rows [00:00, 18539.18 rows/s]
Processed: 50 rows [01:09,  1.39s/ rows]
Cleanup: 4 tables [00:00, 2536.24 tables/s]


400 calls


Preparing: 50 rows [00:00, 20526.10 rows/s]
Processed: 50 rows [00:00, 9273.28 rows/s]
Preparing: 50 rows [00:00, 21430.12 rows/s]
Processed: 50 rows [01:12,  1.46s/ rows]
Cleanup: 4 tables [00:00, 2525.55 tables/s]


450 calls


Preparing: 50 rows [00:00, 29363.65 rows/s]
Processed: 50 rows [00:00, 9675.00 rows/s]
Preparing: 50 rows [00:00, 27428.09 rows/s]
Processed: 50 rows [01:10,  1.41s/ rows]
Cleanup: 4 tables [00:00, 3967.18 tables/s]


500 calls


Preparing: 50 rows [00:00, 26924.53 rows/s]
Processed: 50 rows [00:00, 11676.79 rows/s]
Preparing: 50 rows [00:00, 14269.25 rows/s]
Processed: 50 rows [01:10,  1.41s/ rows]
Cleanup: 4 tables [00:00, 3887.21 tables/s]


550 calls


Preparing: 50 rows [00:00, 29070.58 rows/s]
Processed: 50 rows [00:00, 9538.15 rows/s]
Preparing: 50 rows [00:00, 21460.83 rows/s]
Processed: 50 rows [01:11,  1.43s/ rows]
Cleanup: 4 tables [00:00, 2255.00 tables/s]


600 calls


Preparing: 50 rows [00:00, 9439.83 rows/s]
Processed: 50 rows [00:00, 9410.60 rows/s]
Preparing: 50 rows [00:00, 21673.75 rows/s]
Processed: 50 rows [01:15,  1.50s/ rows]
Cleanup: 4 tables [00:00, 4499.12 tables/s]


650 calls


Preparing: 50 rows [00:00, 23550.28 rows/s]
Processed: 50 rows [00:00, 8741.05 rows/s]
Preparing: 50 rows [00:00, 20024.37 rows/s]
Processed: 50 rows [01:06,  1.34s/ rows]
Cleanup: 4 tables [00:00, 5038.20 tables/s]


700 calls


Preparing: 50 rows [00:00, 20502.02 rows/s]
Processed: 50 rows [00:00, 6633.20 rows/s]
Preparing: 50 rows [00:00, 14301.36 rows/s]
Processed: 50 rows [01:11,  1.44s/ rows]
Cleanup: 4 tables [00:00, 1886.78 tables/s]


750 calls


Preparing: 50 rows [00:00, 25392.32 rows/s]
Processed: 50 rows [00:00, 9243.85 rows/s]
Preparing: 50 rows [00:00, 14232.45 rows/s]
Processed: 50 rows [01:07,  1.36s/ rows]
Cleanup: 4 tables [00:00, 2440.68 tables/s]


800 calls


Preparing: 50 rows [00:00, 21079.02 rows/s]
Processed: 50 rows [00:00, 7375.51 rows/s]
Preparing: 50 rows [00:00, 16362.27 rows/s]
Processed: 50 rows [01:10,  1.40s/ rows]
Cleanup: 4 tables [00:00, 2679.21 tables/s]


850 calls


Preparing: 50 rows [00:00, 18498.30 rows/s]
Processed: 50 rows [00:00, 8778.00 rows/s]
Preparing: 50 rows [00:00, 16431.50 rows/s]
Processed: 50 rows [01:25,  1.70s/ rows]
Cleanup: 4 tables [00:00, 3146.51 tables/s]


900 calls


Preparing: 50 rows [00:00, 33005.23 rows/s]
Processed: 50 rows [00:00, 12281.28 rows/s]
Preparing: 50 rows [00:00, 24264.17 rows/s]
Processed: 50 rows [01:12,  1.44s/ rows]
Cleanup: 4 tables [00:00, 2077.93 tables/s]


950 calls


Preparing: 50 rows [00:00, 20055.01 rows/s]
Processed: 50 rows [00:00, 7933.84 rows/s]
Preparing: 50 rows [00:00, 16550.80 rows/s]
Processed: 50 rows [01:17,  1.55s/ rows]
Cleanup: 4 tables [00:00, 3142.39 tables/s]

1000 calls





There are several interesting moments about OpenAI's implementation.

* The OpenAI documentation claims [the structured output is achieved 100% time](https://openai.com/index/introducing-structured-outputs-in-the-api/) (see the graph attached here for convenience). This gives an optimistic impression that "strict" output mode is all that is needed. At the same time, the documentation recommends using Pydantic or Zod for the data models, and not feed the JSON schemas directly. In fact, [OpenAI's API examples](https://platform.openai.com/docs/guides/structured-outputs/examples_) do not even provide instructions for the output format in the prompt – which gives an impression that any prompt will work in conjunction with structured output.

  <img src="https://github.com/iterative/datachain-examples/blob/main/assets/gpt-json-stats.jpg?raw=true" alt="Image description" style="width:500px;"/>

* Despite these bold claims, the prompt requesting a structured output can still fail for GPT-4o. Once in about 5,000 runs GPT may still emit a malformed JSON output like this:

```
pydantic_core._pydantic_core.ValidationError: 1 validation error for Evaluation
  Invalid JSON: EOF while parsing a list at line 15579 column 967 [type=json_invalid, input_value='{"outcome":"Yes","explan...   \t   \t   \t   \t   ', input_type=str]
```

*  Moreover, if the prompt is inconsistent, this may provoke the model to produce a broken object more often. This does not happen very often and affects about 0.05% outputs but it means the prompt must stay in sync. It is a small number, but is much higher than the rosy "100% reliable" picture painted by OpenAI:


In [6]:
import os
import ast
import json
from datachain import File, DataChain, Column
from openai import OpenAI
from pydantic import BaseModel, Field, field_validator

openai_api_key = os.getenv("OPENAI_API_KEY")
prompt = "You are assigned to evaluate the success of a dialog between user and a chatbot. Do not reply with a JSON and use Markdown."

class ActionItem(BaseModel):
    team: str 
    task: str

class EvalResponse(BaseModel):
    sentiment: str 
    key_issues: list[str] 
    action_items: list[ActionItem] 

def eval_dialogue(client, file: File) -> EvalResponse:

     try:
         completion = client.beta.chat.completions.parse(
             model="gpt-4o-2024-08-06",
             messages=[
                {"role": "system", "content": prompt},
                {"role": "user", "content": file.read()},
             ],
             response_format=EvalResponse,
         )
     except Exception as e:
         print(f"OpenAI API error: {e}")
         return EvalResponse(sentiment = "", key_issues =[], action_items =[]) 
     message = completion.choices[0].message
     try:
         json_dict = ast.literal_eval(str(message.parsed.dict()))
     except Exception as e:
         # Catch cases where GPT fails to produce valid JSON
         print(f"Parsed message format error: {e}")
         print(message.parsed.dict())
         return EvalResponse(sentiment="", key_issues=[], action_items=[])
     try:
         EvalResponse(**message.parsed.dict())
     except ValidationError as e:
         print(e)
     return message.parsed

# 1,000 entries to evaluate
for i in range(1,21):
    chain = (
       DataChain.from_storage("gs://datachain-demo/chatbot-KiT/", object_name="file", type="text")
       .setup(client=lambda : OpenAI(api_key=openai_api_key))
       .settings(cache=True)
       .map(evaluation=eval_dialogue)
       .exec()
    )
    print(f"{i*50} calls")


Preparing: 50 rows [00:00, 19562.99 rows/s]
Processed: 50 rows [00:00, 6850.30 rows/s]
Preparing: 50 rows [00:00, 17811.72 rows/s]
Processed: 50 rows [01:11,  1.44s/ rows]
Cleanup: 4 tables [00:00, 3210.34 tables/s]


50 calls


Preparing: 50 rows [00:00, 22902.17 rows/s]
Processed: 50 rows [00:00, 9391.63 rows/s]
Preparing: 50 rows [00:00, 20432.11 rows/s]
Processed: 50 rows [01:12,  1.45s/ rows]
Cleanup: 4 tables [00:00, 2452.45 tables/s]


100 calls


Preparing: 50 rows [00:00, 23539.70 rows/s]
Processed: 50 rows [00:00, 9319.43 rows/s]
Preparing: 50 rows [00:00, 20848.51 rows/s]
Processed: 50 rows [01:20,  1.61s/ rows]
Cleanup: 4 tables [00:00, 3011.53 tables/s]


150 calls


Preparing: 50 rows [00:00, 26303.17 rows/s]
Processed: 50 rows [00:00, 9848.10 rows/s]
Preparing: 50 rows [00:00, 18314.14 rows/s]
Processed: 50 rows [01:26,  1.74s/ rows]
Cleanup: 4 tables [00:00, 3252.66 tables/s]


200 calls


Preparing: 50 rows [00:00, 20128.15 rows/s]
Processed: 50 rows [00:00, 9543.79 rows/s]
Preparing: 50 rows [00:00, 21051.52 rows/s]
Processed: 50 rows [01:23,  1.67s/ rows]
Cleanup: 4 tables [00:00, 3653.57 tables/s]


250 calls


Preparing: 50 rows [00:00, 39089.51 rows/s]
Processed: 50 rows [00:00, 11334.73 rows/s]
Preparing: 50 rows [00:00, 19262.90 rows/s]
Processed: 50 rows [01:09,  1.39s/ rows]
Cleanup: 4 tables [00:00, 2639.17 tables/s]


300 calls


Preparing: 50 rows [00:00, 41429.32 rows/s]
Processed: 50 rows [00:00, 13767.16 rows/s]
Preparing: 50 rows [00:00, 29658.49 rows/s]
Processed: 50 rows [01:12,  1.45s/ rows]
Cleanup: 4 tables [00:00, 2665.17 tables/s]


350 calls


Preparing: 50 rows [00:00, 41016.08 rows/s]
Processed: 50 rows [00:00, 9216.63 rows/s]
Preparing: 50 rows [00:00, 16030.82 rows/s]
Processed: 50 rows [01:18,  1.58s/ rows]
Cleanup: 4 tables [00:00, 3842.70 tables/s]


400 calls


Preparing: 50 rows [00:00, 29886.73 rows/s]
Processed: 50 rows [00:00, 11128.43 rows/s]
Preparing: 50 rows [00:00, 23063.37 rows/s]
Processed: 50 rows [01:09,  1.38s/ rows]
Cleanup: 4 tables [00:00, 5533.38 tables/s]


450 calls


Preparing: 50 rows [00:00, 25948.43 rows/s]
Processed: 50 rows [00:00, 12885.73 rows/s]
Preparing: 50 rows [00:00, 24686.90 rows/s]
Processed: 50 rows [01:08,  1.37s/ rows]
Cleanup: 4 tables [00:00, 5310.93 tables/s]


500 calls


Preparing: 50 rows [00:00, 25175.89 rows/s]
Processed: 50 rows [00:00, 12129.98 rows/s]
Preparing: 50 rows [00:00, 18464.10 rows/s]
Processed: 50 rows [01:07,  1.36s/ rows]
Cleanup: 4 tables [00:00, 3590.25 tables/s]


550 calls


Preparing: 50 rows [00:00, 25537.65 rows/s]
Processed: 50 rows [00:00, 7902.45 rows/s]
Preparing: 50 rows [00:00, 18953.02 rows/s]
Processed: 50 rows [01:08,  1.37s/ rows]
Cleanup: 4 tables [00:00, 2317.62 tables/s]


600 calls


Preparing: 50 rows [00:00, 26273.52 rows/s]
Processed: 50 rows [00:00, 10254.52 rows/s]
Preparing: 50 rows [00:00, 19865.04 rows/s]
Processed: 50 rows [01:13,  1.47s/ rows]
Cleanup: 4 tables [00:00, 2408.44 tables/s]


650 calls


Preparing: 50 rows [00:00, 23833.98 rows/s]
Processed: 50 rows [00:00, 12239.71 rows/s]
Preparing: 50 rows [00:00, 24753.92 rows/s]
Processed: 50 rows [01:05,  1.31s/ rows]
Cleanup: 4 tables [00:00, 2532.03 tables/s]


700 calls


Preparing: 50 rows [00:00, 25919.56 rows/s]
Processed: 50 rows [00:00, 13619.64 rows/s]
Preparing: 50 rows [00:00, 23785.32 rows/s]
Processed: 50 rows [01:06,  1.32s/ rows]
Cleanup: 4 tables [00:00, 3499.63 tables/s]


750 calls


Preparing: 50 rows [00:00, 35116.41 rows/s]
Processed: 50 rows [00:00, 11279.26 rows/s]
Preparing: 50 rows [00:00, 25694.09 rows/s]
Processed: 50 rows [01:17,  1.54s/ rows]
Cleanup: 4 tables [00:00, 2320.18 tables/s]


800 calls


Preparing: 50 rows [00:00, 27795.26 rows/s]
Processed: 50 rows [00:00, 11022.56 rows/s]
Preparing: 50 rows [00:00, 18301.35 rows/s]
Processed: 50 rows [01:09,  1.40s/ rows]
Cleanup: 4 tables [00:00, 2147.07 tables/s]


850 calls


Preparing: 50 rows [00:00, 30428.79 rows/s]
Processed: 50 rows [00:00, 8196.48 rows/s]
Preparing: 50 rows [00:00, 25215.25 rows/s]
Processed: 50 rows [01:12,  1.46s/ rows]
Cleanup: 4 tables [00:00, 2461.45 tables/s]


900 calls


Preparing: 50 rows [00:00, 25509.69 rows/s]
Processed: 50 rows [00:00, 8595.24 rows/s]
Preparing: 50 rows [00:00, 16431.50 rows/s]
Processed: 50 rows [01:14,  1.48s/ rows]
Cleanup: 4 tables [00:00, 3364.19 tables/s]


950 calls


Preparing: 50 rows [00:00, 46634.47 rows/s]
Processed: 50 rows [00:00, 10288.23 rows/s]
Preparing: 50 rows [00:00, 21258.51 rows/s]
Processed: 50 rows [01:06,  1.34s/ rows]
Cleanup: 4 tables [00:00, 2745.86 tables/s]

1000 calls





For the example above,  approximately one run in 2,000 fails to produce a valid JSON.
The failure is typically an OpenAI giving up during parsing:

```
  File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/openai/lib/_parsing/_completions.py", line 72, in parse_chat_completion
    raise LengthFinishReasonError()
openai.LengthFinishReasonError: Could not parse response content as the length limit was reached
```

In a rare occasion, the model can also produce a malformed JSON dictionary:

```
pydantic_core._pydantic_core.ValidationError: 1 validation error for Evaluation
  Invalid JSON: EOF while parsing a string at line 1 column 5641 [type=json_invalid, input_value='{"outcome":"No","explana...ure bots! ! \\n++++++++', input_type=str]
    For further information visit https://errors.pydantic.dev/2.8/v/json_invalid
Processed: 49 rows [02:38,  3.24s/ rows]
```

Either way, the failire rate for OpenAI's Structured Outputs still seem to somewhat dependent on the prompt.

* Second, the OpenAI API reads Pydantic field descriptions and *tries* to sticks to them. In the example below, the field "suggestions" is set to validates for exactly six entries, the field "outcome" validates to two literals, and suggestions must start with "K". As we can see from the run, transgressions do happen but they are infrequent. Moreover, simpler instructions appear to be more reliable:


In [1]:
import ast
import os
from datachain import File, DataChain, Column
from openai import OpenAI
from pydantic import BaseModel, Field, field_validator

openai_api_key = os.getenv("OPENAI_API_KEY")
prompt = "You are assigned to evaluate the success of a dialog between user and a chatbot. Reply with JSON."

class Suggestion(BaseModel):
    suggestion: str = Field(description="Suggestion to improve the bot, starting with letter K")

class Evaluation(BaseModel):
    outcome: str = Field(description="whether a dialog was successful, either Yes or No")
    explanation: str = Field(description="rationale behind the decision on outcome")
    suggestions: list[Suggestion] = Field(description="Six ways to improve a bot")
    
    @field_validator("outcome")
    def check_literal(cls, value):
        if not (value in ["Yes", "No"]):
            print(f"Literal Yes/No not followed: {value}") 
        return value
    
    @field_validator("suggestions")
    def count_suggestions(cls, value):
        if len(value) != 6:
            print(f"Array length of 6 not followed: {value}")
        count = sum(1 for item in value if item.suggestion.startswith('K'))
        if len(value) != count:
            print(f"{len(value)-count} suggestions don't start with K")
        return value

def eval_dialogue(client, file: File) -> Evaluation:
     completion = client.beta.chat.completions.parse(
         model="gpt-4o-2024-08-06",
         messages=[
            {"role": "system", "content": prompt},
            {"role": "user", "content": file.read()},
         ],
         response_format=Evaluation,
     )
     message = completion.choices[0].message
     try:
         json_dict = ast.literal_eval(str(message.parsed.dict()))
     except Exception as e:
         # Catch cases where GPT fails to produce valid JSON
         print(f"Parsed message format error: {e}")
         print(message.parsed.dict())
         return Evaluation(outcome="", explanation="", suggestions=[])
     try:
         Evaluation(**message.parsed.dict())
     except ValidationError as e:
         print(e)
     return message.parsed

# 1,000 entries to evaluate
for i in range(1,21):
    chain = (
       DataChain.from_storage("gs://datachain-demo/chatbot-KiT/", object_name="file", type="text")
       .setup(client=lambda : OpenAI(api_key=openai_api_key))
       .settings(cache=True)
       .map(evaluation=eval_dialogue)
       .exec()
    )
    print(f"{i*50} calls")

Preparing: 50 rows [00:00, 21153.44 rows/s]
Processed: 50 rows [00:00, 9248.74 rows/s]
Preparing: 50 rows [00:00, 19004.55 rows/s]
Processed: 50 rows [01:59,  2.38s/ rows]
Cleanup: 4 tables [00:00, 4214.32 tables/s]


50 calls


Preparing: 50 rows [00:00, 37395.72 rows/s]
Processed: 50 rows [00:00, 12109.67 rows/s]
Preparing: 50 rows [00:00, 20097.29 rows/s]
Processed: 9 rows [00:18,  2.28s/ rows]

1 suggestions don't start with K
1 suggestions don't start with K


Processed: 50 rows [01:52,  2.25s/ rows]
Cleanup: 4 tables [00:00, 2262.60 tables/s]


100 calls


Preparing: 50 rows [00:00, 33372.88 rows/s]
Processed: 50 rows [00:00, 11641.14 rows/s]
Preparing: 50 rows [00:00, 24283.84 rows/s]
Processed: 7 rows [00:13,  2.20s/ rows]

1 suggestions don't start with K
1 suggestions don't start with K


Processed: 50 rows [01:47,  2.15s/ rows]
Cleanup: 4 tables [00:00, 2346.14 tables/s]


150 calls


Preparing: 50 rows [00:00, 38699.98 rows/s]
Processed: 50 rows [00:00, 13202.92 rows/s]
Preparing: 50 rows [00:00, 24836.00 rows/s]
Processed: 3 rows [00:04,  1.75s/ rows]

1 suggestions don't start with K
1 suggestions don't start with K


Processed: 4 rows [00:07,  2.14s/ rows]

1 suggestions don't start with K
1 suggestions don't start with K


Processed: 31 rows [01:04,  2.01s/ rows]

1 suggestions don't start with K
1 suggestions don't start with K


Processed: 50 rows [01:42,  2.06s/ rows]
Cleanup: 4 tables [00:00, 2281.68 tables/s]


200 calls


Preparing: 50 rows [00:00, 33251.18 rows/s]
Processed: 50 rows [00:00, 12165.87 rows/s]
Preparing: 50 rows [00:00, 21711.90 rows/s]
Processed: 48 rows [01:42,  1.93s/ rows]

Array length of 6 not followed: [Suggestion(suggestion='Keep improving the natural language understanding')]
Array length of 6 not followed: [Suggestion(suggestion='Keep improving the natural language understanding')]


Processed: 50 rows [01:46,  2.13s/ rows]
Cleanup: 4 tables [00:00, 2459.64 tables/s]


250 calls


Preparing: 50 rows [00:00, 26122.97 rows/s]
Processed: 50 rows [00:00, 9331.46 rows/s]
Preparing: 50 rows [00:00, 15660.91 rows/s]
Processed: 50 rows [01:59,  2.38s/ rows]
Cleanup: 4 tables [00:00, 3784.62 tables/s]


300 calls


Preparing: 50 rows [00:00, 26096.96 rows/s]
Processed: 50 rows [00:00, 9551.18 rows/s]
Preparing: 50 rows [00:00, 19252.29 rows/s]
Processed: 50 rows [02:09,  2.58s/ rows]
Cleanup: 4 tables [00:00, 2372.68 tables/s]


350 calls


Preparing: 50 rows [00:00, 38793.04 rows/s]
Processed: 50 rows [00:00, 10419.08 rows/s]
Preparing: 50 rows [00:00, 16562.57 rows/s]
Processed: 50 rows [02:41,  3.23s/ rows]
Cleanup: 4 tables [00:00, 3959.69 tables/s]


400 calls


Preparing: 50 rows [00:00, 33151.31 rows/s]
Processed: 50 rows [00:00, 10933.49 rows/s]
Preparing: 50 rows [00:00, 19799.40 rows/s]
Processed: 50 rows [01:56,  2.32s/ rows]
Cleanup: 4 tables [00:00, 2791.55 tables/s]


450 calls


Preparing: 50 rows [00:00, 31230.86 rows/s]
Processed: 50 rows [00:00, 11935.30 rows/s]
Preparing: 50 rows [00:00, 28324.58 rows/s]
Processed: 4 rows [00:06,  1.75s/ rows]

1 suggestions don't start with K
1 suggestions don't start with K


Processed: 39 rows [01:33,  2.60s/ rows]

Array length of 6 not followed: [Suggestion(suggestion='Knowledge-base improvement to handle inappropriate language and flag concerns')]
Array length of 6 not followed: [Suggestion(suggestion='Knowledge-base improvement to handle inappropriate language and flag concerns')]


Processed: 50 rows [02:01,  2.43s/ rows]
Cleanup: 4 tables [00:00, 4140.48 tables/s]


500 calls


Preparing: 50 rows [00:00, 13901.31 rows/s]
Processed: 50 rows [00:00, 7024.46 rows/s]
Preparing: 50 rows [00:00, 14867.09 rows/s]
Processed: 50 rows [01:48,  2.17s/ rows]
Cleanup: 4 tables [00:00, 2375.37 tables/s]


550 calls


Preparing: 50 rows [00:00, 39272.51 rows/s]
Processed: 50 rows [00:00, 11539.30 rows/s]
Preparing: 50 rows [00:00, 20938.02 rows/s]
Processed: 38 rows [01:22,  2.24s/ rows]

1 suggestions don't start with K
1 suggestions don't start with K


Processed: 50 rows [01:47,  2.14s/ rows]
Cleanup: 4 tables [00:00, 3018.57 tables/s]


600 calls


Preparing: 50 rows [00:00, 21375.52 rows/s]
Processed: 50 rows [00:00, 9536.84 rows/s]
Preparing: 50 rows [00:00, 21204.77 rows/s]
Processed: 49 rows [01:42,  2.26s/ rows]

1 suggestions don't start with K
1 suggestions don't start with K


Processed: 50 rows [01:44,  2.10s/ rows]
Cleanup: 4 tables [00:00, 1755.12 tables/s]


650 calls


Preparing: 50 rows [00:00, 19127.62 rows/s]
Processed: 50 rows [00:00, 6852.99 rows/s]
Preparing: 50 rows [00:00, 17973.53 rows/s]
Processed: 50 rows [02:00,  2.40s/ rows]
Cleanup: 4 tables [00:00, 2730.22 tables/s]


700 calls


Preparing: 50 rows [00:00, 25906.76 rows/s]
Processed: 50 rows [00:00, 12297.13 rows/s]
Preparing: 50 rows [00:00, 19239.93 rows/s]
Processed: 50 rows [02:01,  2.43s/ rows]
Cleanup: 4 tables [00:00, 2094.27 tables/s]


750 calls


Preparing: 50 rows [00:00, 40556.02 rows/s]
Processed: 50 rows [00:00, 8819.72 rows/s]
Preparing: 50 rows [00:00, 27053.04 rows/s]
Processed: 50 rows [02:29,  2.99s/ rows]
Cleanup: 4 tables [00:00, 2350.41 tables/s]


800 calls


Preparing: 50 rows [00:00, 25460.14 rows/s]
Processed: 50 rows [00:00, 8252.28 rows/s]
Preparing: 50 rows [00:00, 20550.24 rows/s]
Processed: 50 rows [02:10,  2.61s/ rows]
Cleanup: 4 tables [00:00, 2557.89 tables/s]


850 calls


Preparing: 50 rows [00:00, 25148.72 rows/s]
Processed: 50 rows [00:00, 8676.31 rows/s]
Preparing: 50 rows [00:00, 15464.58 rows/s]
Processed: 50 rows [01:53,  2.28s/ rows]
Cleanup: 4 tables [00:00, 3062.65 tables/s]


900 calls


Preparing: 50 rows [00:00, 25472.51 rows/s]
Processed: 50 rows [00:00, 13896.71 rows/s]
Preparing: 50 rows [00:00, 20883.81 rows/s]
Processed: 48 rows [01:48,  1.84s/ rows]

1 suggestions don't start with K
1 suggestions don't start with K


Processed: 50 rows [01:53,  2.27s/ rows]
Cleanup: 4 tables [00:00, 2305.51 tables/s]


950 calls


Preparing: 50 rows [00:00, 23529.14 rows/s]
Processed: 50 rows [00:00, 9865.70 rows/s]
Preparing: 50 rows [00:00, 21221.94 rows/s]
Processed: 50 rows [02:54,  3.48s/ rows]
Cleanup: 4 tables [00:00, 2653.78 tables/s]

1000 calls






Unfortunately, OpenAI's documentation states that the built-in Pydantic field constraints (like "minItems") are not supported – which leaves it one the user to verify and enforce the content of the objects produced by the LLM.

Nonetheless, the direct support for Zod and Pydantic data models is very useful and puts OpenAI ahead of the competition.
