# Chat Extraction

This benchmark combines classification, summarization, and extraction in one a combined task. The model is
expected to output formatted json in the expected schema.

In [1]:
# %pip install -U --quiet langchain langchain_benchmarks
# %pip install -U openai rapidfuzz fireworks-ai anthropic

For this code to work, please configure LangSmith environment variables with your credentials,
in addition to your LLM providers' API keys.

In [2]:
import getpass
import os
import uuid

uid = uuid.uuid4().hex[:4]  # Avoid conflicts in project names

# Get your API key from https://smith.langchain.com/settings
api_keys = [
    "LANGCHAIN_API_KEY",
    "OPENAI_API_KEY",
    "ANTHROPIC_API_KEY",
    "FIREWORKS_API_KEY",
]
for key in api_keys:
    if key not in os.environ:
        os.environ[key] = getpass.getpass(f"Enter your {key}: ")

In [3]:
from langchain_benchmarks import clone_public_dataset, registry

task = registry["Chat Extraction"]

# Clone the dataset to your tenant
clone_public_dataset(task.dataset_id, dataset_name=task.name)


task

Dataset Chat Extraction already exists. Skipping.
You can access the dataset at https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/08042749-504d-4509-9549-5f5c579115f6.


0,1
Name,Chat Extraction
Type,ExtractionTask
Dataset ID,54d6d8e4-b420-4b9e-862d-548b1b65a6fe
Description,A dataset meant to test the ability of an LLM to extract and infer structured information from a dialogue. The dialogue is between a user and a support engineer. Outputs should be structured as a JSON object and test both the ability of the LLM to correctly structure the information and its ability to perform simple classification tasks.


#### Schema

Each extraction task has an expected output schema defined in a Pydantic BaseModel object, which we can use to
get a JSON schema object.

In [4]:
import pprint

pprint.pprint(task.schema.schema())

{'definitions': {'ProgrammingLanguage': {'description': 'An enumeration.',
                                         'enum': ['python',
                                                  'javascript',
                                                  'typescript',
                                                  'unknown',
                                                  'other'],
                                         'title': 'ProgrammingLanguage',
                                         'type': 'string'},
                 'QuestionCategorization': {'properties': {'category_if_other': {'description': 'question '
                                                                                                'category '
                                                                                                'if '
                                                                                                'the '
                                                              

## Define an extraction chain

Let's build the extraction chain that we can use to get structured information from the emails.

In [5]:
from langchain.chat_models import ChatOpenAI
from langchain.output_parsers.openai_functions import JsonOutputFunctionsParser

llm = ChatOpenAI(model="gpt-3.5-turbo-16k", temperature=0).bind_functions(
    functions=[task.schema],
    function_call=task.schema.schema()["title"],
)


def format_run(dialogue_input: dict):
    question = dialogue_input["question"]
    answer = dialogue_input["answer"]
    return {
        "dialogue": f"<question>\n{question}\n</question>\n"
        f"<assistant-response>\n{answer}\n</assistant-response>"
    }


output_parser = JsonOutputFunctionsParser()
extraction_chain = (
    format_run
    | task.instructions
    | llm
    | output_parser
    # Wrap as 'output' so to be unified for the evaluators
    | (lambda x: {"output": x})
)

In [6]:
extraction_chain.invoke(
    {"question": "how do i run llama 2 locally?", "answer": "Llama.cpp of course."}
)

{'output': {'issue_summary': 'Running Llama 2 locally',
  'question': {'question_category': 'Implementation Issues',
   'category_if_other': '',
   'is_off_topic': False,
   'toxicity': 0,
   'sentiment': 'Neutral',
   'programming_language': 'unknown'},
  'response': {'response_type': 'provide guidance',
   'response_type_if_other': '',
   'confidence_level': 0.9,
   'followup_actions': []}}}

Now it's time to measure our chain's effectiveness!

## Evaluate

Let's evaluate the chain now.

In [7]:
from langsmith.client import Client

from langchain_benchmarks.extraction.tasks.chat_extraction import get_eval_config

In [8]:
client = Client()

eval_config = get_eval_config()

test_run = client.run_on_dataset(
    dataset_name=task.name,
    llm_or_chain_factory=extraction_chain,
    evaluation=eval_config,
    verbose=True,
    project_metadata={
        "arch": "openai-functions",
    },
)

View the evaluation results for project 'elderly-paper-12' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/08042749-504d-4509-9549-5f5c579115f6/compare?selectedSessions=bf3a827d-b2fe-4d58-a1a6-17169e4c2f74

View all tests for Dataset Chat Extraction at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/08042749-504d-4509-9549-5f5c579115f6
[------------------------------------------------->] 27/27
 Eval quantiles:


Unnamed: 0,feedback.json_edit_distance,feedback.json_schema,feedback.toxicity_similarity,feedback.sentiment_similarity,feedback.confidence_level_similarity,feedback.question_category,feedback.off_topic_similarity,feedback.programming_language_similarity,error,execution_time
count,27.0,27.0,27.0,27.0,27.0,27.0,27.0,27.0,0.0,27.0
unique,,,,,,,,,0.0,
top,,,,,,,,,,
freq,,,,,,,,,,
mean,0.413472,1.0,0.991111,0.740741,0.236296,0.296296,0.888889,0.62963,,5.870635
std,0.140195,0.0,0.010127,0.254588,0.151,0.465322,0.320256,0.492103,,0.502469
min,0.219101,1.0,0.98,0.5,0.18,0.0,0.0,0.0,,4.975395
25%,0.301781,1.0,0.98,0.5,0.18,0.0,1.0,0.0,,5.57553
50%,0.372493,1.0,1.0,0.5,0.18,0.0,1.0,1.0,,5.831284
75%,0.509838,1.0,1.0,1.0,0.18,1.0,1.0,1.0,,6.048197


## Compare to Claude-2

Let's compare to an Anthropic's Claude-2. We will mimic the function calling interface.

In [45]:
from typing import Any, Dict, List, Type

from langchain.chat_models import ChatAnthropic
from langchain.output_parsers.xml import XMLOutputParser
from langchain.prompts import ChatPromptTemplate
from langchain.pydantic_v1 import BaseModel

claude_prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are a data extraction bot tasked with extracting and inferring information from dialogues and generating tickets. Always respond "
            "only with XML based on the following JSON schema:\n{schema}",
        ),
        (
            "user",
            "Generate a ticket from the following question-response pair:\n"
            "<Dialogue>\n{dialogue}\n</Dialogue>\n"
            "Remember, respond directly with this format:\n"
            "<{function_call}>\n...\n</{function_call}>"
            "RESPOND ONLY IN XML THEN STOP.",
        ),
    ]
)
prompt = claude_prompt.partial(
    schema=task.schema.schema_json(), function_call=task.schema.schema()["title"]
)

claude = ChatAnthropic(model="claude-2", temperature=0, max_tokens_to_sample=2048)


class MergeSchema:
    """Merge the XML Output Parser schema into the output."""

    def __init__(self, schema: Type[BaseModel]):
        self.schema = schema

    @property
    def _func_name(self) -> str:
        return self.schema.__name__

    def _merge_schema(self, parsed_output: Any, schema: Type[BaseModel]):
        merged_output = {}
        if isinstance(parsed_output, dict):
            items = parsed_output.items()
        elif isinstance(parsed_output, list):
            items = [(k, v) for item in parsed_output for k, v in item.items()]
        else:
            return parsed_output

        for key, value in items:
            if key in schema.__fields__:
                field_info = schema.__fields__[key]
                if isinstance(value, list):
                    if issubclass(field_info.type_, (BaseModel, dict)):
                        result = self._merge_schema(value, field_info.type_)
                    elif all(
                        isinstance(item, dict) and item.keys() == {"item"}
                        for item in value
                    ):
                        result = [next(iter(item.values())) for item in value]
                    else:
                        result = value
                else:
                    result = value
            else:
                result = value
            if key in merged_output:
                if isinstance(merged_output[key], list):
                    merged_output[key].append(result)
                else:
                    merged_output[key] = [merged_output[key], result]
            else:
                merged_output[key] = result

        return merged_output

    def __call__(self, parsed_output: dict) -> Dict[str, Any]:
        merged_output = {}
        if self._func_name not in parsed_output:
            return parsed_output
        return {
            self._func_name: self._merge_schema(
                parsed_output[self._func_name], self.schema
            )
        }


def try_parse(llm_output, config):
    try:
        output_chain = XMLOutputParser() | MergeSchema(task.schema)
        parsed = output_chain.invoke(llm_output, config)
        # Wrap as 'output' so to be unified for the evaluators
        return {"output": parsed.get("GenerateTicket")}
    except Exception as e:
        return {"output": llm_output, "error": str(e)}


claude_extraction_chain = format_run | prompt | claude | try_parse

In [46]:
result = claude_extraction_chain.invoke(
    {"question": "how do i run llama 2 locally?", "answer": "Llama.cpp of course."}
)
result

{'output': {'issue_summary': 'How to run Llama locally',
  'question': {'question_category': 'Implementation Issues',
   'is_off_topic': 'false',
   'toxicity': '0',
   'sentiment': 'Neutral',
   'programming_language': 'unknown'},
  'response': {'response_type': 'provide guidance',
   'confidence_level': '3',
   'followup_actions': ['Ask clarifying questions about the specific issue',
    'Provide documentation or examples for running Llama locally']}}}

In [11]:
claude_test_run = client.run_on_dataset(
    dataset_name=task.name,
    llm_or_chain_factory=claude_extraction_chain,
    evaluation=eval_config,
    verbose=True,
    project_name=f"claude-2-json-schema-to-xml-{uid}",
    project_metadata={
        "arch": "claude-json-schema-xml-output",
    },
)

View the evaluation results for project 'claude-2-json-schema-to-xml-9f6b' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/08042749-504d-4509-9549-5f5c579115f6/compare?selectedSessions=1c897f8e-5ed7-4d67-a399-0253bf232538

View all tests for Dataset Chat Extraction at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/08042749-504d-4509-9549-5f5c579115f6
[------------------------------------------------->] 27/27
 Eval quantiles:


Unnamed: 0,feedback.json_edit_distance,feedback.json_schema,feedback.toxicity_similarity,feedback.sentiment_similarity,feedback.confidence_level_similarity,feedback.question_category,feedback.off_topic_similarity,feedback.programming_language_similarity,error,execution_time
count,27.0,27.0,27.0,27.0,27.0,27.0,27.0,27.0,0.0,27.0
unique,,,,,,,,,0.0,
top,,,,,,,,,,
freq,,,,,,,,,,
mean,0.383192,0.62963,1.0,0.944444,0.962963,0.481481,0.0,0.407407,,12.467178
std,0.117886,0.492103,0.0,0.160128,0.079169,0.509175,0.0,0.500712,,1.365667
min,0.055914,0.0,1.0,0.5,0.8,0.0,0.0,0.0,,9.971956
25%,0.33273,0.0,1.0,1.0,1.0,0.0,0.0,0.0,,11.559693
50%,0.381773,1.0,1.0,1.0,1.0,0.0,0.0,0.0,,12.046261
75%,0.46914,1.0,1.0,1.0,1.0,1.0,0.0,1.0,,13.183503


So it looks like edit distance is pretty good, but the schema validation leaves something to be desired.

We're defining the schema in JSON then requesting XML. Let's try keeping it unified.

## Try with XSD Schema Definition

In this variant, let's see if Claude performs better if we keep our structure consistent.

In [49]:
from typing import Any, Dict, List, Type

from langchain.chat_models import ChatAnthropic
from langchain.output_parsers.xml import XMLOutputParser
from langchain.prompts import ChatPromptTemplate
from langchain.pydantic_v1 import BaseModel

# This is the schema the model will populate
xsd = """<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

    <xs:simpleType name="QuestionCategory">
        <xs:restriction base="xs:string">
            <xs:enumeration value="Implementation Issues"/>
            <xs:enumeration value="Feature Requests"/>
            <xs:enumeration value="Concept Explanations"/>
            <xs:enumeration value="Code Optimization"/>
            <xs:enumeration value="Security and Privacy Concerns"/>
            <xs:enumeration value="Model Training and Fine-tuning"/>
            <xs:enumeration value="Data Handling and Manipulation"/>
            <xs:enumeration value="User Interaction Flow"/>
            <xs:enumeration value="Technical Integration"/>
            <xs:enumeration value="Error Handling and Logging"/>
            <xs:enumeration value="Customization and Configuration"/>
            <xs:enumeration value="External API and Data Source Integration"/>
            <xs:enumeration value="Language and Localization"/>
            <xs:enumeration value="Streaming and Real-time Processing"/>
            <xs:enumeration value="Tool Development"/>
            <xs:enumeration value="Function Calling"/>
            <xs:enumeration value="LLM Integrations"/>
            <xs:enumeration value="General Agent Questions"/>
            <xs:enumeration value="General Chit Chat"/>
            <xs:enumeration value="Memory"/>
            <xs:enumeration value="Debugging Help"/>
            <xs:enumeration value="Application Design"/>
            <xs:enumeration value="Prompt Templates"/>
            <xs:enumeration value="Cost Tracking"/>
            <xs:enumeration value="Other"/>
        </xs:restriction>
    </xs:simpleType>

    <xs:simpleType name="Sentiment">
        <xs:restriction base="xs:string">
            <xs:enumeration value="Negative"/>
            <xs:enumeration value="Neutral"/>
            <xs:enumeration value="Positive"/>
        </xs:restriction>
    </xs:simpleType>

    <xs:simpleType name="ProgrammingLanguage">
        <xs:restriction base="xs:string">
            <xs:enumeration value="python"/>
            <xs:enumeration value="javascript"/>
            <xs:enumeration value="typescript"/>
            <xs:enumeration value="unknown"/>
            <xs:enumeration value="other"/>
        </xs:restriction>
    </xs:simpleType>

    <xs:complexType name="QuestionCategorization">
        <xs:sequence>
            <xs:element name="question_category" type="QuestionCategory"/>
            <xs:element name="category_if_other" type="xs:string" minOccurs="0"/>
            <xs:element name="is_off_topic" type="xs:boolean"/>
            <xs:element name="toxicity" type="xs:int">
                <xs:minInclusive value="0"/>
                <xs:maxInclusive value="5"/>
            </xs:element>
            <xs:element name="sentiment" type="Sentiment"/>
            <xs:element name="programming_language" type="ProgrammingLanguage"/>
        </xs:sequence>
    </xs:complexType>

    <xs:simpleType name="ResponseType">
        <xs:restriction base="xs:string">
            <xs:enumeration value="resolve issue"/>
            <xs:enumeration value="provide guidance"/>
            <xs:enumeration value="request information"/>
            <xs:enumeration value="give up"/>
            <xs:enumeration value="none"/>
            <xs:enumeration value="other"/>
        </xs:restriction>
    </xs:simpleType>

    <xs:complexType name="ResponseCategorization">
        <xs:sequence>
            <xs:element name="response_type" type="ResponseType"/>
            <xs:element name="response_type_if_other" type="xs:string" minOccurs="0"/>
            <xs:element name="confidence_level" type="xs:int">
                <xs:minInclusive value="0"/>
                <xs:maxInclusive value="5"/>
            </xs:element>
            <xs:element name="followup_actions" type="xs:string" minOccurs="0" maxOccurs="unbounded"/>
        </xs:sequence>
    </xs:complexType>

    <xs:complexType name="GenerateTicket">
        <xs:sequence>
            <xs:element name="issue_summary" type="xs:string"/>
            <xs:element name="question" type="QuestionCategorization"/>
            <xs:element name="response" type="ResponseCategorization"/>
        </xs:sequence>
    </xs:complexType>

</xs:schema>"""

prompt = claude_prompt.partial(schema=xsd, function_call=task.schema.schema()["title"])

claude_extraction_chain = format_run | prompt | claude | try_parse

In [50]:
result = claude_extraction_chain.invoke(
    {
        "question": "how do i run llama 2 locally?",
        "answer": "Llama.cpp of course. Afterwords remember to install it, then add it to your path!",
    }
)
result

{'output': {'issue_summary': 'How to run Llama locally',
  'question': {'question_category': 'LLM Integrations',
   'is_off_topic': 'false',
   'toxicity': '0',
   'sentiment': 'Neutral',
   'programming_language': 'unknown'},
  'response': {'response_type': 'provide guidance',
   'confidence_level': '3',
   'followup_actions': ['Install Llama locally', 'Add Llama to path']}}}

In [52]:
claude_xsd_test_run = client.run_on_dataset(
    dataset_name=task.name,
    llm_or_chain_factory=claude_extraction_chain,
    evaluation=eval_config,
    verbose=True,
    project_name=f"claude-2-xsd-to-xml-{uid}",
    project_metadata={
        "arch": "claude-xml",
    },
)

View the evaluation results for project 'claude-2-xsd-to-xml-9f6b-3' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/08042749-504d-4509-9549-5f5c579115f6/compare?selectedSessions=8322f5a4-2eae-4171-8b20-1af3b4cd1245

View all tests for Dataset Chat Extraction at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/08042749-504d-4509-9549-5f5c579115f6
[------------------------------------------------->] 27/27
 Eval quantiles:


Unnamed: 0,feedback.json_edit_distance,feedback.json_schema,feedback.toxicity_similarity,feedback.sentiment_similarity,feedback.confidence_level_similarity,feedback.question_category,feedback.off_topic_similarity,feedback.programming_language_similarity,error,execution_time
count,27.0,27.0,27.0,27.0,27.0,27.0,27.0,27.0,0.0,27.0
unique,,,,,,,,,0.0,
top,,,,,,,,,,
freq,,,,,,,,,,
mean,0.399774,0.592593,1.0,0.87037,0.962963,0.37037,0.0,0.555556,,10.895454
std,0.101523,0.500712,0.0,0.223288,0.079169,0.492103,0.0,0.50637,,1.715562
min,0.114068,0.0,1.0,0.5,0.8,0.0,0.0,0.0,,7.780824
25%,0.349609,0.0,1.0,0.75,1.0,0.0,0.0,0.0,,9.730746
50%,0.402204,1.0,1.0,1.0,1.0,0.0,0.0,1.0,,10.449593
75%,0.430347,1.0,1.0,1.0,1.0,1.0,0.0,1.0,,11.746244


The json schema metric went down, meaning that the output counter-intuitively is less friendly to our parser than before.


Let's try with an open source model: `llama-v2-34b-code-instruct`.

## Try with Llama 2

`llama-v2-34b-code-instruct` is an open source model that is meant to be good at both code-gen and other tasks.
Let's benchmark it.

In [15]:
import json

from langchain.chat_models import ChatFireworks
from langchain.output_parsers.json import parse_json_markdown
from langchain.schema.output_parser import StrOutputParser

mistral_prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are a data extraction bot tasked with extracting and inferring information from dialogues and generating tickets. Always respond "
            "only with json based on the following JSON schema:\n{schema}",
        ),
        (
            "user",
            "Generate a ticket from the following question-response pair:\n"
            "<Dialogue>\n{dialogue}\n</Dialogue>\n"
            "Remember, respond directly with this format:\n"
            '{{"{function_call}": ...}}\n'
            "RESPOND ONLY IN JSON THEN STOP.",
        ),
    ]
)

prompt = mistral_prompt.partial(
    schema=task.schema.schema_json(), function_call=task.schema.schema()["title"]
)

llm = ChatFireworks(
    model="accounts/fireworks/models/llama-v2-34b-code-instruct",
    temperature=0,
    model_kwargs={"max_tokens": 4000},
)


def parse_output(ai_message):
    content = ai_message.content
    parser = lambda x: json.loads(x, strict=False)
    try:
        parsed = parse_json_markdown(content, parser=parser)
        if "GenerateTicket" in parsed:
            return {"output": parsed["GenerateTicket"]}
        return {"output": parsed}
    except json.JSONDecodeError:
        return {"output": content}


fireworks_extraction_chain = format_run | prompt | llm | parse_output

In [16]:
result = fireworks_extraction_chain.invoke(
    {"question": "how do i run llama 2 locally?", "answer": "Llama.cpp of course."}
)
result

{'output': {'issue_summary': 'How to run Llama 2 locally',
  'question': {'question_category': 'Implementation Issues',
   'is_off_topic': False,
   'toxicity': 0,
   'sentiment': 'Positive',
   'programming_language': 'unknown'},
  'response': {'response_type': 'Resolve Issue',
   'confidence_level': 5,
   'followup_actions': []}}}

In [17]:
llama_v2_test_run = client.run_on_dataset(
    dataset_name=task.name,
    llm_or_chain_factory=fireworks_extraction_chain,
    evaluation=eval_config,
    verbose=True,
    project_name=f"llama-v2-34b-code-instruct-{uid}",
    project_metadata={"arch": "claude-xml", "model": "llama-v2-34b-code-instruct"},
)

View the evaluation results for project 'llama-v2-34b-code-instruct-9f6b' at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/08042749-504d-4509-9549-5f5c579115f6/compare?selectedSessions=3860ba49-a07a-46ca-a595-0d94ae59e49f

View all tests for Dataset Chat Extraction at:
https://smith.langchain.com/o/ebbaf2eb-769b-4505-aca2-d11de10372a4/datasets/08042749-504d-4509-9549-5f5c579115f6
[------------------------------------------------->] 27/27
 Eval quantiles:


Unnamed: 0,feedback.json_edit_distance,feedback.json_schema,feedback.toxicity_similarity,feedback.sentiment_similarity,feedback.confidence_level_similarity,feedback.question_category,feedback.off_topic_similarity,feedback.programming_language_similarity,error,execution_time
count,16.0,27.0,27.0,27.0,27.0,27.0,27.0,27.0,0.0,27.0
unique,,,,,,,,,0.0,
top,,,,,,,,,,
freq,,,,,,,,,,
mean,0.409389,0.222222,0.333333,0.37037,0.540741,0.074074,0.407407,0.259259,,7.054639
std,0.139532,0.423659,0.480384,0.382114,0.495564,0.26688,0.500712,0.446576,,4.961196
min,0.159737,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,4.277054
25%,0.3192,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,4.758676
50%,0.440756,0.0,0.0,0.5,0.8,0.0,0.0,0.0,,5.391378
75%,0.510686,0.0,1.0,0.5,1.0,0.0,1.0,0.5,,6.008275


## Compare Results

Here, we'll take a look at the underlying results a little bit. You can review the results to see relative performance in aggregate and on a per-example basis.

In [18]:
import pandas as pd

df = (
    test_run.to_dataframe()
    .join(claude_test_run.to_dataframe(), rsuffix="_claude")
    .join(claude_xsd_test_run.to_dataframe(), rsuffix="_claude_xsd")
    .join(llama_v2_test_run.to_dataframe(), rsuffix="_llama_v2")
)

In [19]:
df.head(5)

Unnamed: 0,inputs.answer,inputs.question,outputs.output,reference.output,feedback.json_edit_distance,feedback.json_schema,feedback.toxicity_similarity,feedback.sentiment_similarity,feedback.confidence_level_similarity,feedback.question_category,...,feedback.json_edit_distance_llama_v2,feedback.json_schema_llama_v2,feedback.toxicity_similarity_llama_v2,feedback.sentiment_similarity_llama_v2,feedback.confidence_level_similarity_llama_v2,feedback.question_category_llama_v2,feedback.off_topic_similarity_llama_v2,feedback.programming_language_similarity_llama_v2,error_llama_v2,execution_time_llama_v2
23a81130-2ad9-46cf-ad27-46589bcea94a,"Pour joindre les deux outputs, vous pouvez uti...",je travail sur python. je souhaite joindre ces...,{'issue_summary': 'Combining two outputs in Py...,"{'question': {'toxicity': 0, 'sentiment': 'Neu...",0.361516,1,0.98,0.5,0.38,0,...,0.628571,0,0.0,0.0,0.8,0,0,0,,5.391378
598316ec-f5e2-4b4d-83a8-36adb18e12fe,"Hmm, I'm not sure.",example for dalle agent,"{'issue_summary': 'Example for DALL-E agent', ...","{'question': {'toxicity': 0, 'sentiment': 'Neu...",0.372024,1,1.0,1.0,0.9,0,...,0.488673,0,0.0,0.5,1.0,0,1,1,,4.844093
d1a1a2e8-6f4c-4325-8aaa-ea20e2449268,"To run Llama2 using pandas, you can follow the...",how do I run llama2 using pandas,{'issue_summary': 'Running Llama2 using pandas...,"{'question': {'toxicity': 0, 'sentiment': 'Neu...",0.587074,1,0.98,0.5,0.18,0,...,,0,0.0,0.0,0.0,0,0,0,,6.037707
140a4819-0046-469d-b4df-8e747ddae112,To clear the conversation in ConversationalRet...,if Im useing ConversationalRetrievalChain how ...,{'issue_summary': 'How to clear conversation i...,"{'question': {'toxicity': 0, 'sentiment': 'Neu...",0.303279,1,1.0,1.0,0.18,0,...,,0,0.0,0.0,0.0,0,0,0,,6.272696
7b0a9dd9-68ce-41a1-9f9d-067d93175477,To perform the task of creating an app that in...,I want to create an app which:\n- chats with u...,{'issue_summary': 'Creating an app to interact...,"{'question': {'toxicity': 0, 'sentiment': 'Neu...",0.75,1,0.98,0.5,0.18,0,...,0.438889,1,1.0,0.5,1.0,0,1,0,,7.150804


#### Here, we compare the aggregate metrics side-by-side

In [20]:
df = (
    test_run.get_aggregate_feedback()
    .add_suffix(".gpt-4")
    .join(claude_test_run.get_aggregate_feedback(), rsuffix=".claude")
    .join(claude_xsd_test_run.get_aggregate_feedback(), rsuffix=".claude_xsd")
    .join(llama_v2_test_run.get_aggregate_feedback(), rsuffix=".llama_v2")
)

In [21]:
from IPython.display import HTML, display

feedback_columns = sorted(
    {col.rsplit(".", 1)[0] for col in df.columns if col.startswith("feedback.")}
)


def render_metric(df, metric):
    sub_cols = [col for col in df.columns if col.startswith(metric)]
    display(HTML(f"<h3>{metric.split('.')[-1]}</h3>"))
    display(df[sub_cols][df.index.isin(["mean", "std"])])

In [22]:
feedback_columns

['feedback',
 'feedback.confidence_level_similarity',
 'feedback.json_edit_distance',
 'feedback.json_schema',
 'feedback.off_topic_similarity',
 'feedback.programming_language_similarity',
 'feedback.question_category',
 'feedback.sentiment_similarity',
 'feedback.toxicity_similarity']

In [23]:
render_metric(df, "execution_time")

Unnamed: 0,execution_time.gpt-4,execution_time,execution_time.claude_xsd,execution_time.llama_v2
mean,5.870635,12.467178,12.217137,7.054639
std,0.502469,1.365667,2.050338,4.961196


In [24]:
for metric in feedback_columns:
    render_metric(df, metric)

Unnamed: 0,feedback.json_edit_distance.gpt-4,feedback.json_schema.gpt-4,feedback.toxicity_similarity.gpt-4,feedback.sentiment_similarity.gpt-4,feedback.confidence_level_similarity.gpt-4,feedback.question_category.gpt-4,feedback.off_topic_similarity.gpt-4,feedback.programming_language_similarity.gpt-4,feedback.json_edit_distance,feedback.json_schema,...,feedback.off_topic_similarity.claude_xsd,feedback.programming_language_similarity.claude_xsd,feedback.json_edit_distance.llama_v2,feedback.json_schema.llama_v2,feedback.toxicity_similarity.llama_v2,feedback.sentiment_similarity.llama_v2,feedback.confidence_level_similarity.llama_v2,feedback.question_category.llama_v2,feedback.off_topic_similarity.llama_v2,feedback.programming_language_similarity.llama_v2
mean,0.413472,1.0,0.991111,0.740741,0.236296,0.296296,0.888889,0.62963,0.383192,0.62963,...,0.0,0.333333,0.409389,0.222222,0.333333,0.37037,0.540741,0.074074,0.407407,0.259259
std,0.140195,0.0,0.010127,0.254588,0.151,0.465322,0.320256,0.492103,0.117886,0.492103,...,0.0,0.480384,0.139532,0.423659,0.480384,0.382114,0.495564,0.26688,0.500712,0.446576


Unnamed: 0,feedback.confidence_level_similarity.gpt-4,feedback.confidence_level_similarity,feedback.confidence_level_similarity.claude_xsd,feedback.confidence_level_similarity.llama_v2
mean,0.236296,0.962963,-14.896296,0.540741
std,0.151,0.079169,6.891549,0.495564


Unnamed: 0,feedback.json_edit_distance.gpt-4,feedback.json_edit_distance,feedback.json_edit_distance.claude_xsd,feedback.json_edit_distance.llama_v2
mean,0.413472,0.383192,0.462497,0.409389
std,0.140195,0.117886,0.081943,0.139532


Unnamed: 0,feedback.json_schema.gpt-4,feedback.json_schema,feedback.json_schema.claude_xsd,feedback.json_schema.llama_v2
mean,1.0,0.62963,0.037037,0.222222
std,0.0,0.492103,0.19245,0.423659


Unnamed: 0,feedback.off_topic_similarity.gpt-4,feedback.off_topic_similarity,feedback.off_topic_similarity.claude_xsd,feedback.off_topic_similarity.llama_v2
mean,0.888889,0.0,0.0,0.407407
std,0.320256,0.0,0.0,0.500712


Unnamed: 0,feedback.programming_language_similarity.gpt-4,feedback.programming_language_similarity,feedback.programming_language_similarity.claude_xsd,feedback.programming_language_similarity.llama_v2
mean,0.62963,0.407407,0.333333,0.259259
std,0.492103,0.500712,0.480384,0.446576


Unnamed: 0,feedback.question_category.gpt-4,feedback.question_category,feedback.question_category.claude_xsd,feedback.question_category.llama_v2
mean,0.296296,0.481481,0.074074,0.074074
std,0.465322,0.509175,0.26688,0.26688


Unnamed: 0,feedback.sentiment_similarity.gpt-4,feedback.sentiment_similarity,feedback.sentiment_similarity.claude_xsd,feedback.sentiment_similarity.llama_v2
mean,0.740741,0.944444,0.981481,0.37037
std,0.254588,0.160128,0.096225,0.382114


Unnamed: 0,feedback.toxicity_similarity.gpt-4,feedback.toxicity_similarity,feedback.toxicity_similarity.claude_xsd,feedback.toxicity_similarity.llama_v2
mean,0.991111,1.0,1.0,0.333333
std,0.010127,0.0,0.0,0.480384
