# Parea Evaluation Deep Dive

# Set Up

In [None]:
! pip install parea-ai tiktoken openai langchain langchain-openai langchain_community html2text pinecone-client nest-asyncio datasets

In [26]:
import json
import os
import random
import uuid

import nest_asyncio
from openai import OpenAI

from parea import Parea, trace
from parea.schemas import LLMInputs, EvaluationResult, Log, Completion, Message, Role, ModelParams, EvaluatedLog

nest_asyncio.apply()

Prerequisite: Make sure you have a OpenAI and Parea API key.
Follow these [steps](https://docs.parea.ai/api-reference/authentication#parea-api-key) to create your free Parea API key.

In [3]:
os.environ["OPENAI_API_KEY"] = "OPENAI_API_KEY"
os.environ["PAREA_API_KEY"] = "PAREA_API_KEY"

In [6]:
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
p = Parea(api_key=os.environ["PAREA_API_KEY"], project_name="parea_eval_testing")
p.wrap_openai_client(client)  # auto trace all openai api calls

Created project parea_eval_testing


# Create DataSet

### Use a HuggingFace Dataset

https://huggingface.co/datasets/go_emotions

In [8]:
# Dataset Import

# Importing with huggingface datasets package
import pandas as pd
from datasets import load_dataset

df = load_dataset("go_emotions")

# creating an emotion index label dictionary
label_index = {
    "0": "admiration",
    "1": "amusement",
    "2": "anger",
    "3": "annoyance",
    "4": "approval",
    "5": "caring",
    "6": "confusion",
    "7": "curiosity",
    "8": "desire",
    "9": "disappointment",
    "10": "disapproval",
    "11": "disgust",
    "12": "embarrassment",
    "13": "excitement",
    "14": "fear",
    "15": "gratitude",
    "16": "grief",
    "17": "joy",
    "18": "love",
    "19": "nervousness",
    "20": "optimism",
    "21": "pride",
    "22": "realization",
    "23": "relief",
    "24": "remorse",
    "25": "sadness",
    "26": "surprise",
    "27": "neutral",
}


def make_sample():
    # Pull some random 20 Comments & Emotion
    data = []
    for i in range(1001, 1022):
        comment = df["train"][i]["text"]
        label_indices = df["train"][i]["labels"]

        # deal with labels
        if not isinstance(label_indices, list):
            label_indices = [label_indices]

        # label mapping
        emotions = ", ".join([label_index.get(str(label)) for label in label_indices])

        data.append((comment, emotions))

    return pd.DataFrame(data, columns=["comment", "emotion label"])


comments_df = make_sample()
comments_df.head()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/9.40k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.77M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/350k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/347k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/43410 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/5426 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/5427 [00:00<?, ? examples/s]

Unnamed: 0,comment,emotion label
0,Omg i hope this is about [NAME]. I would LOVE ...,optimism
1,Finale,neutral
2,Which suggests nothing in itself. The same mod...,"anger, annoyance"
3,I double dog dare him.,neutral
4,"Believe you me. TLJ is much, much worse.","disappointment, disgust"


In [9]:
def to_simple_dictionary(comments_df):
    questions = [q for q in comments_df["comment"]]
    target = [a for a in comments_df["emotion label"]]
    # Parea uses the reserved keyword "target" when feeding a dataset to evaluation functions
    return [{"comment": q, "target": t} for q, t in zip(questions, target)]


dataset = to_simple_dictionary(comments_df)
dataset

{'comment': 'Omg i hope this is about [NAME]. I would LOVE to see [NAME] and [NAME] go head to head',
 'target': 'optimism'}

---
# 1: Comparing Models with Custom Evaluation

Classification Task of emotions with GPT-4o, GPT-4o-mini and GPT-4-T

### Setting up LLM call

Emotion classification task. Take in a social media comment, apply one of 27 emotion labels or neutral to it.

In [17]:
def call_llm(comment: str, model: str) -> str:
    return (
        client.chat.completions.create(
            model=model,
            messages=[
                {
                    "role": "user",
                    "content": f"""
            You are a cutting edge emotion analysis classification assistant.
            You analyze a comment, and apply one or more emotion labels to it.
            The emotion labels are detailed here:

            ['admiration', 'amusement', 'anger', 'annoyance', 'approval', 'caring', 'confusion',
            'curiosity', 'desire', 'disappointment', 'disapproval', 'disgust', 'embarrassment',
            'excitement', 'fear', 'gratitude', 'grief', 'joy', 'love', 'nervousness', 'optimism',
            'pride', 'realization', 'relief', 'remorse', 'sadness', 'surprise', 'neutral']

            Your output should simply be just the respective emotion, and if there are multiple seperated with commas.

            The comment is here: {comment}
            """,
                }
            ],
        )
        .choices[0]
        .message.content
    )

### Defining a custom eval

Currently, we have two pieces of data
1. The dataset social media comment
2. The dataset assigned emotion label(s)

Want to evaluate model performance on the (1)Dataset social media comment in comparison to the (2)dataset assigned emotion label.

The below function assigns a "matches_target" score of 1 if it's an exact match, 0.5 if the LLM output partially contains the expected label, or 0 if nothing is included.

In [18]:
# Parea evaluation functions expect a log param. This is the log object created when we build the trace log.
# With the log object you can access any parameter set on your traced function like output and inputs and even configuration model parameters.
# Log also stores the reserved "target" if one is set on the dataset being evaluated
def matches_target(log: Log) -> float:
    # Getting the emotions and response as a set
    expected_answer = set(log.target.split(", "))
    response = set(log.output.split(", "))

    # Check if response matches the expected answer exactly
    if response == expected_answer:
        return 1
    # Check if there is any overlap (partial match)
    elif response & expected_answer:
        return 0.5
    # No overlap at all
    return 0

### Using parea.experiment() to run your evaluations

experiment() needs a few arguments, the function to evaluate, and the dataset to that will be fed into the function as input params.

### Evaluating matches_target score against go_emotions dataset and GPT models
We 'attach' evaluation metrics to the function we want to evaluate using Parea's trace decorator. This will take the outputs from the traced function and pass it to the evaluation metric

In [29]:
MODELS = ["gpt-4o", "gpt-4o-mini", "gpt-4-turbo"]


@trace(eval_funcs=[matches_target])
def gpt4o_classify_emotion(comment) -> str:
    return call_llm(model="gpt-4o", comment=comment)


@trace(eval_funcs=[matches_target])
def gptmini_classify_emotion(comment) -> str:
    return call_llm(model="gpt-4o-mini", comment=comment)


@trace(eval_funcs=[matches_target])
def gpt4turbo_classify_emotion(comment) -> str:
    return call_llm(model="gpt-4-turbo", comment=comment)

In [13]:
p.experiment(name="Emotions_Classifier", data=dataset, func=gpt4o_classify_emotion, metadata={"variant": MODELS[0]}).run()

Run name set to: ortho-hymn, since a name was not provided.


Running samples: 100%|██████████| 21/21 [00:02<00:00, 10.48sample/s]
0it [00:04, ?it/s]


Experiment Emotions_Classifier Run ortho-hymn stats:
{
  "latency": "0.52",
  "input_tokens": "203.43",
  "output_tokens": "3.05",
  "total_tokens": "206.48",
  "cost": "0.00106",
  "matches_target": "0.21"
}


View experiment & traces at: https://app.parea.ai/experiments/Emotions_Classifier/271b912f-e90c-4b8b-9d7e-ec9963ed958a



In [14]:
p.experiment(name="Emotions_Classifier", data=dataset, func=gptmini_classify_emotion, metadata={"variant": MODELS[1]}).run()

Run name set to: urban-kill, since a name was not provided.


Running samples: 100%|██████████| 21/21 [00:01<00:00, 15.47sample/s]
0it [00:04, ?it/s]


Experiment Emotions_Classifier Run urban-kill stats:
{
  "latency": "0.52",
  "input_tokens": "203.43",
  "output_tokens": "4.24",
  "total_tokens": "207.67",
  "cost": "0.00003",
  "matches_target": "0.26"
}


View experiment & traces at: https://app.parea.ai/experiments/Emotions_Classifier/e58a11f4-13e4-455f-9147-49a679c25a7d



In [20]:
p.experiment(name="Emotions_Classifier", data=dataset, func=gpt4turbo_classify_emotion, metadata={"variant": MODELS[2]}).run()

Run name set to: lamer-mays, since a name was not provided.


Running samples: 100%|██████████| 21/21 [00:02<00:00,  9.58sample/s]
0it [00:04, ?it/s]


Experiment Emotions_Classifier Run lamer-mays stats:
{
  "latency": "0.77",
  "input_tokens": "207.71",
  "output_tokens": "3.43",
  "total_tokens": "211.14",
  "cost": "0.00218",
  "matches_target": "0.29"
}


View experiment & traces at: https://app.parea.ai/experiments/Emotions_Classifier/e0f9f2d5-bc45-4c28-a3a2-9b9d5db23940



In [21]:
# alternatively, since our Dataset does not have a field for "model" we
# can create a factory method that wraps our function and accepts the model param.
def factory(model: str):
    @trace(eval_funcs=[matches_target])
    def classify_emotion(comment) -> str:
        return call_llm(model=model, comment=comment)

    return classify_emotion


# then we can call p.experiment in a loop
for model in MODELS:
    p.experiment(name="Emotions_Classifier", data=dataset, func=factory(model), metadata={"variant": model}).run(run_name=f"{model}-{str(uuid.uuid4())[:4]}")

Running samples: 100%|██████████| 21/21 [00:01<00:00, 15.37sample/s]
0it [00:04, ?it/s]


Experiment Emotions_Classifier Run gpt-4o-abcd stats:
{
  "latency": "0.50",
  "input_tokens": "203.43",
  "output_tokens": "2.76",
  "total_tokens": "206.19",
  "cost": "0.00106",
  "matches_target": "0.21"
}


View experiment & traces at: https://app.parea.ai/experiments/Emotions_Classifier/9c5d9cb0-9125-4db4-880f-a06898e836e2



Running samples: 100%|██████████| 21/21 [00:01<00:00, 12.52sample/s]
0it [00:04, ?it/s]


Experiment Emotions_Classifier Run gpt-4o-mini-175c stats:
{
  "latency": "0.60",
  "input_tokens": "203.43",
  "output_tokens": "3.86",
  "total_tokens": "207.29",
  "cost": "0.00003",
  "matches_target": "0.31"
}


View experiment & traces at: https://app.parea.ai/experiments/Emotions_Classifier/7128aaec-5061-4fe4-81a3-67ad5189ad36



Running samples: 100%|██████████| 21/21 [00:01<00:00, 12.19sample/s]
0it [00:04, ?it/s]


Experiment Emotions_Classifier Run gpt-4-turbo-b188 stats:
{
  "latency": "0.73",
  "input_tokens": "207.71",
  "output_tokens": "3.19",
  "total_tokens": "210.90",
  "cost": "0.00217",
  "matches_target": "0.33"
}


View experiment & traces at: https://app.parea.ai/experiments/Emotions_Classifier/bbc7a636-8fda-4371-946f-302f94781d7f



---
# 2: Custom LLM-As-Judge Evaluator

Using custom evaluators to assess model performance, using another llm model

#### Creating a new dataset of question and answer pairs

Website of interest: https://lilianweng.github.io/posts/2023-06-23-agent/

In [22]:
def get_context():
    # Loading A Web Page
    import requests
    from bs4 import BeautifulSoup

    url = "https://lilianweng.github.io/posts/2023-06-23-agent/"
    response = requests.get(url)
    soup = BeautifulSoup(response.content, "html.parser")
    text = [p.text for p in soup.find_all("p")]
    return "\n".join(text)


full_text = get_context()

In [24]:
# Example Questions
inputs = [
    "What is the primary function of LLM in autonomous agents?",
    "Can you describe the role of 'Planning' in LLM-powered autonomous agents?",
    "What types of memory are utilized by LLM-powered agents?",
    "How do autonomous agents use tool APIs?",
    "What are some challenges faced by LLM-powered autonomous agents in real-world applications?",
]

outputs = [
    "LLM functions as the core controller or 'brain' of autonomous agents, enabling them to handle complex tasks through planning, memory, and tool use.",
    "In LLM-powered agents, 'Planning' involves breaking down complex tasks into manageable sub goals, reflecting on past actions, and refining strategies for improved outcomes.",
    "LLM-powered agents utilize short-term memory for in-context learning and long-term memory for retaining and recalling information over extended periods, often leveraging external vector stores.",
    "Autonomous agents use tool APIs to extend their capabilities beyond the model's weights, allowing access to current information, code execution, and proprietary data.",
    "Challenges include managing the complexity of task dependencies, maintaining the stability of model outputs, and ensuring efficient interaction with external models and APIs.",
]

# create QA pair dictionary. Target is a reserved keyword and will be automatically passed to our eval function.
qa_pairs = [{"question": q, "target": a} for q, a in zip(inputs, outputs)]

### Create LLM Judge Evals

Two LLM evals to be tested.
1. Chain of Though
2. Helpfulness Criteria

In [45]:
MODELS = ["gpt-4o", "claude-3-haiku-20240307", "mistral-small-latest"]
JUDGE = random.choice(["gpt-4o", "mistral-small-latest"])


# To make it easy to use any model we want as an evaluator we will use p.completion.
# p.completion is Parea's method to call any LLM provider you have enabled on Parea using the same interface.
# You could easily remove p.completion and use the provider of you choice directly.
# Notice also this time we return a EvaluationResult, which allows us to add a reason field.
def llm_judge(name: str, prompt: str) -> EvaluationResult:
    try:
        response = p.completion(
            data=Completion(
                llm_configuration=LLMInputs(
                    model=JUDGE,
                    messages=[Message(role=Role.system, content="Respond in JSON. JSON must have keys reason and score"), Message(role=Role.user, content=prompt)],
                    model_params=ModelParams(response_format={"type": "json_object"}),
                )
            )
        )
        response_dict = json.loads(response.content)
        return EvaluationResult(name=name, score=int(response_dict["score"]), reason=str(response_dict["reason"]))
    except Exception as e:
        print("Eval Exception:", e)
        return EvaluationResult(name=f"error-{name}", score=0, reason=f"Error in grading: {e}")


def cot_eval(log: Log) -> EvaluationResult:
    cot_template = f"""You are a teacher grading a quiz.
  You are given a question, the context the question is about, and the student's answer.
  You are asked to score the student's answer as either 1 for CORRECT or 0 for INCORRECT, based on the context.
  provide a reason that is a step by step reasoning about your conclusion.

  Grade the student answers based ONLY on their factual accuracy. Ignore differences in punctuation and phrasing between the student answer and true answer.
  It is OK if the student answer contains more information than the true answer, as long as it does not contain any conflicting statements.

  QUESTION: {log.inputs["question"]}
  CONTEXT: {full_text}
  STUDENT ANSWER: {log.output}

  RESPOND IN JSON, keys are:
    reason: step by step reasoning here
    score: 1 if CORRECT, 0 if INCORRECT
  """
    return llm_judge(name="cot_eval", prompt=cot_template)


def helpfulness_eval(log: Log) -> EvaluationResult:
    helpful_criteria_template = f"""You are assessing a submitted answer on a given task or input based on a set of criteria. Here is the data:
  [BEGIN DATA]
  ***
  [Input]: {log.inputs["question"]}
  ***
  [Submission]: {log.output}
  ***
  [Criteria]: "Is the submission helpful, insightful, and appropriate? If so, respond 1. If not, respond 0."
  ***
  [END DATA]
  Does the submission meet the Criteria?
  Respond in JSON with keys: reason and score.
    reason = step by step manner your reasoning about the criteria and the score given.
    score = A score of 1 or 0 corresponding to whether the submission meets all criteria or not.
  """
    return llm_judge(name="helpfulness_eval", prompt=helpful_criteria_template)


EVALS = {
    "cot_eval": cot_eval,
    "helpfulness_eval": helpfulness_eval,
}

In [41]:
def qa_factory(model: str, eval_name: str):
    @trace(eval_funcs=[EVALS[eval_name]])
    def qa_llm_call(question: str) -> str:
        msgs = [
            {"role": "system", "content": f"Answer the user's question using this context: \n\n\n {full_text}"},
            {"role": "user", "content": f"Answer the question in 2-3 sentences {question}"},
        ]
        return p.completion(data=Completion(llm_configuration=LLMInputs(model=model, messages=[Message(**d) for d in msgs]))).content

    return qa_llm_call

In [42]:
for model in MODELS:
    p.experiment(
        name="QA_Evals_CoT",
        data=qa_pairs,
        func=qa_factory(model, "cot_eval"),
        metadata={"judge-model": JUDGE},
    ).run(run_name=f"{model}-{str(uuid.uuid4())[:4]}")

Running samples: 100%|██████████| 5/5 [00:00<00:00, 15.10sample/s]
Waiting for evaluations to finish: 100%|██████████| 5/5 [00:07<00:00,  1.51s/it]


Experiment QA_Evals_CoT Run gpt-4o-d48f stats:
{
  "latency": "0.29",
  "input_tokens": "9985.20",
  "output_tokens": "196.20",
  "total_tokens": "10181.40",
  "cost": "0.05288",
  "cot_eval": "1.00"
}


View experiment & traces at: https://app.parea.ai/experiments/QA_Evals_CoT/24b61e6f-7258-4219-90cd-9bd0da94a5f2



Running samples: 100%|██████████| 5/5 [00:00<00:00, 14.60sample/s]
Waiting for evaluations to finish: 100%|██████████| 5/5 [00:09<00:00,  1.81s/it]


Experiment QA_Evals_CoT Run claude-3-haiku-20240307-0138 stats:
{
  "latency": "0.24",
  "input_tokens": "10647.80",
  "output_tokens": "250.00",
  "total_tokens": "10897.80",
  "cost": "0.02936",
  "cot_eval": "1.00"
}


View experiment & traces at: https://app.parea.ai/experiments/QA_Evals_CoT/1088db2e-40ab-42b0-8793-3af662e1692d



Running samples: 100%|██████████| 5/5 [00:00<00:00,  8.93sample/s]
Waiting for evaluations to finish: 100%|██████████| 5/5 [00:08<00:00,  1.71s/it]


Experiment QA_Evals_CoT Run mistral-small-latest-1ea3 stats:
{
  "latency": "0.33",
  "input_tokens": "10727.60",
  "output_tokens": "185.60",
  "total_tokens": "10913.20",
  "cost": "0.03320",
  "cot_eval": "0.80"
}


View experiment & traces at: https://app.parea.ai/experiments/QA_Evals_CoT/c0319184-0cc2-4af9-ab5d-d0e0ace3a823



In [43]:
for model in MODELS:
    p.experiment(
        name="QA_Evals_Helpfulness",
        data=qa_pairs,
        func=qa_factory(model, "helpfulness_eval"),
    ).run(run_name=f"{model}-{str(uuid.uuid4())[:4]}")

Running samples: 100%|██████████| 5/5 [00:00<00:00, 14.00sample/s]
Waiting for evaluations to finish: 100%|██████████| 5/5 [00:08<00:00,  1.71s/it]


Experiment QA_Evals_Helpfulness Run gpt-4o-2081 stats:
{
  "latency": "0.30",
  "input_tokens": "5115.20",
  "output_tokens": "175.20",
  "total_tokens": "5290.40",
  "cost": "0.02820",
  "helpfulness_eval": "1.00"
}


View experiment & traces at: https://app.parea.ai/experiments/QA_Evals_Helpfulness/65324bef-1764-487a-94ea-811649c61042



Running samples: 100%|██████████| 5/5 [00:00<00:00,  9.58sample/s]
Waiting for evaluations to finish: 100%|██████████| 5/5 [00:07<00:00,  1.51s/it]


Experiment QA_Evals_Helpfulness Run claude-3-haiku-20240307-daa7 stats:
{
  "latency": "0.32",
  "input_tokens": "5777.80",
  "output_tokens": "205.20",
  "total_tokens": "5983.00",
  "cost": "0.00432",
  "helpfulness_eval": "1.00"
}


View experiment & traces at: https://app.parea.ai/experiments/QA_Evals_Helpfulness/3285c661-28ca-452e-95ff-0e4e4f0abaa2



Running samples: 100%|██████████| 5/5 [00:00<00:00, 10.26sample/s]
Waiting for evaluations to finish: 100%|██████████| 5/5 [00:07<00:00,  1.41s/it]


Experiment QA_Evals_Helpfulness Run mistral-small-latest-4b20 stats:
{
  "latency": "0.31",
  "input_tokens": "5857.60",
  "output_tokens": "151.80",
  "total_tokens": "6009.40",
  "cost": "0.00834",
  "helpfulness_eval": "0.80"
}


View experiment & traces at: https://app.parea.ai/experiments/QA_Evals_Helpfulness/094c7ca9-84d2-48e8-bba1-e7d86e065065



We can also do multiple evaluation metrics at once

In [44]:
@trace(eval_funcs=[cot_eval, helpfulness_eval])
def qa_llm_call(question: str) -> str:
    msgs = [
        {"role": "system", "content": f"Answer the user's question using this context: \n\n\n {full_text}"},
        {"role": "user", "content": f"Answer the question in 2-3 sentences {question}"},
    ]
    return p.completion(data=Completion(llm_configuration=LLMInputs(model="gpt-4o-mini", messages=[Message(**d) for d in msgs]))).content


p.experiment(
    name="QA_Evals",
    data=qa_pairs,
    func=qa_llm_call,
).run()

Run name set to: alpha-kibe, since a name was not provided.


Running samples: 100%|██████████| 5/5 [00:00<00:00, 14.19sample/s]
Waiting for evaluations to finish: 100%|██████████| 5/5 [00:10<00:00,  2.01s/it]


Experiment QA_Evals Run alpha-kibe stats:
{
  "latency": "0.31",
  "input_tokens": "10229.00",
  "output_tokens": "314.80",
  "total_tokens": "10543.80",
  "cost": "0.03126",
  "cot_eval": "0.80",
  "helpfulness_eval": "1.00"
}


View experiment & traces at: https://app.parea.ai/experiments/QA_Evals/a4eeb170-5f54-4a7a-8d8f-87a05b3c82cd



### Dataset level evals: Set up a Summary Evaluator to evaluate all experiment evaluation results

We'll do balanced accuracy on our chain of though llm judge eval, our criteria for a pass >= 80% accuracy

In [46]:
from collections import defaultdict


# Dataset level evals accept a logs param which is a list of Logs with
# their score fields filled with the prior evaluation results
def balanced_acc_is_correct(logs: list[EvaluatedLog]):
    score_name = cot_eval.__name__

    correct = defaultdict(int)
    total = defaultdict(int)
    for log in logs:
        if eval_result := log.get_score(score_name):
            correct[log.target] += int(eval_result.score)
            total[log.target] += 1
    recalls = [correct[key] / total[key] for key in correct]

    acc = sum(recalls) / len(recalls)
    if acc >= 0.8:
        return 1
    return 0

In [47]:
for model in MODELS:
    p.experiment(name="QA_Evals_CoT", data=qa_pairs, func=qa_factory(model, "cot_eval"), dataset_level_evals=[balanced_acc_is_correct]).run(
        run_name=f"{model}-{str(uuid.uuid4())[:4]}"
    )

Running samples: 100%|██████████| 5/5 [00:00<00:00,  9.66sample/s]
Waiting for evaluations to finish: 100%|██████████| 2/2 [00:04<00:00,  2.25s/it]


Experiment QA_Evals_CoT Run gpt-4o-3c4c stats:
{
  "latency": "0.35",
  "input_tokens": "9985.20",
  "output_tokens": "196.20",
  "total_tokens": "10181.40",
  "cost": "0.05288",
  "cot_eval": "1.00",
  "balanced_acc_is_correct": 1
}


View experiment & traces at: https://app.parea.ai/experiments/QA_Evals_CoT/d355e92b-50fa-4c11-a3c2-28e7e4383b4c



Running samples: 100%|██████████| 5/5 [00:00<00:00, 12.01sample/s]
Waiting for evaluations to finish: 100%|██████████| 1/1 [00:04<00:00,  4.51s/it]


Experiment QA_Evals_CoT Run claude-3-haiku-20240307-9743 stats:
{
  "latency": "0.27",
  "input_tokens": "10647.80",
  "output_tokens": "250.00",
  "total_tokens": "10897.80",
  "cost": "0.02936",
  "cot_eval": "1.00",
  "balanced_acc_is_correct": 1
}


View experiment & traces at: https://app.parea.ai/experiments/QA_Evals_CoT/50264d3c-1d0a-46b8-9e6a-810e5d5c5a09



Running samples: 100%|██████████| 5/5 [00:00<00:00, 13.42sample/s]
Waiting for evaluations to finish: 100%|██████████| 1/1 [00:04<00:00,  4.51s/it]


Experiment QA_Evals_CoT Run mistral-small-latest-5bec stats:
{
  "latency": "0.24",
  "input_tokens": "10727.60",
  "output_tokens": "185.60",
  "total_tokens": "10913.20",
  "cost": "0.03320",
  "cot_eval": "0.80",
  "balanced_acc_is_correct": 1
}


View experiment & traces at: https://app.parea.ai/experiments/QA_Evals_CoT/7b4fe0e2-be81-4dc1-acbf-2b699ce38f74

