# Promptfoo-style eval without promptfoo

Goal: Run test-suite-style eval (like Promptfoo) with completely custom components, i.e. without using Promptfoo.

In this case, you have 2 options:
1. Run with inputs. Library runs AIConfig for you first.
2. Run with outputs only. You run AIConfig and save the outputs for eval.

Run the notebook in order for an example of each.

Assumptions:
* You have a parametrized AIConfig with a test input called "the_query", like this: 
`"input": "{{the_query}}"`
* You have some evaluation criteria in mind for the AIConfig's text output.
* Promptfoo integration does not meet my needs, e.g.
  * You want to run the AIConfig myself instead of handing control to Promptfoo
  * You need to scale beyond what Promptfoo can reasonably handle

In [1]:
# Package installs & environment setup
!pip3 install lastmile-utils --force
# If you see errors, no worries you can generally ignore. Just make sure that the 
# following output matches with the version specified in the 
# aiconfig/python/requirements.txt file (or is a higher version)
!pip3 list | grep lastmile-utils

import openai

# Create ~/.env file with this line: `export OPENAI_API_KEY=<your key here>`
# You can get your key from https://platform.openai.com/api-keys 
import dotenv
import os
dotenv.load_dotenv()
openai.api_key = os.getenv("OPENAI_API_KEY")

[0mCollecting lastmile-utils
  Using cached lastmile_utils-0.0.21-py3-none-any.whl.metadata (901 bytes)
Collecting black==23.11.0 (from lastmile-utils)
  Using cached black-23.11.0-cp310-cp310-macosx_11_0_arm64.whl.metadata (66 kB)
Collecting chardet==5.2.0 (from lastmile-utils)
  Using cached chardet-5.2.0-py3-none-any.whl.metadata (3.4 kB)
Collecting flake8==6.1.0 (from lastmile-utils)
  Using cached flake8-6.1.0-py2.py3-none-any.whl.metadata (3.8 kB)
Collecting isort==5.12.0 (from lastmile-utils)
  Using cached isort-5.12.0-py3-none-any.whl (91 kB)
Collecting jsoncomment==0.4.2 (from lastmile-utils)
  Using cached jsoncomment-0.4.2-py3-none-any.whl (6.8 kB)
Collecting pandas==2.1.2 (from lastmile-utils)
  Using cached pandas-2.1.2-cp310-cp310-macosx_11_0_arm64.whl.metadata (18 kB)
Collecting pydantic==2.4.2 (from lastmile-utils)
  Using cached pydantic-2.4.2-py3-none-any.whl.metadata (158 kB)
Collecting pylint==3.0.2 (from lastmile-utils)
  Using cached pylint-3.0.2-py3-none-any.wh

In [2]:
print("Imports and set log level")

import logging

import pandas as pd
import lastmile_utils.lib.jupyter as jupyter_utils

pd.set_option("display.max_colwidth", None)

from aiconfig.eval.api import (
    run_test_suite_with_inputs,
    TestSuiteWithInputsSettings,
)


jupyter_utils.set_log_level(logging.WARNING)


Imports and set log level



  from .autonotebook import tqdm as notebook_tqdm


# Define a Metric

In [3]:
from typing import Literal
from aiconfig.eval.api import test_suite_common as common, test_suite_metrics as metrics
import lastmile_utils.lib.core.api as core_utils

print(
    """
    Before we define test suites, let's define a few metrics. Below, we will run these 
    on our data along with some off-the-shelf metrics.
    """
)

# 1. Helper function to construct a Metric that counts a specific letter.
def make_letter_count_metric(letter_to_count: str) -> metrics.TestSuiteMetric[str, int]:
    async def letter_count_metric(datum: str):
        return datum.count(letter_to_count)
    
    output_metric = metrics.TestSuiteMetric(
        evaluation_fn=letter_count_metric,
        metric_metadata=common.EvaluationMetricMetadata(
            name="letter_count",
            description=f"Counts the number of times the given letter appears in the text",
            extra_metadata={"letter_to_count": letter_to_count},
        )
    )
    return output_metric

# 2. Define a metric count_z using the helper function
count_z = make_letter_count_metric("z")


class EmotionalValenceRating(core_utils.Record):
    emotional_valence: Literal["happy"] | Literal["sad"] | Literal["neutral"] | Literal["angry"]
    confidence_probability: float

# 3. Define a metric that asks GPT-3.5 to assess the emotional valence of the text.
gpt3_5_emotional_valence = metrics.make_openai_structured_llm_metric(
    eval_llm_name="gpt-3.5-turbo-0613",
    pydantic_basemodel_type=EmotionalValenceRating,
    metric_name="emotional_valence",
    metric_description="Emotional valence",
    field_descriptions=dict(
        emotional_valence=(
            "Exactly one of 'happy', 'sad', 'neutral', or 'angry', 'unsure' based on the emotional valence of the input text. "
            "Do not output anything else. Only output precisely one of those words, lowercase, with no punctuation or whitespace. "
            "Always output exactly one of the predefined words."
        ),
        confidence_probability="The probability that the emotional valence is correct.",
    ),
)

# await gpt3_5_emotional_valence("i am insane.")
# await count_z("i am insane.zzz")


    Before we define test suites, let's define a few metrics. Below, we will run these 
    on our data along with some off-the-shelf metrics.
    


## Option 1: provide inputs, library runs AIConfig for you

In [4]:
print(
    """
    Define test suite with inputs (option 1), 
      as opposed to using pre-computed AIConfig outputs (option 2)
      
      Define list of inputs and test criteria.
      In this case, we are checking brevity for each test case
      as well as checking that each output contains a specific expected substring.
"""
)


ts_settings = TestSuiteWithInputsSettings(
    prompt_name="gen_itinerary",
    aiconfig_path="./travel_parametrized.aiconfig.json",
)

# Each of these pairs will be used to construct a test case just below.
# For each pair (input, expected_substring) we define a test case that says, 
# "When I run this input through this AIConfig, 
# I expect the output to contain this particular substring".

# For example, when we call `substring_match(substring, case_sensitive=False)` below,
# and substring=="Empire State Building", we are telling the library to create a 
# boolean metric (i.e. a pass/fail test case) that passes (value==1.0) if the substring
# "empire state building" appears in the AIConfig output 
# when the AIConfig is given the input "Iconic midtown skyscrapers".
# "Tell me 3 fun attractions related to {{the_query}} to do in NYC."
# Each test input will get put into "the_query" in the input prompt:
# See the aiconfig (python/src/aiconfig/eval/custom_eval/examples/travel/travel_parametrized.aiconfig.json).
test_inputs_with_substrings = [
    ("different kinds of cuisines", "Magnolia Bakery"),
    ("iconic midtown skyscrapers", "Empire State Building"),
]
expected_substrings = []

test_suite_with_inputs = []
for test_input, substring in test_inputs_with_substrings:
    for metric in [
        metrics.brevity, 
        metrics.substring_match(substring, case_sensitive=False), 
        metrics.gpt3_5_text_ratings, 
        count_z, 
        gpt3_5_emotional_valence,
    ]:
        test_suite_with_inputs.append((test_input, metric))


    Define test suite with inputs (option 1), 
      as opposed to using pre-computed AIConfig outputs (option 2)
      
      Define list of inputs and test criteria.
      In this case, we are checking brevity for each test case
      as well as checking that each output contains a specific expected substring.



In [5]:
print("If you like, you can inspect the test suite before passing it to the evaluation library.")

for test_input, fn in test_suite_with_inputs:
    print("\nTest input:\n", test_input, "\nFunction:\n", fn)

If you like, you can inspect the test suite before passing it to the evaluation library.

Test input:
 different kinds of cuisines 
Function:
 TestSuiteMetric(evaluation_fn=<function metric.<locals>._construct.<locals>.evaluation_fn at 0x2bc1e8940>, metric_metadata=EvaluationMetricMetadata({
  "name": "brevity",
  "description": "Absolute text length",
  "best_value": 1,
  "worst_value": 9223372036854775807,
  "extra_metadata": {
    "args": []
  },
  "id": "5b29b6ba68aeeadc42b7333015f4b158f7514f68c05fef79a702e98cf9983085"
}))

Test input:
 different kinds of cuisines 
Function:
 TestSuiteMetric(evaluation_fn=<function metric.<locals>._construct.<locals>.evaluation_fn at 0x2bc355360>, metric_metadata=EvaluationMetricMetadata({
  "name": "substring_match",
  "description": "True (pass) if contains given substring",
  "best_value": true,
  "worst_value": false,
  "extra_metadata": {
    "args": [
      "Magnolia Bakery"
    ],
    "case_sensitive": false
  },
  "id": "12b2b88421a53f87fa1

In [6]:
print("Run the eval interface (option 1, with inputs)")

df_result = await run_test_suite_with_inputs(
    test_suite=test_suite_with_inputs,
    settings=ts_settings,
)

print("Raw output")
df_result

Run the eval interface (option 1, with inputs)
Raw output


Unnamed: 0,input,aiconfig_output,value,metric_id,metric_name,metric_description,best_possible_value,worst_possible_value
0,different kinds of cuisines,1. Explore Chelsea Market's international food stalls. 2. Guided Manhattan Chinatown food tour. 3. Experience Italian heritage and cuisine in Little Italy.,155,5b29b6ba68aeeadc42b7333015f4b158f7514f68c05fef79a702e98cf9983085,brevity,Absolute text length,1,9223372036854775807
1,different kinds of cuisines,1. Explore Chelsea Market's international food stalls. 2. Guided Manhattan Chinatown food tour. 3. Experience Italian heritage and cuisine in Little Italy.,False,12b2b88421a53f87fa1502c48a3bfa8b84aa22af3528178f0ec8d699db041d8d,substring_match,True (pass) if contains given substring,True,False
2,different kinds of cuisines,1. Explore Chelsea Market's international food stalls. 2. Guided Manhattan Chinatown food tour. 3. Experience Italian heritage and cuisine in Little Italy.,"CustomMetricPydanticObject(data={\n ""conciseness_rating"": 5,\n ""conciseness_confidence"": 0.9,\n ""conciseness_reasoning"": ""The text is concise and provides clear information about three different food-related experiences in New York City.""\n})",300b32bb8a01befd5e729eaf73506bdba01f910c0db0c8f70136dd2e48e298a7,text_ratings,Text ratings,,
3,different kinds of cuisines,1. Explore Chelsea Market's international food stalls. 2. Guided Manhattan Chinatown food tour. 3. Experience Italian heritage and cuisine in Little Italy.,0,855b84d49dadc258f82d949bf3d57a100c788e6e093e4615e8b4e03567f1ffc9,letter_count,Counts the number of times the given letter appears in the text,,
4,different kinds of cuisines,1. Explore Chelsea Market's international food stalls. 2. Guided Manhattan Chinatown food tour. 3. Experience Italian heritage and cuisine in Little Italy.,"CustomMetricPydanticObject(data={\n ""emotional_valence"": ""happy"",\n ""confidence_probability"": 0.9\n})",a351b0b7ab1639eb32695430b3e1bb65c96d11b528730c103d5879234a3bd8bb,emotional_valence,Emotional valence,,
5,iconic midtown skyscrapers,"Day 1: Empire State Building, Skyride. Day 2: Rockefeller Center, Top of the Rock. Day 3: One World Trade Center, 9/11 Memorial & Museum.",137,5b29b6ba68aeeadc42b7333015f4b158f7514f68c05fef79a702e98cf9983085,brevity,Absolute text length,1,9223372036854775807
6,iconic midtown skyscrapers,"Day 1: Empire State Building, Skyride. Day 2: Rockefeller Center, Top of the Rock. Day 3: One World Trade Center, 9/11 Memorial & Museum.",True,17bb1efe1fb306bce98240f3534f5d29c68564e4e7c1c0db17198247d19754e3,substring_match,True (pass) if contains given substring,True,False
7,iconic midtown skyscrapers,"Day 1: Empire State Building, Skyride. Day 2: Rockefeller Center, Top of the Rock. Day 3: One World Trade Center, 9/11 Memorial & Museum.","CustomMetricPydanticObject(data={\n ""conciseness_rating"": 5,\n ""conciseness_confidence"": 0.9,\n ""conciseness_reasoning"": ""The text provides a clear and concise itinerary for three days in New York City, mentioning the main attractions to visit each day.""\n})",300b32bb8a01befd5e729eaf73506bdba01f910c0db0c8f70136dd2e48e298a7,text_ratings,Text ratings,,
8,iconic midtown skyscrapers,"Day 1: Empire State Building, Skyride. Day 2: Rockefeller Center, Top of the Rock. Day 3: One World Trade Center, 9/11 Memorial & Museum.",0,855b84d49dadc258f82d949bf3d57a100c788e6e093e4615e8b4e03567f1ffc9,letter_count,Counts the number of times the given letter appears in the text,,
9,iconic midtown skyscrapers,"Day 1: Empire State Building, Skyride. Day 2: Rockefeller Center, Top of the Rock. Day 3: One World Trade Center, 9/11 Memorial & Museum.","CustomMetricPydanticObject(data={\n ""emotional_valence"": ""neutral"",\n ""confidence_probability"": 0.9\n})",a351b0b7ab1639eb32695430b3e1bb65c96d11b528730c103d5879234a3bd8bb,emotional_valence,Emotional valence,,


In [7]:
print("Unstack for nicer manual review")
df_result.set_index(["input", "aiconfig_output", "metric_name"])\
        .value.unstack("metric_name")

Unstack for nicer manual review


Unnamed: 0_level_0,metric_name,brevity,emotional_valence,letter_count,substring_match,text_ratings
input,aiconfig_output,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
different kinds of cuisines,1. Explore Chelsea Market's international food stalls.\n2. Guided Manhattan Chinatown food tour.\n3. Experience Italian heritage and cuisine in Little Italy.,155,"CustomMetricPydanticObject(data={\n ""emotional_valence"": ""happy"",\n ""confidence_probability"": 0.9\n})",0,False,"CustomMetricPydanticObject(data={\n ""conciseness_rating"": 5,\n ""conciseness_confidence"": 0.9,\n ""conciseness_reasoning"": ""The text is concise and provides clear information about three different food-related experiences in New York City.""\n})"
iconic midtown skyscrapers,"Day 1: Empire State Building, Skyride.\nDay 2: Rockefeller Center, Top of the Rock.\nDay 3: One World Trade Center, 9/11 Memorial & Museum.",137,"CustomMetricPydanticObject(data={\n ""emotional_valence"": ""neutral"",\n ""confidence_probability"": 0.9\n})",0,True,"CustomMetricPydanticObject(data={\n ""conciseness_rating"": 5,\n ""conciseness_confidence"": 0.9,\n ""conciseness_reasoning"": ""The text provides a clear and concise itinerary for three days in New York City, mentioning the main attractions to visit each day.""\n})"


## Option 2: Run eval on already-computed AIConfig outputs.

In [8]:
print("Define outputs to test and criteria, similar to option 1.")


from aiconfig.eval.api import (
    run_test_suite_outputs_only,
)

from aiconfig.eval.api import test_suite_metrics as metrics


# This is similar to "test_inputs_with_substrings" above, but we have the AIConfig *outputs*
# in the test cases, rather than the inputs. The library will evaluate these strings directly
# because there is no need to run the AIConfig to generate the outputs.
test_outputs_with_substrings = [
    (
        "Begin at Chelsea Market for diverse food options. Continue to Queens for immersive food tours. Conclude at Smorgasburg for unique outdoor food market experience",
        "Magnolia Bakery"
    ),
    (
        "1. Empire State Building: Observation deck visit, explore exhibits and historical displays. 2. Rockefeller Center: Visit \"Top of the Rock\", ice-skating, NBC Studio tour, shopping and dining. 3. Chrysler Building: Admire exterior and iconic spire, photo opportunities.",
        "Empire State Building"
    )
]



test_suite_outputs_only = []
for test_output, substring in test_outputs_with_substrings:
    for metric in [
        metrics.brevity, 
        metrics.substring_match(substring, case_sensitive=False), 
        metrics.gpt3_5_text_ratings, 
        count_z, 
        gpt3_5_emotional_valence,
    ]:
        test_suite_outputs_only.append((test_output, metric))

Define outputs to test and criteria, similar to option 1.


In [9]:
print("If you like, you can inspect the test suite before passing it to the evaluation library.")

for test_output, fn in test_suite_outputs_only:
    print("\nTest output:\n", test_output, "\nFunction:\n", fn)

If you like, you can inspect the test suite before passing it to the evaluation library.

Test output:
 Begin at Chelsea Market for diverse food options. Continue to Queens for immersive food tours. Conclude at Smorgasburg for unique outdoor food market experience 
Function:
 TestSuiteMetric(evaluation_fn=<function metric.<locals>._construct.<locals>.evaluation_fn at 0x2bc1e8940>, metric_metadata=EvaluationMetricMetadata({
  "name": "brevity",
  "description": "Absolute text length",
  "best_value": 1,
  "worst_value": 9223372036854775807,
  "extra_metadata": {
    "args": []
  },
  "id": "5b29b6ba68aeeadc42b7333015f4b158f7514f68c05fef79a702e98cf9983085"
}))

Test output:
 Begin at Chelsea Market for diverse food options. Continue to Queens for immersive food tours. Conclude at Smorgasburg for unique outdoor food market experience 
Function:
 TestSuiteMetric(evaluation_fn=<function metric.<locals>._construct.<locals>.evaluation_fn at 0x2bc388790>, metric_metadata=EvaluationMetricMetada

In [10]:
print("Run the eval library")
df_result = await run_test_suite_outputs_only(
    test_suite=test_suite_outputs_only,
)
print("Raw output")
df_result

Run the eval library
Raw output


Unnamed: 0,input,aiconfig_output,value,metric_id,metric_name,metric_description,best_possible_value,worst_possible_value
0,Missing,Begin at Chelsea Market for diverse food options. Continue to Queens for immersive food tours. Conclude at Smorgasburg for unique outdoor food market experience,160,5b29b6ba68aeeadc42b7333015f4b158f7514f68c05fef79a702e98cf9983085,brevity,Absolute text length,1,9223372036854775807
1,Missing,Begin at Chelsea Market for diverse food options. Continue to Queens for immersive food tours. Conclude at Smorgasburg for unique outdoor food market experience,False,12b2b88421a53f87fa1502c48a3bfa8b84aa22af3528178f0ec8d699db041d8d,substring_match,True (pass) if contains given substring,True,False
2,Missing,Begin at Chelsea Market for diverse food options. Continue to Queens for immersive food tours. Conclude at Smorgasburg for unique outdoor food market experience,"CustomMetricPydanticObject(data={\n ""conciseness_rating"": 4,\n ""conciseness_confidence"": 0.8,\n ""conciseness_reasoning"": ""The text provides a clear and concise description of the itinerary, mentioning the starting point, the main activity in Queens, and the final destination.""\n})",300b32bb8a01befd5e729eaf73506bdba01f910c0db0c8f70136dd2e48e298a7,text_ratings,Text ratings,,
3,Missing,Begin at Chelsea Market for diverse food options. Continue to Queens for immersive food tours. Conclude at Smorgasburg for unique outdoor food market experience,0,855b84d49dadc258f82d949bf3d57a100c788e6e093e4615e8b4e03567f1ffc9,letter_count,Counts the number of times the given letter appears in the text,,
4,Missing,Begin at Chelsea Market for diverse food options. Continue to Queens for immersive food tours. Conclude at Smorgasburg for unique outdoor food market experience,"CustomMetricPydanticObject(data={\n ""emotional_valence"": ""happy"",\n ""confidence_probability"": 0.9\n})",a351b0b7ab1639eb32695430b3e1bb65c96d11b528730c103d5879234a3bd8bb,emotional_valence,Emotional valence,,
5,Missing,"1. Empire State Building: Observation deck visit, explore exhibits and historical displays. 2. Rockefeller Center: Visit ""Top of the Rock"", ice-skating, NBC Studio tour, shopping and dining. 3. Chrysler Building: Admire exterior and iconic spire, photo opportunities.",267,5b29b6ba68aeeadc42b7333015f4b158f7514f68c05fef79a702e98cf9983085,brevity,Absolute text length,1,9223372036854775807
6,Missing,"1. Empire State Building: Observation deck visit, explore exhibits and historical displays. 2. Rockefeller Center: Visit ""Top of the Rock"", ice-skating, NBC Studio tour, shopping and dining. 3. Chrysler Building: Admire exterior and iconic spire, photo opportunities.",True,17bb1efe1fb306bce98240f3534f5d29c68564e4e7c1c0db17198247d19754e3,substring_match,True (pass) if contains given substring,True,False
7,Missing,"1. Empire State Building: Observation deck visit, explore exhibits and historical displays. 2. Rockefeller Center: Visit ""Top of the Rock"", ice-skating, NBC Studio tour, shopping and dining. 3. Chrysler Building: Admire exterior and iconic spire, photo opportunities.","CustomMetricPydanticObject(data={\n ""conciseness_rating"": 5,\n ""conciseness_confidence"": 0.9,\n ""conciseness_reasoning"": ""The text provides a concise description of the attractions and activities at each location.""\n})",300b32bb8a01befd5e729eaf73506bdba01f910c0db0c8f70136dd2e48e298a7,text_ratings,Text ratings,,
8,Missing,"1. Empire State Building: Observation deck visit, explore exhibits and historical displays. 2. Rockefeller Center: Visit ""Top of the Rock"", ice-skating, NBC Studio tour, shopping and dining. 3. Chrysler Building: Admire exterior and iconic spire, photo opportunities.",0,855b84d49dadc258f82d949bf3d57a100c788e6e093e4615e8b4e03567f1ffc9,letter_count,Counts the number of times the given letter appears in the text,,
9,Missing,"1. Empire State Building: Observation deck visit, explore exhibits and historical displays. 2. Rockefeller Center: Visit ""Top of the Rock"", ice-skating, NBC Studio tour, shopping and dining. 3. Chrysler Building: Admire exterior and iconic spire, photo opportunities.","CustomMetricPydanticObject(data={\n ""emotional_valence"": ""happy"",\n ""confidence_probability"": 0.9\n})",a351b0b7ab1639eb32695430b3e1bb65c96d11b528730c103d5879234a3bd8bb,emotional_valence,Emotional valence,,


In [11]:
print("Unstack for nicer manual review")
df_result.set_index([ "aiconfig_output", "metric_name"])\
        .value.unstack("metric_name")

Unstack for nicer manual review


metric_name,brevity,emotional_valence,letter_count,substring_match,text_ratings
aiconfig_output,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
"1. Empire State Building: Observation deck visit, explore exhibits and historical displays. 2. Rockefeller Center: Visit ""Top of the Rock"", ice-skating, NBC Studio tour, shopping and dining. 3. Chrysler Building: Admire exterior and iconic spire, photo opportunities.",267,"CustomMetricPydanticObject(data={\n ""emotional_valence"": ""happy"",\n ""confidence_probability"": 0.9\n})",0,True,"CustomMetricPydanticObject(data={\n ""conciseness_rating"": 5,\n ""conciseness_confidence"": 0.9,\n ""conciseness_reasoning"": ""The text provides a concise description of the attractions and activities at each location.""\n})"
Begin at Chelsea Market for diverse food options. Continue to Queens for immersive food tours. Conclude at Smorgasburg for unique outdoor food market experience,160,"CustomMetricPydanticObject(data={\n ""emotional_valence"": ""happy"",\n ""confidence_probability"": 0.9\n})",0,False,"CustomMetricPydanticObject(data={\n ""conciseness_rating"": 4,\n ""conciseness_confidence"": 0.8,\n ""conciseness_reasoning"": ""The text provides a clear and concise description of the itinerary, mentioning the starting point, the main activity in Queens, and the final destination.""\n})"
