# Promptfoo-style eval without promptfoo

Goal: Run test-suite-style eval (like Promptfoo) with completely custom components, i.e. without using Promptfoo.

In this case, you have 2 options:
1. Run with inputs. Library runs AIConfig for you first.
2. Run with outputs only. You run AIConfig and save the outputs for eval.

Run the notebook in order for an example of each.

Assumptions:
* You have a parametrized AIConfig with a test input called "the_query", like this: 
`"input": "{{the_query}}"`
* You have some evaluation criteria in mind for the AIConfig's text output.
* Promptfoo integration does not meet my needs, e.g.
  * You want to run the AIConfig myself instead of handing control to Promptfoo
  * You need to scale beyond what Promptfoo can reasonably handle

In [1]:
# Package installs & environment setup
!pip3 install lastmile-utils --force
# If you see errors, no worries you can generally ignore. Just make sure that the 
# following output matches with the version specified in the 
# aiconfig/python/requirements.txt file (or is a higher version)
!pip3 list | grep lastmile-utils

import openai

# Create ~/.env file with this line: `export OPENAI_API_KEY=<your key here>`
# You can get your key from https://platform.openai.com/api-keys 
import dotenv
import os
dotenv.load_dotenv()
openai.api_key = os.getenv("OPENAI_API_KEY")

[0mCollecting lastmile-utils
  Using cached lastmile_utils-0.0.13-py3-none-any.whl.metadata (901 bytes)
Collecting black==23.11.0 (from lastmile-utils)
  Using cached black-23.11.0-cp310-cp310-macosx_11_0_arm64.whl.metadata (66 kB)
Collecting chardet==5.2.0 (from lastmile-utils)
  Using cached chardet-5.2.0-py3-none-any.whl.metadata (3.4 kB)
Collecting flake8==6.1.0 (from lastmile-utils)
  Using cached flake8-6.1.0-py2.py3-none-any.whl.metadata (3.8 kB)
Collecting isort==5.12.0 (from lastmile-utils)
  Using cached isort-5.12.0-py3-none-any.whl (91 kB)
Collecting jsoncomment==0.4.2 (from lastmile-utils)
  Using cached jsoncomment-0.4.2-py3-none-any.whl (6.8 kB)
Collecting pandas==2.1.2 (from lastmile-utils)
  Using cached pandas-2.1.2-cp310-cp310-macosx_11_0_arm64.whl.metadata (18 kB)
Collecting pydantic==2.4.2 (from lastmile-utils)
  Using cached pydantic-2.4.2-py3-none-any.whl.metadata (158 kB)
Collecting pylint==3.0.2 (from lastmile-utils)
  Using cached pylint-3.0.2-py3-none-any.wh

In [2]:
print("Imports and set log level")

import logging

import pandas as pd
import lastmile_utils.lib.jupyter as jupyter_utils

pd.set_option("display.max_colwidth", None)

from aiconfig.eval.api import (
    run_test_suite_with_inputs,
    TestSuiteWithInputsSettings,
)

from aiconfig.eval.api import metrics

jupyter_utils.set_log_level(logging.WARNING)



Imports and set log level



  from .autonotebook import tqdm as notebook_tqdm


## Option 1: provide inputs, library runs AIConfig for you

In [3]:
print(
    """
    Define test suite with inputs (option 1), 
      as opposed to using pre-computed AIConfig outputs (option 2)
      
      Define list of inputs and test criteria.
      In this case, we are checking brevity for each test case
      as well as checking that each output contains a specific expected substring.
"""
)


ts_settings = TestSuiteWithInputsSettings(
    prompt_name="gen_itinerary",
    aiconfig_path="./travel_parametrized.aiconfig.json",
)

# Each of these pairs will be used to construct a test case just below.
# For each pair (input, expected_substring) we define a test case that says, 
# "When I run this input through this AIConfig, 
# I expect the output to contain this particular substring".

# For example, when we call `substring_match(substring, case_sensitive=False)` below,
# and substring=="Empire State Building", we are telling the library to create a 
# boolean metric (i.e. a pass/fail test case) that passes (value==1.0) if the substring
# "empire state building" appears in the AIConfig output 
# when the AIConfig is given the input "Iconic midtown skyscrapers".
# "Tell me 3 fun attractions related to {{the_query}} to do in NYC."
# Each test input will get put into "the_query" in the input prompt:
# See the aiconfig (python/src/aiconfig/eval/custom_eval/examples/travel/travel_parametrized.aiconfig.json).
test_inputs_with_substrings = [
    ("different kinds of cuisines", "Magnolia Bakery"),
    ("iconic midtown skyscrapers", "Empire State Building"),
]
expected_substrings = []

test_suite_with_inputs = []
for test_input, substring in test_inputs_with_substrings:
    # Add the brevity metric
    test_fn1 = metrics.brevity
    test_suite_with_inputs.append((test_input, test_fn1))
    # Add substring check function
    test_fn2 = metrics.substring_match(substring, case_sensitive=False)
    test_suite_with_inputs.append((test_input, test_fn2))

    # Add a model-graded eval metric that uses GPT 3.5 to return a struct
    test_fn3 = metrics.gpt3_5_text_ratings
    test_suite_with_inputs.append((test_input, test_fn3))


    Define test suite with inputs (option 1), 
      as opposed to using pre-computed AIConfig outputs (option 2)
      
      Define list of inputs and test criteria.
      In this case, we are checking brevity for each test case
      as well as checking that each output contains a specific expected substring.



In [4]:
print("If you like, you can inspect the test suite before passing it to the evaluation library.")

for test_input, fn in test_suite_with_inputs:
    print("\nTest input:\n", test_input, "\nFunction:\n", fn)

If you like, you can inspect the test suite before passing it to the evaluation library.

Test input:
 different kinds of cuisines 
Function:
 Metric(evaluation_fn=<function _calculate_brevity at 0x2a039d990>, metric_metadata=EvaluationMetricMetadata({
  "name": "brevity",
  "description": "Absolute text length",
  "best_value": 1,
  "worst_value": 9223372036854775807,
  "extra_metadata": {},
  "id": "24952ce05ce6dcbd370ccc3b39d410edeab8e1cf420130a83cf9388df6bcfdc3"
}))

Test input:
 different kinds of cuisines 
Function:
 Metric(evaluation_fn=<function substring_match.<locals>._fn at 0x2a039eb90>, metric_metadata=EvaluationMetricMetadata({
  "name": "substring_match",
  "description": "True (pass) if contains given substring",
  "best_value": true,
  "worst_value": false,
  "extra_metadata": {
    "substring": "Magnolia Bakery",
    "case_sensitive": false
  },
  "id": "0c461362f44884023dda5537ce88263ba20d555562bac8abc05bcde0ce1aacf6"
}))

Test input:
 different kinds of cuisines 
Fun

In [5]:
print("Run the eval interface (option 1, with inputs)")

df_result = await run_test_suite_with_inputs(
    test_suite=test_suite_with_inputs,
    settings=ts_settings,
)

print("Raw output")
df_result

Run the eval interface (option 1, with inputs)
Raw output


Unnamed: 0,input,aiconfig_output,value,metric_id,metric_name,metric_description,best_possible_value,worst_possible_value
0,different kinds of cuisines,1. Chinatown Food Tour 2. Little Italy Pizza Tour 3. Chelsea Market Visit,73,24952ce05ce6dcbd370ccc3b39d410edeab8e1cf420130a83cf9388df6bcfdc3,brevity,Absolute text length,1,9223372036854775807
1,different kinds of cuisines,1. Chinatown Food Tour 2. Little Italy Pizza Tour 3. Chelsea Market Visit,False,0c461362f44884023dda5537ce88263ba20d555562bac8abc05bcde0ce1aacf6,substring_match,True (pass) if contains given substring,True,False
2,different kinds of cuisines,1. Chinatown Food Tour 2. Little Italy Pizza Tour 3. Chelsea Market Visit,"CustomMetricPydanticObject(data={\n ""conciseness_rating"": 4,\n ""conciseness_confidence"": 0.8,\n ""conciseness_reasoning"": ""The text is concise and provides a clear list of three different food tours.""\n})",300b32bb8a01befd5e729eaf73506bdba01f910c0db0c8f70136dd2e48e298a7,text_ratings,Text ratings,,
3,iconic midtown skyscrapers,"1. Start at Top of the Rock Observation Deck for panoramic city views. 2. Visit Empire State Building, experience exhibits and ascend to 86th-floor deck. 3. Finish day strolling through The High Line, enjoying art and scenery.",226,24952ce05ce6dcbd370ccc3b39d410edeab8e1cf420130a83cf9388df6bcfdc3,brevity,Absolute text length,1,9223372036854775807
4,iconic midtown skyscrapers,"1. Start at Top of the Rock Observation Deck for panoramic city views. 2. Visit Empire State Building, experience exhibits and ascend to 86th-floor deck. 3. Finish day strolling through The High Line, enjoying art and scenery.",True,53e4c7163f49fdc7727286e638ff07bcb570faaa334456775c616c2f4ad3eb3f,substring_match,True (pass) if contains given substring,True,False
5,iconic midtown skyscrapers,"1. Start at Top of the Rock Observation Deck for panoramic city views. 2. Visit Empire State Building, experience exhibits and ascend to 86th-floor deck. 3. Finish day strolling through The High Line, enjoying art and scenery.","CustomMetricPydanticObject(data={\n ""conciseness_rating"": 5,\n ""conciseness_confidence"": 0.9,\n ""conciseness_reasoning"": ""The text is concise and provides clear instructions for a day of sightseeing in New York City.""\n})",300b32bb8a01befd5e729eaf73506bdba01f910c0db0c8f70136dd2e48e298a7,text_ratings,Text ratings,,


In [6]:
print("Unstack for nicer manual review")
df_result.set_index(["input", "aiconfig_output", "metric_name"])\
        .value.unstack("metric_name")

Unstack for nicer manual review


Unnamed: 0_level_0,metric_name,brevity,substring_match,text_ratings
input,aiconfig_output,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
different kinds of cuisines,1. Chinatown Food Tour 2. Little Italy Pizza Tour 3. Chelsea Market Visit,73,False,"CustomMetricPydanticObject(data={\n ""conciseness_rating"": 4,\n ""conciseness_confidence"": 0.8,\n ""conciseness_reasoning"": ""The text is concise and provides a clear list of three different food tours.""\n})"
iconic midtown skyscrapers,"1. Start at Top of the Rock Observation Deck for panoramic city views.\n2. Visit Empire State Building, experience exhibits and ascend to 86th-floor deck.\n3. Finish day strolling through The High Line, enjoying art and scenery.",226,True,"CustomMetricPydanticObject(data={\n ""conciseness_rating"": 5,\n ""conciseness_confidence"": 0.9,\n ""conciseness_reasoning"": ""The text is concise and provides clear instructions for a day of sightseeing in New York City.""\n})"


## Option 2: Run eval on already-computed AIConfig outputs.

In [12]:
print("Define outputs to test and criteria, similar to option 1.")


from aiconfig.eval.api import (
    run_test_suite_outputs_only,
)

from aiconfig.eval.api import metrics


# This is similar to "test_inputs_with_substrings" above, but we have the AIConfig *outputs*
# in the test cases, rather than the inputs. The library will evaluate these strings directly
# because there is no need to run the AIConfig to generate the outputs.
test_outputs_with_substrings = [
    (
        "Begin at Chelsea Market for diverse food options. Continue to Queens for immersive food tours. Conclude at Smorgasburg for unique outdoor food market experience",
        "Magnolia Bakery"
    ),
    (
        "1. Empire State Building: Observation deck visit, explore exhibits and historical displays. 2. Rockefeller Center: Visit \"Top of the Rock\", ice-skating, NBC Studio tour, shopping and dining. 3. Chrysler Building: Admire exterior and iconic spire, photo opportunities.",
        "Empire State Building"
    )
]



test_suite_outputs_only = []
for test_output, substring in test_outputs_with_substrings:
    # Add the brevity metric
    test_fn1 = metrics.brevity
    test_suite_outputs_only.append((test_output, test_fn1))
    # Add substring check function
    test_fn2 = metrics.substring_match(substring, case_sensitive=False)
    test_suite_outputs_only.append(
        (test_output, test_fn2)
    )

    # Add a model-graded eval metric that uses GPT 3.5 to return a struct
    test_fn3 = metrics.gpt3_5_text_ratings
    test_suite_outputs_only.append((test_input, test_fn3))    

Define outputs to test and criteria, similar to option 1.


In [13]:
print("If you like, you can inspect the test suite before passing it to the evaluation library.")

for test_output, fn in test_suite_outputs_only:
    print("\nTest output:\n", test_output, "\nFunction:\n", fn)

If you like, you can inspect the test suite before passing it to the evaluation library.

Test output:
 Begin at Chelsea Market for diverse food options. Continue to Queens for immersive food tours. Conclude at Smorgasburg for unique outdoor food market experience 
Function:
 Metric(evaluation_fn=<function _calculate_brevity at 0x2a039d990>, metric_metadata=EvaluationMetricMetadata({
  "name": "brevity",
  "description": "Absolute text length",
  "best_value": 1,
  "worst_value": 9223372036854775807,
  "extra_metadata": {},
  "id": "24952ce05ce6dcbd370ccc3b39d410edeab8e1cf420130a83cf9388df6bcfdc3"
}))

Test output:
 Begin at Chelsea Market for diverse food options. Continue to Queens for immersive food tours. Conclude at Smorgasburg for unique outdoor food market experience 
Function:
 Metric(evaluation_fn=<function substring_match.<locals>._fn at 0x2a03cbd90>, metric_metadata=EvaluationMetricMetadata({
  "name": "substring_match",
  "description": "True (pass) if contains given substr

In [9]:
print("Run the eval library")
df_result = await run_test_suite_outputs_only(
    test_suite=test_suite_outputs_only,
)
print("Raw output")
df_result

Run the eval library
Raw output


Unnamed: 0,input,aiconfig_output,value,metric_id,metric_name,metric_description,best_possible_value,worst_possible_value
0,Missing,Begin at Chelsea Market for diverse food options. Continue to Queens for immersive food tours. Conclude at Smorgasburg for unique outdoor food market experience,160,24952ce05ce6dcbd370ccc3b39d410edeab8e1cf420130a83cf9388df6bcfdc3,brevity,Absolute text length,1,9223372036854775807
1,Missing,Begin at Chelsea Market for diverse food options. Continue to Queens for immersive food tours. Conclude at Smorgasburg for unique outdoor food market experience,False,0c461362f44884023dda5537ce88263ba20d555562bac8abc05bcde0ce1aacf6,substring_match,True (pass) if contains given substring,True,False
2,Missing,"1. Empire State Building: Observation deck visit, explore exhibits and historical displays. 2. Rockefeller Center: Visit ""Top of the Rock"", ice-skating, NBC Studio tour, shopping and dining. 3. Chrysler Building: Admire exterior and iconic spire, photo opportunities.",267,24952ce05ce6dcbd370ccc3b39d410edeab8e1cf420130a83cf9388df6bcfdc3,brevity,Absolute text length,1,9223372036854775807
3,Missing,"1. Empire State Building: Observation deck visit, explore exhibits and historical displays. 2. Rockefeller Center: Visit ""Top of the Rock"", ice-skating, NBC Studio tour, shopping and dining. 3. Chrysler Building: Admire exterior and iconic spire, photo opportunities.",True,53e4c7163f49fdc7727286e638ff07bcb570faaa334456775c616c2f4ad3eb3f,substring_match,True (pass) if contains given substring,True,False


In [10]:
print("Unstack for nicer manual review")
df_result.set_index([ "aiconfig_output", "metric_name"])\
        .value.unstack("metric_name")

Unstack for nicer manual review


metric_name,brevity,substring_match
aiconfig_output,Unnamed: 1_level_1,Unnamed: 2_level_1
"1. Empire State Building: Observation deck visit, explore exhibits and historical displays. 2. Rockefeller Center: Visit ""Top of the Rock"", ice-skating, NBC Studio tour, shopping and dining. 3. Chrysler Building: Admire exterior and iconic spire, photo opportunities.",267,True
Begin at Chelsea Market for diverse food options. Continue to Queens for immersive food tours. Conclude at Smorgasburg for unique outdoor food market experience,160,False


In [11]:
import os
import openai

from aiconfig.eval.api import metrics

openai.api_key = os.getenv("OPENAI_API_KEY")

await metrics.gpt3_5_text_ratings("one two three")

CustomMetricPydanticObject(data={
  "conciseness_rating": 3,
  "conciseness_confidence": 0.8,
  "conciseness_reasoning": "The text is short and simple."
})