# Promptfoo-style eval without promptfoo

Goal: Run test-suite-style eval (like Promptfoo) with completely custom components, i.e. without using Promptfoo.

In this case, you have 2 options:
1. Run with inputs. Library runs AIConfig for you first.
2. Run with outputs only. You run AIConfig and save the outputs for eval.

Run the notebook in order for an example of each.

Assumptions:
* You have a parametrized AIConfig with a test input called "the_query", like this: 
`"input": "{{the_query}}"`
* You have some evaluation criteria in mind for the AIConfig's text output.
* Promptfoo integration does not meet my needs, e.g.
  * You want to run the AIConfig myself instead of handing control to Promptfoo
  * You need to scale beyond what Promptfoo can reasonably handle

In [4]:
# Package installs & environment setup
!pip3 install lastmile-utils --force
# If you see errors, no worries you can generally ignore. Just make sure that the 
# following output matches with the version specified in the 
# aiconfig/python/requirements.txt file (or is a higher version)
!pip3 list | grep lastmile-utils

import openai

# Create ~/.env file with this line: `export OPENAI_API_KEY=<your key here>`
# You can get your key from https://platform.openai.com/api-keys 
import dotenv
import os
dotenv.load_dotenv()
openai.api_key = os.getenv("OPENAI_API_KEY")

lastmile-utils               0.0.9


In [2]:
print("Imports and set log level")

import logging

import pandas as pd
import lastmile_utils.lib.jupyter as jupyter_utils

pd.set_option("display.max_colwidth", None)

from aiconfig.eval.api import (
    brevity,
    substring_match,
    run_test_suite_with_inputs,
    TestSuiteWithInputsSettings,
)

jupyter_utils.set_log_level(logging.WARNING)



Imports and set log level



  from .autonotebook import tqdm as notebook_tqdm


## Option 1: provide inputs, library runs AIConfig for you

In [3]:
print(
    """
    Define test suite with inputs (option 1), 
      as opposed to using pre-computed AIConfig outputs (option 2)
      
      Define list of inputs and test criteria.
      In this case, we are checking brevity for each test case
      as well as checking that each output contains a specific expected substring.
"""
)


ts_settings = TestSuiteWithInputsSettings(
    prompt_name="gen_itinerary",
    aiconfig_path="./travel_parametrized.aiconfig.json",
)

# Each of these pairs will be used to construct a test case just below.
# For each pair (input, expected_substring) we define a test case that says, 
# "When I run this input through this AIConfig, 
# I expect the output to contain this particular substring".

# For example, when we call `substring_match(substring, case_sensitive=False)` below,
# and substring=="Empire State Building", we are telling the library to create a 
# boolean metric (i.e. a pass/fail test case) that passes (value==1.0) if the substring
# "empire state building" appears in the AIConfig output 
# when the AIConfig is given the input "Iconic midtown skyscrapers".
# "Tell me 3 fun attractions related to {{the_query}} to do in NYC."
# Each test input will get put into "the_query" in the input prompt:
# See the aiconfig (python/src/aiconfig/eval/custom_eval/examples/travel/travel_parametrized.aiconfig.json).
test_inputs_with_substrings = [
    ("different kinds of cuisines", "Magnolia Bakery"),
    ("iconic midtown skyscrapers", "Empire State Building"),
]
expected_substrings = []

test_suite_with_inputs = []
for test_input, substring in test_inputs_with_substrings:
    # Add the brevity metric
    test_fn1 = brevity
    test_suite_with_inputs.append((test_input, test_fn1))
    # Add substring check function
    test_fn2 = substring_match(substring, case_sensitive=False)
    test_suite_with_inputs.append((test_input, test_fn2))


    Define test suite with inputs (option 1), 
      as opposed to using pre-computed AIConfig outputs (option 2)
      
      Define list of inputs and test criteria.
      In this case, we are checking brevity for each test case
      as well as checking that each output contains a specific expected substring.



In [9]:
print("If you like, you can inspect the test suite before passing it to the evaluation library.")

for test_input, fn in test_suite_with_inputs:
    print("\nTest input:\n", test_input, "\nFunction:\n", fn)

If you like, you can inspect the test suite before passing it to the evaluation library.

Test input:
 different kinds of cuisines 
Function:
 Metric(calculate=<function _calculate_brevity at 0x15133f240>, interpretation={
  "name": "brevity",
  "description": "Absolute text length",
  "best_value": 1.0,
  "worst_value": Infinity,
  "extra_metadata": {}
})

Test input:
 different kinds of cuisines 
Function:
 Metric(calculate=<function substring_match.<locals>._fn at 0x16a140e00>, interpretation={
  "name": "substring_match",
  "description": "1.0 (pass) if contains given substring",
  "best_value": 1.0,
  "worst_value": 0.0,
  "extra_metadata": {
    "substring": "Magnolia Bakery",
    "case_sensitive": false
  }
})

Test input:
 iconic midtown skyscrapers 
Function:
 Metric(calculate=<function _calculate_brevity at 0x15133f240>, interpretation={
  "name": "brevity",
  "description": "Absolute text length",
  "best_value": 1.0,
  "worst_value": Infinity,
  "extra_metadata": {}
})

Tes

In [13]:
!pip list | grep lastmile-utils

lastmile-utils               0.0.9


In [12]:
print("Run the eval interface (option 1, with inputs)")

df_result = await run_test_suite_with_inputs(
    test_suite=test_suite_with_inputs,
    settings=ts_settings,
)

print("Raw output")
df_result

Run the eval interface (option 1, with inputs)


AttributeError: module 'lastmile_utils.lib.core.api' has no attribute 'hash_id'

In [5]:
print("Unstack for nicer manual review")
df_result.set_index(["input", "aiconfig_output", "metric_name"])\
        .value.unstack("metric_name")

Unstack for nicer manual review


Unnamed: 0_level_0,metric_name,brevity,substring_match
input,aiconfig_output,Unnamed: 2_level_1,Unnamed: 3_level_1
different kinds of cuisines,Visit Chelsea Market for diverse cuisine. Join Chinatown Food Tour for Chinese dishes. Explore Little Italy on a culinary walking tour.,135.0,0.0
iconic midtown skyscrapers,1. Top of the Rock\n2. Empire State Building\n3. NBC Studio Tour,62.0,1.0


## Option 2: Run eval on already-computed AIConfig outputs.

In [6]:
print("Define outputs to test and criteria, similar to option 1.")


from aiconfig.eval.api import (
    brevity,
    substring_match,
    run_test_suite_outputs_only,
)


# This is similar to "test_inputs_with_substrings" above, but we have the AIConfig *outputs*
# in the test cases, rather than the inputs. The library will evaluate these strings directly
# because there is no need to run the AIConfig to generate the outputs.
test_outputs_with_substrings = [
    (
        "Begin at Chelsea Market for diverse food options. Continue to Queens for immersive food tours. Conclude at Smorgasburg for unique outdoor food market experience",
        "Magnolia Bakery"
    ),
    (
        "1. Empire State Building: Observation deck visit, explore exhibits and historical displays. 2. Rockefeller Center: Visit \"Top of the Rock\", ice-skating, NBC Studio tour, shopping and dining. 3. Chrysler Building: Admire exterior and iconic spire, photo opportunities.",
        "Empire State Building"
    )
]



test_suite_outputs_only = []
for test_output, substring in test_outputs_with_substrings:
    # Add the brevity metric
    test_fn1 = brevity
    test_suite_outputs_only.append((test_output, test_fn1))
    # Add substring check function
    test_fn2 = substring_match(substring, case_sensitive=False)
    test_suite_outputs_only.append(
        (test_output, test_fn2)
    )

Define outputs to test and criteria, similar to option 1.


In [7]:
print("If you like, you can inspect the test suite before passing it to the evaluation library.")

for test_output, fn in test_suite_outputs_only:
    print("\nTest output:\n", test_output, "\nFunction:\n", fn)

If you like, you can inspect the test suite before passing it to the evaluation library.

Test output:
 Begin at Chelsea Market for diverse food options. Continue to Queens for immersive food tours. Conclude at Smorgasburg for unique outdoor food market experience 
Function:
 Metric(calculate=<function _calculate_brevity at 0x142b18ca0>, interpretation={
  "name": "brevity",
  "description": "Absolute text length",
  "best_value": 1.0,
  "worst_value": Infinity,
  "extra_metadata": {}
})

Test output:
 Begin at Chelsea Market for diverse food options. Continue to Queens for immersive food tours. Conclude at Smorgasburg for unique outdoor food market experience 
Function:
 Metric(calculate=<function substring_match.<locals>._fn at 0x142ba5ab0>, interpretation={
  "name": "substring_match",
  "description": "1.0 (pass) if contains given substring",
  "best_value": 1.0,
  "worst_value": 0.0,
  "extra_metadata": {
    "substring": "Magnolia Bakery",
    "case_sensitive": false
  }
})

Test

In [8]:
print("Run the eval library")
df_result = await run_test_suite_outputs_only(
    test_suite=test_suite_outputs_only,
)
print("Raw output")
df_result

Run the eval library
Raw output


Unnamed: 0,input,aiconfig_output,value,metric_id,metric_name,metric_description,best_possible_value,worst_possible_value
0,Missing,Begin at Chelsea Market for diverse food options. Continue to Queens for immersive food tours. Conclude at Smorgasburg for unique outdoor food market experience,160.0,1370aa8bacbe156352bf5a0d2cbf2b8dd8d54362e3e86809c7a4ae91e52530b4,brevity,Absolute text length,1.0,inf
1,Missing,Begin at Chelsea Market for diverse food options. Continue to Queens for immersive food tours. Conclude at Smorgasburg for unique outdoor food market experience,0.0,2b849774ca14c9e75f3492f100b190cd0ff0a5e4a5a7f21f318b1ba0ea239054,substring_match,1.0 (pass) if contains given substring,1.0,0.0
2,Missing,"1. Empire State Building: Observation deck visit, explore exhibits and historical displays. 2. Rockefeller Center: Visit ""Top of the Rock"", ice-skating, NBC Studio tour, shopping and dining. 3. Chrysler Building: Admire exterior and iconic spire, photo opportunities.",267.0,1370aa8bacbe156352bf5a0d2cbf2b8dd8d54362e3e86809c7a4ae91e52530b4,brevity,Absolute text length,1.0,inf
3,Missing,"1. Empire State Building: Observation deck visit, explore exhibits and historical displays. 2. Rockefeller Center: Visit ""Top of the Rock"", ice-skating, NBC Studio tour, shopping and dining. 3. Chrysler Building: Admire exterior and iconic spire, photo opportunities.",1.0,e09e60224d3b9ca180ca934cbabf904a4e63843ee5f8bf060c062504dceab519,substring_match,1.0 (pass) if contains given substring,1.0,0.0


In [9]:
print("Unstack for nicer manual review")
df_result.set_index([ "aiconfig_output", "metric_name"])\
        .value.unstack("metric_name")

Unstack for nicer manual review


metric_name,brevity,substring_match
aiconfig_output,Unnamed: 1_level_1,Unnamed: 2_level_1
"1. Empire State Building: Observation deck visit, explore exhibits and historical displays. 2. Rockefeller Center: Visit ""Top of the Rock"", ice-skating, NBC Studio tour, shopping and dining. 3. Chrysler Building: Admire exterior and iconic spire, photo opportunities.",267.0,1.0
Begin at Chelsea Market for diverse food options. Continue to Queens for immersive food tours. Conclude at Smorgasburg for unique outdoor food market experience,160.0,0.0
