# Promptfoo-style eval without promptfoo

Goal: Run test-suite-style eval (like Promptfoo) with completely custom components, i.e. without using Promptfoo.

In this case, you have 2 options:
1. Run with inputs. Library runs AIConfig for you first.
2. Run with outputs only. You run AIConfig and save the outputs for eval.

Run the notebook in order for an example of each.

Assumptions:
* I have a parametrized AIConfig with a test input called "the_query", like this: 
`"input": "{{the_query}}"`
* I have some evaluation criteria in mind for the AIConfig's text output.
* Promptfoo integration does not meet my needs, e.g.
  * I want to run the AIConfig myself instead of handing control to Promptfoo
  * I need to scale beyond what Promptfoo can reasonably handle

In [1]:
!pip3 install lastmile-utils

print("Imports and set log level")

import itertools
import logging

import pandas as pd
import lastmile_utils.lib.jupyter as jupyter_utils

pd.set_option("display.max_colwidth", None)

from aiconfig.eval.lib import (
    TestSuiteWithInputsSettings,
    UserTestSuiteWithInputs,
    brevity,
    substring_match,
    run_test_suite_with_inputs,
)

jupyter_utils.set_log_level(logging.WARNING)



Imports and set log level



  from .autonotebook import tqdm as notebook_tqdm


## Option 1: provide inputs, library runs AIConfig for you

In [2]:
print("""
    Define test suite with inputs (option 1), 
      as opposed to using pre-computed AIConfig outputs (option 2)
      
      Define list of inputs and test criteria.
      In this case, we are checking brevity for each test case
      as well as checking that each output contains a specific expected substring.
""")



tsconfig = TestSuiteWithInputsSettings(
    {
        "prompt_name": "get_activities",
        "aiconfig_path": "./travel_parametrized.aiconfig.json",
    }
)


test_inputs = [
    "Empire State Building is on fifth avenue. What is the cross street?",
    "What is the best borough?",
]
expected_substrings = [
    "34th street",
    "Brooklyn",
]


test_suite_with_inputs = UserTestSuiteWithInputs(
    [
        (test_input, substring_match(substring, case_sensitive=False))
        for test_input, substring in zip(test_inputs, expected_substrings)
    ]
    + list(itertools.product(test_inputs, [brevity])),
)



    Define test suite with inputs (option 1), 
      as opposed to using pre-computed AIConfig outputs (option 2)
      
      Define list of inputs and test criteria.
      In this case, we are checking brevity for each test case
      as well as checking that each output contains a specific expected substring.



In [10]:
print("Run the eval interface (option 1, with inputs)")
df_result = await run_test_suite_with_inputs(
    test_suite=test_suite_with_inputs,
    settings=tsconfig,
)

Run the eval interface (option 1, with inputs)


In [4]:
print("Raw output")
df_result

Raw output


Unnamed: 0,input,aiconfig_output,value,metric_name,metric_description,best_value,worst_value
0,Empire State Building is on fifth avenue. What is the cross street?,The cross street of the Empire State Building is West 34th Street.,1.0,contains_substring,1.0 (pass) if contains given substring,1.0,0.0
1,Empire State Building is on fifth avenue. What is the cross street?,The cross street of the Empire State Building is West 34th Street.,66.0,brevity,Absolute text length,1.0,inf
2,What is the best borough?,"This is subjective and can vary depending on personal preferences, as each borough in a city can offer different experiences and attractions. For example, in New York City, some people may prefer the vibrant and diverse atmosphere of Manhattan, while others may enjoy the artistic and trendy scene in Brooklyn. Ultimately, the best borough would depend on individual interests, lifestyle, and priorities.",1.0,contains_substring,1.0 (pass) if contains given substring,1.0,0.0
3,What is the best borough?,"This is subjective and can vary depending on personal preferences, as each borough in a city can offer different experiences and attractions. For example, in New York City, some people may prefer the vibrant and diverse atmosphere of Manhattan, while others may enjoy the artistic and trendy scene in Brooklyn. Ultimately, the best borough would depend on individual interests, lifestyle, and priorities.",404.0,brevity,Absolute text length,1.0,inf


In [5]:
print("Unstack for nicer manual review")
df_result.set_index(["input", "aiconfig_output", "metric_name"])\
        .value.unstack("metric_name")

Unstack for nicer manual review


Unnamed: 0_level_0,metric_name,brevity,contains_substring
input,aiconfig_output,Unnamed: 2_level_1,Unnamed: 3_level_1
Empire State Building is on fifth avenue. What is the cross street?,The cross street of the Empire State Building is West 34th Street.,66.0,1.0
What is the best borough?,"This is subjective and can vary depending on personal preferences, as each borough in a city can offer different experiences and attractions. For example, in New York City, some people may prefer the vibrant and diverse atmosphere of Manhattan, while others may enjoy the artistic and trendy scene in Brooklyn. Ultimately, the best borough would depend on individual interests, lifestyle, and priorities.",404.0,1.0


## Option 2: Run eval on already-computed AIConfig outputs.

In [6]:
print("Define outputs to test and criteria, similar to option 1.")
import itertools


from aiconfig.eval.lib import (
    UserTestSuiteOutputsOnly,
    brevity,
    substring_match,
    run_test_suite_outputs_only,
)



test_outputs = [
    "The cross street for the Empire State Building is West 34th Street.",
    'The answer to this question is subjective and depends on individual preferences. All the boroughs of a city have their own unique features, attractions, and advantages. Therefore, it is impossible to determine the "best" borough as it varies depending on personal interests, lifestyle, and priorities.',
]
test_suite_outputs_only = UserTestSuiteOutputsOnly(
    [
        (test_output, substring_match(substring, case_sensitive=False))
        for test_output, substring in zip(test_outputs, expected_substrings)
    ]
    + list(itertools.product(test_outputs, [brevity]))
)

Define outputs to test and criteria, similar to option 1.


In [7]:
print("Run the eval library")
df_result = await run_test_suite_outputs_only(
    test_suite=test_suite_outputs_only,
)

Run the eval library


In [8]:
print("Raw output")
df_result

Raw output


Unnamed: 0,input,aiconfig_output,value,metric_name,metric_description,best_value,worst_value
0,Missing,The cross street for the Empire State Building is West 34th Street.,1.0,contains_substring,1.0 (pass) if contains given substring,1.0,0.0
1,Missing,"The answer to this question is subjective and depends on individual preferences. All the boroughs of a city have their own unique features, attractions, and advantages. Therefore, it is impossible to determine the ""best"" borough as it varies depending on personal interests, lifestyle, and priorities.",0.0,contains_substring,1.0 (pass) if contains given substring,1.0,0.0
2,Missing,The cross street for the Empire State Building is West 34th Street.,67.0,brevity,Absolute text length,1.0,inf
3,Missing,"The answer to this question is subjective and depends on individual preferences. All the boroughs of a city have their own unique features, attractions, and advantages. Therefore, it is impossible to determine the ""best"" borough as it varies depending on personal interests, lifestyle, and priorities.",301.0,brevity,Absolute text length,1.0,inf


In [9]:
print("Unstack for nicer manual review")
df_result.set_index([ "aiconfig_output", "metric_name"])\
        .value.unstack("metric_name")

Unstack for nicer manual review


metric_name,brevity,contains_substring
aiconfig_output,Unnamed: 1_level_1,Unnamed: 2_level_1
"The answer to this question is subjective and depends on individual preferences. All the boroughs of a city have their own unique features, attractions, and advantages. Therefore, it is impossible to determine the ""best"" borough as it varies depending on personal interests, lifestyle, and priorities.",301.0,0.0
The cross street for the Empire State Building is West 34th Street.,67.0,1.0
