## Answer Relevance Feedback Requirements

In many ways, feedbacks can be thought of as LLM apps themselves. Given text, they return some result. Thinking in this way, we can use TruLens to evaluate and track our feedback quality. We can even do this for different models (e.g. gpt-3.5 and gpt-4) or prompting schemes (such as chain-of-thought reasoning).

This notebook follows an evaluation of a set of test cases. You are encouraged to run this on your own and even expand the test cases to evaluate performance on test cases applicable to your scenario or domain.

In [None]:
# Import relevance feedback function
from trulens_eval.feedback import GroundTruthAgreement, OpenAI
from trulens_eval import TruBasicApp, Feedback, Tru, Select
from test_cases import answer_relevance_golden_set

In [2]:
import os
os.environ["OPENAI_API_KEY"] = "..."

In [3]:
turbo = OpenAI(model_engine="gpt-3.5-turbo")
# Define your feedback functions
def wrapped_relevance_turbo(input, output):
    return turbo.relevance(input, output)

# Define your feedback functions
def wrapped_relevance_with_cot_turbo(input, output):
    return turbo.relevance_with_cot_reasons(input, output)

gpt4 = OpenAI(model_engine="gpt-4")
# Define your feedback functions
def wrapped_relevance_gpt4(input, output):
    return gpt4.relevance(input, output)

# Define your feedback functions
def wrapped_relevance_with_cot_gpt4(input, output):
    return gpt4.relevance_with_cot_reasons(input, output)

Here we'll set up our golden set as a set of prompts, responses and expected scores stored in `test_cases.py`. Then, our numeric_difference method will look up the expected score for each prompt/response pair by **exact match**. After looking up the expected score, we will then take the L1 difference between the actual score and expected score.

In [4]:
# Create a Feedback object using the numeric_difference method of the ground_truth object
ground_truth = GroundTruthAgreement(answer_relevance_golden_set)
# Call the numeric_difference method with app and record
f_groundtruth = Feedback(ground_truth.numeric_difference, name = "Relevance Smoke Test").on(Select.Record.calls[0].args.args[0]).on(Select.Record.calls[0].args.args[1]).on_output()

✅ In Relevance Smoke Test, input prompt will be set to *.__record__.calls[0].args.args[0] .
✅ In Relevance Smoke Test, input response will be set to *.__record__.calls[0].args.args[1] .
✅ In Relevance Smoke Test, input score will be set to *.__record__.main_output or `Select.RecordOutput` .


In [5]:
tru_wrapped_relevance_turbo = TruBasicApp(wrapped_relevance_turbo, app_id = "answer relevance gpt-3.5-turbo", feedbacks=[f_groundtruth])
tru_wrapped_relevance_with_cot_turbo = TruBasicApp(wrapped_relevance_with_cot_turbo, app_id = "answer relevance with cot reasoning gpt-3.5-turbo", feedbacks=[f_groundtruth])
tru_wrapped_relevance_gpt4 = TruBasicApp(wrapped_relevance_gpt4, app_id = "answer relevance gpt-4", feedbacks=[f_groundtruth])
tru_wrapped_relevance_with_cot_gpt4 = TruBasicApp(wrapped_relevance_with_cot_gpt4, app_id = "answer relevance with cot reasoning gpt-4", feedbacks=[f_groundtruth])

In [6]:
for i in range(len(answer_relevance_golden_set)):
    prompt = answer_relevance_golden_set[i]["query"]
    response = answer_relevance_golden_set[i]["response"]
    tru_wrapped_relevance_turbo.call_with_record(prompt, response)
    tru_wrapped_relevance_with_cot_turbo.call_with_record(prompt, response)
    tru_wrapped_relevance_gpt4.call_with_record(prompt, response)
    tru_wrapped_relevance_with_cot_gpt4.call_with_record(prompt, response)

Task queue full. Finishing existing tasks.
Task queue full. Finishing existing tasks.


In [7]:
Tru().get_leaderboard(app_ids=[])

Unnamed: 0_level_0,Relevance Smoke Test,latency,total_cost
app_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
answer relevance gpt-3.5-turbo,0.794118,0.058824,0.000763
answer relevance gpt-4,0.770588,0.058824,0.015277
answer relevance with cot reasoning gpt-3.5-turbo,0.770588,0.058824,0.000908
answer relevance with cot reasoning gpt-4,0.758824,0.058824,0.019336
