## Context Relevance Evaluations

In many ways, feedbacks can be thought of as LLM apps themselves. Given text, they return some result. Thinking in this way, we can use TruLens to evaluate and track our feedback quality. We can even do this for different models (e.g. gpt-3.5 and gpt-4) or prompting schemes (such as chain-of-thought reasoning).

This notebook follows an evaluation of a set of test cases. You are encouraged to run this on your own and even expand the test cases to evaluate performance on test cases applicable to your scenario or domain.

In [1]:
# Import relevance feedback function
from trulens_eval.feedback import GroundTruthAgreement, OpenAI as fOpenAI
from trulens_eval import TruBasicApp, Feedback, Tru, Select
from test_cases import context_relevance_golden_set

import openai

In [2]:
import os
os.environ["OPENAI_API_KEY"] = "..."

In [3]:
turbo = fOpenAI(model_engine="gpt-3.5-turbo")
# Define your feedback functions
def wrapped_relevance_turbo(input, output):
    return turbo.qs_relevance(input, output)

# Define your feedback functions
def wrapped_relevance_with_cot_turbo(input, output):
    return turbo.qs_relevance_with_cot_reasons(input, output)

gpt4 = fOpenAI(model_engine="gpt-4")
# Define your feedback functions
def wrapped_relevance_gpt4(input, output):
    return gpt4.qs_relevance(input, output)

# Define your feedback functions
def wrapped_relevance_with_cot_gpt4(input, output):
    return gpt4.qs_relevance_with_cot_reasons(input, output)

Here we'll set up our golden set as a set of prompts, responses and expected scores stored in `test_cases.py`. Then, our numeric_difference method will look up the expected score for each prompt/response pair by **exact match**. After looking up the expected score, we will then take the L1 difference between the actual score and expected score.

In [4]:
# Create a Feedback object using the numeric_difference method of the ground_truth object
ground_truth = GroundTruthAgreement(context_relevance_golden_set)
# Call the numeric_difference method with app and record
f_groundtruth = Feedback(ground_truth.numeric_difference, name = "Context Relevance Smoke Test").on(Select.Record.calls[0].args.args[0]).on(Select.Record.calls[0].args.args[1]).on_output()

✅ In Context Relevance Smoke Test, input prompt will be set to *.__record__.calls[0].args.args[0] .
✅ In Context Relevance Smoke Test, input response will be set to *.__record__.calls[0].args.args[1] .
✅ In Context Relevance Smoke Test, input score will be set to *.__record__.main_output or `Select.RecordOutput` .


In [5]:
tru_wrapped_relevance_turbo = TruBasicApp(wrapped_relevance_turbo, app_id = "context relevance gpt-3.5-turbo", feedbacks=[f_groundtruth])
tru_wrapped_relevance_with_cot_turbo = TruBasicApp(wrapped_relevance_with_cot_turbo, app_id = "context relevance with cot reasoning gpt-3.5-turbo", feedbacks=[f_groundtruth])
tru_wrapped_relevance_gpt4 = TruBasicApp(wrapped_relevance_gpt4, app_id = "context relevance gpt-4", feedbacks=[f_groundtruth])
tru_wrapped_relevance_with_cot_gpt4 = TruBasicApp(wrapped_relevance_with_cot_gpt4, app_id = "context relevance with cot reasoning gpt-4", feedbacks=[f_groundtruth])

🦑 Tru initialized with db url sqlite:///default.sqlite .
🛑 Secret keys may be written to the database. See the `database_redact_keys` option of `Tru` to prevent this.


In [6]:
for i in range(len(context_relevance_golden_set)):
    prompt = context_relevance_golden_set[i]["query"]
    response = context_relevance_golden_set[i]["response"]
    tru_wrapped_relevance_turbo.call_with_record(prompt, response)
    tru_wrapped_relevance_with_cot_turbo.call_with_record(prompt, response)
    tru_wrapped_relevance_gpt4.call_with_record(prompt, response)
    tru_wrapped_relevance_with_cot_gpt4.call_with_record(prompt, response)

Task queue full. Finishing existing tasks.


In [9]:
Tru().get_leaderboard(app_ids=[])

Unnamed: 0_level_0,Context Relevance Smoke Test,latency,total_cost
app_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
context relevance gpt-3.5-turbo,0.8,0.066667,0.000762
context relevance gpt-4,0.78,0.066667,0.015268
context relevance with cot reasoning gpt-4,0.733333,0.066667,0.01956
context relevance with cot reasoning gpt-3.5-turbo,0.706667,0.066667,0.000918
