## Meta Evaluation - evaluating your LLM-as-judge with TruLens

Meta evaluation is the process of evaluating evaluation methods themselves. Here we are measuring and benchmarking the performance of LLM-based evaluators (aka LLM-as-judge), where the main focus of performance is human alignment. In other words, how closely aligned the generated scores are with human evaluation processes.


###
In TruLens, we implement this as a special case of GroundTruth evaluation, since we canonically regard human preferences as the groundtruth in most LLM tasks. 

For experiment tracking, we provide a suite automatic metric computation via Aggregator, 

In [None]:
# Import relevance feedback function
from trulens_eval.feedback import GroundTruthAgreement, GroundTruthAggregator
from trulens_eval import Tru
import numpy as np

tru = Tru()
tru.reset_database()

golden_set = [
    {
        "query": "who are the Apple's competitors?",
        "response": "Apple competitors include Samsung, Google, and Microsoft.",
        "expected_score": 1.0,  # groundtruth score annotated by human
    },
    {
        "query": "what is the capital of France?",
        "response": "Paris is the capital of France.",
        "expected_score": 1.0,
    },
    {
        "query": "what is the capital of Spain?",
        "response": "I love going to Spain.",
        "expected_score": 0,
    },
]
# Create a Feedback object using the numeric_difference method of the ground_truth object
ground_truth = GroundTruthAgreement(golden_set)

In [None]:
from trulens_eval.feedback import Cortex, OpenAI

# provider = Cortex(model_engine="mistral-large")
provider = OpenAI(model_engine="gpt-4o")

In [None]:
from typing import Tuple


# output is feedback_score
def context_relevance_ff_to_score(input, output, benchmark_params) -> float:
    return provider.context_relevance(
        question=input,
        context=output,
        temperature=benchmark_params["temperature"],
    )


# output is (feedback_score, confidence_score)
def context_relevance_ff_to_score_with_confidence(
    input, output, benchmark_params
) -> Tuple[float, float]:
    return provider.context_relevance_verb_confidence(
        question=input,
        context=output,
        temperature=benchmark_params["temperature"],
    )

### Collect all prompt and expected responses from the golden set and pass to GroundTruthAggregator as ground truth labels

In [None]:
prompts = []
responses = []
for i in range(len(golden_set)):
    prompt = golden_set[i]["query"]
    response = golden_set[i]["response"]

    prompts.append(prompt)
    responses.append(response)

true_labels = [entry["expected_score"] for entry in golden_set]

mae_agg_func = GroundTruthAggregator(true_labels=true_labels).mae

In [None]:
from trulens_eval.feedback.benchmark_frameworks.tru_benchmark_experiment import (
    BenchmarkParams,
)

tru_benchmark_arctic = tru.BenchmarkExperiment(
    app_id="MAE",
    ground_truth=golden_set,
    trace_to_score_fn=context_relevance_ff_to_score,
    agg_funcs=[mae_agg_func],
    benchmark_params=BenchmarkParams(temperature=0.5),
)

In [None]:
with tru_benchmark_arctic as recording:
    feedback_res = tru_benchmark_arctic.app.collect_feedback_scores()

### Sanity check: compare the generated feedback scores with the passed in ground truth labels [1, 1, 0] 

In [None]:
feedback_res  # generate feedback scores from our context relevance feedback function

In [None]:
tru.get_leaderboard(app_ids=[])

In [None]:
custom_gt_aggr_fnc = GroundTruthAggregator(true_labels=true_labels)
custom_gt_aggr_fnc.custom_aggr

In [None]:
ece_agg_func = GroundTruthAggregator(true_labels=true_labels).ece
tru_benchmark_arctic_calibration = tru.BenchmarkExperiment(
    app_id="Expected Calibration Error (ECE)",
    ground_truth=golden_set,
    trace_to_score_fn=context_relevance_ff_to_score_with_confidence,
    agg_funcs=[ece_agg_func],
    benchmark_params=BenchmarkParams(temperature=0),
)

In [None]:
with tru_benchmark_arctic_calibration as recording:
    feedback_results = (
        tru_benchmark_arctic_calibration.app.collect_feedback_scores()
    )

In [None]:
feedback_results  # a tuple of (generate_feedback_scores, confidence_scores)  from our context relevance feedback function

In [None]:
tru.get_leaderboard(app_ids=[])