# 📓 Comprehensiveness Evaluations

In many ways, feedbacks can be thought of as LLM apps themselves. Given text,
they return some result. Thinking in this way, we can use TruLens to evaluate
and track our feedback quality. We can even do this for different models (e.g. 
gpt-4o) or prompting schemes (such as chain-of-thought reasoning).

This notebook follows an evaluation of a set of test cases generated from human
annotated datasets. In particular, we generate test cases from
[MeetingBank](https://arxiv.org/abs/2305.17529) to evaluate our
comprehensiveness feedback function.

MeetingBank is one of the datasets dedicated to automated evaluations on
summarization tasks, which are closely related to the comprehensiveness
evaluation in RAG with the retrieved context (i.e. the source) and response
(i.e. the summary). It contains human annotation of numerical score (**1** to
**5**). 

For evaluating comprehensiveness feedback functions, we compute the annotated
"informativeness" scores, a measure of how well  the summaries capture all the
main points of the meeting segment. A good summary should contain all and only
the important information of the source., and normalized to **0** to **1** score
as our **expected_score** and to match the output of feedback functions.

In [None]:
import os

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from trulens.core import Feedback
from trulens.core import Select
from trulens.core import TruSession
from trulens.feedback import GroundTruthAgreement
from trulens.providers.openai import OpenAI as fOpenAI

In [None]:
from test_cases import generate_meetingbank_comprehensiveness_benchmark

test_cases_gen = generate_meetingbank_comprehensiveness_benchmark(
    human_annotation_file_path="./datasets/meetingbank/human_scoring.json",
    meetingbank_file_path="YOUR_LOCAL_DOWNLOAD_PATH/MeetingBank/Metadata/MeetingBank.json",
)
length = sum(1 for _ in test_cases_gen)
test_cases_gen = generate_meetingbank_comprehensiveness_benchmark(
    human_annotation_file_path="./datasets/meetingbank/human_scoring.json",
    meetingbank_file_path="YOUR_LOCAL_DOWNLOAD_PATH/MeetingBank/Metadata/MeetingBank.json",
)

comprehensiveness_golden_set = []
for i in range(length):
    comprehensiveness_golden_set.append(next(test_cases_gen))

assert len(comprehensiveness_golden_set) == length

In [None]:
comprehensiveness_golden_set[:3]

In [None]:
os.environ["OPENAI_API_KEY"] = "sk-..."  # for groundtruth feedback function

In [None]:
session = TruSession()

provider_gpt_4o = fOpenAI(model_engine="gpt-4o")

provider_gpt_4o_mini = fOpenAI(model_engine="gpt-4o-mini")

In [None]:
# comprehensiveness of summary with transcript as reference

f_comprehensiveness_openai_gpt_4o = Feedback(
    provider_gpt_4o.comprehensiveness_with_cot_reasons
).on_input_output()

f_comprehensiveness_openai_gpt_4o_mini = Feedback(
    provider_gpt_4o_mini.comprehensiveness_with_cot_reasons
).on_input_output()

In [None]:
# Create a Feedback object using the numeric_difference method of the
# ground_truth object.
ground_truth = GroundTruthAgreement(
    comprehensiveness_golden_set, provider=fOpenAI()
)

# Call the numeric_difference method with app and record and aggregate to get
# the mean absolute error.
f_mae = (
    Feedback(ground_truth.absolute_error, name="Mean Absolute Error")
    .on(Select.Record.calls[0].args.args[0])
    .on(Select.Record.calls[0].args.args[1])
    .on_output()
)

In [None]:
scores_gpt_4o = []
scores_gpt_4o_mini = []
true_scores = []  # human prefrences / scores

for i in range(190, len(comprehensiveness_golden_set)):
    source = comprehensiveness_golden_set[i]["query"]
    summary = comprehensiveness_golden_set[i]["response"]
    expected_score = comprehensiveness_golden_set[i]["expected_score"]

    feedback_score_gpt_4o = f_comprehensiveness_openai_gpt_4o(source, summary)[
        0
    ]
    feedback_score_gpt_4o_mini = f_comprehensiveness_openai_gpt_4o_mini(
        source, summary
    )[0]

    scores_gpt_4o.append(feedback_score_gpt_4o)
    scores_gpt_4o_mini.append(feedback_score_gpt_4o_mini)
    true_scores.append(expected_score)

    df_results = pd.DataFrame({
        "scores (gpt-4o)": scores_gpt_4o,
        "scores (gpt-4o-mini)": scores_gpt_4o_mini,
        "expected score": true_scores,
    })

    # Save the DataFrame to a CSV file
    df_results.to_csv(
        "./results/results_comprehensiveness_benchmark_new_3.csv", index=False
    )

In [None]:
mae_gpt_4o = sum(
    abs(score - true_score)
    for score, true_score in zip(scores_gpt_4o, true_scores)
) / len(scores_gpt_4o)

mae_gpt_4o_mini = sum(
    abs(score - true_score)
    for score, true_score in zip(scores_gpt_4o_mini, true_scores)
) / len(scores_gpt_4o_mini)

In [None]:
print(f"MAE gpt-4o: {mae_gpt_4o}")
print(f"MAE gpt-4o-mini: {mae_gpt_4o_mini}")

## Visualization to help investigation in LLM alignments with (mean) absolute errors

In [None]:
# Assuming scores and true_scores are flat lists of predicted probabilities and
# their corresponding ground truth relevances

# Calculate the absolute errors
errors = np.abs(np.array(scores_gpt_4o) - np.array(true_scores))

# Scatter plot of scores vs true_scores
plt.figure(figsize=(10, 5))

# First subplot: scatter plot with color-coded errors
plt.subplot(1, 2, 1)
scatter = plt.scatter(scores_gpt_4o, true_scores, c=errors, cmap="viridis")
plt.colorbar(scatter, label="Absolute Error")
plt.plot(
    [0, 1], [0, 1], "r--", label="Perfect Alignment"
)  # Line of perfect alignment
plt.xlabel("Model Scores")
plt.ylabel("True Scores")
plt.title("Model (GPT-4o) Scores vs. True Scores")
plt.legend()

# Second subplot: Error across score ranges
plt.subplot(1, 2, 2)
plt.scatter(scores_gpt_4o, errors, color="blue")
plt.xlabel("Model Scores")
plt.ylabel("Absolute Error")
plt.title("Error Across Score Ranges")

plt.tight_layout()
plt.show()