# 8. Evaluator Agent Benchmark (Comp-Analysis)
**Category:** AI Agent Core Capabilities

**Source:** [e0397123 / comp-analysis](https://github.com/e0397123/comp-analysis)

**Description:** Used to train Critic Agents specifically for evaluating the
dialogue quality generated by other agents.

**Data Content:** Multi-dimensional dialogue evaluation data, including comparisons
between various LLM scores for dialogue quality and human ratings.

**Paper:** [Large Language Models Are Not Yet Human-Level Evaluators for Abstractive Summarization (AAAI 2024)](https://arxiv.org/abs/2305.13091)

---

**This notebook covers:**
1. Data loading: dialog-level GPT-4 annotations, FED human annotations, turn-level ratings
2. GPT-4 score distributions & inter-rater agreement
3. Human annotation analysis (FED dataset)
4. GPT-4 vs Human alignment comparison
5. Dialog model ranking & turn-level cross-dataset comparison
6. Turn-level dimension correlation heatmap

## 1. Setup

In [None]:
# Install dependencies (uncomment if needed)
# !pip install pandas matplotlib seaborn

In [None]:
import os
import json
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

sns.set_theme(style="whitegrid")
plt.rcParams["figure.figsize"] = (12, 6)
plt.rcParams["figure.dpi"] = 100
plt.rcParams["axes.titlesize"] = 13
plt.rcParams["axes.labelsize"] = 11

## 2. Dataset Overview

This benchmark evaluates how well LLMs (GPT-4) can serve as automatic dialogue
evaluators compared to human judges. Data is organized at two levels:

**Dialog-level evaluation** (whole conversation quality):
- 5 evaluation dimensions: Coherence, Diversity, Engagement, Informativeness, Overall
- 5 independent GPT-4 raters per dimension
- Human annotations with multiple annotators
- Source datasets: FED, HEVAL, IEVAL, ConTurE, Reliable, PersonaSee

**Turn-level evaluation** (single response quality):
- 5 evaluation dimensions: Interesting, Relevance, Specificity, Understandability, Overall
- Source datasets: FED-turn, ConTurE-turn, PersonaUSR, PersonaZhao, DailyDialog, TopicalUSR

**Robustness tests:** Perturbed versions to test evaluator consistency.

## 3. Data Loading

### 3.1 Clone Repository

In [None]:
# Clone the repository (skip if already cloned)
REPO_DIR = Path("comp-analysis")
if not REPO_DIR.exists():
    os.system("git clone https://github.com/e0397123/comp-analysis.git")
    print("Repository cloned.")
else:
    print(f"Repository already exists at {REPO_DIR}")

DIALOG_DIR = REPO_DIR / "dialog_level_texts"
TURN_DIR = REPO_DIR / "turn_level_texts"
ROBUST_DIR = REPO_DIR / "robustness_data"

### 3.2 Load Dialog-Level Data

In [None]:
# Load dialog-level GPT-4 annotations
gpt4_files = sorted(DIALOG_DIR.glob("*_gpt4_annotations.json"))

gpt4_dialog = {}
for f in gpt4_files:
    name = f.stem.replace("_gpt4_annotations", "")
    with open(f, "r", encoding="utf-8") as fh:
        gpt4_dialog[name] = json.load(fh)
    n_dialogues = len(list(gpt4_dialog[name].values())[0])
    n_cols = len(gpt4_dialog[name])
    print(f"  {name}: {n_dialogues} dialogues, {n_cols} columns")

print(f"\nTotal dialog-level datasets: {len(gpt4_dialog)}")

In [None]:
# Load dialog-level human annotations (FED dataset)
human_file = DIALOG_DIR / "fed_human_annotations.json"
with open(human_file, "r", encoding="utf-8") as f:
    fed_human = json.load(f)

print(f"FED human annotations: {len(fed_human)} dialogues")
print(f"Fields per dialogue: {list(fed_human[0].keys())}")

### 3.3 Load Turn-Level Data

In [None]:
# Load turn-level ratings
turn_files = sorted(TURN_DIR.glob("turn_*_ratings.json"))

turn_ratings = {}
for f in turn_files:
    name = f.stem.replace("_ratings", "")
    with open(f, "r", encoding="utf-8") as fh:
        turn_ratings[name] = json.load(fh)
    datasets_in = list(turn_ratings[name].keys())
    n_items = len(list(turn_ratings[name].values())[0])
    print(f"  {name}: {len(datasets_in)} source datasets, {n_items} items each")

print(f"\nTotal turn-level rating files: {len(turn_ratings)}")

## 4. Data Schema & Samples

### 4.1 Dialog-Level GPT-4 Annotations

In [None]:
fed_gpt4 = gpt4_dialog["fed"]

print("=== Dialog-Level GPT-4 Annotations (FED) ===")
print(f"Columns: {list(fed_gpt4.keys())}")
print(f"Number of dialogues: {len(list(fed_gpt4.values())[0])}")
print(f"\nDimension naming: <dim>_<rater_id>")
print("  coh = Coherence, div = Diversity, eng = Engagement,")
print("  inf = Informativeness, ovr = Overall")
print(f"\nSample scores (first 5 dialogues, rater 1):")
for dim in ["coh_1", "div_1", "eng_1", "inf_1", "ovr_1"]:
    print(f"  {dim}: {fed_gpt4[dim][:5]}")

### 4.2 Dialog-Level Human Annotations (FED)

In [None]:
sample = fed_human[0]
print("=== Dialog-Level Human Annotations (FED) ===")
print(f"Keys: {list(sample.keys())}")
print(f"Model: {sample['model']}")
print(f"Dialogue ID: {sample['dialogue_id']}")
print(f"Dialog turns: {len(sample['dialog'])}")
print(f"Annotation dimensions: {list(sample['annotations'].keys())}")

print(f"\nSample dialog:")
for turn in sample["dialog"][:4]:
    print(f"  [{turn['speaker']}]: {turn['text'][:80]}...")

print(f"\nSample annotations:")
for dim, scores in list(sample["annotations"].items())[:5]:
    print(f"  {dim}: {scores}")

### 4.3 Turn-Level Ratings

In [None]:
print("=== Turn-Level Ratings ===")
print(f"Rating files: {list(turn_ratings.keys())}")
for name, data in turn_ratings.items():
    first_ds = list(data.keys())[0]
    print(f"\n  {name}:")
    print(f"    Source datasets: {list(data.keys())}")
    print(f"    Sample ({first_ds}, first 5): {data[first_ds][:5]}")

## 5. Exploratory Data Analysis

### 5.1 GPT-4 Score Distribution by Dimension (Dialog-Level)

In [None]:
dimensions = {
    "coh": "Coherence", "div": "Diversity", "eng": "Engagement",
    "inf": "Informativeness", "ovr": "Overall",
}

rows = []
for dim_key, dim_name in dimensions.items():
    for rater in range(1, 6):
        col = f"{dim_key}_{rater}"
        if col in fed_gpt4:
            for score in fed_gpt4[col]:
                rows.append({"dimension": dim_name, "rater": rater, "score": score})

df_gpt4 = pd.DataFrame(rows)

plt.figure(figsize=(12, 6))
sns.boxplot(data=df_gpt4, x="dimension", y="score",
            hue="dimension", palette="Set2", legend=False)
plt.title("GPT-4 Score Distribution by Evaluation Dimension (FED Dialog-Level)")
plt.xlabel("Dimension")
plt.ylabel("Score")
plt.tight_layout()
plt.show()

print("GPT-4 score statistics per dimension:")
print(df_gpt4.groupby("dimension")["score"].describe().round(2).to_string())

### 5.2 Inter-Rater Agreement (GPT-4 Raters)

In [None]:
# Compute mean and std across 5 GPT-4 raters for each dialogue
rater_agreement = {}
for dim_key, dim_name in dimensions.items():
    rater_cols = [f"{dim_key}_{r}" for r in range(1, 6)
                  if f"{dim_key}_{r}" in fed_gpt4]
    if rater_cols:
        rater_matrix = np.array([fed_gpt4[c] for c in rater_cols])  # (5, N)
        rater_agreement[dim_name] = {
            "mean_std": np.mean(np.std(rater_matrix, axis=0)),
            "mean_range": np.mean(np.max(rater_matrix, axis=0)
                                  - np.min(rater_matrix, axis=0)),
        }

agree_df = pd.DataFrame(rater_agreement).T
agree_df.columns = ["Mean Std (across raters)", "Mean Range (across raters)"]

agree_df.plot(kind="bar", figsize=(10, 5), color=["steelblue", "coral"])
plt.title("GPT-4 Inter-Rater Disagreement by Dimension")
plt.ylabel("Score Variation")
plt.xticks(rotation=0)
plt.legend(loc="upper right")
plt.tight_layout()
plt.show()

print(agree_df.round(3).to_string())

### 5.3 Human Annotation Distribution (FED)

In [None]:
# Extract human ratings per dimension
human_dims = {}
for item in fed_human:
    for dim, scores in item["annotations"].items():
        if dim not in human_dims:
            human_dims[dim] = []
        human_dims[dim].extend(scores)

dim_counts = {k: len(v) for k, v in human_dims.items()}
dim_df = pd.Series(dim_counts).sort_values(ascending=False)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

dim_df.head(11).plot(kind="barh", ax=axes[0], color="steelblue")
axes[0].set_title("Number of Human Ratings per Dimension")
axes[0].set_xlabel("Total Ratings")
axes[0].invert_yaxis()

for dim in ["Coherent", "Overall", "Informative", "Likeable"]:
    if dim in human_dims:
        axes[1].hist(human_dims[dim], bins=range(0, 7), alpha=0.5,
                     label=dim, edgecolor="white")
axes[1].set_title("Human Rating Distribution (Selected Dimensions)")
axes[1].set_xlabel("Rating")
axes[1].set_ylabel("Frequency")
axes[1].legend(fontsize=8)

plt.tight_layout()
plt.show()

### 5.4 GPT-4 vs Human: FED Dataset Comparison

In [None]:
# Compute per-dialogue mean GPT-4 scores for 'Overall'
gpt4_overall = []
for rater in range(1, 6):
    col = f"ovr_{rater}"
    if col in fed_gpt4:
        gpt4_overall.append(fed_gpt4[col])
gpt4_overall_mean = np.mean(gpt4_overall, axis=0)

# Compute per-dialogue mean human 'Overall' scores
human_overall = []
for item in fed_human:
    if "Overall" in item["annotations"]:
        human_overall.append(np.mean(item["annotations"]["Overall"]))
    else:
        human_overall.append(np.nan)

# Align lengths and remove NaN
min_len = min(len(gpt4_overall_mean), len(human_overall))
gpt4_aligned = gpt4_overall_mean[:min_len]
human_aligned = np.array(human_overall[:min_len])
mask = ~np.isnan(human_aligned)
gpt4_clean = gpt4_aligned[mask]
human_clean = human_aligned[mask]

if len(gpt4_clean) > 0:
    plt.figure(figsize=(8, 6))
    plt.scatter(human_clean, gpt4_clean, alpha=0.5, s=40,
                color="steelblue", edgecolors="black")
    plt.xlabel("Human Mean Overall Rating")
    plt.ylabel("GPT-4 Mean Overall Rating")
    plt.title("GPT-4 vs Human: Per-Dialogue Overall Rating")

    lims = [min(plt.xlim()[0], plt.ylim()[0]),
            max(plt.xlim()[1], plt.ylim()[1])]
    plt.plot(lims, lims, "--", color="gray", alpha=0.5,
             label="Perfect agreement")
    plt.legend()
    plt.tight_layout()
    plt.show()

    correlation = np.corrcoef(human_clean, gpt4_clean)[0, 1]
    print(f"Pearson correlation (Human vs GPT-4 Overall): {correlation:.3f}")
    print(f"Number of dialogues compared: {len(gpt4_clean)}")

### 5.5 Dialog Models Compared (Human Annotations)

In [None]:
model_scores = {}
for item in fed_human:
    model = item.get("model", "unknown")
    if "Overall" in item["annotations"]:
        if model not in model_scores:
            model_scores[model] = []
        model_scores[model].append(np.mean(item["annotations"]["Overall"]))

model_df = pd.DataFrame([
    {"Model": m, "Mean Overall": np.mean(s), "Std": np.std(s), "Count": len(s)}
    for m, s in model_scores.items()
]).sort_values("Mean Overall", ascending=False)

plt.figure(figsize=(10, 5))
bars = plt.barh(model_df["Model"], model_df["Mean Overall"],
                color="mediumseagreen", edgecolor="white",
                xerr=model_df["Std"], capsize=3)
plt.title("Human Overall Rating by Dialog Model (FED)")
plt.xlabel("Mean Overall Rating")
plt.gca().invert_yaxis()
for bar, count in zip(bars, model_df["Count"]):
    plt.text(bar.get_width() + 0.1, bar.get_y() + bar.get_height() / 2,
             f"n={count}", va="center", fontsize=9)
plt.tight_layout()
plt.show()

### 5.6 Turn-Level: Cross-Dataset Rating Comparison

In [None]:
if "turn_overall" in turn_ratings:
    overall = turn_ratings["turn_overall"]
    turn_stats = {}
    for ds_name, ratings in overall.items():
        ratings_clean = [r for r in ratings
                         if r is not None
                         and not (isinstance(r, float) and np.isnan(r))]
        if ratings_clean:
            turn_stats[ds_name] = {
                "mean": np.mean(ratings_clean),
                "std": np.std(ratings_clean),
                "count": len(ratings_clean),
            }

    turn_df = pd.DataFrame(turn_stats).T.sort_values("mean", ascending=False)

    plt.figure(figsize=(10, 5))
    plt.barh(turn_df.index, turn_df["mean"], color="orchid",
             edgecolor="white", xerr=turn_df["std"], capsize=3)
    plt.title("Turn-Level GPT-4 Overall Rating by Source Dataset")
    plt.xlabel("Mean Overall Rating")
    plt.gca().invert_yaxis()
    plt.tight_layout()
    plt.show()

    print("Turn-level overall statistics:")
    print(turn_df.round(2).to_string())

### 5.7 Turn-Level: Dimension Correlation Heatmap

In [None]:
# Build a per-item DataFrame with all turn-level dimensions
# Use a persona dataset as reference (exists in most rating files)
turn_dim_data = {}
for name, data in turn_ratings.items():
    dim_name = name.replace("turn_", "").capitalize()
    for ds_key in data.keys():
        if "persona" in ds_key.lower():
            turn_dim_data[dim_name] = data[ds_key]
            break

if len(turn_dim_data) >= 3:
    min_len = min(len(v) for v in turn_dim_data.values())
    turn_dim_df = pd.DataFrame({k: v[:min_len] for k, v in turn_dim_data.items()})
    turn_dim_df = turn_dim_df.apply(pd.to_numeric, errors="coerce").dropna()

    if len(turn_dim_df) > 10:
        plt.figure(figsize=(8, 6))
        corr = turn_dim_df.corr()
        sns.heatmap(corr, annot=True, fmt=".2f", cmap="coolwarm",
                    vmin=-1, vmax=1, square=True, linewidths=0.5)
        plt.title("Turn-Level Dimension Correlation (Persona dataset)")
        plt.tight_layout()
        plt.show()

### 5.8 Summary

In [None]:
print("=== Dataset Summary ===\n")

print(f"Dialog-level GPT-4 annotation sets: {len(gpt4_dialog)}")
for name, data in gpt4_dialog.items():
    n = len(list(data.values())[0])
    print(f"  {name}: {n} dialogues, {len(data)} columns (5 dims x 5 raters)")

print(f"\nDialog-level human annotations (FED): {len(fed_human)} dialogues")
print(f"  Models: {sorted(set(d.get('model', '?') for d in fed_human))}")

print(f"\nTurn-level rating files: {len(turn_ratings)}")
for name, data in turn_ratings.items():
    print(f"  {name}: {list(data.keys())}")

## 6. Key Observations

1. **Multi-dimensional evaluation:** The benchmark covers 5+ quality dimensions
   (coherence, diversity, engagement, informativeness, overall) at both dialog
   and turn levels, enabling fine-grained evaluator assessment.

2. **GPT-4 inter-rater variability:** Even with the same model, 5 independent
   GPT-4 runs produce different scores, highlighting the stochastic nature
   of LLM-based evaluation.

3. **Human-LLM alignment gap:** The correlation between GPT-4 and human
   ratings varies by dimension, suggesting LLMs are better evaluators on some
   quality aspects than others.

4. **Cross-dataset generalization:** Turn-level ratings vary significantly
   across source datasets, indicating that evaluator quality is task-dependent.

5. **Research relevance (IS/AI):**
   - **Critic agents:** Train LLMs to evaluate other agents' outputs reliably
   - **Evaluation automation:** Replace expensive human evaluation with calibrated LLM judges
   - **Bias detection:** Identify systematic differences between human and LLM ratings
   - **Multi-dimensional quality:** Move beyond single-score evaluation to nuanced assessment