# 9. Evaluator Agent Benchmark — Deep Analysis (Comp-Analysis)
**Category:** AI Agent Core Capabilities

**Source:** [e0397123 / comp-analysis](https://github.com/e0397123/comp-analysis)

**Description:** Used to train Critic Agents specifically for evaluating the
dialogue quality generated by other agents.

**Data Content:** Multi-dimensional dialogue evaluation data, comparing LLM scores
with human ratings across dialog-level and turn-level quality dimensions.

**Paper:** [A Comprehensive Analysis of the Effectiveness of Large Language Models as Automatic Dialogue Evaluators (AAAI 2024)](https://ojs.aaai.org/index.php/AAAI/article/view/29918)

---

**This notebook extends notebook 8** by performing:
1. Spearman & Pearson correlation analysis across all datasets and dimensions
2. GPT-4 evaluator calibration and bias analysis
3. Robustness testing (original vs perturbed dialogues)
4. Cross-dataset evaluator generalization
5. A critic agent evaluation scoring framework

## 1. Setup

In [None]:
# Install dependencies (uncomment if needed)
# !pip install pandas matplotlib seaborn scipy

In [None]:
import os
import json
import re
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
from scipy import stats
from itertools import combinations

sns.set_theme(style="whitegrid")
plt.rcParams["figure.figsize"] = (12, 6)
plt.rcParams["figure.dpi"] = 100
plt.rcParams["axes.titlesize"] = 13
plt.rcParams["axes.labelsize"] = 11

# Helper: some GPT-4 JSON files contain an extra "indices" column that is NOT
# a score column.  This function returns only the actual score column names
# (pattern: <dim>_<rater>, e.g. "coh_1", "ovr_5").
import re as _re
_SCORE_COL_RE = _re.compile(r"^[a-z]+_\d+$")

def score_columns(data: dict) -> list:
    """Return only the score column names from a GPT-4 annotation dict."""
    return [c for c in data if _SCORE_COL_RE.match(c)]

## 2. Dataset Overview

The **comp-analysis** benchmark evaluates LLMs as automatic dialogue evaluators.
It provides three data categories:

| Category | Level | Dimensions | Purpose |
|----------|-------|------------|---------|
| Dialog-level GPT-4 | Whole conversation | Coherence, Diversity, Engagement, Informativeness, Overall | Measure LLM evaluator scores |
| Dialog-level Human | Whole conversation | 11 fine-grained dimensions (0–2 scale) | Ground truth for calibration |
| Turn-level Ratings | Single response | Interesting, Relevance, Specificity, Understandability, Overall | Cross-dataset comparison |
| Robustness Data | Both levels | Perturbation types: order shuffle, repetition, contradiction, boring | Test evaluator consistency |

**Source datasets:** FED, HEVAL, IEVAL, ConTurE, Reliable, PersonaSee (dialog); FED-turn, ConTurE-turn, PersonaUSR, PersonaZhao, DailyDialog, TopicalUSR (turn).

## 3. Data Loading

In [None]:
# Clone the repository (skip if already cloned)
REPO_DIR = Path("comp-analysis")
if not REPO_DIR.exists():
    os.system("git clone https://github.com/e0397123/comp-analysis.git")
    print("Repository cloned.")
else:
    print(f"Repository already exists at {REPO_DIR}")

DIALOG_DIR = REPO_DIR / "dialog_level_texts"
TURN_DIR = REPO_DIR / "turn_level_texts"
ROBUST_DIR = REPO_DIR / "robustness_data"

In [None]:
# --- Dialog-level GPT-4 annotations ---
gpt4_files = sorted(DIALOG_DIR.glob("*_gpt4_annotations.json"))

gpt4_dialog = {}
for f in gpt4_files:
    name = f.stem.replace("_gpt4_annotations", "")
    with open(f, "r", encoding="utf-8") as fh:
        gpt4_dialog[name] = json.load(fh)
    n_dialogues = len(list(gpt4_dialog[name].values())[0])
    print(f"  {name}: {n_dialogues} dialogues, {len(gpt4_dialog[name])} columns")

# --- Dialog-level human annotations (FED) ---
human_file = DIALOG_DIR / "fed_human_annotations.json"
with open(human_file, "r", encoding="utf-8") as f:
    fed_human = json.load(f)
print(f"  FED human: {len(fed_human)} dialogues, "
      f"dims = {list(fed_human[0]['annotations'].keys())}")

# --- Turn-level ratings ---
turn_files = sorted(TURN_DIR.glob("turn_*_ratings.json"))

turn_ratings = {}
for f in turn_files:
    name = f.stem.replace("_ratings", "")
    with open(f, "r", encoding="utf-8") as fh:
        turn_ratings[name] = json.load(fh)
    datasets_in = list(turn_ratings[name].keys())
    print(f"  {name}: {len(datasets_in)} datasets")

# --- Dialog-level text files (for dialogue content) ---
dialog_texts = {}
for f in sorted(DIALOG_DIR.glob("*_text.txt")):
    name = f.stem.replace("-dial_text", "").replace("_text", "")
    with open(f, "r", encoding="utf-8") as fh:
        lines = [l.strip() for l in fh if l.strip()]
    dialog_texts[name] = lines
    print(f"  dialog text {name}: {len(lines)} dialogues")

print(f"\nLoaded {len(gpt4_dialog)} GPT-4 sets, {len(turn_ratings)} turn-level sets, "
      f"{len(dialog_texts)} dialog text files.")

In [None]:
# --- Robustness data ---
robust_dialog_dir = ROBUST_DIR / "dialog_level_robust_data"
robust_turn_dir = ROBUST_DIR / "turn_level_robust_data"

# Load dialog-level original coherent dialogues
fed_originals = []
orig_file = robust_dialog_dir / "fed_coherent_original.json"
if orig_file.exists():
    with open(orig_file, "r", encoding="utf-8") as f:
        fed_originals = json.load(f)
    print(f"FED original dialogues: {len(fed_originals)}")

# Load all robustness perturbation text files
robustness_data = {"dialog": {}, "turn": {}}

if robust_dialog_dir.exists():
    for dim_dir in sorted(robust_dialog_dir.iterdir()):
        if dim_dir.is_dir():
            dim_name = dim_dir.name
            robustness_data["dialog"][dim_name] = {}
            for txt_file in sorted(dim_dir.glob("*.txt")):
                with open(txt_file, "r", encoding="utf-8") as fh:
                    lines = [l.strip() for l in fh if l.strip()]
                robustness_data["dialog"][dim_name][txt_file.stem] = lines

if robust_turn_dir.exists():
    for dim_dir in sorted(robust_turn_dir.iterdir()):
        if dim_dir.is_dir():
            dim_name = dim_dir.name
            robustness_data["turn"][dim_name] = {}
            for txt_file in sorted(dim_dir.glob("*.txt")):
                with open(txt_file, "r", encoding="utf-8") as fh:
                    lines = [l.strip() for l in fh if l.strip()]
                robustness_data["turn"][dim_name][txt_file.stem] = lines

print(f"\nRobustness (dialog-level) dimensions: {list(robustness_data['dialog'].keys())}")
for dim, files in robustness_data["dialog"].items():
    print(f"  {dim}: {list(files.keys())} ({[len(v) for v in files.values()]} items)")

print(f"\nRobustness (turn-level) dimensions: {list(robustness_data['turn'].keys())}")
for dim, files in robustness_data["turn"].items():
    print(f"  {dim}: {list(files.keys())} ({[len(v) for v in files.values()]} items)")

## 4. Data Schema Deep Dive

In [None]:
# Build a summary of all GPT-4 annotation datasets
dimensions = {"coh": "Coherence", "div": "Diversity", "eng": "Engagement",
              "inf": "Informativeness", "ovr": "Overall"}

print("=== GPT-4 Dialog-Level Annotation Schema ===")
print(f"Dimensions: {list(dimensions.values())}")
print(f"Raters per dimension: 5 (independent GPT-4 runs)")
print(f"Score scale: 1-5 (float)\n")

summary_rows = []
for ds_name, data in gpt4_dialog.items():
    n = len(list(data.values())[0])
    dims_present = set()
    for col in data.keys():
        dim_key = col.rsplit("_", 1)[0]
        dims_present.add(dim_key)
    summary_rows.append({
        "Dataset": ds_name,
        "Dialogues": n,
        "Columns": len(data),
        "Dimensions": ", ".join(sorted(dims_present)),
    })

pd.DataFrame(summary_rows)

In [None]:
# Human annotation schema (11 fine-grained dimensions)
print("=== Human Annotation Schema (FED) ===")
print(f"Scale: 0-2 (integer, higher = better)")
print(f"Annotators: multiple per dimension\n")

human_dim_stats = {}
for item in fed_human:
    for dim, scores in item["annotations"].items():
        if dim not in human_dim_stats:
            human_dim_stats[dim] = []
        human_dim_stats[dim].extend(scores)

human_summary = pd.DataFrame([
    {
        "Dimension": dim,
        "Total Ratings": len(scores),
        "Mean": np.mean(scores),
        "Std": np.std(scores),
        "Unique Values": sorted(set(scores)),
    }
    for dim, scores in human_dim_stats.items()
]).sort_values("Total Ratings", ascending=False)

print(human_summary.to_string(index=False))

In [None]:
# Dialog text structure
print("=== Dialog Text Format ===")
for ds_name, lines in list(dialog_texts.items())[:2]:
    print(f"\n--- {ds_name} (first dialogue) ---")
    parts = lines[0].split("\t")
    print(f"  Dialog ID: {parts[0]}")
    print(f"  Turns: {len(parts) - 1}")
    for turn in parts[1:4]:
        print(f"    {turn[:100]}")
    if len(parts) > 4:
        print(f"    ... ({len(parts) - 4} more turns)")

## 5. Multi-Dataset Correlation Analysis

We compute **Spearman** (rank-based) and **Pearson** (linear) correlations between
GPT-4 raters and between GPT-4 vs. human ratings. This is the core metric for
evaluating LLMs as critic agents.

### 5.1 GPT-4 Intra-Model Agreement (All Datasets)

In [None]:
# Compute pairwise Spearman correlation between 5 GPT-4 raters for each dataset/dimension
intra_results = []
for ds_name, data in gpt4_dialog.items():
    for dim_key, dim_label in dimensions.items():
        rater_cols = [f"{dim_key}_{r}" for r in range(1, 6) if f"{dim_key}_{r}" in data]
        if len(rater_cols) < 2:
            continue
        pair_corrs = []
        for c1, c2 in combinations(rater_cols, 2):
            rho, pval = stats.spearmanr(data[c1], data[c2])
            pair_corrs.append(rho)
        intra_results.append({
            "Dataset": ds_name,
            "Dimension": dim_label,
            "Mean Spearman": np.mean(pair_corrs),
            "Std Spearman": np.std(pair_corrs),
            "Min Spearman": np.min(pair_corrs),
            "Max Spearman": np.max(pair_corrs),
            "N Pairs": len(pair_corrs),
        })

df_intra = pd.DataFrame(intra_results)
print("=== GPT-4 Intra-Model Agreement (Pairwise Spearman) ===")
print(df_intra.round(3).to_string(index=False))

In [None]:
# Heatmap: Mean Spearman correlation by Dataset x Dimension
pivot_intra = df_intra.pivot(index="Dataset", columns="Dimension", values="Mean Spearman")

plt.figure(figsize=(10, 5))
sns.heatmap(pivot_intra, annot=True, fmt=".3f", cmap="YlOrRd", vmin=0, vmax=1,
            linewidths=0.5, square=True)
plt.title("GPT-4 Intra-Rater Agreement (Mean Pairwise Spearman)")
plt.tight_layout()
plt.show()

print("\nHighest agreement:",
      df_intra.loc[df_intra["Mean Spearman"].idxmax()][
          ["Dataset", "Dimension", "Mean Spearman"]].to_dict())
print("Lowest agreement:",
      df_intra.loc[df_intra["Mean Spearman"].idxmin()][
          ["Dataset", "Dimension", "Mean Spearman"]].to_dict())

### 5.2 GPT-4 vs Human: Dimension-Level Correlation (FED)

In [None]:
# Map human annotation dimensions to GPT-4 dimensions
# Human: Coherent, Diverse, Likeable (~ Engagement), Informative, Overall
# GPT-4: coh, div, eng, inf, ovr
dim_mapping = {
    "coh": "Coherent",
    "div": "Diverse",
    "eng": "Likeable",       # closest proxy for engagement
    "inf": "Informative",
    "ovr": "Overall",
}

fed_gpt4 = gpt4_dialog.get("fed", {})

# Compute per-dialogue mean for GPT-4 (mean of 5 raters) and human (mean of annotators)
gpt4_vs_human = []
for gpt4_dim, human_dim in dim_mapping.items():
    # GPT-4 mean across 5 raters
    rater_scores = []
    for r in range(1, 6):
        col = f"{gpt4_dim}_{r}"
        if col in fed_gpt4:
            rater_scores.append(fed_gpt4[col])
    if not rater_scores:
        continue
    gpt4_mean = np.mean(rater_scores, axis=0)

    # Human mean per dialogue
    human_mean = []
    for item in fed_human:
        if human_dim in item["annotations"]:
            human_mean.append(np.mean(item["annotations"][human_dim]))
        else:
            human_mean.append(np.nan)

    # Align and clean
    min_len = min(len(gpt4_mean), len(human_mean))
    g = gpt4_mean[:min_len]
    h = np.array(human_mean[:min_len])
    mask = ~np.isnan(h)
    g_clean, h_clean = g[mask], h[mask]

    if len(g_clean) >= 5:
        spearman_r, spearman_p = stats.spearmanr(h_clean, g_clean)
        pearson_r, pearson_p = stats.pearsonr(h_clean, g_clean)
        gpt4_vs_human.append({
            "GPT-4 Dim": dimensions[gpt4_dim],
            "Human Dim": human_dim,
            "Spearman r": spearman_r,
            "Spearman p": spearman_p,
            "Pearson r": pearson_r,
            "Pearson p": pearson_p,
            "N": len(g_clean),
        })

df_corr = pd.DataFrame(gpt4_vs_human)
print("=== GPT-4 vs Human Correlation (FED Dataset) ===")
print(df_corr.round(3).to_string(index=False))

In [None]:
# Visualize GPT-4 vs Human correlations
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

x = range(len(df_corr))
width = 0.35

axes[0].bar([i - width/2 for i in x], df_corr["Spearman r"], width,
            label="Spearman", color="steelblue")
axes[0].bar([i + width/2 for i in x], df_corr["Pearson r"], width,
            label="Pearson", color="coral")
axes[0].set_xticks(list(x))
axes[0].set_xticklabels(df_corr["GPT-4 Dim"], rotation=15)
axes[0].set_ylabel("Correlation")
axes[0].set_title("GPT-4 vs Human: Correlation by Dimension")
axes[0].legend()
axes[0].axhline(y=0, color="gray", linestyle="--", alpha=0.5)

# Significance markers
for i, (_, row) in enumerate(df_corr.iterrows()):
    sig = ("***" if row["Spearman p"] < 0.001 else
           "**" if row["Spearman p"] < 0.01 else
           "*" if row["Spearman p"] < 0.05 else "ns")
    axes[0].text(i - width/2, row["Spearman r"] + 0.02, sig,
                 ha="center", fontsize=8)

# Scatter plots for each dimension
colors = ["steelblue", "coral", "mediumseagreen", "orchid", "goldenrod"]
for idx, (gpt4_dim, human_dim) in enumerate(dim_mapping.items()):
    rater_scores = [fed_gpt4[f"{gpt4_dim}_{r}"] for r in range(1, 6)
                    if f"{gpt4_dim}_{r}" in fed_gpt4]
    if not rater_scores:
        continue
    gpt4_mean = np.mean(rater_scores, axis=0)
    human_mean = [
        np.mean(item["annotations"][human_dim])
        if human_dim in item["annotations"] else np.nan
        for item in fed_human
    ]
    min_len = min(len(gpt4_mean), len(human_mean))
    g, h = gpt4_mean[:min_len], np.array(human_mean[:min_len])
    mask = ~np.isnan(h)
    axes[1].scatter(h[mask], g[mask], alpha=0.5, s=30,
                    label=dimensions[gpt4_dim], color=colors[idx])

axes[1].set_xlabel("Human Mean Rating")
axes[1].set_ylabel("GPT-4 Mean Rating")
axes[1].set_title("GPT-4 vs Human: All Dimensions")
axes[1].legend(fontsize=8)
lims = [min(axes[1].get_xlim()[0], axes[1].get_ylim()[0]),
        max(axes[1].get_xlim()[1], axes[1].get_ylim()[1])]
axes[1].plot(lims, lims, "--", color="gray", alpha=0.4)

plt.tight_layout()
plt.show()

### 5.3 Cross-Dataset GPT-4 Score Distributions

In [None]:
# Compare GPT-4 score distributions across all datasets for each dimension
cross_rows = []
for ds_name, data in gpt4_dialog.items():
    for dim_key, dim_label in dimensions.items():
        all_scores = []
        for r in range(1, 6):
            col = f"{dim_key}_{r}"
            if col in data:
                all_scores.extend(data[col])
        if all_scores:
            cross_rows.append({
                "Dataset": ds_name,
                "Dimension": dim_label,
                "Mean": np.mean(all_scores),
                "Std": np.std(all_scores),
                "Median": np.median(all_scores),
                "Skewness": stats.skew(all_scores),
                "N": len(all_scores),
            })

df_cross = pd.DataFrame(cross_rows)

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Mean score comparison
pivot_mean = df_cross.pivot(index="Dataset", columns="Dimension", values="Mean")
pivot_mean.plot(kind="bar", ax=axes[0], colormap="Set2", edgecolor="white")
axes[0].set_title("Mean GPT-4 Scores by Dataset and Dimension")
axes[0].set_ylabel("Mean Score")
axes[0].set_xticklabels(axes[0].get_xticklabels(), rotation=15)
axes[0].legend(fontsize=8, loc="lower right")

# Score variance (std) comparison
pivot_std = df_cross.pivot(index="Dataset", columns="Dimension", values="Std")
pivot_std.plot(kind="bar", ax=axes[1], colormap="Set2", edgecolor="white")
axes[1].set_title("GPT-4 Score Std Dev by Dataset and Dimension")
axes[1].set_ylabel("Standard Deviation")
axes[1].set_xticklabels(axes[1].get_xticklabels(), rotation=15)
axes[1].legend(fontsize=8, loc="upper right")

plt.tight_layout()
plt.show()

In [None]:
# Skewness analysis: does GPT-4 tend to give high or low scores?
pivot_skew = df_cross.pivot(index="Dataset", columns="Dimension", values="Skewness")

plt.figure(figsize=(10, 5))
sns.heatmap(pivot_skew, annot=True, fmt=".2f", cmap="RdBu_r", center=0,
            linewidths=0.5, square=True)
plt.title("GPT-4 Score Skewness by Dataset and Dimension\n"
          "(Negative = tends high, Positive = tends low)")
plt.tight_layout()
plt.show()

## 6. GPT-4 Evaluator Calibration & Bias

### 6.1 Scale Usage Patterns
Does GPT-4 use the full 1–5 scale, or does it cluster around certain values?

In [None]:
# Analyze score frequency distribution per dataset
n_datasets = len(gpt4_dialog)
cols = min(3, n_datasets)
rows_fig = (n_datasets + cols - 1) // cols
fig, axes = plt.subplots(rows_fig, cols, figsize=(5 * cols, 4 * rows_fig))
if rows_fig * cols > 1:
    axes = axes.flatten()
else:
    axes = [axes]

for idx, (ds_name, data) in enumerate(gpt4_dialog.items()):
    if idx >= len(axes):
        break
    all_scores = []
    for col in score_columns(data):          # <-- filter out non-score columns
        all_scores.extend(data[col])
    ax = axes[idx]
    ax.hist(all_scores, bins=np.arange(0.5, 6.5, 1), color="steelblue",
            edgecolor="white", density=True, alpha=0.8)
    ax.set_title(f"{ds_name} (n={len(all_scores)})")
    ax.set_xlabel("Score")
    ax.set_ylabel("Density")
    ax.set_xticks([1, 2, 3, 4, 5])

for i in range(len(gpt4_dialog), len(axes)):
    axes[i].set_visible(False)

plt.suptitle("GPT-4 Score Distribution by Dataset (All Dimensions Pooled)",
             fontsize=14, y=1.01)
plt.tight_layout()
plt.show()

In [None]:
# Quantify GPT-4's tendency toward extreme vs moderate scores
calibration_rows = []
for ds_name, data in gpt4_dialog.items():
    all_scores = []
    for col in score_columns(data):          # <-- filter out non-score columns
        all_scores.extend(data[col])
    all_scores = np.array(all_scores)
    calibration_rows.append({
        "Dataset": ds_name,
        "% Score=1": np.mean(all_scores == 1) * 100,
        "% Score=2": np.mean(all_scores == 2) * 100,
        "% Score=3": np.mean(all_scores == 3) * 100,
        "% Score=4": np.mean(all_scores == 4) * 100,
        "% Score=5": np.mean(all_scores == 5) * 100,
        "% Extreme (1 or 5)": np.mean((all_scores == 1) | (all_scores == 5)) * 100,
        "Mean": np.mean(all_scores),
    })

df_calib = pd.DataFrame(calibration_rows)
print("=== GPT-4 Scale Usage Patterns ===")
print(df_calib.round(1).to_string(index=False))

### 6.2 Positivity Bias Detection
Does GPT-4 systematically rate dialogues higher than human annotators?

In [None]:
# Compare GPT-4 vs human score distributions on overlapping FED dialogues
bias_results = []
for gpt4_dim, human_dim in dim_mapping.items():
    rater_scores = [fed_gpt4[f"{gpt4_dim}_{r}"] for r in range(1, 6)
                    if f"{gpt4_dim}_{r}" in fed_gpt4]
    if not rater_scores:
        continue
    gpt4_mean_per_dialog = np.mean(rater_scores, axis=0)

    human_mean_per_dialog = []
    for item in fed_human:
        if human_dim in item["annotations"]:
            human_mean_per_dialog.append(np.mean(item["annotations"][human_dim]))
        else:
            human_mean_per_dialog.append(np.nan)

    min_len = min(len(gpt4_mean_per_dialog), len(human_mean_per_dialog))
    g = gpt4_mean_per_dialog[:min_len]
    h = np.array(human_mean_per_dialog[:min_len])
    mask = ~np.isnan(h)
    g_clean, h_clean = g[mask], h[mask]

    if len(g_clean) >= 5:
        # Normalize both to 0-1 range for fair comparison
        g_norm = (g_clean - g_clean.min()) / (g_clean.max() - g_clean.min() + 1e-8)
        h_norm = (h_clean - h_clean.min()) / (h_clean.max() - h_clean.min() + 1e-8)
        bias = np.mean(g_norm - h_norm)  # positive = GPT-4 rates higher

        bias_results.append({
            "Dimension": dimensions[gpt4_dim],
            "GPT-4 Mean (raw)": np.mean(g_clean),
            "Human Mean (raw)": np.mean(h_clean),
            "Normalized Bias": bias,
            "Direction": "GPT-4 higher" if bias > 0 else "Human higher",
        })

df_bias = pd.DataFrame(bias_results)
print("=== Positivity Bias Analysis ===")
print(df_bias.round(3).to_string(index=False))

plt.figure(figsize=(8, 5))
colors_bar = ["coral" if b > 0 else "steelblue" for b in df_bias["Normalized Bias"]]
plt.barh(df_bias["Dimension"], df_bias["Normalized Bias"],
         color=colors_bar, edgecolor="white")
plt.axvline(x=0, color="black", linewidth=0.8)
plt.xlabel("Normalized Bias (positive = GPT-4 rates higher)")
plt.title("GPT-4 Positivity Bias by Dimension (FED)")
plt.tight_layout()
plt.show()

### 6.3 Per-Model Evaluation Bias
Does GPT-4 systematically prefer certain dialogue models over others, compared to humans?

In [None]:
# Group FED dialogues by model and compare GPT-4 vs Human Overall scores
model_comparison = {}
for i, item in enumerate(fed_human):
    model = item.get("model", "unknown")
    if model not in model_comparison:
        model_comparison[model] = {"human": [], "gpt4": []}

    # Human Overall
    if "Overall" in item["annotations"]:
        model_comparison[model]["human"].append(
            np.mean(item["annotations"]["Overall"]))
    else:
        model_comparison[model]["human"].append(np.nan)

    # GPT-4 Overall (mean of 5 raters)
    gpt4_scores_i = []
    for r in range(1, 6):
        col = f"ovr_{r}"
        if col in fed_gpt4 and i < len(fed_gpt4[col]):
            gpt4_scores_i.append(fed_gpt4[col][i])
    if gpt4_scores_i:
        model_comparison[model]["gpt4"].append(np.mean(gpt4_scores_i))
    else:
        model_comparison[model]["gpt4"].append(np.nan)

model_rows = []
for model, scores in model_comparison.items():
    h = [s for s in scores["human"] if not np.isnan(s)]
    g = [s for s in scores["gpt4"] if not np.isnan(s)]
    if h and g:
        model_rows.append({
            "Model": model,
            "Human Mean": np.mean(h),
            "GPT-4 Mean": np.mean(g),
            "Rank (Human)": 0,
            "Rank (GPT-4)": 0,
            "N": min(len(h), len(g)),
        })

df_model = pd.DataFrame(model_rows).sort_values("Human Mean", ascending=False)
df_model["Rank (Human)"] = range(1, len(df_model) + 1)
df_model = df_model.sort_values("GPT-4 Mean", ascending=False)
df_model["Rank (GPT-4)"] = range(1, len(df_model) + 1)
df_model = df_model.sort_values("Rank (Human)")

print("=== Per-Model Ranking Comparison ===")
print(df_model.round(3).to_string(index=False))

fig, ax = plt.subplots(figsize=(10, 5))
x = range(len(df_model))
width = 0.35
ax.bar([i - width/2 for i in x], df_model["Human Mean"], width,
       label="Human", color="steelblue")
ax.bar([i + width/2 for i in x], df_model["GPT-4 Mean"], width,
       label="GPT-4", color="coral")
ax.set_xticks(list(x))
ax.set_xticklabels(df_model["Model"], rotation=15)
ax.set_ylabel("Mean Overall Rating")
ax.set_title("Human vs GPT-4 Overall Rating by Dialog Model")
ax.legend()
plt.tight_layout()
plt.show()

## 7. Robustness Analysis

A reliable evaluator should detect quality degradation when dialogues are perturbed.
We analyze the robustness perturbation data to understand:
- How many perturbation types exist per dimension
- Dialogue structure changes under perturbation

### 7.1 Robustness Data Overview

In [None]:
# Summarize robustness data structure
print("=== Dialog-Level Robustness Perturbations ===")
dialog_robust_rows = []
for dim, files in robustness_data["dialog"].items():
    dim_full = {"coh": "Coherence", "div": "Diversity",
                "eng": "Engagement", "inf": "Informativeness"}.get(dim, dim)
    for fname, lines in files.items():
        perturb_type = fname.replace("fed-", "").replace("decode-", "").replace("_text", "")
        dialog_robust_rows.append({
            "Dimension": dim_full,
            "Perturbation": perturb_type,
            "N Dialogues": len(lines),
            "Avg Turns": np.mean([len(l.split("\t")) - 1 for l in lines]) if lines else 0,
        })

df_robust_dialog = pd.DataFrame(dialog_robust_rows)
if len(df_robust_dialog) > 0:
    print(df_robust_dialog.round(1).to_string(index=False))

print("\n=== Turn-Level Robustness Perturbations ===")
turn_robust_rows = []
for dim, files in robustness_data["turn"].items():
    dim_full = {"int": "Interesting", "rel": "Relevance",
                "spe": "Specificity", "und": "Understandability"}.get(dim, dim)
    for fname, lines in files.items():
        is_original = "original" in fname
        turn_robust_rows.append({
            "Dimension": dim_full,
            "File": fname,
            "Type": "Original" if is_original else "Perturbed",
            "N Items": len(lines),
        })

df_robust_turn = pd.DataFrame(turn_robust_rows)
if len(df_robust_turn) > 0:
    print(df_robust_turn.to_string(index=False))

### 7.2 Dialog-Level Perturbation Structure Analysis

In [None]:
# Analyze how perturbations change dialogue structure
original_turns = [len(d) for d in fed_originals] if fed_originals else []

struct_rows = []
for dim, files in robustness_data["dialog"].items():
    for fname, lines in files.items():
        turns_per_dialog = [len(l.split("\t")) - 1 for l in lines]
        struct_rows.append({
            "Perturbation": fname.replace("_text", ""),
            "Dimension": dim,
            "Mean Turns": np.mean(turns_per_dialog),
            "Std Turns": np.std(turns_per_dialog),
            "Min Turns": np.min(turns_per_dialog),
            "Max Turns": np.max(turns_per_dialog),
        })

if struct_rows:
    df_struct = pd.DataFrame(struct_rows)

    from matplotlib.patches import Patch

    plt.figure(figsize=(14, 6))
    colors_dim = {"coh": "steelblue", "div": "coral",
                  "eng": "mediumseagreen", "inf": "orchid"}
    bar_colors = [colors_dim.get(row["Dimension"], "gray")
                  for _, row in df_struct.iterrows()]
    plt.barh(range(len(df_struct)), df_struct["Mean Turns"], color=bar_colors,
             xerr=df_struct["Std Turns"], capsize=3, edgecolor="white")
    plt.yticks(range(len(df_struct)), df_struct["Perturbation"], fontsize=9)
    plt.xlabel("Mean Number of Turns")
    plt.title("Dialogue Structure Under Perturbation")

    if original_turns:
        plt.axvline(x=np.mean(original_turns), color="black", linestyle="--",
                    alpha=0.6, label=f"Original mean ({np.mean(original_turns):.1f})")

    dim_labels = {"coh": "Coherence", "div": "Diversity",
                  "eng": "Engagement", "inf": "Informativeness"}
    legend_elements = [Patch(facecolor=c, label=dim_labels[d])
                       for d, c in colors_dim.items() if d in dim_labels]
    plt.legend(handles=legend_elements, loc="lower right", fontsize=9)

    plt.gca().invert_yaxis()
    plt.tight_layout()
    plt.show()
else:
    print("No dialog-level robustness data available.")

### 7.3 Turn-Level Perturbation: Original vs Perturbed Comparison

In [None]:
# Show examples of original vs perturbed turns
print("=== Turn-Level Perturbation Examples ===")
for dim, files in robustness_data["turn"].items():
    dim_full = {"int": "Interesting", "rel": "Relevance",
                "spe": "Specificity", "und": "Understandability"}.get(dim, dim)
    originals = {k: v for k, v in files.items() if "original" in k}
    perturbeds = {k: v for k, v in files.items() if "perturbed" in k}

    for orig_name, orig_lines in originals.items():
        perturb_name = orig_name.replace("original", "perturbed")
        if perturb_name in perturbeds:
            perturbed_lines = perturbeds[perturb_name]
            print(f"\n--- {dim_full}: {orig_name.replace('_text', '')} ---")
            if orig_lines and perturbed_lines:
                orig_parts = orig_lines[0].split("\t")
                pert_parts = perturbed_lines[0].split("\t")
                print(f"  Original ({len(orig_lines)} items):")
                print(f"    ID: {orig_parts[0]}")
                print(f"    Last turn: {orig_parts[-1][:100]}")
                print(f"  Perturbed ({len(perturbed_lines)} items):")
                print(f"    ID: {pert_parts[0]}")
                print(f"    Last turn: {pert_parts[-1][:100]}")

In [None]:
# Analyze text-level changes between original and perturbed
perturbation_stats = []
for dim, files in robustness_data["turn"].items():
    dim_full = {"int": "Interesting", "rel": "Relevance",
                "spe": "Specificity", "und": "Understandability"}.get(dim, dim)
    originals = {k: v for k, v in files.items() if "original" in k}
    perturbeds = {k: v for k, v in files.items() if "perturbed" in k}

    for orig_name, orig_lines in originals.items():
        perturb_name = orig_name.replace("original", "perturbed")
        if perturb_name in perturbeds:
            perturbed_lines = perturbeds[perturb_name]
            n = min(len(orig_lines), len(perturbed_lines))
            orig_lens = [len(l) for l in orig_lines[:n]]
            pert_lens = [len(l) for l in perturbed_lines[:n]]
            perturbation_stats.append({
                "Dimension": dim_full,
                "Type": orig_name.replace("-original_text", "").replace("_text", ""),
                "N Pairs": n,
                "Orig Mean Len": np.mean(orig_lens),
                "Pert Mean Len": np.mean(pert_lens),
                "Len Change %": ((np.mean(pert_lens) - np.mean(orig_lens)) /
                                 (np.mean(orig_lens) + 1e-8) * 100),
            })

if perturbation_stats:
    df_perturb = pd.DataFrame(perturbation_stats)
    print("=== Turn-Level Perturbation Statistics ===")
    print(df_perturb.round(1).to_string(index=False))

    plt.figure(figsize=(10, 5))
    colors_pct = ["coral" if x > 0 else "steelblue"
                  for x in df_perturb["Len Change %"]]
    plt.barh(df_perturb["Dimension"] + " / " + df_perturb["Type"],
             df_perturb["Len Change %"], color=colors_pct, edgecolor="white")
    plt.axvline(x=0, color="black", linewidth=0.8)
    plt.xlabel("Length Change (%)")
    plt.title("Text Length Change: Original vs Perturbed Turns")
    plt.gca().invert_yaxis()
    plt.tight_layout()
    plt.show()
else:
    print("No original/perturbed pairs found.")

## 8. Turn-Level Deep Analysis

### 8.1 Cross-Dataset Rating Comparison (All Dimensions)

In [None]:
# Build a comprehensive turn-level DataFrame
turn_summary_rows = []
for rating_name, data in turn_ratings.items():
    dim_name = rating_name.replace("turn_", "").capitalize()
    for ds_name, scores in data.items():
        clean = [s for s in scores
                 if s is not None and not (isinstance(s, float) and np.isnan(s))]
        if clean:
            turn_summary_rows.append({
                "Dimension": dim_name,
                "Dataset": ds_name,
                "N": len(clean),
                "Mean": np.mean(clean),
                "Std": np.std(clean),
                "Min": np.min(clean),
                "Max": np.max(clean),
            })

df_turn_all = pd.DataFrame(turn_summary_rows)

# Pivot: Mean rating by Dataset x Dimension
pivot_turn = df_turn_all.pivot_table(index="Dataset", columns="Dimension",
                                      values="Mean")

plt.figure(figsize=(12, 6))
sns.heatmap(pivot_turn, annot=True, fmt=".2f", cmap="YlGnBu", linewidths=0.5)
plt.title("Turn-Level Mean Rating by Dataset and Dimension")
plt.tight_layout()
plt.show()

print("\nDatasets with highest Overall mean:")
if "Overall" in pivot_turn.columns:
    print(pivot_turn["Overall"].sort_values(ascending=False).round(3).to_string())

### 8.2 Turn-Level Dimension Correlations (Per Dataset)

In [None]:
# For each dataset with multiple dimension ratings, compute correlation matrix
dataset_dims = {}
for rating_name, data in turn_ratings.items():
    dim_name = rating_name.replace("turn_", "").capitalize()
    if dim_name == "Ratings":  # skip the combined file
        continue
    for ds_name, scores in data.items():
        if ds_name not in dataset_dims:
            dataset_dims[ds_name] = {}
        dataset_dims[ds_name][dim_name] = scores

# Plot correlation heatmaps for datasets with 3+ dimensions
multi_dim_datasets = {k: v for k, v in dataset_dims.items() if len(v) >= 3}
n_plots = len(multi_dim_datasets)

if n_plots > 0:
    cols = min(3, n_plots)
    rows_fig = (n_plots + cols - 1) // cols
    fig, axes = plt.subplots(rows_fig, cols, figsize=(6 * cols, 5 * rows_fig))
    if n_plots == 1:
        axes = [axes]
    elif rows_fig * cols > 1:
        axes = axes.flatten()
    else:
        axes = [axes]

    for idx, (ds_name, dim_data) in enumerate(multi_dim_datasets.items()):
        min_len = min(len(v) for v in dim_data.values())
        df_ds = pd.DataFrame({k: v[:min_len] for k, v in dim_data.items()})
        df_ds = df_ds.apply(pd.to_numeric, errors="coerce").dropna()

        if len(df_ds) > 5 and idx < len(axes):
            corr = df_ds.corr(method="spearman")
            sns.heatmap(corr, annot=True, fmt=".2f", cmap="coolwarm",
                        vmin=-1, vmax=1, square=True, linewidths=0.5,
                        ax=axes[idx])
            axes[idx].set_title(f"{ds_name} (n={len(df_ds)})")

    for i in range(n_plots, len(axes)):
        axes[i].set_visible(False)

    plt.suptitle("Turn-Level Dimension Correlations (Spearman) by Dataset",
                 fontsize=14, y=1.02)
    plt.tight_layout()
    plt.show()
else:
    print("Not enough multi-dimension datasets for correlation analysis.")

### 8.3 Rating Scale Heterogeneity Across Datasets

In [None]:
# Show how different datasets use different rating scales
scale_rows = []
for ds_name, dim_data in dataset_dims.items():
    all_scores = []
    for scores in dim_data.values():
        all_scores.extend([s for s in scores if s is not None])
    if all_scores:
        all_scores = np.array(all_scores, dtype=float)
        all_scores = all_scores[~np.isnan(all_scores)]
        if len(all_scores) > 0:
            scale_rows.append({
                "Dataset": ds_name,
                "Min": np.min(all_scores),
                "Max": np.max(all_scores),
                "Range": np.max(all_scores) - np.min(all_scores),
                "Unique Values": len(np.unique(all_scores)),
                "Mean": np.mean(all_scores),
                "N Dimensions": len(dim_data),
            })

df_scale = pd.DataFrame(scale_rows).sort_values("Range", ascending=False)
print("=== Rating Scale Summary by Source Dataset ===")
print(df_scale.round(2).to_string(index=False))

fig, ax = plt.subplots(figsize=(10, 5))
for _, row in df_scale.iterrows():
    ax.barh(row["Dataset"], row["Range"], left=row["Min"],
            color="steelblue", alpha=0.7, edgecolor="white")
    ax.plot(row["Mean"], row["Dataset"], "D", color="coral", markersize=8)
ax.set_xlabel("Score")
ax.set_title("Rating Scale Range by Dataset (diamond = mean)")
ax.invert_yaxis()
plt.tight_layout()
plt.show()

## 9. Critic Agent Evaluation Framework

Based on the benchmark data, we build a scoring framework that quantifies
how well an LLM evaluator ("critic agent") performs across key criteria.

### Evaluation Criteria

| Criterion | Metric | Weight | Source |
|-----------|--------|--------|--------|
| **Alignment** | Spearman correlation with human ratings | 0.30 | Section 5.2 |
| **Consistency** | Inter-rater agreement (1 − mean std) | 0.20 | Section 5.1 |
| **Discrimination** | Score variance (ability to differentiate) | 0.15 | Section 5.3 |
| **Calibration** | Scale usage uniformity | 0.15 | Section 6.1 |
| **Generalization** | Cross-dataset score stability | 0.10 | Section 5.3 |
| **Bias** | Absolute normalized bias | 0.10 | Section 6.2 |

In [None]:
# Compute the Critic Agent Evaluation Scorecard for GPT-4

def compute_critic_scores(gpt4_data, human_data, dim_map, dims, bias_df):
    """Compute multi-dimensional critic agent evaluation scores."""
    scores = {}

    # --- 1. Alignment (Spearman with human) ---
    alignment_scores = []
    for gpt4_dim, human_dim in dim_map.items():
        rater_scores = [gpt4_data[f"{gpt4_dim}_{r}"] for r in range(1, 6)
                        if f"{gpt4_dim}_{r}" in gpt4_data]
        if not rater_scores:
            continue
        gpt4_mean = np.mean(rater_scores, axis=0)
        human_mean = [
            np.mean(item["annotations"][human_dim])
            if human_dim in item["annotations"] else np.nan
            for item in human_data
        ]
        min_len = min(len(gpt4_mean), len(human_mean))
        g, h = gpt4_mean[:min_len], np.array(human_mean[:min_len])
        mask = ~np.isnan(h)
        if np.sum(mask) >= 5:
            rho, _ = stats.spearmanr(h[mask], g[mask])
            alignment_scores.append(max(rho, 0))
    scores["Alignment"] = np.mean(alignment_scores) if alignment_scores else 0

    # --- 2. Consistency (1 - normalized mean std across raters) ---
    consistency_scores = []
    for dim_key in dims:
        rater_cols = [f"{dim_key}_{r}" for r in range(1, 6)
                      if f"{dim_key}_{r}" in gpt4_data]
        if len(rater_cols) < 2:
            continue
        rater_matrix = np.array([gpt4_data[c] for c in rater_cols])
        mean_std = np.mean(np.std(rater_matrix, axis=0))
        consistency_scores.append(1 - min(mean_std / 2.0, 1.0))
    scores["Consistency"] = (np.mean(consistency_scores)
                             if consistency_scores else 0)

    # --- 3. Discrimination (normalized std of mean scores) ---
    disc_scores = []
    for dim_key in dims:
        rater_cols = [f"{dim_key}_{r}" for r in range(1, 6)
                      if f"{dim_key}_{r}" in gpt4_data]
        if not rater_cols:
            continue
        mean_per_dialog = np.mean([gpt4_data[c] for c in rater_cols], axis=0)
        disc_scores.append(min(np.std(mean_per_dialog) / 2.0, 1.0))
    scores["Discrimination"] = np.mean(disc_scores) if disc_scores else 0

    # --- 4. Calibration (scale usage uniformity via entropy) ---
    all_scores_flat = []
    for col in score_columns(gpt4_data):     # <-- filter out non-score columns
        all_scores_flat.extend(gpt4_data[col])
    all_scores_flat = np.array(all_scores_flat)
    counts = np.array([np.sum(all_scores_flat == s) for s in [1, 2, 3, 4, 5]])
    counts = counts / (counts.sum() + 1e-8)
    entropy = -np.sum(counts * np.log(counts + 1e-8))
    max_entropy = np.log(5)
    scores["Calibration"] = entropy / max_entropy

    # --- 5. Generalization (1 - CoV of means across dims) ---
    dim_means = []
    for dim_key in dims:
        rater_cols = [f"{dim_key}_{r}" for r in range(1, 6)
                      if f"{dim_key}_{r}" in gpt4_data]
        if rater_cols:
            dim_means.append(
                np.mean([np.mean(gpt4_data[c]) for c in rater_cols]))
    if dim_means and np.mean(dim_means) > 0:
        cov = np.std(dim_means) / np.mean(dim_means)
        scores["Generalization"] = max(1 - cov, 0)
    else:
        scores["Generalization"] = 0

    # --- 6. Bias (1 - absolute bias) ---
    if bias_df is not None and len(bias_df) > 0:
        mean_abs_bias = np.mean(np.abs(bias_df["Normalized Bias"]))
        scores["Low Bias"] = max(1 - mean_abs_bias * 2, 0)
    else:
        scores["Low Bias"] = 0.5

    return scores


critic_scores = compute_critic_scores(
    fed_gpt4, fed_human, dim_mapping, list(dimensions.keys()), df_bias
)

# Display scorecard
weights = {"Alignment": 0.30, "Consistency": 0.20, "Discrimination": 0.15,
           "Calibration": 0.15, "Generalization": 0.10, "Low Bias": 0.10}

scorecard_rows = []
for criterion, score in critic_scores.items():
    w = weights.get(criterion, 0)
    scorecard_rows.append({
        "Criterion": criterion,
        "Score (0-1)": score,
        "Weight": w,
        "Weighted": score * w,
    })

df_scorecard = pd.DataFrame(scorecard_rows)
overall_score = df_scorecard["Weighted"].sum()

print("=== GPT-4 Critic Agent Scorecard (FED Dataset) ===")
print(df_scorecard.round(3).to_string(index=False))
print(f"\nOverall Critic Score: {overall_score:.3f} / 1.000")

In [None]:
# Radar chart of critic agent scores
labels = list(critic_scores.keys())
values = list(critic_scores.values())
values += values[:1]  # close the polygon

angles = np.linspace(0, 2 * np.pi, len(labels), endpoint=False).tolist()
angles += angles[:1]

fig, ax = plt.subplots(figsize=(8, 8), subplot_kw=dict(polar=True))
ax.fill(angles, values, alpha=0.2, color="steelblue")
ax.plot(angles, values, "o-", linewidth=2, color="steelblue", markersize=8)
ax.set_xticks(angles[:-1])
ax.set_xticklabels(labels, fontsize=11)
ax.set_ylim(0, 1)
ax.set_yticks([0.2, 0.4, 0.6, 0.8, 1.0])
ax.set_yticklabels(["0.2", "0.4", "0.6", "0.8", "1.0"], fontsize=8)
ax.set_title(f"GPT-4 Critic Agent Profile\n(Overall: {overall_score:.3f})",
             fontsize=14, pad=20)
plt.tight_layout()
plt.show()

### 9.2 Multi-Dataset Critic Scorecard

Compute the scorecard for GPT-4 across all available datasets (using
consistency and discrimination metrics where human data is not available).

In [None]:
# Compute partial scorecards for each GPT-4 dataset
multi_ds_scores = []
for ds_name, data in gpt4_dialog.items():
    ds_scores = {}

    # Consistency
    consistency_vals = []
    for dim_key in dimensions:
        rater_cols = [f"{dim_key}_{r}" for r in range(1, 6)
                      if f"{dim_key}_{r}" in data]
        if len(rater_cols) >= 2:
            rater_matrix = np.array([data[c] for c in rater_cols])
            mean_std = np.mean(np.std(rater_matrix, axis=0))
            consistency_vals.append(1 - min(mean_std / 2.0, 1.0))
    ds_scores["Consistency"] = (np.mean(consistency_vals)
                                if consistency_vals else 0)

    # Discrimination
    disc_vals = []
    for dim_key in dimensions:
        rater_cols = [f"{dim_key}_{r}" for r in range(1, 6)
                      if f"{dim_key}_{r}" in data]
        if rater_cols:
            mean_per_dialog = np.mean([data[c] for c in rater_cols], axis=0)
            disc_vals.append(min(np.std(mean_per_dialog) / 2.0, 1.0))
    ds_scores["Discrimination"] = np.mean(disc_vals) if disc_vals else 0

    # Calibration (use only score columns, not "indices")
    all_s = []
    for col in score_columns(data):          # <-- filter out non-score columns
        all_s.extend(data[col])
    all_s = np.array(all_s)
    counts = np.array([np.sum(all_s == s) for s in [1, 2, 3, 4, 5]])
    counts = counts / (counts.sum() + 1e-8)
    entropy = -np.sum(counts * np.log(counts + 1e-8))
    ds_scores["Calibration"] = entropy / np.log(5)

    multi_ds_scores.append({"Dataset": ds_name, **ds_scores})

df_multi = pd.DataFrame(multi_ds_scores)
print("=== GPT-4 Critic Scores Across Datasets ===")
print(df_multi.round(3).to_string(index=False))

df_multi_plot = df_multi.set_index("Dataset")
df_multi_plot.plot(kind="bar", figsize=(12, 5), colormap="Set2", edgecolor="white")
plt.title("GPT-4 Critic Agent Scores Across Datasets")
plt.ylabel("Score (0-1)")
plt.xticks(rotation=15)
plt.legend(loc="lower right")
plt.tight_layout()
plt.show()

## 10. Dimension-Level Deep Dive: What Makes a Good Evaluator?

### 10.1 Per-Dimension Rater Correlation Matrix (FED)

In [None]:
# Build rater-level correlation matrices for each dimension
fig, axes = plt.subplots(1, 5, figsize=(24, 4))

for idx, (dim_key, dim_label) in enumerate(dimensions.items()):
    rater_cols = [f"{dim_key}_{r}" for r in range(1, 6)
                  if f"{dim_key}_{r}" in fed_gpt4]
    if len(rater_cols) < 2:
        continue
    df_raters = pd.DataFrame({
        f"R{r}": fed_gpt4[f"{dim_key}_{r}"]
        for r in range(1, 6) if f"{dim_key}_{r}" in fed_gpt4
    })
    corr = df_raters.corr(method="spearman")
    sns.heatmap(corr, annot=True, fmt=".2f", cmap="YlOrRd", vmin=0, vmax=1,
                square=True, linewidths=0.5, ax=axes[idx], cbar=(idx == 4))
    axes[idx].set_title(f"{dim_label}")

plt.suptitle("GPT-4 Pairwise Rater Correlation (Spearman) by Dimension",
             fontsize=14, y=1.05)
plt.tight_layout()
plt.show()

### 10.2 Score Stability: Coefficient of Variation per Dialogue

In [None]:
# For each dialogue, compute the Coefficient of Variation across 5 raters
cv_data = []
for dim_key, dim_label in dimensions.items():
    rater_cols = [f"{dim_key}_{r}" for r in range(1, 6)
                  if f"{dim_key}_{r}" in fed_gpt4]
    if len(rater_cols) < 2:
        continue
    rater_matrix = np.array([fed_gpt4[c] for c in rater_cols])  # (5, N)
    means = np.mean(rater_matrix, axis=0)
    stds = np.std(rater_matrix, axis=0)
    cvs = stds / (means + 1e-8)
    for cv_val, mean_val in zip(cvs, means):
        cv_data.append({"Dimension": dim_label, "CV": cv_val,
                        "Mean Score": mean_val})

df_cv = pd.DataFrame(cv_data)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# CV distribution by dimension
sns.boxplot(data=df_cv, x="Dimension", y="CV", hue="Dimension",
            palette="Set2", legend=False, ax=axes[0])
axes[0].set_title("Score Instability (CV) by Dimension")
axes[0].set_ylabel("Coefficient of Variation")

# CV vs Mean Score scatter
for dim_label in dimensions.values():
    subset = df_cv[df_cv["Dimension"] == dim_label]
    axes[1].scatter(subset["Mean Score"], subset["CV"],
                    alpha=0.3, s=20, label=dim_label)
axes[1].set_xlabel("Mean GPT-4 Score")
axes[1].set_ylabel("Coefficient of Variation")
axes[1].set_title("Score Instability vs Score Level")
axes[1].legend(fontsize=8)

plt.tight_layout()
plt.show()

print("Mean CV by dimension (lower = more consistent):")
print(df_cv.groupby("Dimension")["CV"].agg(
    ["mean", "median", "max"]).round(3).to_string())

## 11. Summary & Key Findings

In [None]:
print("=" * 70)
print("EVALUATOR AGENT BENCHMARK - SUMMARY")
print("=" * 70)

print(f"\n[Data Scope]")
print(f"  Dialog-level GPT-4 annotation sets: {len(gpt4_dialog)}")
for name, data in gpt4_dialog.items():
    n = len(list(data.values())[0])
    print(f"    {name}: {n} dialogues, {len(data)} columns")
print(f"  Dialog-level human annotations (FED): {len(fed_human)} dialogues")
print(f"  Turn-level rating files: {len(turn_ratings)}")
print(f"  Robustness perturbation types: "
      f"{sum(len(v) for v in robustness_data['dialog'].values())} dialog-level, "
      f"{sum(len(v) for v in robustness_data['turn'].values())} turn-level")

print(f"\n[GPT-4 vs Human Alignment (FED)]")
if len(df_corr) > 0:
    for _, row in df_corr.iterrows():
        sig = ("***" if row["Spearman p"] < 0.001 else
               "**" if row["Spearman p"] < 0.01 else
               "*" if row["Spearman p"] < 0.05 else "ns")
        print(f"  {row['GPT-4 Dim']:18s}: Spearman={row['Spearman r']:.3f}{sig}, "
              f"Pearson={row['Pearson r']:.3f}")

print(f"\n[Critic Agent Overall Score]: {overall_score:.3f} / 1.000")
for _, row in df_scorecard.iterrows():
    bar = '#' * int(row['Score (0-1)'] * 20)
    print(f"  {row['Criterion']:18s}: {row['Score (0-1)']:.3f} [{bar:<20s}]")

print(f"\n[Key Takeaways]")
print(f"  1. GPT-4 shows moderate consistency (mean CV varies by dimension)")
print(f"  2. Human-LLM alignment is dimension-dependent")
print(f"  3. Score distributions reveal calibration patterns (scale usage bias)")
print(f"  4. Robustness data enables perturbation-sensitivity testing")
print(f"  5. The critic scoring framework provides a multi-criteria evaluation")

## 12. Key Observations

1. **Dimension-dependent alignment:** GPT-4's correlation with human ratings
   varies significantly across quality dimensions. Some dimensions (e.g., Coherence)
   are easier for LLMs to evaluate than others (e.g., Engagement/Likeability).

2. **Systematic scoring biases:** GPT-4 exhibits measurable positivity or negativity
   bias depending on the dimension, with a tendency to compress the score range
   compared to human annotators.

3. **Intra-model variance is non-trivial:** Even running the same GPT-4 model
   5 times produces different scores, with Coefficient of Variation highest
   for low-scoring dialogues (uncertain evaluations).

4. **Cross-dataset generalization gaps:** GPT-4's evaluator quality metrics
   (consistency, discrimination, calibration) shift substantially across
   source datasets, suggesting evaluator performance is task-dependent.

5. **Robustness as a critical dimension:** The robustness perturbation data
   (order shuffling, repetition, contradiction) provides a way to test whether
   evaluators detect quality degradation — essential for reliable critic agents.

6. **Multi-criteria scoring framework:** The critic agent scorecard (Alignment,
   Consistency, Discrimination, Calibration, Generalization, Bias) provides a
   holistic view of evaluator quality beyond simple correlation.

7. **Research relevance (IS/AI):**
   - **Critic agents:** Quantify LLM evaluator reliability for agent-based systems
   - **Evaluation automation:** Identify which quality dimensions can be reliably automated
   - **Calibration methods:** Design post-hoc calibration to reduce GPT-4 scoring bias
   - **Ensemble evaluation:** Combine multiple LLM runs to reduce variance
   - **Robustness testing:** Validate evaluator sensitivity to quality perturbations