# Judge Evaluation Review Notebook

This notebook inspects evaluation results to understand how the judge scored model responses. It analyzes judge scores, citation metrics, and provides visualizations.

## Usage

**Option 1: Analyze specific results directory**
- Set `RESULTS_DIR` in the code cell below (e.g., `"rag_baseline_0p5b"`)
- The notebook will look for `metrics.json` in `evaluation/results/{RESULTS_DIR}/`

**Option 2: Analyze specific metrics file**
- Set `METRICS_PATH` directly to point to any `metrics.json` file

**Default**: Uses `subset20/rag_baseline.json` (legacy path - update for current experiments)

## Setup
Load the summary JSON, unpack the per-example records, and derive helper columns that capture citation behaviour.

In [None]:
import json
from pathlib import Path

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from IPython.display import display
from sklearn.metrics import (
    accuracy_score,
    f1_score,
    precision_score,
    recall_score,
    roc_auc_score,
)

plt.style.use("ggplot")

In [None]:
def find_project_root(markers=("pyproject.toml", ".git")):
    """Return the repository root by looking for a known anchor file."""
    current = Path.cwd().resolve()
    for candidate in (current, *current.parents):
        if any((candidate / marker).exists() for marker in markers):
            return candidate
    raise FileNotFoundError(
        "Unable to locate project root; run this notebook from inside the repository"
    )


PROJECT_ROOT = find_project_root()

# Configuration: Set the results directory or metrics file path
# Option 1: Specify a results directory (will look for metrics.json inside)
RESULTS_DIR = None  # e.g., "rag_baseline_0p5b", "lora_science_0p5b_ft_only", "hybrid_science_0p5b"

# Option 2: Specify full path to metrics.json file directly
METRICS_PATH = None  # e.g., Path("evaluation/results/rag_baseline_0p5b/metrics.json")

# Legacy: Default to old subset20 path if nothing specified
if RESULTS_DIR:
    summary_path = PROJECT_ROOT / "evaluation/results" / RESULTS_DIR / "metrics.json"
elif METRICS_PATH:
    summary_path = Path(METRICS_PATH) if isinstance(METRICS_PATH, str) else METRICS_PATH
else:
    # Legacy fallback - update this for your current experiments
    RESULTS_ROOT = PROJECT_ROOT / "evaluation/results/subset20"
    summary_path = RESULTS_ROOT / "rag_baseline.json"
    print(f"Warning: Using legacy path {summary_path}")
    print("Update RESULTS_DIR or METRICS_PATH in this cell to analyze current experiments")

if not summary_path.exists():
    raise FileNotFoundError(
        f"Metrics file not found at {summary_path}\n"
        f"Update RESULTS_DIR or METRICS_PATH in the cell above to point to your evaluation results."
    )

print(f"Analyzing results from: {summary_path}")

In [None]:
with summary_path.open(encoding="utf-8") as fh:
    payload = json.load(fh)

summary = payload["summary"]
examples = payload["examples"]
df = pd.DataFrame(examples)
df["mean_coverage"] = df["citation_metrics"].apply(
    lambda row: row.get("mean_coverage", float("nan"))
)
df["missing_citations"] = df["citation_metrics"].apply(lambda row: len(row.get("missing", [])))
df["extra_citations"] = df["citation_metrics"].apply(lambda row: len(row.get("extra", [])))
df["referenced_citations"] = df["citation_metrics"].apply(
    lambda row: len(row.get("referenced", []))
)


def extract_score(scores, target):
    if isinstance(scores, dict):
        value = scores.get(target)
        if value is not None:
            return float(value)
    return float("nan")


score_keys = ["factuality", "grounding", "completeness", "communication"]
for key in score_keys:
    column_name = f"score_{key}"
    df[column_name] = df["judge_scores"].apply(extract_score, target=key)
df["verdict"] = df["judge_verdict"].fillna("").str.lower()
df.head()

## Run Summary
The JSON summary provides the top-level evaluation context.

In [None]:
pd.DataFrame([summary])

## Judge Score Summary
Mean and distribution of the structured judge rubric scores and pass rate.

**Dimension Glossary**
- **Factuality**: penalises hallucinated or contradictory claims; rewards scientifically correct statements.
- **Grounding**: checks that answers rely on the provided sources rather than speculation.
- **Completeness**: captures how fully the response covers each question requirement.
- **Communication**: reflects clarity, structure, and adherence to the requested style.

In [None]:
score_columns = [col for col in df.columns if col.startswith("score_")]
score_overview = (
    df[score_columns]
    .agg(["mean", "median", "std", "min", "max"])
    .T.rename_axis("dimension")
    .reset_index()
)
score_overview["dimension"] = score_overview["dimension"].str.replace("score_", "").str.title()
display(score_overview.round(3))

pass_rate = df["verdict"].eq("pass").mean()
print(f"Pass rate: {pass_rate:.2%}")

In [None]:
fig, ax = plt.subplots(figsize=(8, 4))
df[score_columns].boxplot(ax=ax)
ax.set_title("Judge score distribution by dimension")
ax.set_ylabel("Score")
ax.set_xlabel("Dimension")
ax.set_ylim(0, 1)
fig.tight_layout()
plt.show()

## Citation Coverage by Task Type
Aggregate citation handling quality for each task class.

In [None]:
coverage_stats = (
    df.groupby("task_type")
    .agg(
        examples=("task_id", "count"),
        mean_coverage=("mean_coverage", "mean"),
        median_coverage=("mean_coverage", "median"),
        avg_missing=("missing_citations", "mean"),
        avg_extra=("extra_citations", "mean"),
    )
    .sort_values("mean_coverage", ascending=False)
)
coverage_stats

In [None]:
fig, ax = plt.subplots(figsize=(8, 4))
coverage_stats["mean_coverage"].plot(kind="bar", ax=ax, color="#4c72b0")
ax.set_ylabel("Mean citation coverage")
ax.set_xlabel("Task type")
ax.set_ylim(0, 1)
ax.set_title("Citation coverage by task type")
fig.tight_layout()
plt.show()

## Distribution of Missing Citations
Visualise how often the model failed to cite required references.

In [None]:
fig, ax = plt.subplots(figsize=(8, 4))
max_missing = df["missing_citations"].max()
bins = range(0, int(max_missing) + 2) if max_missing and max_missing > 0 else 5
df["missing_citations"].plot(kind="hist", bins=bins, ax=ax, color="#dd8452")
ax.set_xlabel("Missing citations per example")
ax.set_ylabel("Example count")
ax.set_title("Distribution of missing citations")
fig.tight_layout()
plt.show()

## Classification-Style Metrics
Translate the citation outcomes into binary labels so we can compute accuracy metrics. An example counts as a success when it includes at least one citation and the judge did not flag missing or extra references. Use the threshold below to require a minimum citation coverage before calling it correct.

In [None]:
SUCCESS_THRESHOLD = 0.10  # tweak as needed for stricter leniency
df["citation_success"] = (
    (df["referenced_citations"] > 0) & (df["missing_citations"] == 0) & (df["extra_citations"] == 0)
)
df["citation_score"] = df["mean_coverage"].fillna(0.0)

y_true = df["citation_success"].astype(int)
y_score = df["citation_score"]
y_pred = (y_score >= SUCCESS_THRESHOLD).astype(int)

metrics = {
    "accuracy": accuracy_score(y_true, y_pred),
    "precision": precision_score(y_true, y_pred, zero_division=0),
    "recall": recall_score(y_true, y_pred, zero_division=0),
    "f1": f1_score(y_true, y_pred, zero_division=0),
}

try:
    metrics["roc_auc"] = roc_auc_score(y_true, y_score)
except ValueError:
    metrics["roc_auc"] = float("nan")

metrics_series = pd.Series(metrics, name="value")
non_zero_metrics = metrics_series[~np.isclose(metrics_series, 0.0, atol=1e-6)].dropna()

if non_zero_metrics.empty:
    print("No non-zero metrics to report at the current threshold.")
else:
    display(non_zero_metrics.round(3))

In [None]:
confusion = pd.crosstab(
    y_true, y_pred, rownames=["Actual success"], colnames=["Predicted success"], dropna=False
)
confusion

## Judge Score Signals
Inspect the raw judge score payloads to see which structured fields (if any) were populated.

In [None]:
score_keys = sorted(
    {key for scores in df["judge_scores"] if isinstance(scores, dict) for key in scores.keys()}
)
score_keys

## Lowest-Coverage Examples
Review the examples with the weakest citation coverage to understand qualitative failure modes.

In [None]:
columns_to_show = [
    "task_id",
    "task_type",
    "mean_coverage",
    "missing_citations",
    "extra_citations",
    "model_answer",
]
df.sort_values("mean_coverage").head(5)[columns_to_show]