# LLM-as-Judge: Evaluation Pipeline from Scratch

## Background

**LLM-as-Judge** is a paradigm where a language model evaluates the outputs of other language models. This is central to companies like Vals AI that build LLM benchmarking infrastructure.

### Why LLM-as-Judge?

Traditional evaluation methods (BLEU, ROUGE, exact match) fail to capture semantic correctness for open-ended generation. Human evaluation is expensive and slow. LLM-as-Judge sits in between: cheaper than humans, more nuanced than string matching.

### Key Challenges

1. **Calibration**: Does the judge's score distribution match human expectations?
2. **Consistency**: Does the same input get the same score across runs / prompt variants?
3. **Bias**: Position bias, verbosity bias, self-preference bias
4. **Cost/Accuracy tradeoff**: Expensive judges are more accurate but slower

### What We Build

This notebook implements a complete LLM-as-Judge evaluation pipeline:
1. Rubric-based scoring with structured prompts
2. Batch evaluation over synthetic datasets
3. Inter-rater agreement metrics (Cohen's kappa, confusion matrices, F1) from scratch
4. Judge sensitivity analysis (flip rate, Krippendorff's alpha)
5. Cost/latency/accuracy tradeoff analysis with Pareto frontiers

**Note**: All judge scoring uses deterministic mock judges (no API keys, no external services).

### References
- [Zheng et al. - Judging LLM-as-a-Judge (2023)](https://arxiv.org/abs/2306.05685)
- [Cohen's Kappa on Wikipedia](https://en.wikipedia.org/wiki/Cohen%27s_kappa)
- [Krippendorff's Alpha on Wikipedia](https://en.wikipedia.org/wiki/Krippendorff%27s_alpha)

In [None]:
from dataclasses import dataclass, field
from typing import Optional, Callable
import numpy as np
import re
import torch
import matplotlib.pyplot as plt

np.random.seed(42)
torch.manual_seed(42)

---
## Part 1: Rubric-Based Scoring (Foundations)

The foundation of LLM-as-Judge is a well-structured rubric that tells the judge *how* to evaluate. A rubric consists of:
- A system prompt defining the judge's role
- Criteria to evaluate against
- A scoring scale with clear anchors
- A structured output format for reliable parsing

We define dataclasses for the rubric, the judge's result, and evaluation samples.

In [None]:
@dataclass
class JudgePrompt:
    """Configuration for an LLM judge's evaluation prompt."""
    system_prompt: str
    rubric_criteria: list[str]
    scoring_scale: tuple[int, int]  # (min_score, max_score)
    output_format: str  # e.g. "Score: {score}\nReasoning: {reasoning}"


@dataclass
class JudgeResult:
    """Output from a judge evaluation."""
    score: int
    reasoning: str
    raw_response: str


@dataclass
class EvalSample:
    """A single evaluation sample with ground truth."""
    question: str
    reference_answer: str
    model_output: str
    human_label: int  # 1 = correct, 0 = incorrect

In [None]:
def format_judge_prompt(
    question: str,
    reference: str,
    candidate: str,
    rubric: JudgePrompt,
) -> str:
    """Render a full grading prompt from the rubric template.

    Combines the system prompt, rubric criteria, scoring scale,
    and the actual question/reference/candidate into a single
    prompt string ready for the judge.
    """
    criteria_str = "\n".join(f"  - {c}" for c in rubric.rubric_criteria)
    min_s, max_s = rubric.scoring_scale

    prompt = (
        f"{rubric.system_prompt}\n\n"
        f"## Evaluation Criteria\n{criteria_str}\n\n"
        f"## Scoring Scale\n"
        f"Rate from {min_s} (worst) to {max_s} (best).\n\n"
        f"## Question\n{question}\n\n"
        f"## Reference Answer\n{reference}\n\n"
        f"## Candidate Answer\n{candidate}\n\n"
        f"## Output Format\n{rubric.output_format}"
    )
    return prompt


def parse_judge_response(response_text: str) -> JudgeResult:
    """Extract score and reasoning from structured judge output.

    Expects format like:
        Score: 4
        Reasoning: The answer is mostly correct but...

    Falls back to score=0 with error reasoning if parsing fails.
    """
    score_match = re.search(r"Score:\s*(\d+)", response_text)
    reasoning_match = re.search(r"Reasoning:\s*(.+)", response_text, re.DOTALL)

    if score_match is None:
        return JudgeResult(
            score=0,
            reasoning="Failed to parse score from response",
            raw_response=response_text,
        )

    score = int(score_match.group(1))
    reasoning = reasoning_match.group(1).strip() if reasoning_match else "No reasoning provided"

    return JudgeResult(score=score, reasoning=reasoning, raw_response=response_text)

In [None]:
# ----- Tests for Part 1 -----

# Test JudgePrompt construction
rubric = JudgePrompt(
    system_prompt="You are a math grading assistant.",
    rubric_criteria=["Correctness of final answer", "Clarity of reasoning"],
    scoring_scale=(1, 5),
    output_format="Score: {score}\nReasoning: {reasoning}",
)
assert rubric.scoring_scale == (1, 5)
assert len(rubric.rubric_criteria) == 2

# Test format_judge_prompt round-trip
prompt = format_judge_prompt(
    question="What is 2 + 2?",
    reference="4",
    candidate="4",
    rubric=rubric,
)
assert "What is 2 + 2?" in prompt
assert "Reference Answer" in prompt
assert "Candidate Answer" in prompt
assert "Correctness of final answer" in prompt
assert "1 (worst) to 5 (best)" in prompt
print("Format test passed.")

# Test parse_judge_response - well-formed
good_response = "Score: 4\nReasoning: The answer is correct and clearly stated."
result = parse_judge_response(good_response)
assert result.score == 4
assert "correct" in result.reasoning
assert result.raw_response == good_response
print("Parse (good) test passed.")

# Test parse_judge_response - malformed (no score)
bad_response = "I think this is pretty good overall."
result_bad = parse_judge_response(bad_response)
assert result_bad.score == 0
assert "Failed to parse" in result_bad.reasoning
print("Parse (malformed) test passed.")

# Test parse_judge_response - score but no reasoning
partial_response = "Score: 3"
result_partial = parse_judge_response(partial_response)
assert result_partial.score == 3
assert result_partial.reasoning == "No reasoning provided"
print("Parse (partial) test passed.")

print("\nAll Part 1 tests passed.")

---
## Part 2: Evaluation Dataset & Batch Scoring

We generate a synthetic math QA dataset with known correct/incorrect answers, then build a deterministic mock judge that scores based on string similarity.

This simulates the real pipeline: dataset of (question, reference, candidate) triples with human labels, scored by an automated judge.

In [None]:
def generate_math_dataset(n: int = 50, seed: int = 42) -> list[EvalSample]:
    """Generate synthetic math QA samples with known correctness.

    Creates multiplication problems. Model outputs are either:
    - Exact correct answer (human_label=1)
    - Close but wrong answer (human_label=0)
    - Completely wrong answer (human_label=0)
    - Correct answer with extra text (human_label=1)
    """
    rng = np.random.RandomState(seed)
    samples: list[EvalSample] = []

    for i in range(n):
        a = int(rng.randint(2, 50))
        b = int(rng.randint(2, 50))
        correct = a * b
        question = f"What is {a} * {b}?"
        reference = str(correct)

        case = rng.choice(["exact", "close_wrong", "far_wrong", "verbose_correct"], p=[0.35, 0.25, 0.2, 0.2])

        if case == "exact":
            model_output = str(correct)
            human_label = 1
        elif case == "close_wrong":
            offset = int(rng.choice([-2, -1, 1, 2]))
            model_output = str(correct + offset)
            human_label = 0
        elif case == "far_wrong":
            model_output = str(int(rng.randint(1, 100)))
            # Unlikely but possible that random number equals correct
            human_label = 1 if model_output == reference else 0
        else:  # verbose_correct
            model_output = f"The answer is {correct}."
            human_label = 1

        samples.append(EvalSample(
            question=question,
            reference_answer=reference,
            model_output=model_output,
            human_label=human_label,
        ))

    return samples


dataset = generate_math_dataset(n=50)
n_correct = sum(s.human_label for s in dataset)
print(f"Generated {len(dataset)} samples: {n_correct} correct, {len(dataset) - n_correct} incorrect")
print(f"\nExample samples:")
for s in dataset[:5]:
    print(f"  Q: {s.question}  Ref: {s.reference_answer}  Model: {s.model_output}  Label: {s.human_label}")

In [None]:
class MockJudge:
    """Deterministic judge that scores based on string similarity.

    Scoring logic:
      - Exact match between model output and reference -> 5
      - Reference string is contained in model output (verbose correct) -> 4
      - Model output is numerically close (within 5) to reference -> 3
      - Otherwise -> 1

    This simulates an LLM judge without any API calls.
    """

    def __init__(self, name: str = "default"):
        self.name = name

    def score(self, sample: EvalSample, rubric: JudgePrompt) -> JudgeResult:
        ref = sample.reference_answer.strip()
        out = sample.model_output.strip()

        if out == ref:
            s = 5
            reasoning = "Exact match with reference answer."
        elif ref in out:
            s = 4
            reasoning = "Reference answer found within model output (verbose but correct)."
        else:
            # Try numeric comparison
            try:
                # Extract first number from model output
                out_nums = re.findall(r"-?\d+", out)
                ref_num = int(ref)
                if out_nums and abs(int(out_nums[0]) - ref_num) <= 5:
                    s = 3
                    reasoning = f"Numerically close (off by {abs(int(out_nums[0]) - ref_num)})."
                else:
                    s = 1
                    reasoning = "Incorrect answer."
            except (ValueError, IndexError):
                s = 1
                reasoning = "Could not parse numeric answer."

        raw = f"Score: {s}\nReasoning: {reasoning}"
        return JudgeResult(score=s, reasoning=reasoning, raw_response=raw)


def run_judge(
    samples: list[EvalSample],
    judge_fn: Callable[[EvalSample, JudgePrompt], JudgeResult],
    rubric: JudgePrompt,
) -> list[JudgeResult]:
    """Batch-score a list of samples using the given judge function."""
    return [judge_fn(s, rubric) for s in samples]

In [None]:
# ----- Tests for Part 2 -----

rubric = JudgePrompt(
    system_prompt="You are a math grading assistant.",
    rubric_criteria=["Correctness of final answer"],
    scoring_scale=(1, 5),
    output_format="Score: {score}\nReasoning: {reasoning}",
)

judge = MockJudge()
results = run_judge(dataset, judge.score, rubric)

# Check output length
assert len(results) == len(dataset), f"Expected {len(dataset)} results, got {len(results)}"
print(f"Scored {len(results)} samples.")

# Check all scores in valid range
min_s, max_s = rubric.scoring_scale
for r in results:
    assert min_s <= r.score <= max_s, f"Score {r.score} out of range [{min_s}, {max_s}]"
print(f"All scores in range [{min_s}, {max_s}].")

# Check score distribution
scores = [r.score for r in results]
print(f"\nScore distribution:")
for s in sorted(set(scores)):
    count = scores.count(s)
    print(f"  Score {s}: {count} samples ({count/len(scores)*100:.0f}%)")

# Verify exact match samples get score 5
for sample, result in zip(dataset, results):
    if sample.model_output.strip() == sample.reference_answer.strip():
        assert result.score == 5, f"Exact match should get 5, got {result.score}"
print("\nExact-match scoring verified.")

print("\nAll Part 2 tests passed.")

---
## Part 3: Inter-Rater Agreement (Statistical Rigor)

The core question: **does the judge agree with humans?**

We implement agreement metrics from scratch, then validate against sklearn.

### Metrics

- **Confusion Matrix**: Counts of (predicted, actual) pairs
- **Accuracy**: Fraction of correct predictions
- **Precision / Recall / F1**: For binary classification
- **Cohen's Kappa**: Agreement adjusted for chance
  - $\kappa = \frac{p_o - p_e}{1 - p_e}$ where $p_o$ = observed agreement, $p_e$ = expected by chance
  - $\kappa = 1$: perfect agreement; $\kappa = 0$: chance agreement; $\kappa < 0$: worse than chance

In [None]:
def confusion_matrix(labels_a: list[int], labels_b: list[int], num_classes: int) -> np.ndarray:
    """Build confusion matrix from two raters' labels.

    Entry cm[i][j] = number of items where rater A said i and rater B said j.
    """
    assert len(labels_a) == len(labels_b), "Label lists must have same length"
    cm = np.zeros((num_classes, num_classes), dtype=int)
    for a, b in zip(labels_a, labels_b):
        cm[a][b] += 1
    return cm


def cohens_kappa(rater1: list[int], rater2: list[int]) -> float:
    """Cohen's kappa: (p_o - p_e) / (1 - p_e).

    Measures inter-rater agreement adjusted for chance.
    """
    assert len(rater1) == len(rater2), "Rater lists must have same length"
    n = len(rater1)
    classes = sorted(set(rater1) | set(rater2))

    # p_o = observed agreement
    p_o = sum(a == b for a, b in zip(rater1, rater2)) / n

    # p_e = expected agreement by chance
    p_e = sum(
        (rater1.count(c) / n) * (rater2.count(c) / n)
        for c in classes
    )

    if p_e == 1.0:
        return 1.0
    return (p_o - p_e) / (1 - p_e)


def accuracy(predicted: list[int], actual: list[int]) -> float:
    """Fraction of predictions that match actual labels."""
    assert len(predicted) == len(actual), "Lists must have same length"
    return sum(p == a for p, a in zip(predicted, actual)) / len(actual)


def precision_recall_f1(
    predicted: list[int],
    actual: list[int],
    positive_label: int = 1,
) -> tuple[float, float, float]:
    """Compute precision, recall, and F1 for binary classification."""
    tp = sum(p == positive_label and a == positive_label for p, a in zip(predicted, actual))
    fp = sum(p == positive_label and a != positive_label for p, a in zip(predicted, actual))
    fn = sum(p != positive_label and a == positive_label for p, a in zip(predicted, actual))

    precision = tp / (tp + fp) if (tp + fp) > 0 else 0.0
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0.0
    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0.0
    return precision, recall, f1

In [None]:
# Convert judge scores to binary labels for comparison with human labels
# Threshold: score >= 4 -> 1 (correct), else -> 0 (incorrect)
SCORE_THRESHOLD = 4

judge_labels = [1 if r.score >= SCORE_THRESHOLD else 0 for r in results]
human_labels = [s.human_label for s in dataset]

# Compute all metrics
cm = confusion_matrix(human_labels, judge_labels, num_classes=2)
kappa = cohens_kappa(human_labels, judge_labels)
acc = accuracy(judge_labels, human_labels)
prec, rec, f1 = precision_recall_f1(judge_labels, human_labels, positive_label=1)

print("Agreement between Mock Judge and Human Labels")
print("=" * 50)
print(f"\nConfusion Matrix (rows=human, cols=judge):")
print(f"              Judge=0  Judge=1")
print(f"  Human=0     {cm[0,0]:5d}    {cm[0,1]:5d}")
print(f"  Human=1     {cm[1,0]:5d}    {cm[1,1]:5d}")
print(f"\nAccuracy:  {acc:.4f}")
print(f"Precision: {prec:.4f}")
print(f"Recall:    {rec:.4f}")
print(f"F1 Score:  {f1:.4f}")
print(f"Cohen's Kappa: {kappa:.4f}")

In [None]:
# ----- Tests for Part 3: validate against sklearn -----
from sklearn.metrics import (
    cohen_kappa_score as sk_kappa,
    confusion_matrix as sk_cm,
    accuracy_score as sk_acc,
    precision_recall_fscore_support as sk_prfs,
)

# Test confusion matrix
sk_cm_result = sk_cm(human_labels, judge_labels)
assert np.array_equal(cm, sk_cm_result), (
    f"Confusion matrix mismatch:\nOurs:\n{cm}\nsklearn:\n{sk_cm_result}"
)
print("Confusion matrix matches sklearn.")

# Test Cohen's kappa
sk_kappa_val = sk_kappa(human_labels, judge_labels)
assert abs(kappa - sk_kappa_val) < 1e-10, (
    f"Kappa mismatch: ours={kappa:.6f}, sklearn={sk_kappa_val:.6f}"
)
print(f"Cohen's kappa matches sklearn: {kappa:.6f}")

# Test accuracy
sk_acc_val = sk_acc(human_labels, judge_labels)
assert abs(acc - sk_acc_val) < 1e-10, f"Accuracy mismatch: {acc} vs {sk_acc_val}"
print(f"Accuracy matches sklearn: {acc:.6f}")

# Test precision/recall/F1
sk_prec, sk_rec, sk_f1, _ = sk_prfs(human_labels, judge_labels, pos_label=1, average="binary")
assert abs(prec - sk_prec) < 1e-10, f"Precision mismatch: {prec} vs {sk_prec}"
assert abs(rec - sk_rec) < 1e-10, f"Recall mismatch: {rec} vs {sk_rec}"
assert abs(f1 - sk_f1) < 1e-10, f"F1 mismatch: {f1} vs {sk_f1}"
print(f"Precision/Recall/F1 matches sklearn.")

# Test perfect agreement edge case
perfect_a = [0, 0, 1, 1, 0, 1]
perfect_b = [0, 0, 1, 1, 0, 1]
assert cohens_kappa(perfect_a, perfect_b) == 1.0, "Perfect agreement should give kappa=1.0"
print("Perfect agreement kappa=1.0 verified.")

print("\nAll Part 3 tests passed.")

---
## Part 4: Judge Sensitivity Analysis

A reliable judge should give consistent scores regardless of minor prompt variations. In practice, LLM judges are sensitive to:
- Rubric wording
- System prompt tone
- Scoring scale presentation

We test sensitivity by running multiple prompt variants and measuring:
- **Flip rate**: fraction of samples that change binary label across variants
- **Score variance per sample**: how much each sample's score varies
- **Krippendorff's alpha**: inter-rater reliability across all variants

In [None]:
class NoisyMockJudge:
    """Mock judge with slight per-variant randomness to simulate sensitivity.

    Uses a base MockJudge score and adds small deterministic noise
    based on the variant_id and sample index. This simulates the
    prompt-sensitivity of real LLM judges.
    """

    def __init__(self, variant_id: int, noise_level: float = 0.3):
        self.variant_id = variant_id
        self.noise_level = noise_level
        self.base_judge = MockJudge(name=f"variant_{variant_id}")
        self._rng = np.random.RandomState(seed=variant_id * 1000)

    def score(self, sample: EvalSample, rubric: JudgePrompt, sample_idx: int = 0) -> JudgeResult:
        base_result = self.base_judge.score(sample, rubric)
        # Add noise: deterministic per (variant_id, sample_idx)
        noise_rng = np.random.RandomState(seed=self.variant_id * 10000 + sample_idx)
        noise = int(np.round(noise_rng.normal(0, self.noise_level * 2)))
        min_s, max_s = rubric.scoring_scale
        noisy_score = int(np.clip(base_result.score + noise, min_s, max_s))

        raw = f"Score: {noisy_score}\nReasoning: {base_result.reasoning} [variant {self.variant_id}]"
        return JudgeResult(score=noisy_score, reasoning=base_result.reasoning, raw_response=raw)


# Define 5 prompt variants (different rubric wordings)
prompt_variants = [
    JudgePrompt(
        system_prompt="You are a strict math grader. Only exact answers get full marks.",
        rubric_criteria=["Exact correctness of the numerical answer"],
        scoring_scale=(1, 5),
        output_format="Score: {score}\nReasoning: {reasoning}",
    ),
    JudgePrompt(
        system_prompt="You are a lenient math tutor. Give credit for effort and partial answers.",
        rubric_criteria=["Approximate correctness", "Shows understanding of the problem"],
        scoring_scale=(1, 5),
        output_format="Score: {score}\nReasoning: {reasoning}",
    ),
    JudgePrompt(
        system_prompt="Evaluate the mathematical response objectively.",
        rubric_criteria=["Correctness", "Completeness"],
        scoring_scale=(1, 5),
        output_format="Score: {score}\nReasoning: {reasoning}",
    ),
    JudgePrompt(
        system_prompt="You are an exam proctor. Grade precisely.",
        rubric_criteria=["Final answer matches expected result exactly"],
        scoring_scale=(1, 5),
        output_format="Score: {score}\nReasoning: {reasoning}",
    ),
    JudgePrompt(
        system_prompt="Rate how well the student answered the math question.",
        rubric_criteria=["Answer quality", "Numerical accuracy"],
        scoring_scale=(1, 5),
        output_format="Score: {score}\nReasoning: {reasoning}",
    ),
]

# Run all variants
n_variants = len(prompt_variants)
n_samples = len(dataset)
# results_matrix[i, j] = score from variant i on sample j
results_matrix = np.zeros((n_variants, n_samples), dtype=int)

for v_idx, rubric_v in enumerate(prompt_variants):
    noisy_judge = NoisyMockJudge(variant_id=v_idx, noise_level=0.3)
    for s_idx, sample in enumerate(dataset):
        result = noisy_judge.score(sample, rubric_v, sample_idx=s_idx)
        results_matrix[v_idx, s_idx] = result.score

print(f"Results matrix shape: {results_matrix.shape} (variants x samples)")
print(f"Score range: [{results_matrix.min()}, {results_matrix.max()}]")

In [None]:
def flip_rate(results_matrix: np.ndarray, threshold: int = 4) -> float:
    """Fraction of samples that change binary label across variants.

    A sample 'flips' if at least one variant labels it differently
    (above/below threshold) from at least one other variant.
    """
    n_variants, n_samples = results_matrix.shape
    binary = (results_matrix >= threshold).astype(int)  # (n_variants, n_samples)
    # A sample flips if not all variants agree
    flips = 0
    for j in range(n_samples):
        labels_for_sample = binary[:, j]
        if labels_for_sample.min() != labels_for_sample.max():
            flips += 1
    return flips / n_samples


def score_variance_per_sample(results_matrix: np.ndarray) -> np.ndarray:
    """Variance of scores per sample across all variants.

    Returns array of shape (n_samples,).
    """
    # Variance along axis 0 (across variants) for each sample
    return np.var(results_matrix, axis=0)


def krippendorff_alpha(results_matrix: np.ndarray) -> float:
    """Krippendorff's alpha for interval data.

    alpha = 1 - (observed_disagreement / expected_disagreement)

    For interval metric:
      - observed = mean squared difference within each unit (sample)
      - expected = overall variance of all ratings

    Handles missing data: any NaN entries are excluded.
    """
    n_raters, n_units = results_matrix.shape

    # Observed disagreement: average pairwise squared difference within each unit
    observed_sum = 0.0
    n_pairs_total = 0

    for j in range(n_units):
        ratings = results_matrix[:, j].astype(float)
        # Remove NaN if any
        ratings = ratings[~np.isnan(ratings)]
        m = len(ratings)
        if m < 2:
            continue
        # Sum of squared pairwise differences
        for i in range(m):
            for k in range(i + 1, m):
                observed_sum += (ratings[i] - ratings[k]) ** 2
                n_pairs_total += 1

    if n_pairs_total == 0:
        return 1.0  # No data to disagree on

    D_o = observed_sum / n_pairs_total

    # Expected disagreement: pairwise squared differences across ALL ratings
    all_ratings = results_matrix.flatten().astype(float)
    all_ratings = all_ratings[~np.isnan(all_ratings)]
    n_total = len(all_ratings)

    expected_sum = 0.0
    n_expected_pairs = 0
    # Efficient computation: Var = E[X^2] - E[X]^2, and
    # mean pairwise squared diff = 2 * Var
    D_e = 2.0 * np.var(all_ratings)

    if D_e == 0.0:
        return 1.0  # All ratings identical

    alpha = 1.0 - (D_o / D_e)
    return alpha

In [None]:
# Compute sensitivity metrics
fr = flip_rate(results_matrix, threshold=SCORE_THRESHOLD)
sv = score_variance_per_sample(results_matrix)
alpha = krippendorff_alpha(results_matrix)

print("Judge Sensitivity Analysis")
print("=" * 50)
print(f"Flip rate (threshold={SCORE_THRESHOLD}): {fr:.4f} ({fr*100:.1f}% of samples change label)")
print(f"Mean score variance per sample: {sv.mean():.4f}")
print(f"Max score variance per sample:  {sv.max():.4f}")
print(f"Krippendorff's alpha: {alpha:.4f}")
print(f"\nInterpretation:")
if alpha > 0.8:
    print(f"  alpha={alpha:.2f} > 0.80: Good reliability.")
elif alpha > 0.667:
    print(f"  alpha={alpha:.2f} > 0.667: Acceptable for tentative conclusions.")
else:
    print(f"  alpha={alpha:.2f} < 0.667: Low reliability, results should be treated cautiously.")

In [None]:
# Visualize: heatmap of per-sample scores across variants (Tufte style)
fig, ax = plt.subplots(figsize=(14, 3.5))

im = ax.imshow(results_matrix, aspect="auto", cmap="YlOrRd", vmin=1, vmax=5)

ax.set_xlabel("Sample Index", fontsize=10)
ax.set_ylabel("Prompt Variant", fontsize=10)
ax.set_yticks(range(n_variants))
ax.set_yticklabels([f"V{i}" for i in range(n_variants)], fontsize=9)
ax.set_title("Judge Scores by Prompt Variant", fontsize=12, pad=10)

# Minimal colorbar
cbar = fig.colorbar(im, ax=ax, shrink=0.8, pad=0.02)
cbar.set_label("Score", fontsize=9)
cbar.set_ticks([1, 2, 3, 4, 5])

# Remove chart junk (Tufte style)
ax.spines["top"].set_visible(False)
ax.spines["right"].set_visible(False)
ax.tick_params(length=3)

plt.tight_layout()
plt.show()

In [None]:
# ----- Tests for Part 4 -----

# Test flip_rate bounds
assert 0.0 <= fr <= 1.0, f"Flip rate {fr} out of [0, 1]"
print(f"Flip rate in valid range: {fr:.4f}")

# Test flip_rate with identical results -> 0.0
identical = np.full((3, 10), 5)
assert flip_rate(identical) == 0.0, "Identical results should have flip_rate=0"
print("Flip rate = 0.0 for identical results: verified.")

# Test alpha bounds
assert -1.0 <= alpha <= 1.0, f"Alpha {alpha} out of [-1, 1]"
print(f"Krippendorff's alpha in valid range: {alpha:.4f}")

# Test alpha = 1.0 for identical results
alpha_perfect = krippendorff_alpha(identical)
assert alpha_perfect == 1.0, f"Identical results should give alpha=1.0, got {alpha_perfect}"
print("Alpha = 1.0 for identical results: verified.")

# Test score_variance_per_sample shape
assert sv.shape == (n_samples,), f"Expected shape ({n_samples},), got {sv.shape}"
assert np.all(sv >= 0), "Variance must be non-negative"
print(f"Score variance shape and non-negativity: verified.")

# Test score_variance = 0 for identical results
sv_identical = score_variance_per_sample(identical)
assert np.all(sv_identical == 0.0), "Identical results should have zero variance"
print("Variance = 0 for identical results: verified.")

print("\nAll Part 4 tests passed.")

---
## Part 5: Cost / Latency / Accuracy Tradeoffs

In production, you must choose between judge quality and cost. The key insight: **you can often get 90% of the accuracy at 10% of the cost** using smart strategies like majority voting with cheap models.

We simulate three evaluation strategies and analyze their Pareto efficiency.

In [None]:
@dataclass
class EvalReport:
    """Summary of an evaluation strategy's performance and cost."""
    model_name: str
    accuracy: float
    cost_per_eval: float       # dollars
    latency_per_eval: float    # seconds
    throughput: float           # evals per second
    accuracy_per_dollar: float  # accuracy / cost


class CheapMockJudge:
    """Lower-accuracy judge simulating a cheap/fast model.

    Has a fixed error rate: randomly flips some correct/incorrect decisions.
    """

    def __init__(self, error_rate: float = 0.22, seed: int = 0):
        self.error_rate = error_rate
        self._rng = np.random.RandomState(seed)
        self.base_judge = MockJudge(name="cheap")

    def score(self, sample: EvalSample, rubric: JudgePrompt) -> JudgeResult:
        base = self.base_judge.score(sample, rubric)
        # Randomly flip score between high and low
        if self._rng.random() < self.error_rate:
            flipped_score = 1 if base.score >= 4 else 5
            raw = f"Score: {flipped_score}\nReasoning: {base.reasoning} [flipped]"
            return JudgeResult(score=flipped_score, reasoning=base.reasoning, raw_response=raw)
        return base


class ExpensiveMockJudge:
    """High-accuracy judge simulating an expensive/slow model.

    Very low error rate.
    """

    def __init__(self, error_rate: float = 0.05, seed: int = 100):
        self.error_rate = error_rate
        self._rng = np.random.RandomState(seed)
        self.base_judge = MockJudge(name="expensive")

    def score(self, sample: EvalSample, rubric: JudgePrompt) -> JudgeResult:
        base = self.base_judge.score(sample, rubric)
        if self._rng.random() < self.error_rate:
            flipped_score = 1 if base.score >= 4 else 5
            raw = f"Score: {flipped_score}\nReasoning: {base.reasoning} [flipped]"
            return JudgeResult(score=flipped_score, reasoning=base.reasoning, raw_response=raw)
        return base


def majority_vote_labels(
    all_results: list[list[JudgeResult]],
    threshold: int = 4,
) -> list[int]:
    """Compute majority vote binary labels from multiple judge runs.

    Each inner list is one run's results. Returns binary labels (0/1)
    based on majority of runs agreeing above/below threshold.
    """
    n_samples = len(all_results[0])
    n_runs = len(all_results)
    labels = []
    for j in range(n_samples):
        votes = sum(1 for run in all_results if run[j].score >= threshold)
        labels.append(1 if votes > n_runs / 2 else 0)
    return labels

In [None]:
# Run three evaluation strategies
rubric_default = prompt_variants[0]  # Use first variant as default

# Strategy 1: Expensive-accurate judge
expensive_judge = ExpensiveMockJudge(error_rate=0.05, seed=100)
expensive_results = run_judge(dataset, expensive_judge.score, rubric_default)
expensive_labels = [1 if r.score >= SCORE_THRESHOLD else 0 for r in expensive_results]
expensive_acc = accuracy(expensive_labels, human_labels)

# Strategy 2: Cheap-fast judge
cheap_judge = CheapMockJudge(error_rate=0.22, seed=200)
cheap_results = run_judge(dataset, cheap_judge.score, rubric_default)
cheap_labels = [1 if r.score >= SCORE_THRESHOLD else 0 for r in cheap_results]
cheap_acc = accuracy(cheap_labels, human_labels)

# Strategy 3: Majority vote (3 cheap runs)
vote_runs = []
for run_seed in [300, 301, 302]:
    j = CheapMockJudge(error_rate=0.22, seed=run_seed)
    vote_runs.append(run_judge(dataset, j.score, rubric_default))
vote_labels = majority_vote_labels(vote_runs, threshold=SCORE_THRESHOLD)
vote_acc = accuracy(vote_labels, human_labels)

print(f"Strategy Accuracies vs Human Labels:")
print(f"  Expensive-accurate: {expensive_acc:.4f}")
print(f"  Cheap-fast:         {cheap_acc:.4f}")
print(f"  Majority-vote-3:    {vote_acc:.4f}")

In [None]:
def eval_economics(
    accuracies: list[float],
    model_names: list[str],
    costs: list[float],
    latencies: list[float],
) -> list[EvalReport]:
    """Build EvalReport objects for each evaluation strategy."""
    reports = []
    for name, acc, cost, lat in zip(model_names, accuracies, costs, latencies):
        throughput = 1.0 / lat if lat > 0 else float("inf")
        acc_per_dollar = acc / cost if cost > 0 else float("inf")
        reports.append(EvalReport(
            model_name=name,
            accuracy=acc,
            cost_per_eval=cost,
            latency_per_eval=lat,
            throughput=throughput,
            accuracy_per_dollar=acc_per_dollar,
        ))
    return reports


reports = eval_economics(
    accuracies=[expensive_acc, cheap_acc, vote_acc],
    model_names=["expensive-accurate", "cheap-fast", "majority-vote-3"],
    costs=[0.03, 0.001, 0.003],
    latencies=[2.0, 0.1, 0.3],
)

print("Evaluation Economics")
print("=" * 70)
print(f"{'Strategy':<20} {'Accuracy':>8} {'Cost ($)':>8} {'Latency(s)':>10} {'Throughput':>10} {'Acc/$':>10}")
print("-" * 70)
for r in reports:
    print(
        f"{r.model_name:<20} {r.accuracy:>8.4f} {r.cost_per_eval:>8.3f} "
        f"{r.latency_per_eval:>10.1f} {r.throughput:>10.1f} {r.accuracy_per_dollar:>10.1f}"
    )

In [None]:
def pareto_frontier(
    costs: list[float],
    accuracies: list[float],
) -> list[int]:
    """Find indices of Pareto-optimal points (minimize cost, maximize accuracy).

    A point is Pareto-optimal if no other point has both lower cost AND higher accuracy.
    """
    n = len(costs)
    is_pareto = [True] * n
    for i in range(n):
        for j in range(n):
            if i == j:
                continue
            # j dominates i if j has lower or equal cost AND higher or equal accuracy
            # with at least one strict inequality
            if (costs[j] <= costs[i] and accuracies[j] >= accuracies[i]) and (
                costs[j] < costs[i] or accuracies[j] > accuracies[i]
            ):
                is_pareto[i] = False
                break
    return [i for i in range(n) if is_pareto[i]]


# Compute Pareto frontier
costs = [r.cost_per_eval for r in reports]
accs = [r.accuracy for r in reports]
names = [r.model_name for r in reports]
pareto_idx = pareto_frontier(costs, accs)

print(f"Pareto-optimal strategies: {[names[i] for i in pareto_idx]}")

In [None]:
# Visualize: Pareto frontier scatter (Tufte style)
fig, ax = plt.subplots(figsize=(7, 5))

# Plot all points
for i, (c, a, name) in enumerate(zip(costs, accs, names)):
    marker = "o" if i in pareto_idx else "x"
    color = "#333333" if i in pareto_idx else "#999999"
    size = 80 if i in pareto_idx else 50
    ax.scatter(c, a, s=size, marker=marker, c=color, zorder=3)
    ax.annotate(
        name,
        (c, a),
        textcoords="offset points",
        xytext=(8, 6),
        fontsize=9,
        color=color,
    )

# Draw Pareto frontier line
if len(pareto_idx) > 1:
    pareto_points = sorted([(costs[i], accs[i]) for i in pareto_idx])
    px, py = zip(*pareto_points)
    ax.plot(px, py, "--", color="#666666", linewidth=1, alpha=0.7, zorder=2)

ax.set_xlabel("Cost per Evaluation ($)", fontsize=10)
ax.set_ylabel("Accuracy", fontsize=10)
ax.set_title("Accuracy vs Cost: Pareto Frontier", fontsize=12, pad=10)

# Tufte style: remove top/right spines, minimal grid
ax.spines["top"].set_visible(False)
ax.spines["right"].set_visible(False)
ax.tick_params(length=3)

plt.tight_layout()
plt.show()

In [None]:
# ----- Tests for Part 5 -----

# Test EvalReport construction
for r in reports:
    assert 0.0 <= r.accuracy <= 1.0, f"Accuracy {r.accuracy} out of range"
    assert r.cost_per_eval > 0, f"Cost must be positive"
    assert r.latency_per_eval > 0, f"Latency must be positive"
    assert r.throughput > 0, f"Throughput must be positive"
    assert r.accuracy_per_dollar > 0, f"Accuracy per dollar must be positive"
print("EvalReport fields validated.")

# Test Pareto ordering: expensive-accurate should be Pareto-optimal
# (highest accuracy, so nothing dominates it on accuracy)
assert 0 in pareto_idx, "Expensive-accurate should be on Pareto frontier"
print("Expensive-accurate is Pareto-optimal: verified.")

# Test: majority vote accuracy >= single cheap judge accuracy
assert vote_acc >= cheap_acc - 0.01, (
    f"Majority vote ({vote_acc:.4f}) should be >= cheap judge ({cheap_acc:.4f}) "
    f"(with small tolerance for randomness)"
)
print(f"Majority vote ({vote_acc:.4f}) >= cheap judge ({cheap_acc:.4f}): verified.")

# Test Pareto frontier with trivially dominated point
test_costs = [0.01, 0.02, 0.03]
test_accs = [0.9, 0.85, 0.95]  # Point 1 (0.02, 0.85) is dominated by point 0 (0.01, 0.9)
test_pareto = pareto_frontier(test_costs, test_accs)
assert 1 not in test_pareto, "Point (0.02, 0.85) should be dominated"
assert 0 in test_pareto and 2 in test_pareto, "Points 0 and 2 should be Pareto-optimal"
print("Pareto frontier logic verified.")

print("\nAll Part 5 tests passed.")

---
## Interview Tips: LLM-as-Judge in Practice

### When to use LLM-as-Judge vs other approaches

| Method | Best for | Weaknesses |
|--------|----------|------------|
| **Exact match** | Factual QA, code correctness | Misses semantically correct paraphrases |
| **BLEU/ROUGE** | Translation, summarization | Poor correlation with human judgment on open-ended tasks |
| **LLM-as-Judge** | Open-ended generation, style, reasoning | Expensive, biased, needs calibration |
| **Human eval** | Gold standard for subjective tasks | Slow, expensive, not scalable |

**Rule of thumb**: Use exact match when you can. Use LLM-as-Judge when the answer space is too large for exact match but too costly for human eval. Always validate the judge against human labels on a held-out set.

### Known Failure Modes

1. **Position bias**: Judges tend to prefer the first (or last) response in pairwise comparisons. Mitigation: swap positions and average.
2. **Verbosity bias**: Longer responses get higher scores even when less accurate. Mitigation: include length-penalizing criteria in rubric.
3. **Self-preference**: GPT-4 rates GPT-4 outputs higher than equivalent Claude outputs. Mitigation: use a different model family for judging than for generation.
4. **Anchoring to rubric examples**: If the rubric shows a score-5 example, the judge calibrates around that specific style. Mitigation: diverse rubric examples.

### Validating Calibration

To check if your judge is well-calibrated:
1. Collect N human-labeled samples (100+ is ideal)
2. Run the judge on those samples
3. Compute Cohen's kappa and F1 against human labels
4. **Bootstrap confidence intervals**: resample with replacement B=1000 times, compute metric each time, report 95% CI
5. If kappa < 0.6, the judge is not reliable enough for production use

### Connection to Vals AI Methodology

Vals AI focuses on building reliable LLM benchmarks. Key principles:
- **Multi-judge ensembles**: Reduce variance by combining multiple judges (like our majority voting approach)
- **Rubric engineering**: The rubric is as important as the model choice. Precise criteria reduce judge variance.
- **Sensitivity testing**: Always measure how much your results change with prompt perturbations (Krippendorff's alpha)
- **Cost-aware evaluation**: Production benchmarking requires balancing quality against budget. The Pareto frontier analysis in Part 5 is exactly the framework used in practice.
- **Continuous calibration**: Re-validate judge accuracy against fresh human labels regularly, especially after model updates.