# Chapter 4: LLM-as-a-Judge

Hands-on implementation of LLM-based evaluation systems.

---

## Learning Objectives

By the end of this chapter, you will be able to:

1. **Understand why LLM judges are powerful** - Learn when and why to use LLMs for evaluation instead of traditional metrics
2. **Measure evaluator agreement** - Apply statistical tools (Spearman's rho, Kendall's tau, Cohen's kappa) to validate evaluation quality
3. **Implement different judging approaches** - Build pointwise, pairwise, and reference-guided evaluation systems
4. **Design effective judge prompts** - Create structured prompts with clear criteria and evaluation steps
5. **Mitigate known biases** - Identify and correct for position bias, verbosity bias, and self-enhancement bias
6. **Use probability-weighted scoring** - Extract richer signals from token probabilities instead of discrete scores

---

## Why Use LLMs as Judges?

Traditional evaluation metrics like BLEU, ROUGE, and BERTScore measure surface-level similarity between generated text and reference text. However, they struggle with:

- **Open-ended tasks** where multiple valid responses exist
- **Subjective qualities** like helpfulness, tone, and engagement
- **Complex reasoning** that requires understanding, not just matching

LLM-as-a-Judge offers several advantages:

| Aspect | Traditional Metrics | LLM Judges |
|--------|---------------------|------------|
| Reference required | Yes | Optional |
| Subjective evaluation | Limited | Strong |
| Reasoning about content | Surface-level | Deep understanding |
| Cost per evaluation | Very low | Moderate |
| Consistency | Perfect | Variable (needs calibration) |

The key insight is that **LLMs can approximate human judgment** at a fraction of the cost, enabling evaluation at scale for tasks where human annotation would be prohibitively expensive.

---

## Part 1: Measuring Evaluator Agreement

Before trusting any evaluator (human or LLM), we need statistical tools to measure how well it agrees with ground truth or other evaluators. This section introduces three essential metrics:

1. **Spearman's rho** - For ranked/ordinal data
2. **Kendall's tau** - For pairwise ranking agreement  
3. **Cohen's kappa** - For categorical judgments

### Setup

We begin by importing the statistical libraries we need for measuring agreement.

In [None]:
import numpy as np
from scipy.stats import spearmanr, kendalltau
from sklearn.metrics import cohen_kappa_score

### Spearman's Rank Correlation

**Spearman's rho** measures monotonic relationships between two sets of rankings. Unlike Pearson correlation, it does not assume linearity - only that when one variable increases, the other tends to increase (or decrease) consistently.

**Key properties:**
- Range: -1 to +1 (1 = perfect agreement, -1 = perfect disagreement, 0 = no correlation)
- Robust to outliers since it uses ranks, not raw values
- Ideal for ordinal scales like 1-5 quality ratings

**Why bootstrap confidence intervals?** A single correlation value tells us the point estimate, but bootstrap resampling shows us the uncertainty around that estimate. This is crucial when working with small sample sizes typical in evaluation studies.

In [None]:
# Example from the book: comparing human scores with metric scores
human_scores = [4, 2, 5, 3, 1, 4, 3, 5, 2, 1]
metric_scores = [0.7, 0.3, 0.9, 0.5, 0.1, 0.8, 0.4, 0.85, 0.25, 0.15]

# Spearman's rank correlation
corr, p_value = spearmanr(human_scores, metric_scores)
print(f"Spearman's ρ: {corr:.3f}")
print(f"p-value: {p_value:.4f}")

# Bootstrap confidence intervals
n_bootstraps = 1000
bootstrap_corrs = []
for _ in range(n_bootstraps):
    indices = np.random.choice(len(human_scores), len(human_scores), replace=True)
    resampled_human = [human_scores[i] for i in indices]
    resampled_metric = [metric_scores[i] for i in indices]
    boot_corr, _ = spearmanr(resampled_human, resampled_metric)
    bootstrap_corrs.append(boot_corr)

lower = np.percentile(bootstrap_corrs, 2.5)
upper = np.percentile(bootstrap_corrs, 97.5)
print(f"95% CI: [{lower:.3f}, {upper:.3f}]")

### Kendall's Tau: Pairwise Agreement

**Kendall's tau** takes a different approach: it counts concordant vs. discordant pairs. For any two items, if both evaluators agree on which is better, that is a concordant pair; otherwise, it is discordant.

**Formula:**
$$\tau = \frac{\text{concordant pairs} - \text{discordant pairs}}{\text{total pairs}}$$

**Why use Kendall's tau?**
- A tau of 0.8 means 80% of pairs agree on ordering - this is directly interpretable
- More intuitive for pairwise comparison tasks (like ranking models)
- Better statistical properties for small samples than Spearman's rho

**When to use each:**
- Use Spearman's rho when you care about overall ranking correlation
- Use Kendall's tau when pairwise preferences matter (e.g., "Is model A better than model B?")

In [None]:
# Example from the book
human_ranks = [1, 2, 3, 4, 5]
metric_ranks = [1, 3, 2, 5, 4]

tau, p_value = kendalltau(human_ranks, metric_ranks)
print(f"Kendall's τ: {tau:.3f}")
print(f"p-value: {p_value:.4f}")

# Count concordant and discordant pairs manually
n = len(human_ranks)
concordant = discordant = 0
for i in range(n):
    for j in range(i + 1, n):
        h_diff = human_ranks[i] - human_ranks[j]
        m_diff = metric_ranks[i] - metric_ranks[j]
        if h_diff * m_diff > 0:
            concordant += 1
        elif h_diff * m_diff < 0:
            discordant += 1

print(f"\nConcordant pairs: {concordant}")
print(f"Discordant pairs: {discordant}")
print(f"Manual τ: {(concordant - discordant) / (concordant + discordant):.3f}")

### Cohen's Kappa: Categorical Agreement

For **binary or categorical judgments** (pass/fail, safe/unsafe, acceptable/unacceptable), simple percent agreement is misleading because two random raters would agree some percentage of the time just by chance.

**Cohen's kappa** corrects for this:

$$\kappa = \frac{P_o - P_e}{1 - P_e}$$

Where:
- $P_o$ = observed agreement
- $P_e$ = expected agreement by chance

**Interpretation guidelines (Landis & Koch, 1977):**
| Kappa | Interpretation |
|-------|---------------|
| 0.81-1.00 | Almost perfect |
| 0.61-0.80 | Substantial |
| 0.41-0.60 | Moderate |
| 0.21-0.40 | Fair |
| 0.00-0.20 | Slight |

**Why kappa matters for LLM judges:** When validating an LLM judge against human labels, kappa tells you how much better than random the LLM is performing.

In [None]:
# Example: Two raters labeling responses as acceptable (1) or not (0)
rater1 = [1, 1, 0, 1, 1, 0, 1, 0, 1, 1]
rater2 = [1, 0, 0, 1, 1, 1, 1, 0, 1, 1]

kappa = cohen_kappa_score(rater1, rater2)

# Compute manually to understand
observed_agreement = sum(r1 == r2 for r1, r2 in zip(rater1, rater2)) / len(rater1)

# Expected agreement by chance
p1_yes = sum(rater1) / len(rater1)
p2_yes = sum(rater2) / len(rater2)
expected_agreement = p1_yes * p2_yes + (1 - p1_yes) * (1 - p2_yes)

manual_kappa = (observed_agreement - expected_agreement) / (1 - expected_agreement)

print(f"Observed agreement: {observed_agreement:.0%}")
print(f"Expected (chance) agreement: {expected_agreement:.0%}")
print(f"Cohen's κ: {kappa:.3f}")
print(f"\nInterpretation (Landis & Koch):")
print("  0.81-1.00: Almost perfect")
print("  0.61-0.80: Substantial")
print("  0.41-0.60: Moderate")
print("  0.21-0.40: Fair")
print("  0.00-0.20: Slight")

---

## Part 2: G-Eval - Systematic LLM-based Evaluation

[G-Eval](https://arxiv.org/abs/2303.16634) (Liu et al., 2023) is a framework for LLM-based evaluation that achieves high correlation with human judgments. It has three key components:

1. **Structured evaluation prompts** - Clear criteria definition and evaluation steps
2. **Auto-generated Chain-of-Thought** - LLM reasons through the evaluation before scoring
3. **Probability-weighted scoring** - Uses token probabilities for finer-grained scores

### Designing Effective Judge Prompts

A good evaluation prompt should include:

| Component | Purpose | Example |
|-----------|---------|---------|
| **Role definition** | Set expectations | "You will be given one summary..." |
| **Metric definition** | Precise criteria | "Coherence (1-5) - the collective quality of all sentences..." |
| **Evaluation steps** | Guide reasoning | "1. Read the article... 2. Compare to summary..." |
| **Input format** | Structured data | "Source Text: {document}" |
| **Output format** | Parseable result | "Evaluation Form (scores ONLY):" |

The following prompt demonstrates these principles for coherence evaluation:

In [None]:
# G-Eval prompt template for coherence evaluation
GEVAL_COHERENCE_PROMPT = """You will be given one summary written for a news article.

Your task is to rate the summary on one metric.

Please make sure you read and understand these instructions carefully.

Evaluation Criteria:

Coherence (1-5) - the collective quality of all sentences. The summary should 
be well-structured and well-organized. The summary should not just be a heap 
of related information, but should build from sentence to sentence into a 
coherent body of information about a topic.

Evaluation Steps:

1. Read the news article carefully and identify the main topic and key points.
2. Read the summary and compare it to the news article. Check if the summary 
   covers the main topic and key points, and if it presents them in a clear 
   and logical order.
3. Assign a score for coherence on a scale of 1 to 5, where 1 is the lowest 
   and 5 is the highest, based on the Evaluation Criteria.

Source Text:
{document}

Summary:
{summary}

Evaluation Form (scores ONLY):
- Coherence: """

# Example usage
document = """The quarterly earnings report showed a 15% increase in revenue, 
driven primarily by strong performance in the cloud services division. 
However, operating costs rose by 8%, partially offsetting gains."""

summary = """The company's revenue grew substantially due to cloud services' 
success, though higher operating expenses moderated the overall financial 
improvement."""

print("G-Eval Coherence Prompt:")
print("-" * 50)
print(GEVAL_COHERENCE_PROMPT.format(document=document, summary=summary))

### Probability-Weighted Scoring

A key insight from G-Eval is that **discrete scores lose information**. When an LLM outputs "3", it might be 90% confident in 3 and 10% in 4, or it might be 50% for 3 and 40% for 2. These represent different underlying assessments.

**Probability-weighted scoring** extracts the LLM's full probability distribution over possible scores and computes the expected value:

$$\text{score} = \sum_{i=1}^{5} i \cdot P(i)$$

**Benefits:**
- **Finer granularity** - Can distinguish between items that would receive the same discrete score
- **Better correlation** - Often achieves higher correlation with human judgments
- **Uncertainty quantification** - High entropy in the distribution indicates uncertainty

The following function computes expected scores from probability distributions:

In [None]:
def probability_weighted_score(score_probs: dict[int, float]) -> float:
    """
    Compute expected score from probability distribution.
    
    score_probs: {score: probability} e.g., {1: 0.1, 2: 0.2, 3: 0.5, 4: 0.15, 5: 0.05}
    """
    return sum(score * prob for score, prob in score_probs.items())

# Example from the book: two summaries with same discrete score but different distributions
summary_a_probs = {1: 0.0, 2: 0.05, 3: 0.70, 4: 0.20, 5: 0.05}  # Confident 3
summary_b_probs = {1: 0.0, 2: 0.35, 3: 0.55, 4: 0.08, 5: 0.02}  # Uncertain 2-3

score_a = probability_weighted_score(summary_a_probs)
score_b = probability_weighted_score(summary_b_probs)

print("Both would receive discrete score of 3, but:")
print(f"  Summary A weighted score: {score_a:.2f}")
print(f"  Summary B weighted score: {score_b:.2f}")
print(f"\n^ Probability weighting correctly ranks A > B")

### Implementing G-Eval with OpenAI

Now we put the pieces together: a complete G-Eval implementation that:

1. Sends the structured evaluation prompt to the LLM
2. Requests `logprobs` to access token probabilities
3. Extracts probabilities for score tokens (1-5)
4. Computes both discrete and probability-weighted scores

**Important API parameters:**
- `max_tokens=1` - We only need a single digit output
- `logprobs=True` - Enable log probability access
- `top_logprobs=5` - Get probabilities for top 5 candidate tokens

In [None]:
from openai import OpenAI

client = OpenAI()

def geval_score(document: str, summary: str, model: str = "gpt-4o-mini") -> dict:
    """
    G-Eval style scoring with probability weighting.
    Returns discrete score and weighted score.
    """
    prompt = GEVAL_COHERENCE_PROMPT.format(document=document, summary=summary)
    
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        max_tokens=1,
        logprobs=True,
        top_logprobs=5
    )
    
    # Extract token and logprobs
    choice = response.choices[0]
    discrete_score = int(choice.message.content.strip())
    
    # Build probability distribution over valid scores
    score_probs = {i: 0.0 for i in range(1, 6)}
    if choice.logprobs and choice.logprobs.content:
        for item in choice.logprobs.content[0].top_logprobs:
            token = item.token.strip()
            if token in ["1", "2", "3", "4", "5"]:
                score_probs[int(token)] = np.exp(item.logprob)
    
    # Normalize probabilities
    total = sum(score_probs.values())
    if total > 0:
        score_probs = {k: v / total for k, v in score_probs.items()}
    
    weighted_score = probability_weighted_score(score_probs)
    
    return {
        "discrete_score": discrete_score,
        "weighted_score": weighted_score,
        "probabilities": score_probs
    }

Let us test the G-Eval scorer on our example document and summary. Notice how the probability distribution provides insight into the model's confidence:

In [None]:
# Test G-Eval scoring
result = geval_score(document, summary)

print(f"Document: {document[:60]}...")
print(f"Summary: {summary[:60]}...")
print(f"\nDiscrete score: {result['discrete_score']}")
print(f"Weighted score: {result['weighted_score']:.2f}")
print(f"\nProbability distribution:")
for score, prob in result['probabilities'].items():
    bar = "█" * int(prob * 20)
    print(f"  {score}: {prob:.2%} {bar}")

---

## Part 3: Pairwise Comparison

While pointwise scoring (rating each response independently) is useful, **pairwise comparison** often achieves higher correlation with human preferences. This is because:

1. **Easier judgment** - "Which is better?" is often easier than "Rate this 1-5"
2. **Calibration-free** - No need to calibrate what a "3" means
3. **Natural for preference data** - Matches how RLHF training data is collected

### The Pairwise Judging Approach

In pairwise comparison, the LLM receives:
- The original question/prompt
- Two candidate responses (A and B)
- Instructions to compare and choose the better one

The prompt below follows the MT-Bench format, a widely-used benchmark for evaluating LLM judges:

In [None]:
PAIRWISE_PROMPT = """Please act as an impartial judge and evaluate the quality of the 
responses provided by two AI assistants to the user's question.

Your evaluation should consider correctness, helpfulness, and relevance.

Avoid any position biases and ensure that the order in which the responses 
were presented does not influence your decision.

[User Question]
{question}

[Assistant A's Answer]
{answer_a}

[Assistant B's Answer]
{answer_b}

After providing your explanation, output your final verdict by strictly 
following this format: "[[A]]" if assistant A is better, "[[B]]" if 
assistant B is better, and "[[C]]" for a tie."""

def pairwise_judge(
    question: str, 
    answer_a: str, 
    answer_b: str,
    model: str = "gpt-4o-mini"
) -> dict:
    """Judge which response is better using pairwise comparison."""
    prompt = PAIRWISE_PROMPT.format(
        question=question,
        answer_a=answer_a,
        answer_b=answer_b
    )
    
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        max_tokens=500
    )
    
    content = response.choices[0].message.content
    
    # Extract verdict
    if "[[A]]" in content:
        verdict = "A"
    elif "[[B]]" in content:
        verdict = "B"
    elif "[[C]]" in content:
        verdict = "tie"
    else:
        verdict = "unknown"
    
    return {"verdict": verdict, "reasoning": content}

Let us test the pairwise judge with two responses of different quality:

In [None]:
# Test pairwise comparison
question = "What is the capital of France?"
answer_a = "The capital of France is Paris."
answer_b = "Paris is the capital city of France, known for the Eiffel Tower and rich cultural heritage."

result = pairwise_judge(question, answer_a, answer_b)
print(f"Question: {question}")
print(f"\nAssistant A: {answer_a}")
print(f"Assistant B: {answer_b}")
print(f"\nVerdict: {result['verdict']}")
print(f"\nReasoning:\n{result['reasoning']}")

### Mitigating Position Bias

**Position bias** is one of the most significant issues with LLM judges: they tend to prefer responses in certain positions (often the first one) regardless of quality.

**The debiasing strategy:**
1. Run the comparison with A first, B second
2. Run again with B first, A second  
3. Only trust the verdict if both runs agree

If the verdicts disagree, the comparison is marked "inconclusive" - this is better than returning a biased result.

**Research findings on position bias:**
- GPT-4 shows 10-15% position bias in some benchmarks
- Bias varies by task type and response length
- Swapping can reduce bias by 50% or more

In [None]:
def pairwise_judge_debiased(
    question: str,
    answer_a: str,
    answer_b: str,
    model: str = "gpt-4o-mini"
) -> dict:
    """
    Run pairwise comparison twice with swapped positions.
    Only return a verdict if both runs agree.
    """
    # Original order: A first, B second
    result1 = pairwise_judge(question, answer_a, answer_b, model)
    
    # Swapped order: B first, A second
    result2 = pairwise_judge(question, answer_b, answer_a, model)
    
    # Map swapped result back
    swapped_verdict = {"A": "B", "B": "A", "tie": "tie"}.get(result2["verdict"], "unknown")
    
    # Check agreement
    if result1["verdict"] == swapped_verdict:
        confident = True
        final_verdict = result1["verdict"]
    else:
        confident = False
        final_verdict = "inconclusive"
    
    return {
        "verdict": final_verdict,
        "confident": confident,
        "original_order": result1["verdict"],
        "swapped_order": result2["verdict"]
    }

# Test debiased comparison
result = pairwise_judge_debiased(question, answer_a, answer_b)
print(f"Original order verdict: {result['original_order']}")
print(f"Swapped order verdict: {result['swapped_order']} (mapped back)")
print(f"Final verdict: {result['verdict']}")
print(f"Confident: {result['confident']}")

---

## Part 4: Reference-Guided Grading

For **factual or reasoning tasks** (math, coding, knowledge questions), we can improve judge accuracy by providing a reference answer. However, there is a critical issue: **context contamination**.

### The Context Contamination Problem

If you ask the judge to generate a reference answer and then evaluate candidates in the same context, the judge may:
- Favor responses that match its own reasoning style
- Be biased toward its own errors
- Apply inconsistent standards

### The Two-Phase Solution

Reference-guided grading separates the process:

1. **Phase 1:** Generate reference answer in a clean context (no candidate answers visible)
2. **Phase 2:** Compare candidates against the reference in a separate API call

This ensures the reference is unbiased by the candidates being evaluated.

In [None]:
REFERENCE_GUIDED_PROMPT = """Please act as an impartial judge and evaluate the quality 
of the responses provided by two AI assistants to the user's question.

Your evaluation should consider correctness and helpfulness. You will be given 
a reference answer, assistant A's answer, and assistant B's answer.

Your job is to evaluate which assistant's answer is better.

Begin your evaluation by comparing both assistants' answers with the reference 
answer. Identify and correct any mistakes.

[User Question]
{question}

[Reference Answer]
{reference}

[Assistant A's Answer]
{answer_a}

[Assistant B's Answer]
{answer_b}

After providing your explanation, output your final verdict: "[[A]]" if 
assistant A is better, "[[B]]" if assistant B is better, "[[C]]" for a tie."""

def reference_guided_judge(
    question: str,
    answer_a: str,
    answer_b: str,
    model: str = "gpt-4o-mini"
) -> dict:
    """
    Two-phase evaluation:
    1. Generate reference answer in clean context
    2. Compare candidates against reference
    """
    # Phase 1: Generate reference answer
    ref_response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": question}],
        max_tokens=500
    )
    reference = ref_response.choices[0].message.content
    
    # Phase 2: Compare with reference
    prompt = REFERENCE_GUIDED_PROMPT.format(
        question=question,
        reference=reference,
        answer_a=answer_a,
        answer_b=answer_b
    )
    
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        max_tokens=500
    )
    
    content = response.choices[0].message.content
    
    if "[[A]]" in content:
        verdict = "A"
    elif "[[B]]" in content:
        verdict = "B"
    elif "[[C]]" in content:
        verdict = "tie"
    else:
        verdict = "unknown"
    
    return {
        "verdict": verdict,
        "reference": reference,
        "reasoning": content
    }

Let us test reference-guided grading on a math problem where one answer is clearly wrong:

In [None]:
# Test reference-guided grading on a math problem
math_question = "What is 15% of 80?"
answer_correct = "15% of 80 is 12."
answer_wrong = "15% of 80 is 15."  # Common mistake

result = reference_guided_judge(math_question, answer_correct, answer_wrong)
print(f"Question: {math_question}")
print(f"\nAssistant A: {answer_correct}")
print(f"Assistant B: {answer_wrong}")
print(f"\nReference: {result['reference']}")
print(f"\nVerdict: {result['verdict']}")

---

## Part 5: Structured Output for Reliable Evaluation

When building production evaluation systems, you need **consistent, parseable outputs**. JSON structured outputs solve this by constraining the LLM to a predefined schema.

### Critical Design Principle: Reasoning Before Score

Due to the **autoregressive nature of LLMs**, the order of fields in your schema matters:

- **Wrong:** `{"score": 4, "reasoning": "..."}` - Score is generated before reasoning
- **Right:** `{"reasoning": "...", "score": 4}` - Reasoning informs the score

When reasoning comes first, the LLM "thinks through" the evaluation before committing to a score, leading to more thoughtful and consistent judgments.

The following uses Pydantic models with OpenAI's structured output feature:

In [None]:
from pydantic import BaseModel

class EvaluationResult(BaseModel):
    """Schema for structured evaluation output."""
    reasoning_steps: list[str]  # Must come BEFORE score (autoregressive ordering)
    score: int

STRUCTURED_PROMPT = """Evaluate this summary for coherence on a scale of 1-5.

Coherence: The summary should be well-structured, well-organized, and build 
from sentence to sentence into a coherent body of information.

Document: {document}
Summary: {summary}

Provide step-by-step reasoning, then assign a score."""

def structured_eval(document: str, summary: str, model: str = "gpt-4o-mini") -> dict:
    """Structured evaluation with reasoning before score."""
    response = client.beta.chat.completions.parse(
        model=model,
        messages=[{
            "role": "user",
            "content": STRUCTURED_PROMPT.format(document=document, summary=summary)
        }],
        response_format=EvaluationResult
    )
    
    result = response.choices[0].message.parsed
    return {
        "reasoning": result.reasoning_steps,
        "score": result.score
    }

# Test structured evaluation
result = structured_eval(document, summary)
print("Reasoning steps:")
for i, step in enumerate(result["reasoning"], 1):
    print(f"  {i}. {step}")
print(f"\nScore: {result['score']}")

---

## Part 6: Multi-Model Evaluation

Different LLMs have different strengths and biases as judges. Using multiple judge models can:
- Reduce model-specific biases
- Increase evaluation robustness
- Provide uncertainty estimates (disagreement indicates hard cases)

### Using Claude as a Judge

Anthropic's Claude models work with the same prompting patterns. The main API differences are:
- Use `messages.create()` instead of `chat.completions.create()`
- Response structure differs slightly (access via `response.content[0].text`)

In [None]:
from anthropic import Anthropic

anthropic_client = Anthropic()

def claude_judge(
    question: str,
    answer_a: str,
    answer_b: str,
    model: str = "claude-sonnet-4-20250514"
) -> dict:
    """Pairwise comparison using Claude."""
    prompt = PAIRWISE_PROMPT.format(
        question=question,
        answer_a=answer_a,
        answer_b=answer_b
    )
    
    response = anthropic_client.messages.create(
        model=model,
        max_tokens=500,
        messages=[{"role": "user", "content": prompt}]
    )
    
    content = response.content[0].text
    
    if "[[A]]" in content:
        verdict = "A"
    elif "[[B]]" in content:
        verdict = "B"
    elif "[[C]]" in content:
        verdict = "tie"
    else:
        verdict = "unknown"
    
    return {"verdict": verdict, "reasoning": content}

# Test Claude as judge
result = claude_judge(question, answer_a, answer_b)
print(f"Claude's verdict: {result['verdict']}")

---

## Biases and Limitations of LLM Judges

Before relying on LLM judges, understand their key limitations:

### Known Biases

| Bias | Description | Mitigation |
|------|-------------|------------|
| **Position bias** | Preference for first or second position | Swap and require agreement |
| **Verbosity bias** | Preference for longer responses | Normalize for length; test explicitly |
| **Self-enhancement** | LLMs prefer their own outputs | Use different model as judge |
| **Authority bias** | Swayed by confident-sounding text | Focus on factual accuracy criteria |
| **Style bias** | Preference for certain writing styles | Multi-dimensional rubrics |

### When NOT to Use LLM Judges

- **Safety-critical decisions** - Human review required
- **Legal/compliance** - Regulatory requirements may prohibit
- **Novel domains** - LLM may lack expertise
- **Adversarial settings** - Can be gamed by crafted responses

### Best Practices

1. **Always validate** against human annotations on a held-out set
2. **Report confidence intervals** for correlation metrics
3. **Test for known biases** before deployment
4. **Use multiple judges** for important decisions
5. **Monitor for drift** as LLMs update

---

## Exercises

Practice implementing and testing LLM judges with these exercises:

1. **Compute agreement metrics**: Calculate Spearman's rho and Kendall's tau between two human annotators' scores. When do they disagree?

2. **Test for verbosity bias**: Create two responses with identical content but different lengths. Does the judge prefer the longer one?

3. **Build a multi-dimensional rubric**: Evaluate responses on helpfulness, accuracy, and tone. Use geometric mean to combine scores.

**Exercise 2: Verbosity Bias Test**

Test whether your LLM judge exhibits verbosity bias by comparing a concise correct answer to a verbose correct answer. If the judge prefers the verbose answer despite equal correctness, that indicates verbosity bias.

In [None]:
# Exercise 2: Verbosity bias test
concise_answer = "The capital of France is Paris."
verbose_answer = """The capital of France is Paris. Paris is a beautiful city 
located in the north-central part of France. It is known for many famous 
landmarks including the Eiffel Tower, the Louvre Museum, and Notre-Dame 
Cathedral. The city has been the capital since the 10th century and remains 
the political, economic, and cultural center of France today."""

result = pairwise_judge(
    "What is the capital of France?",
    concise_answer,
    verbose_answer
)
print(f"Concise: {concise_answer}")
print(f"Verbose: {verbose_answer[:50]}...")
print(f"\nVerdict: {result['verdict']}")
print("^ Does the judge exhibit verbosity bias?")

---

## Summary and Key Takeaways

### What We Learned

1. **LLM judges are powerful** - They can evaluate subjective qualities that traditional metrics cannot, approximating human judgment at scale

2. **Measuring agreement is essential** - Use Spearman's rho for rankings, Kendall's tau for pairwise preferences, and Cohen's kappa for categorical judgments

3. **Three main judging approaches:**
   - **Pointwise (G-Eval)** - Rate each response independently; use probability weighting for finer granularity
   - **Pairwise** - Compare two responses directly; often higher correlation with human preferences
   - **Reference-guided** - Use a separate reference answer for factual tasks; avoid context contamination

4. **Prompt engineering matters:**
   - Define clear evaluation criteria
   - Include step-by-step evaluation instructions
   - Put reasoning before scores in structured outputs
   - Use consistent output formats

5. **Biases require active mitigation:**
   - Position bias: swap and require agreement
   - Verbosity bias: test explicitly; consider length normalization
   - Self-enhancement: use different models for generation and judging

### Next Steps

- Build an evaluation pipeline for your use case
- Collect human annotations to validate your LLM judge
- Experiment with different prompt designs and models
- Consider ensemble approaches for critical decisions

### Further Reading

- [G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment](https://arxiv.org/abs/2303.16634)
- [Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena](https://arxiv.org/abs/2306.05685)
- [Large Language Models are not Fair Evaluators](https://arxiv.org/abs/2305.17926)