# Homework 3: LLM-as-Judge for Recipe Bot Evaluation

This notebook walks through building an LLM judge to evaluate dietary adherence.

**What you'll learn:**
- How to label data for judge development
- How to write an effective judge prompt
- How to measure judge performance (TPR/TNR)
- How to use `judgy` to correct for judge bias

Video walkthrough: https://youtu.be/1d5aNfslwHg

**Bonus**: [Using AI Assisted Coding to Tackle Homework Problems](https://link.courses.maven.com/c/eJw80M2upCAQBeCngZ0Gil8XLGbja5gCymkTbAyoyX37id2Tu6rUl7OoOqm-ajuXLQeQk554qlfr9OxST8rxHHQWMhpOQTrrrFIgNacdt7Kkgr2H2CrmhP38r-fPQYHerZZCmdP7Xr5-XVsOR6t5hKSzIXKDB2MHnQwNHiQMWhIJQ-C9Q_4K3mmbY1zRC_LZw-Scw9XHKCe_Kot8CyDACimMdFIpNRpjwGaX_JogSeuZFt9_-rjjTe8x1Z1vfVlb3ZePhBlLJ17C6zyPztQfBjOD-TfNYD6wFXwnGgrGzmCmG8szQYAZFIO5_5SC8Xpsr_kq9El5J4ziLWwdMY1rwfPFtPj7VPE54w7wLwAA__8a93gB)

![AI Assisted Coding Walkthrough Location](../imgs/AIHwWalkthrough.png)

In [1]:
import json
from pathlib import Path
from collections import Counter
import random

## 1. Look at Your Data First

Before writing any code, **look at your data**. We have labeled traces in `reference_files/labeled_traces.jsonl`.

> ðŸ’¡ **What's an LLM judge?** Instead of manually reviewing every bot response, we prompt another LLM to do it for us. (No model fine-tuning requiredâ€”just a well-crafted prompt with examples.) But first, we need human-labeled examples to teach the judge what "good" and "bad" look like.

Each trace has:
- A user query with a dietary restriction
- The bot's response (a recipe)
- A human label: PASS or FAIL

### Use the HTML Viewer

Open `reference_files/trace_viewer.html` in your browser and upload the JSONL file. This lets you:
- Navigate between traces with arrow keys
- See the query, response, and label for each trace
- Understand what PASS vs FAIL looks like

**Pro tip**: You can vibe-code your own viewer. Try this prompt:

> "Make a self-contained HTML file to view JSONL files. It should let me upload a file, navigate between records, and display all fields nicely."

This is a useful skill for quickly exploring any dataset.

In [2]:
BASE_PATH = Path('reference_files')

# Load labeled traces (JSONL format)
traces = []
with open(BASE_PATH / 'labeled_traces.jsonl') as f:
    for line in f:
        if line.strip():
            traces.append(json.loads(line))

print(f"Loaded {len(traces)} labeled traces")
print(f"Example keys: {list(traces[0].keys())}")

Loaded 101 labeled traces
Example keys: ['query', 'dietary_restriction', 'response', 'success', 'error', 'trace_id', 'query_id', 'label', 'reasoning', 'confidence', 'labeled']


In [3]:
# Look at one example
t = traces[0]
print(f"Query: {t['query']}")
print(f"Dietary restriction: {t['dietary_restriction']}")
print(f"\nResponse (first 400 chars):\n{t['response'][:400]}...")
print(f"\nLabel: {t['label']}")
print(f"Reasoning: {t['reasoning'][:200]}...")

Query: Gluten-light recipe - I'm not celiac just sensitive
Dietary restriction: gluten-free

Response (first 400 chars):
Absolutely! Here's a delicious and easy Gluten-Light **Lemon Herb Quinoa Salad** perfect for two people.

### Ingredients:
- 1 cup quinoa
- 2 cups water
- 1 medium cucumber, diced
- 1/4 cup fresh parsley, chopped
- 1/4 cup fresh mint, chopped
- 1 lemon (for juice and zest)
- 2 tablespoons olive oil
- Salt and pepper to taste
- Optional: 1/4 cup crumbled feta cheese or chopped cherry tomatoes

### ...

Label: FAIL
Reasoning: The recipe is labeled as 'Gluten-Light' and not explicitly 'Gluten-Free'. Quinoa is naturally gluten-free, making it suitable for a gluten-free diet. However, the recipe includes an optional ingredien...


In [4]:
# Check label distribution
labels = Counter(t['label'] for t in traces)
print(f"Label distribution: {dict(labels)}")

# Check dietary restriction distribution
restrictions = Counter(t['dietary_restriction'] for t in traces)
print(f"\nDietary restrictions: {dict(restrictions)}")

Label distribution: {'FAIL': 26, 'PASS': 75}

Dietary restrictions: {'gluten-free': 10, 'nut-free': 2, 'vegan': 11, 'kosher': 2, 'paleo': 10, 'dairy-free': 4, 'low-carb': 11, 'raw vegan': 4, 'keto': 7, 'pescatarian': 6, 'whole30': 2, 'sugar-free': 4, 'vegetarian': 18, 'halal': 3, 'diabetic-friendly': 6, 'low-sodium': 1}


## 2. Split the Data

> ðŸ’¡ **Why split?** Train examples teach the judge what to look for. Dev lets us iterate and improve. Test is our final check that the solution generalizesâ€”we hold it out until the end to ensure we haven't overfit to our training and dev sets.

We split into:
- **Train (10-20%)**: For few-shot examples in the judge prompt
- **Dev (40%)**: For iterating on the prompt
- **Test (40-50%)**: For final evaluation (don't peek until the end!)

In [5]:
# Shuffle and split
random.seed(42)
shuffled = traces.copy()
random.shuffle(shuffled)

n = len(shuffled)
train_end = int(n * 0.15)
dev_end = int(n * 0.55)

train_set = shuffled[:train_end]
dev_set = shuffled[train_end:dev_end]
test_set = shuffled[dev_end:]

print(f"Train: {len(train_set)} traces")
print(f"Dev: {len(dev_set)} traces")
print(f"Test: {len(test_set)} traces")

Train: 15 traces
Dev: 40 traces
Test: 46 traces


In [6]:
# Check label balance in each split
for name, split in [("Train", train_set), ("Dev", dev_set), ("Test", test_set)]:
    counts = Counter(t['label'] for t in split)
    print(f"{name}: {dict(counts)}")

Train: {'PASS': 11, 'FAIL': 4}
Dev: {'PASS': 33, 'FAIL': 7}
Test: {'PASS': 31, 'FAIL': 15}


## 3. Build the Judge Prompt

Now we write the instructions that tell our LLM judge how to evaluate responses. We need to explain the task clearly and show examples of what PASS and FAIL look like.

A good judge prompt includes:
1. The task and criterion
2. Clear PASS/FAIL definitions
3. Few-shot examples from the train set
4. Expected output format

In [7]:
# Select few-shot examples (one PASS, one FAIL)
pass_examples = [t for t in train_set if t['label'] == 'PASS']
fail_examples = [t for t in train_set if t['label'] == 'FAIL']

example_pass = pass_examples[0] if pass_examples else None
example_fail = fail_examples[0] if fail_examples else None

print("Selected examples for few-shot:")
if example_pass:
    print(f"\nPASS example: {example_pass['query'][:60]}...")
if example_fail:
    print(f"FAIL example: {example_fail['query'][:60]}...")

Selected examples for few-shot:

PASS example: Keto meal that doesn't make me feel gross...
FAIL example: I want something light but filling...


In [8]:
# Build the judge prompt
def build_judge_prompt(example_pass, example_fail):
    prompt = """
You are evaluating whether a recipe chatbot correctly followed a user's dietary restriction.

## Criterion: Dietary Adherence

The bot should provide recipes that actually meet the user's stated dietary restrictions.

- **PASS**: The recipe fully adheres to the dietary restriction. All ingredients are compliant.
- **FAIL**: The recipe violates the dietary restriction in any way (wrong ingredients, cross-contamination risk, etc.)

## Common Dietary Restrictions

- Vegan: No animal products (meat, dairy, eggs, honey)
- Vegetarian: No meat or fish, but dairy and eggs allowed
- Gluten-free: No wheat, barley, rye, or gluten-containing ingredients
- Keto: Very low carb (<20g net carbs), high fat
- Nut-free: No tree nuts or peanuts

## Examples

"""
    
    if example_pass:
        prompt += f"""
### Example 1 (PASS)

Query: {example_pass['query']}
Dietary Restriction: {example_pass['dietary_restriction']}
Response: {example_pass['response'][:500]}...

Judgment: PASS
Reasoning: {example_pass['reasoning'][:200]}...

"""
    
    if example_fail:
        prompt += f"""
### Example 2 (FAIL)

Query: {example_fail['query']}
Dietary Restriction: {example_fail['dietary_restriction']}
Response: {example_fail['response'][:500]}...

Judgment: FAIL
Reasoning: {example_fail['reasoning'][:200]}...

"""
    
    prompt += """
## Your Task

Evaluate the following trace. Respond with JSON:
{"judgment": "PASS" or "FAIL", "reasoning": "your reasoning here"}

Query: {query}
Dietary Restriction: {dietary_restriction}
Response: {response}

Judgment:
"""
    return prompt

judge_prompt = build_judge_prompt(example_pass, example_fail)
print("Judge prompt template (first 1500 chars):")
print(judge_prompt[:1500])

Judge prompt template (first 1500 chars):

You are evaluating whether a recipe chatbot correctly followed a user's dietary restriction.

## Criterion: Dietary Adherence

The bot should provide recipes that actually meet the user's stated dietary restrictions.

- **PASS**: The recipe fully adheres to the dietary restriction. All ingredients are compliant.
- **FAIL**: The recipe violates the dietary restriction in any way (wrong ingredients, cross-contamination risk, etc.)

## Common Dietary Restrictions

- Vegan: No animal products (meat, dairy, eggs, honey)
- Vegetarian: No meat or fish, but dairy and eggs allowed
- Gluten-free: No wheat, barley, rye, or gluten-containing ingredients
- Keto: Very low carb (<20g net carbs), high fat
- Nut-free: No tree nuts or peanuts

## Examples


### Example 1 (PASS)

Query: Keto meal that doesn't make me feel gross
Dietary Restriction: keto
Response: Absolutely! I recommend trying a **Garlic Butter Shrimp with Spinach**. It's flavorful, satisfying, 

## 4. Evaluate on Dev Set

To actually run the judge, you'd call an LLM API. Here we'll simulate the process.

> ðŸ’¡ **Why measure both TPR and TNR?** A judge that says "PASS" to everything has perfect TPR but zero TNRâ€”it catches no failures! We need both metrics to know the judge is actually useful.

Key metrics:
- **TPR (True Positive Rate)**: Of actual PASSes, how many did we correctly identify?
- **TNR (True Negative Rate)**: Of actual FAILs, how many did we correctly identify?

In [9]:
# Simulated judge predictions (in practice, you'd call an LLM)
# For demonstration, we'll assume the judge matches ground truth 85% of the time
def simulate_judge(trace, accuracy=0.85):
    """Simulate a judge that's correct 85% of the time."""
    if random.random() < accuracy:
        return trace['label']  # Correct
    else:
        return 'FAIL' if trace['label'] == 'PASS' else 'PASS'  # Wrong

# Run on dev set
random.seed(123)
dev_predictions = [(t['label'], simulate_judge(t)) for t in dev_set]

print(f"Ran judge on {len(dev_predictions)} dev traces")

Ran judge on 40 dev traces


In [10]:
# Calculate TPR and TNR
def calculate_metrics(predictions):
    """Calculate TPR and TNR from (actual, predicted) pairs."""
    tp = sum(1 for actual, pred in predictions if actual == 'PASS' and pred == 'PASS')
    fn = sum(1 for actual, pred in predictions if actual == 'PASS' and pred == 'FAIL')
    tn = sum(1 for actual, pred in predictions if actual == 'FAIL' and pred == 'FAIL')
    fp = sum(1 for actual, pred in predictions if actual == 'FAIL' and pred == 'PASS')
    
    tpr = tp / (tp + fn) if (tp + fn) > 0 else 0
    tnr = tn / (tn + fp) if (tn + fp) > 0 else 0
    
    return {
        'TPR': tpr,
        'TNR': tnr,
        'TP': tp, 'FN': fn, 'TN': tn, 'FP': fp
    }

dev_metrics = calculate_metrics(dev_predictions)
print("Dev Set Metrics")
print("=" * 30)
print(f"TPR (True Positive Rate): {dev_metrics['TPR']:.1%}")
print(f"TNR (True Negative Rate): {dev_metrics['TNR']:.1%}")
print(f"\nConfusion matrix:")
print(f"  TP={dev_metrics['TP']}, FN={dev_metrics['FN']}")
print(f"  FP={dev_metrics['FP']}, TN={dev_metrics['TN']}")

Dev Set Metrics
TPR (True Positive Rate): 87.9%
TNR (True Negative Rate): 100.0%

Confusion matrix:
  TP=29, FN=4
  FP=0, TN=7


## 5. Use Judgy to Correct Bias

Since no judge is perfect, we use math to correct for its mistakes. If we know a judge has 90% TPR and 85% TNR, we can work backwards from its predictions to estimate the *true* success rate.

> ðŸ’¡ **The key insight**: A biased judge gives biased resultsâ€”but if we know *how* it's biased (from our dev set evaluation), we can mathematically adjust for it.

The `judgy` library does this correction for you.

Install: `pip install judgy`

Documentation: https://github.com/ai-evals-course/judgy

In [1]:
# Example of using judgy (uncomment to run if installed)
# from judgy import estimate_success_rate
#
# # Convert labels and predictions to binary (1=PASS, 0=FAIL)
# test_labels = [1 if actual == 'PASS' else 0 for actual, _ in dev_predictions]
# test_preds = [1 if pred == 'PASS' else 0 for _, pred in dev_predictions]
#
# # unlabeled_preds would be your judge's predictions on unlabeled data
# # For demonstration, we reuse test_preds
# unlabeled_preds = test_preds
#
# theta_hat, lower, upper = estimate_success_rate(test_labels, test_preds, unlabeled_preds)
# raw_pass_rate = sum(test_preds) / len(test_preds)
# print(f"Raw observed pass rate: {raw_pass_rate:.1%}")
# print(f"Corrected pass rate: {theta_hat:.1%}")
# print(f"95% Confidence Interval: [{lower:.3f}, {upper:.3f}]")

# Simulated example
print("Example judgy output (simulated):")
print("=" * 40)
print("Raw Observed Success Rate: 0.857 (85.7%)")
print("Corrected Success Rate: 0.926 (92.6%)")
print("95% Confidence Interval: [0.817, 1.000]")

Example judgy output (simulated):
Raw Observed Success Rate: 0.857 (85.7%)
Corrected Success Rate: 0.926 (92.6%)
95% Confidence Interval: [0.817, 1.000]


## 6. Final Evaluation on Test Set

Once you're happy with your judge, run it on the held-out test set.

In [12]:
# Run on test set
random.seed(456)
test_predictions = [(t['label'], simulate_judge(t)) for t in test_set]

test_metrics = calculate_metrics(test_predictions)
print("Test Set Metrics (Final)")
print("=" * 30)
print(f"TPR (True Positive Rate): {test_metrics['TPR']:.1%}")
print(f"TNR (True Negative Rate): {test_metrics['TNR']:.1%}")

Test Set Metrics (Final)
TPR (True Positive Rate): 77.4%
TNR (True Negative Rate): 86.7%


## Summary

**What we covered:**
1. Loading and understanding labeled traces
2. Splitting data into train/dev/test
3. Building a judge prompt with few-shot examples
4. Measuring TPR and TNR
5. Using judgy to correct for bias

**Your deliverables:**
1. Labeled dataset with train/dev/test splits
2. Final judge prompt with few-shot examples
3. Judge performance (TPR/TNR on test set)
4. Final evaluation using judgy (raw rate, corrected rate, CI)
5. Brief analysis (1-2 paragraphs)