# ðŸ““ The GenAI Revolution Cookbook

**Title:** LLM System Evaluation Without Gold Answers: A Practical How-To

**Description:** Build a reference-free LLM grading system that scores open responses using rubrics, few-shots, and multi-judge consensus, plus validation, drift monitoring.

---

*This jupyter notebook contains executable code examples. Run the cells below to try out the code yourself!*



## Why This Approach Works

Manual evaluation is slow, subjective, and doesn't scale. When you're iterating on prompts, testing new models, or running A/B experiments, you need fast, consistent feedback. Reference-free LLM evaluation solves this by using a frontier model as a judge. It scores outputs against a rubric without needing gold-standard references, making it practical for open-ended tasks like summarization, creative writing, or customer support responses.

This guide builds a production-ready evaluator with structured outputs, few-shot anchoring, validation against human labels, and optional multi-judge consensus. You'll have a Colab-ready notebook and a minimal FastAPI service you can deploy today.

## How It Works (High-Level Overview)

The system follows this flow:

1. **Inputs**: Task instructions, a rubric, a question, and a candidate response.
2. **Judge Prompt**: Combine rubric criteria and few-shot examples into a strict prompt.
3. **Model Call**: Send the prompt to an LLM with low temperature and JSON mode enabled.
4. **Parse & Validate**: Use Pydantic to enforce schema and catch malformed outputs.
5. **Logging**: Write scores, rationales, and metadata to CSV for tracking.
6. **Consensus (Optional)**: Run multiple judges with different seeds and average scores to reduce variance.
7. **Validation**: Compare AI scores to human labels using Spearman correlation and Cohen's kappa.
8. **Outputs**: Structured evaluation results, CSV logs, and validation metrics.

## Setup & Installation

Run this cell first to install all dependencies.

In [None]:
!pip install openai anthropic pydantic tenacity transformers scipy scikit-learn fastapi uvicorn pandas

Import all required libraries and verify they load correctly.

In [None]:
import os
import json
import time
import csv
from datetime import datetime
from typing import Dict, List, Optional
from statistics import mean

import pandas as pd
import numpy as np
from scipy.stats import spearmanr
from sklearn.metrics import cohen_kappa_score

from pydantic import BaseModel, Field, ValidationError
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
from openai import OpenAI
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
from fastapi import FastAPI, HTTPException

print("All imports successful.")

Securely load API keys using Colab's built-in secrets manager. This cell checks for required keys and raises an error if any are missing.

In [None]:
from google.colab import userdata
from google.colab.userdata import SecretNotFoundError

keys = ["OPENAI_API_KEY", "ANTHROPIC_API_KEY"]
missing = []
for k in keys:
    value = None
    try:
        value = userdata.get(k)
    except SecretNotFoundError:
        pass

    os.environ[k] = value if value is not None else ""

    if not os.environ[k]:
        missing.append(k)

if missing:
    raise EnvironmentError(f"Missing keys: {', '.join(missing)}. Add them in Colab â†’ Settings â†’ Secrets.")

print("All keys loaded.")

## Step-by-Step Implementation

### Define the Rubric and Evaluation Schema

Use Pydantic to enforce structure and catch malformed outputs early. This prevents silent failures and makes debugging easier.

In [None]:
class CriterionScore(BaseModel):
    """
    Represents a single criterion's score and rationale.

    Args:
        score (int): Score for the criterion (1-5).
        rationale (str): Explanation for the score.

    Raises:
        ValidationError: If score is not in the range 1-5.
    """
    score: int = Field(ge=1, le=5)
    rationale: str

class EvaluationResult(BaseModel):
    """
    Represents the full evaluation result for a response.

    Args:
        criteria (Dict[str, CriterionScore]): Scores and rationales for each criterion.
        overall_score (float): Overall score (1-5).
        summary_feedback (str): Summary feedback for the response.

    Raises:
        ValidationError: If any field is missing or out of range.
    """
    criteria: Dict[str, CriterionScore]
    overall_score: float = Field(ge=1, le=5)
    summary_feedback: str

RUBRIC = {
    "relevance": "How directly does the response address the prompt. 1 off-topic. 3 partially relevant. 5 fully on-point.",
    "coherence": "How clear and logically structured is the response. 1 disorganized. 3 somewhat clear. 5 very clear.",
    "depth": "How thorough and insightful is the response. 1 superficial. 3 adequate detail. 5 deep and well-supported.",
    "creativity": "How original and valuable are the ideas. 1 generic. 3 somewhat original. 5 fresh and compelling."
}

### Build the Judge Prompt

Construct a strict prompt that includes rubric criteria, few-shot examples, and the candidate response. This anchors the model's behavior and reduces variance.

In [None]:
def build_judge_prompt(task_instructions: str, rubric: dict, question: str, response: str, few_shots: List[dict]) -> str:
    """
    Constructs a strict prompt for the LLM judge, including rubric and few-shots.

    Args:
        task_instructions (str): Instructions for the evaluation task.
        rubric (dict): Rubric criteria and descriptions.
        question (str): The input prompt/question.
        response (str): The candidate response to evaluate.
        few_shots (List[dict]): Few-shot examples with input, response, and evaluation.

    Returns:
        str: The complete prompt for the LLM.
    """
    parts = []
    parts.append("You are an impartial evaluator. Score the response using the rubric. Return strict JSON only.")
    parts.append("Rubric criteria. 1 to 5 integers. Provide a short rationale per criterion and a one-paragraph summary.")
    for name, desc in rubric.items():
        parts.append(f"- {name}: {desc}")
    parts.append("\nExamples:")
    for ex in few_shots:
        parts.append("Input: " + ex["input"])
        parts.append("Response: " + ex["response"])
        parts.append("Evaluation JSON: " + json.dumps(ex["evaluation"]))
    parts.append("\nNow evaluate the candidate.")
    parts.append("Input: " + question)
    parts.append("Response: " + response)
    parts.append("Return JSON with fields: criteria, overall_score, summary_feedback.")
    parts.append("criteria is an object with keys for each rubric criterion, each has fields score and rationale.")
    return "\n".join(parts)

### Call the OpenAI API with Retries

Use low temperature for determinism and enable JSON mode to ensure parsable outputs. Retries handle transient API errors.

In [None]:
OPENAI_MODEL = os.getenv("OPENAI_MODEL", "gpt-4o-mini")
client = OpenAI()

class APIError(Exception):
    """Custom exception for API errors."""
    pass

@retry(stop=stop_after_attempt(4), wait=wait_exponential(multiplier=0.5, min=0.5, max=8),
       retry=retry_if_exception_type(APIError))
def call_openai(prompt: str, temperature: float = 0.2, max_tokens: int = 400, model: str = OPENAI_MODEL) -> str:
    """
    Calls the OpenAI chat completion API with retries and error handling.

    Args:
        prompt (str): The prompt to send to the model.
        temperature (float): Sampling temperature for determinism.
        max_tokens (int): Maximum tokens to generate.
        model (str): Model name to use.

    Returns:
        str: The model's response content.

    Raises:
        APIError: If the API call fails or returns empty content.
    """
    try:
        resp = client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}],
            temperature=temperature,
            max_tokens=max_tokens,
            response_format={"type": "json_object"}
        )
    except Exception as e:
        raise APIError(str(e))
    content = resp.choices[0].message.content
    if not content:
        raise APIError("Empty response")
    return content

### Parse and Validate the LLM Output

Pydantic validation catches schema violations and missing fields. This prevents downstream errors and makes debugging faster.

In [None]:
def parse_evaluation(json_text: str) -> Optional[EvaluationResult]:
    """
    Parses and validates the LLM's JSON output against the EvaluationResult schema.

    Args:
        json_text (str): The JSON string output from the LLM.

    Returns:
        Optional[EvaluationResult]: Parsed and validated result, or None if invalid.
    """
    try:
        data = json.loads(json_text)
        if "criteria" not in data:
            raise ValueError("Missing criteria")
        for k in RUBRIC.keys():
            if k not in data["criteria"]:
                raise ValueError(f"Missing criterion {k}")
        return EvaluationResult(**data)
    except Exception as e:
        print(f"Validation error: {e}")
        return None

### Define Few-Shot Examples

Few-shots anchor the model's scoring behavior and reduce bias. Include examples that cover edge cases like verbosity without substance and concise but strong responses.

In [None]:
FEW_SHOTS = [
    {
        "input": "Write a product tagline for a privacy-focused email app.",
        "response": "Email, but better.",
        "evaluation": {
            "criteria": {
                "relevance": {"score": 3, "rationale": "Mentions email but not privacy."},
                "coherence": {"score": 4, "rationale": "Clear and short."},
                "depth": {"score": 1, "rationale": "No specifics."},
                "creativity": {"score": 2, "rationale": "Generic phrasing."}
            },
            "overall_score": 2.5,
            "summary_feedback": "Too generic. Bring privacy into the message and add a twist."
        }
    },
    {
        "input": "Write a product tagline for a privacy-focused email app.",
        "response": "Your inbox, only yours. Private email for real life.",
        "evaluation": {
            "criteria": {
                "relevance": {"score": 5, "rationale": "Directly emphasizes privacy."},
                "coherence": {"score": 5, "rationale": "Flows and reads well."},
                "depth": {"score": 3, "rationale": "Basic value but not features."},
                "creativity": {"score": 4, "rationale": "Memorable and on-brand."}
            },
            "overall_score": 4.25,
            "summary_feedback": "Clear privacy angle. Consider a more distinctive hook."
        }
    }
]

### Evaluate a Single Response

This function combines prompt building, API call, parsing, and optional repair. It returns a validated result or None if the output is malformed.

In [None]:
def evaluate_response(task_instructions: str, question: str, response: str,
                      few_shots=FEW_SHOTS, temperature: float = 0.2, model: str = OPENAI_MODEL) -> Optional[EvaluationResult]:
    """
    Evaluates a single candidate response using the LLM judge.

    Args:
        task_instructions (str): Instructions for the evaluation task.
        question (str): The input prompt/question.
        response (str): The candidate response to evaluate.
        few_shots (list): Few-shot examples for the prompt.
        temperature (float): Sampling temperature for the LLM.
        model (str): Model name to use.

    Returns:
        Optional[EvaluationResult]: Validated evaluation result, or None if invalid.
    """
    prompt = build_judge_prompt(task_instructions, RUBRIC, question, response, few_shots)
    raw = call_openai(prompt, temperature=temperature, model=model)
    parsed = parse_evaluation(raw)
    if parsed is None:
        repair_prompt = f"Fix this JSON to match the schema. Return JSON only.\n{raw}"
        try:
            repaired = call_openai(repair_prompt, temperature=0.0, max_tokens=400, model=model)
            parsed = parse_evaluation(repaired)
        except Exception as e:
            print(f"Repair failed: {e}")
            parsed = None
    return parsed

### Batch Evaluation with Logging

Evaluate multiple items and write results to CSV. This logs every evaluation with timestamp, scores, and feedback for tracking over time.

In [None]:
RUBRIC_VERSION = "v1.0"
PROMPT_VERSION = "v1.0"

def evaluate_batch(items, task_instructions: str, outfile: str, model: str = OPENAI_MODEL) -> None:
    """
    Evaluates a batch of items and writes results to a CSV file.

    Args:
        items (list): List of dicts with 'question' and 'response' keys.
        task_instructions (str): Instructions for the evaluation task.
        outfile (str): Output CSV file path.
        model (str): Model name to use.

    Returns:
        None
    """
    with open(outfile, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=[
            "timestamp", "model", "rubric_version", "prompt_version",
            "question", "response",
            "relevance", "coherence", "depth", "creativity",
            "overall_score", "summary_feedback"
        ])
        writer.writeheader()
        for it in items:
            q = it["question"].strip()
            r = it["response"].strip()
            if not r:
                writer.writerow({
                    "timestamp": datetime.utcnow().isoformat(),
                    "model": model, "rubric_version": RUBRIC_VERSION, "prompt_version": PROMPT_VERSION,
                    "question": q, "response": r,
                    "relevance": "", "coherence": "", "depth": "", "creativity": "",
                    "overall_score": "", "summary_feedback": "Empty response"
                })
                continue
            result = evaluate_response(task_instructions, q, r, model=model)
            if result is None:
                writer.writerow({
                    "timestamp": datetime.utcnow().isoformat(),
                    "model": model, "rubric_version": RUBRIC_VERSION, "prompt_version": PROMPT_VERSION,
                    "question": q, "response": r,
                    "relevance": "", "coherence": "", "depth": "", "creativity": "",
                    "overall_score": "", "summary_feedback": "Invalid JSON"
                })
                continue
            row = {
                "timestamp": datetime.utcnow().isoformat(),
                "model": model,
                "rubric_version": RUBRIC_VERSION,
                "prompt_version": PROMPT_VERSION,
                "question": q,
                "response": r,
                "relevance": result.criteria["relevance"].score,
                "coherence": result.criteria["coherence"].score,
                "depth": result.criteria["depth"].score,
                "creativity": result.criteria["creativity"].score,
                "overall_score": result.overall_score,
                "summary_feedback": result.summary_feedback
            }
            writer.writerow(row)

### Multi-Judge Consensus (Optional)

Run multiple judges with different temperatures and average their scores. This reduces variance and mitigates idiosyncratic biases. For more on building robust multi-agent and consensus systems, see our guide to [multi-agent AI systems with CrewAI and YAML](/article/how-to-build-multi-agent-ai-systems-with-crewai-and-yaml-2).

In [None]:
def evaluate_with_consensus(task_instructions: str, question: str, response: str,
                            judges=[{"model": "gpt-4o-mini", "temperature": 0.2},
                                    {"model": "gpt-4o-mini", "temperature": 0.0},
                                    {"model": "gpt-4o-mini", "temperature": 0.4}],
                            few_shots=FEW_SHOTS) -> Optional[EvaluationResult]:
    """
    Runs multiple LLM judges and averages their scores for consensus.

    Args:
        task_instructions (str): Instructions for the evaluation task.
        question (str): The input prompt/question.
        response (str): The candidate response to evaluate.
        judges (list): List of judge configs (model, temperature).
        few_shots (list): Few-shot examples for the prompt.

    Returns:
        Optional[EvaluationResult]: Consensus evaluation result, or None if all fail.
    """
    results = []
    for j in judges:
        res = evaluate_response(task_instructions, question, response, few_shots, temperature=j["temperature"], model=j["model"])
        if res:
            results.append(res)
    if not results:
        return None
    merged = {
        "criteria": {},
        "overall_score": mean([r.overall_score for r in results]),
        "summary_feedback": max([r.summary_feedback for r in results], key=len)
    }
    for k in RUBRIC.keys():
        merged["criteria"][k] = {
            "score": int(round(mean([r.criteria[k].score for r in results]))),
            "rationale": max([r.criteria[k].rationale for r in results], key=len)
        }
    try:
        return EvaluationResult(**merged)
    except ValidationError:
        return None

## Run and Validate

### Evaluate a Small Dataset

Run the evaluator on sample items and write results to CSV. This demonstrates the end-to-end flow.

In [None]:
sample_items = [
    {"question": "Write a headline for a privacy-first email app.", "response": "Email that minds its business."},
    {"question": "Write a headline for a privacy-first email app.", "response": "Advanced AI email assistant for sales teams."},
    {"question": "Write a headline for a privacy-first email app.", "response": "Your inbox, only yours."}
]

evaluate_batch(sample_items, "Score headline quality.", "results.csv")
print("Wrote results.csv")

Inspect the first few rows of the output to verify structure and scores.

In [None]:
df = pd.read_csv("results.csv")
print(df.head())

### Validate Against Human Labels

Compare AI scores to human labels using Spearman correlation and Cohen's kappa. This measures agreement and helps you tune the rubric or few-shots.

In [None]:
def validation_report(df: pd.DataFrame):
    """
    Computes Spearman correlation and Cohen's kappa between human and AI scores.

    Args:
        df (pd.DataFrame): DataFrame with columns: human_* and ai_* for each criterion.

    Returns:
        dict: Metrics for each criterion and overall.
    """
    metrics = {}
    for k in ["relevance", "coherence", "depth", "creativity"]:
        rho, p = spearmanr(df[f"human_{k}"], df[f"ai_{k}"], nan_policy="omit")
        metrics[f"{k}_spearman"] = round(rho, 3)
        human_bins = df[f"human_{k}"].apply(lambda s: 0 if s <= 2 else 1 if s == 3 else 2)
        ai_bins = df[f"ai_{k}"].apply(lambda s: 0 if s <= 2 else 1 if s == 3 else 2)
        metrics[f"{k}_kappa"] = round(cohen_kappa_score(human_bins, ai_bins), 3)
    rho_overall, _ = spearmanr(df["human_overall"], df["ai_overall"], nan_policy="omit")
    metrics["overall_spearman"] = round(rho_overall, 3)
    return metrics

Load AI results and human labels, then compute validation metrics. Replace the example human labels with real annotations.

In [None]:
ai = pd.read_csv("results.csv")
human = pd.DataFrame({
    "human_relevance": [5, 2, 5],
    "human_coherence": [4, 3, 5],
    "human_depth": [3, 2, 3],
    "human_creativity": [4, 2, 4],
    "human_overall": [4.0, 2.2, 4.2]
})
df = pd.concat([ai, human], axis=1)
print(validation_report(df))

### Detect Drift Over Time

Track score distributions and flag significant shifts. This helps you catch rubric drift, model updates, or data quality issues. To learn more about detecting and addressing model drift in production, check out our article on [context rot and LLM memory management](/article/context-rot-why-llms-forget-as-their-memory-grows-3).

In [None]:
def drift_flags(scores: list[float], window: int = 200, z_thresh: float = 3.0):
    """
    Detects drift in a rolling window of scores using z-score thresholding.

    Args:
        scores (list[float]): List of scores over time.
        window (int): Size of the rolling window.
        z_thresh (float): Z-score threshold for flagging drift.

    Returns:
        list: List of (index, z-score) tuples where drift is detected.
    """
    if len(scores) < window * 2:
        return []
    flags = []
    for i in range(window, len(scores)):
        prev = np.array(scores[i-window:i])
        mu, sigma = prev.mean(), prev.std() + 1e-6
        z = (scores[i] - mu) / sigma
        if abs(z) >= z_thresh:
            flags.append((i, float(z)))
    return flags

Run drift detection on recent scores and print any flagged indices.

In [None]:
recent_scores = df["overall_score"].dropna().tolist()
flags = drift_flags(recent_scores, window=2, z_thresh=2.0)
if flags:
    print("Drift detected at indices:", flags)
else:
    print("No drift detected.")

## Conclusion

You now have a reference-free LLM evaluator with structured outputs, few-shot anchoring, Pydantic validation, and optional multi-judge consensus. The system logs every evaluation with metadata for tracking, validates against human labels, and detects drift over time.

Key design decisions:

1. **Low temperature and JSON mode** ensure deterministic, parsable outputs.
2. **Pydantic validation** catches schema violations early and prevents silent failures.
3. **Few-shot examples** anchor scoring behavior and reduce variance.
4. **Multi-judge consensus** averages scores from multiple runs to mitigate bias.
5. **Logging with versioning** enables drift detection and rubric iteration.

### Next Steps

1. **Deploy as an API**: Wrap the evaluator in a FastAPI service and deploy to a cloud provider.
2. **Add caching**: Use a content-hash cache to avoid redundant API calls and reduce costs.
3. **Weekly validation CI**: Automate validation against fresh human labels to catch rubric drift.
4. **Monitoring dashboards**: Instrument metrics for Prometheus and build Grafana dashboards to track score distributions, latency, and error rates.

For a deeper dive into building robust prompts and ensuring reliable outputs, see our guide on [prompt engineering with LLM APIs](/article/prompt-engineering-with-llm-apis-how-to-get-reliable-outputs-3). If you are unsure which LLM best fits your needs, our article on [how to pick an LLM for your app](/article/how-to-choose-an-ai-model-for-your-app-speed-cost-reliability) explores performance, efficiency, and pricing considerations.