# LOTUS Evaluations Demo

This notebook demonstrates the usage of LOTUS's AI evaluations subpackage (`lotus.evals`). The package provides two main evaluation methods:

1. **`llm_as_judge`** - Use an LLM to evaluate/score individual responses
2. **`pairwise_judge`** - Use an LLM to compare two responses and determine which is better

Both methods are available as pandas DataFrame accessors, making them easy to integrate into data processing pipelines.

## Setup

First, let's import the necessary libraries and configure the language model.

In [None]:
import pandas as pd
from pydantic import BaseModel, Field

import lotus
from lotus.models import LM
from lotus.types import ReasoningStrategy

In [None]:
# Configure the language model
# You can use any model supported by LiteLLM (OpenAI, Anthropic, etc.)
lm = LM(model="gpt-4o-mini")
lotus.settings.configure(lm=lm)

---
## 1. LLM as Judge - Basic Usage

The `llm_as_judge` accessor allows you to use an LLM to evaluate responses in your DataFrame. This is useful for:
- Grading student answers
- Evaluating model outputs
- Quality assessment of generated content

In [None]:
# Sample data: student responses to ML questions
data = {
    "student_id": [1, 2, 3, 4],
    "question": [
        "Explain the difference between supervised and unsupervised learning",
        "What is the purpose of cross-validation in machine learning?",
        "Describe how gradient descent works",
        "What are the advantages of ensemble methods?",
    ],
    "answer": [
        "Supervised learning uses labeled data to train models, while unsupervised learning finds patterns in unlabeled data. For example, classification is supervised, clustering is unsupervised.",
        "Gradient descent is an optimization algorithm that minimizes cost functions.",  # Wrong answer!
        "Cross-validation helps assess model performance by splitting data into training and validation sets multiple times to get a better estimate of how the model generalizes.",
        "Ensemble methods combine multiple models to improve performance. They reduce overfitting and variance, often leading to better generalization than individual models.",
    ],
}

df = pd.DataFrame(data)
df

In [None]:
# Basic evaluation: score answers on a 1-10 scale
judge_instruction = (
    "Rate the accuracy and completeness of this {answer} to the {question} "
    "on a scale of 1-10, where 10 is excellent. Only output the score."
)

results = df.llm_as_judge(
    judge_instruction=judge_instruction,
)

results

The `_judge_0` column contains the evaluation results. Notice that student 2 likely received a lower score because their answer was about gradient descent, not cross-validation!

---
## 2. Multiple Trials for Robustness

LLM evaluations can be noisy. Running multiple trials helps assess consistency and get more robust scores.

In [None]:
# Run 3 independent trials
results_multi = df.llm_as_judge(
    judge_instruction=judge_instruction,
    n_trials=3,
)

# View all trial results
results_multi[["student_id", "question", "_judge_0", "_judge_1", "_judge_2"]]

In [None]:
# Calculate average score across trials
judge_cols = [col for col in results_multi.columns if col.startswith("_judge")]
results_multi["avg_score"] = results_multi[judge_cols].astype(float).mean(axis=1)
results_multi[["student_id", "question", "avg_score"]]

---
## 3. Structured Output with Response Format

For more detailed evaluations, you can use Pydantic models to define a structured response format. This ensures consistent, parseable outputs.

In [None]:
# Define a structured evaluation schema
class EvaluationScore(BaseModel):
    score: int = Field(description="Score from 1-10")
    reasoning: str = Field(description="Detailed reasoning for the score")
    strengths: list[str] = Field(description="Key strengths of the answer")
    improvements: list[str] = Field(description="Areas for improvement")

In [None]:
# Evaluate with structured output
results_structured = df.llm_as_judge(
    judge_instruction="Evaluate the student {answer} for the {question}",
    response_format=EvaluationScore,
    suffix="_eval",
)

results_structured

In [None]:
# Access structured evaluation data
for idx, row in results_structured.iterrows():
    eval_result = row["_eval_0"]
    print(f"\nStudent {row['student_id']}:")
    print(f"  Score: {eval_result.score}/10")
    print(f"  Reasoning: {eval_result.reasoning}")
    print(f"  Strengths: {eval_result.strengths}")
    print(f"  Improvements: {eval_result.improvements}")

---
## 4. Custom System Prompts

You can customize the system prompt to set up the judge with specific expertise or evaluation criteria.

In [None]:
# Custom system prompt for a strict ML expert
custom_system_prompt = (
    "You are a senior machine learning professor grading PhD qualifying exams. "
    "You have extremely high standards and expect precise, technically accurate answers. "
    "Penalize heavily for any inaccuracies or missing key concepts."
)

results_strict = df.llm_as_judge(
    judge_instruction=judge_instruction,
    system_prompt=custom_system_prompt,
)

results_strict[["student_id", "question", "_judge_0"]]

---
## 5. Pairwise Comparison

The `pairwise_judge` accessor compares two responses side-by-side. This is especially useful for:
- A/B testing model outputs
- Comparing baseline vs. improved responses
- Preference ranking

In [None]:
# Sample data: comparing two model outputs
comparison_data = {
    "prompt": [
        "Write a one-sentence summary of the benefits of regular exercise.",
        "Explain the difference between supervised and unsupervised learning in one sentence.",
        "Suggest a polite email subject line to schedule a 1:1 meeting.",
    ],
    "model_a": [
        "Regular exercise improves physical health and mental well-being by boosting energy, mood, and resilience.",
        "Supervised learning uses labeled data to learn mappings, while unsupervised learning finds patterns without labels.",
        "Meeting request.",
    ],
    "model_b": [
        "Exercise is good.",
        "Supervised learning and unsupervised learning are both machine learning approaches.",
        "Requesting a 1:1: finding time to connect next week?",
    ],
}

df_compare = pd.DataFrame(comparison_data)
df_compare

In [None]:
# Compare model outputs
pairwise_instruction = (
    "Given the prompt {prompt}, compare the two responses.\n"
    "Output only 'A' or 'B' or 'Tie' if the responses are equally good."
)

pairwise_results = df_compare.pairwise_judge(
    col1="model_a",
    col2="model_b",
    judge_instruction=pairwise_instruction,
)

pairwise_results

---
## 6. Position Bias Mitigation with `permute_cols`

LLMs can exhibit position bias (e.g., preferring the first option). The `permute_cols` option runs evaluations with both orderings to mitigate this bias.

In [None]:
# Run pairwise evaluation with position permutation
# Note: n_trials must be even when permute_cols=True
pairwise_permuted = df_compare.pairwise_judge(
    col1="model_a",
    col2="model_b",
    judge_instruction=pairwise_instruction,
    n_trials=4,  # 2 trials for each ordering
    permute_cols=True,
)

# View all trial results
judge_cols = [col for col in pairwise_permuted.columns if "_judge" in col]
pairwise_permuted[["prompt"] + judge_cols]

---
## 7. Structured Pairwise Comparison

Combine pairwise comparison with structured outputs for detailed analysis.

In [None]:
# Define structured comparison schema
class ComparisonResult(BaseModel):
    winner: str = Field(description="'A', 'B', or 'Tie'")
    reasoning: str = Field(description="Why this response is better")
    score_a: int = Field(description="Score for response A (1-10)")
    score_b: int = Field(description="Score for response B (1-10)")

In [None]:
# Structured pairwise comparison
pairwise_structured = df_compare.pairwise_judge(
    col1="model_a",
    col2="model_b",
    judge_instruction=(
        "Given the prompt {prompt}, compare the two responses. "
        "Determine which is better and explain why."
    ),
    response_format=ComparisonResult,
)

pairwise_structured

In [None]:
# Display detailed comparison results
for idx, row in pairwise_structured.iterrows():
    result = row["_judge_0"]
    print(f"\nPrompt: {row['prompt'][:50]}...")
    print(f"  Winner: {result.winner}")
    print(f"  Scores: A={result.score_a}, B={result.score_b}")
    print(f"  Reasoning: {result.reasoning}")

---
## 8. Chain-of-Thought Reasoning

For complex evaluations, you can enable chain-of-thought (CoT) reasoning to get the model to think step-by-step before giving a final answer.

In [None]:
# Enable zero-shot chain-of-thought reasoning
results_cot = df.llm_as_judge(
    judge_instruction=judge_instruction,
    strategy=ReasoningStrategy.ZS_COT,
    return_explanations=True,  # Capture the CoT reasoning
)

results_cot[["student_id", "_judge_0", "explanation_judge_0"]]

In [None]:
# View the chain-of-thought reasoning
for idx, row in results_cot.iterrows():
    print(f"\nStudent {row['student_id']}:")
    print(f"  Score: {row['_judge_0']}")
    print(f"  Reasoning: {row['explanation_judge_0']}")

---
## 9. Few-Shot Learning with Examples

Provide example evaluations to guide the judge's behavior and calibrate scoring.

In [None]:
# Create examples DataFrame with expected answers
examples = pd.DataFrame({
    "question": [
        "What is machine learning?",
        "Define overfitting."
    ],
    "answer": [
        "Machine learning is a subset of AI that enables systems to learn from data without being explicitly programmed.",
        "Overfitting is bad."
    ],
    "Answer": [  # Required column with expected judge output
        "9",  # Good answer
        "2",  # Poor answer
    ]
})

examples

In [None]:
# Evaluate with few-shot examples
results_fewshot = df.llm_as_judge(
    judge_instruction=judge_instruction,
    examples=examples,
)

results_fewshot[["student_id", "question", "_judge_0"]]

---
## 10. Accessing Raw Outputs

For debugging or advanced analysis, you can access the raw model outputs.

In [None]:
# Get raw outputs for debugging
results_raw = df.llm_as_judge(
    judge_instruction=judge_instruction,
    return_raw_outputs=True,
)

results_raw[["student_id", "_judge_0", "raw_output_judge_0"]]

---
## Summary

The LOTUS evaluations subpackage provides:

| Feature | `llm_as_judge` | `pairwise_judge` |
|---------|----------------|------------------|
| **Use Case** | Score individual responses | Compare two responses |
| **Multiple Trials** | `n_trials` parameter | `n_trials` parameter |
| **Structured Output** | `response_format` (Pydantic) | `response_format` (Pydantic) |
| **Position Bias Mitigation** | N/A | `permute_cols=True` |
| **Chain-of-Thought** | `strategy=ReasoningStrategy.ZS_COT` | `strategy=ReasoningStrategy.ZS_COT` |
| **Few-Shot Examples** | `examples` DataFrame | `examples` DataFrame |
| **Custom System Prompt** | `system_prompt` | `system_prompt` |

These tools integrate seamlessly with pandas DataFrames, making it easy to evaluate large datasets of model outputs or human responses.