# Chapter 1: Evaluations and Alignments for AI

Welcome to the hands-on companion for Chapter 1! This notebook introduces the fundamental concepts of AI evaluation and alignment through practical examples.

---

## Learning Objectives

By the end of this notebook, you will be able to:

1. **Distinguish between verifiable and open-ended tasks** - Understand why some AI outputs can be automatically checked while others require human judgment
2. **Implement basic evaluation patterns** - Write code to test AI-generated solutions against ground truth
3. **Extract structured answers from free-form text** - Handle the practical challenge of parsing AI responses
4. **Calculate faithfulness metrics** - Quantify hallucination rates in AI-generated content
5. **Recognize the tension between different alignment goals** - Understand why "aligned AI" isn't a simple concept

---

## Why Evaluation Matters

Before an AI system can be deployed, we need to answer a fundamental question: **Is it working correctly?**

This seems simple, but AI evaluation is surprisingly nuanced:
- For a calculator, "correct" means matching the mathematical answer
- For a chatbot, "correct" might mean helpful, harmless, and honest
- For a creative writing assistant, "correct" is subjective

This chapter explores the spectrum from **verifiable tasks** (clear right/wrong answers) to **open-ended tasks** (requiring human judgment), and introduces the concept of **faithfulness** as a way to measure hallucination.

## Setup

First, let's import the libraries we'll use throughout this notebook. These are standard tools for data manipulation, numerical computing, and visualization.

In [None]:
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

---

## Part 1: Verifiable Tasks

### What Makes a Task "Verifiable"?

A **verifiable task** is one where correctness can be determined automatically, without human judgment. Examples include:

- **Code generation**: Does the code pass unit tests?
- **Math problems**: Does the answer match the solution?
- **Factual questions**: Is the response correct according to a knowledge base?
- **Classification**: Does the label match the ground truth?

The key insight is that verifiable tasks have a **ground truth** that can be checked programmatically.

### HumanEval: A Benchmark for Code Generation

[HumanEval](https://github.com/openai/human-eval) is a famous benchmark that evaluates AI code generation. Each problem includes:
1. A function signature with docstring
2. A set of hidden test cases

The AI's job is to complete the function. Success is binary: either all tests pass, or they don't.

Below is an example problem. Notice how we can definitively say whether the implementation is correct by running the tests.

In [None]:
def has_close_elements(numbers: list[float], threshold: float) -> bool:
    """Check if any two numbers are closer than threshold."""
    for i, num1 in enumerate(numbers):
        for j, num2 in enumerate(numbers):
            if i != j and abs(num1 - num2) < threshold:
                return True
    return False

# Verifiable: tests pass or fail
tests = [
    ([1.0, 2.0, 3.0], 0.5, False),
    ([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3, True),
]

for nums, thresh, expected in tests:
    result = has_close_elements(nums, thresh)
    status = "\u2713" if result == expected else "\u2717"
    print(f"{status} has_close_elements({nums}, {thresh}) = {result}")

### Key Takeaway: Verifiable Evaluation

Notice what just happened:
- We had a **clear specification** (the docstring)
- We had **test cases** with expected outputs
- We could **automatically determine** if the code was correct

This is the gold standard for AI evaluation: no ambiguity, no human judgment needed. However, most real-world AI applications don't fit this mold neatly.

---

## Part 2: The Answer Extraction Problem

### From Reasoning to Evaluation

Modern AI models often "think out loud" using **chain-of-thought (CoT)** reasoning. While this improves accuracy, it creates a practical challenge: **how do we extract the final answer for evaluation?**

Consider asking an AI: "What is 42 x 38?"

The AI might respond:
> "Let me work through this step by step. 42 x 38 can be computed as 42 x 40 - 42 x 2 = 1680 - 84 = 1596. Therefore, the answer is 1,596."

The reasoning is helpful for understanding, but for evaluation we need to extract just "1596" to compare against the ground truth.

### Why This Matters

Answer extraction seems trivial but causes real problems in evaluation:
- Models might format numbers differently (1596 vs 1,596 vs "one thousand five hundred ninety-six")
- The answer might appear multiple times in the reasoning
- Different models have different response styles

Robust answer extraction is essential for fair, reproducible evaluation.

In [None]:
responses = [
    "Let me work through this. 42 x 38 = 1680 - 84 = 1596. The answer is 1,596.",
    "1596",
    "The answer is 1,596",
]

def extract_number(text: str) -> int | None:
    """Extract the final number from a response."""
    text = text.replace(",", "")
    numbers = re.findall(r'\b\d+\b', text)
    return int(numbers[-1]) if numbers else None

for r in responses:
    print(f"{extract_number(r)} <- '{r[:40]}...'")

### How the Extraction Works

Our simple extractor uses a common heuristic: **take the last number in the response**. This works because models typically state their final answer at the end.

The regex pattern `\b\d+\b` matches word-boundary-delimited digit sequences, and we remove commas first to handle formatted numbers like "1,596".

> **Note**: Production systems often use more sophisticated extraction, including asking the model to format answers in a specific way (e.g., "Put your final answer in \\boxed{}").

---

## Part 3: Measuring Faithfulness (Detecting Hallucination)

### The Hallucination Problem

One of the most significant challenges with large language models is **hallucination**: generating content that sounds plausible but is factually incorrect. This is especially dangerous when:

- Users trust the AI's confident tone
- The AI is used for research or decision-making
- Incorrect information could cause harm

### What is Faithfulness?

**Faithfulness** measures how well an AI's response is grounded in source material. It answers the question: "Of all the claims the AI made, how many are actually supported by the evidence?"

$$\text{Faithfulness} = \frac{\text{Supported Claims}}{\text{Total Claims}}$$

A faithfulness score of 100% means every claim can be verified. Lower scores indicate hallucination.

### Practical Example

Imagine asking an AI to summarize an arXiv paper. The AI makes several claims about the paper. We can check each claim against the actual paper to compute faithfulness.

In [None]:
def faithfulness(claims: list[dict]) -> float:
    """Faithfulness = Supported Claims / Total Claims"""
    if not claims:
        return 1.0
    return sum(c['supported'] for c in claims) / len(claims)

# Example: AI response about an arXiv paper
claims = [
    {"claim": "Year: 2020", "supported": True},
    {"claim": "Source: arXiv", "supported": True},
    {"claim": "Title: 'GLU Variants Improve Transformer'", "supported": True},
    {"claim": "arXiv: 2002.05202v1", "supported": True},
    {"claim": "Authors: Narang, Chung", "supported": False},  # Hallucinated!
]

score = faithfulness(claims)
print(f"Faithfulness: {score:.0%}")
print(f"Hallucination rate: {1-score:.0%}")

### Interpreting the Results

In this example, the AI correctly identified the year, source, title, and arXiv ID, but **hallucinated the authors**. This is a common failure mode: the AI might know the paper exists but confuse details with similar papers.

With 4 out of 5 claims supported, we get:
- **Faithfulness**: 80%
- **Hallucination rate**: 20%

> **Key Insight**: A single hallucinated fact can undermine trust in an entire response. This is why faithfulness evaluation is critical for applications like research assistance, medical information, or legal analysis.

---

## Practice Exercises

Now it's your turn! These exercises reinforce the concepts from this chapter.

### Exercise 1: Task Classification

Classify each of the following tasks as **verifiable** or **open-ended**:

| Task | Verifiable or Open-ended? | Why? |
|------|---------------------------|------|
| Sentiment classification | ? | ? |
| Poetry generation | ? | ? |
| SQL query generation | ? | ? |
| Customer support response | ? | ? |

*Think about: Can you automatically check if the output is "correct"?*

### Exercise 2: Calculate Faithfulness

An AI response about a historical event contains 12 claims. After fact-checking, you find that 3 claims are not supported by the source material.

Calculate the faithfulness score using the code cell below.

### Exercise 3: Alignment Conflicts (Discussion)

Design a scenario where **policy alignment** (following the organization's rules) conflicts with **principled alignment** (doing what's ethically right).

*Hint: Consider cases involving user privacy, content moderation, or access to information.*

In [None]:
# Exercise 2
total, unsupported = 12, 3
print(f"Faithfulness: {(total - unsupported) / total:.0%}")

---

## Summary and Key Takeaways

### What We Learned

1. **Verifiable vs. Open-ended Tasks**
   - Verifiable tasks have ground truth that can be checked automatically
   - Open-ended tasks require human judgment or more sophisticated evaluation
   - The distinction matters for choosing evaluation strategies

2. **Answer Extraction is Non-trivial**
   - Chain-of-thought reasoning improves AI accuracy but complicates evaluation
   - Robust extraction handles formatting variations and locates final answers
   - Evaluation infrastructure must be as rigorous as the models themselves

3. **Faithfulness Quantifies Hallucination**
   - Faithfulness = Supported Claims / Total Claims
   - Even high faithfulness (e.g., 80%) means some claims are wrong
   - Critical for trust in AI-generated content

### Looking Ahead

In the following chapters, we'll explore specific evaluation metrics:
- **Chapter 2**: BLEU and ROUGE for text generation quality
- **Chapter 3**: BERTScore and COMET for semantic similarity
- **Chapter 4**: LLM-as-a-Judge for automated open-ended evaluation

---

*This notebook is part of the AI Evaluation and Alignments book (Manning Seminal Papers Series).*