<a target="_blank" href="https://colab.research.google.com/github/wandb/eval-course/blob/main/notebooks/chapter_01_0.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
<!--- @wandbcode{eval-course-01} -->

# Chapter 1: Introduction to LLM Evaluation

## Setup

In [1]:
try:
    import google.colab
    IN_COLAB = True
except ImportError:
    IN_COLAB = False

if IN_COLAB:
    !git clone --branch main https://github.com/wandb/eval-course
    %cd eval-course
    !pip install uv
    !uv pip install --system --quiet -r requirements.txt
    !uv pip install --system scipy==1.11.4
    %cd notebooks
else:
    print("Not running in Google Colab. Skipping git clone and pip install commands.")

Cloning into 'eval-course'...
remote: Enumerating objects: 204, done.[K
remote: Counting objects: 100% (204/204), done.[K
remote: Compressing objects: 100% (145/145), done.[K
remote: Total 204 (delta 120), reused 131 (delta 55), pack-reused 0 (from 0)[K
Receiving objects: 100% (204/204), 1.16 MiB | 8.87 MiB/s, done.
Resolving deltas: 100% (120/120), done.
/content/eval-course
Collecting uv
  Downloading uv-0.5.16-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Downloading uv-0.5.16-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (15.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.5/15.5 MB[0m [31m24.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: uv
Successfully installed uv-0.5.16
[2mUsing Python 3.10.12 environment at /usr[0m
[2K[2mResolved [1m2 packages[0m [2min 229ms[0m[0m
[2K[37m⠙[0m [2mPreparing packages...[0m (0/2)
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/2)
[2mnumpy     

In [2]:
try:
    in_jupyter = True
except ImportError:
    in_jupyter = False
if in_jupyter:
    import nest_asyncio

    nest_asyncio.apply()

In [3]:
import asyncio
import json

import weave
from set_env import set_env

In [4]:
set_env("GOOGLE_API_KEY")
set_env("WANDB_API_KEY")
print("Env set!")

        Unable to set GOOGLE_API_KEY=GOOGLE_API_KEY,
        not in colab or Secrets not set, not kaggle
        or Secrets not set, no .env/dotenv/env file
        in the current working dir or parent dirs.[0m
        Unable to set WANDB_API_KEY=WANDB_API_KEY,
        not in colab or Secrets not set, not kaggle
        or Secrets not set, no .env/dotenv/env file
        in the current working dir or parent dirs.[0m


Env set!


In [8]:
!vi .env


7[?47h[>4;2m[?1h=[?2004h[?1004h[1;24r[?12h[?12l[22;2t[22;1t[29m[m[H[2J[?25l[24;1H".env" [New][2;1H▽[6n[2;1H  [3;1HPzz\[0%m[6n[3;1H           [1;1H[>c]10;?]11;?[2;1H[1m[34m~                                                                                                   [3;1H~                                                                                                   [4;1H~                                                                                                   [5;1H~                                                                                                   [6;1H~                                                                                                   [7;1H~                                                                                                   [8;1H~                                                                                                   [9;1H~                                                  

In [5]:
from utils.config import ENTITY, MODEL, MODEL_CLIENT, WEAVE_PROJECT
from utils.evals import calculate_kappa_scores, get_evaluation_predictions
from utils.llm_client import LLMClient
from utils.prompts import (
    MedicalPrivacyJudgement,
    MedicalTaskScoreJudgement,
    medical_privacy_judge_prompt,
    medical_privacy_system_prompt,
    medical_system_prompt,
    medical_task,
    medical_task_score_prompt,
    medical_task_score_system_prompt,
)
from utils.render import display_prompt, print_dialogue_data
from utils.deserialize import MainCriteria, deserialize_model

## Understanding Medical Data Extraction Evaluation

### The Task: What Are We Trying to Do?

#### Raw Data Format
Medical conversations are messy and unstructured. Looking at our example data:

- Back-and-forth conversation between doctor and patient
- Contains personal details, small talk, and medical information mixed together
- Informal language ("hey", "mm-hmm", "yeah")
- Important details scattered throughout

#### Extraction Goals
The LLM needs to:
1. Find relevant information
2. Ignore irrelevant details
3. Standardize the format
4. Protect patient privacy
5. Maintain medical accuracy

In [None]:
if ENTITY is not None:
    weave_client = weave.init(f"{ENTITY}/{WEAVE_PROJECT}")
else:
    weave_client = weave.init(f"{WEAVE_PROJECT}")


In [None]:
display_prompt(medical_system_prompt)
display_prompt(medical_task)

<div align="center">
    <img src="https://github.com/wandb/eval-course/blob/main/notebooks/media/medical_chatbot.png?raw=1" width="250"/>
</div>

In [None]:
# Make sure to update the ENTITY and WEAVE_PROJECT in config.py to the correct project!
# Uncomment the following line to use your own annotated data after running chapter_01_generate_medical_data.ipynb
# annotated_medical_data = weave.ref(
#     f"weave:///{ENTITY}/{WEAVE_PROJECT}/object/medical_data_annotations:latest",
# ).get()
annotated_medical_data = weave.ref("weave:///a-sh0ts/eval_course_ch1_dev/object/medical_data_annotations:At9gri9UasftpPe5VNzT3EuIXQWAo5MYX8aMf2cuE8A").get()

In [None]:
print_dialogue_data(annotated_medical_data, indexes_to_show=[0], max_chars=500)

### In fact, let's just generate an example now:

In [None]:
llm = LLMClient(model_name=MODEL, client_type=MODEL_CLIENT)
llm.predict(
    user_prompt=medical_task.format(transcript=annotated_medical_data[0][0]["input"]),
    system_prompt=medical_system_prompt,
)

## Data Collection and Curation for Evaluation

### Our approach for medical extraction evaluation data:

1. Start with real medical transcripts from production systems
   - Actual doctor-patient conversations
   - Authentic medical terminology and flows
   - Real-world edge cases

2. Dataset Diversity Requirements
   - Various medical conditions
   - Different conversation styles
   - Mix of routine and complex cases
   - Remove duplicates for clean evaluation

In [None]:
print_dialogue_data(annotated_medical_data, indexes_to_show=[1], max_chars=2000)

## Why and How to Evaluate LLMs

### Core Principles of LLM Evaluation
Unlike traditional software testing, LLM evaluation requires special consideration:

1. **Non-Deterministic Outputs**
   - Models can give different valid answers
   - Responses vary between runs
   - Multiple correct solutions possible

2. **Quality is Multi-Dimensional**
   - Correctness isn't binary
   - Context matters heavily
   - Different stakeholders have different priorities

3. **Scale vs Accuracy Trade-offs**
   - Manual review is accurate but expensive
   - Automated checks are scalable but limited
   - Hybrid approaches often work best

### Practical Evaluation Recipe 🧑‍🍳

1. **Define Success Criteria**
   - List must-have requirements
   - Set acceptable thresholds
   - Identify critical failures

2. **Build Evaluation Suite**
   - Automated checks for clear rules
   - Expert review for nuanced cases
   - Version control evaluation code

3. **Create Scoring System**
   - Establish baselines

### Applying to Medical Data Extraction 🏥

For our medical extraction task, this means:
- **Success Criteria**: Required fields, privacy compliance, word limits
- **Evaluation Suite**: Automated checks + medical expert review
- **Scoring**: Combination of format, accuracy, and safety metrics

Let's see how to implement this...

![](https://github.com/wandb/eval-course/blob/main/notebooks/media/traditional_llm_eval.png?raw=1)

## Annotation: Building Quality Training Data

### Why Annotate?
To evaluate LLMs effectively, we need expert-labeled data that:
1. Defines what "good" looks like
2. Shows us what to test for
3. Helps align our automated tests with human judgment

### Ideal Process
Experts review outputs and provide structured feedback. This creates a foundation for:
- Building automated evaluation tests
- Measuring how well those tests match expert judgment
- Refining our evaluation methods until they align with expert standards

### Our Annotation Process
NOTE: In a production system, this would be done by licensed medical professionals using a strict rubric.
For this example code, we'll use synthetic annotations to demonstrate the process:

1. Binary Pass/Fail Judgments
   - Pass: Correctly extracted key medical information
   - Fail: Missed critical details or made dangerous assumptions

2. Detailed Critiques Required
   - For Passes: Document accuracy while noting improvement areas
   - For Fails: Identify specific medical extraction errors and their potential impact

These annotated examples become our evaluation dataset, though in practice,
medical evaluations should always be validated by qualified healthcare professionals.

Think of annotations as our compass - they help ensure our later automated evaluation methods point in the same direction as human experts while assessing the quality of our LLM's outputs.

<div align="center">
    <img src="https://github.com/wandb/eval-course/blob/main/notebooks/media/annotation_ui.png?raw=1" width="450"/>
</div>

In [None]:
print_dialogue_data(annotated_medical_data, indexes_to_show=[2, 3, 4], max_chars=500)

## Evaluation: Measuring Performance

### Understanding LLM Evaluation
Unlike traditional software testing, LLM evaluation requires multiple approaches:

1. **Automated Checks**
   - Fast, programmatic tests
   - Clear pass/fail criteria
   - Example: format rules, required fields

2. **Model-Assisted Evaluation**
   - Using LLMs to evaluate outputs
   - Helpful for subjective criteria
   - Example: checking medical accuracy, privacy compliance

3. **Expert Review**
   - Human validation of complex cases
   - Ground truth for training evaluators
   - Example: annotated datasets

### Building Evaluation Systems

In this notebook, we'll implement this through:

1. **Basic Tests**
   ```python
   test_adheres_to_required_keys()
   test_adheres_to_word_limit()
   ```

2. **LLM Judges**
   ```python
   judge_adheres_to_privacy_guidelines()
   judge_overall_score()
   ```

3. **Key Questions**
   - How closely do automated evaluations match human judgment?
   - When do automated systems diverge from human experts?
   - What makes a good evaluation system?

These questions lead us to the concept of alignment - measuring how well our automated systems match human expectations and values. We'll explore practical ways to measure and improve this alignment after implementing our evaluation system.

![](https://github.com/wandb/eval-course/blob/main/notebooks/media/eval_task_flowchart.png?raw=1)

### Using Domain Knowledge to Build Evaluation Tests

We'll create four key tests to evaluate our medical extraction outputs:

1. **Required Fields Check**
   - Verifies presence of essential medical fields
   - E.g., "Chief complaint", "Symptoms", "Follow-up instructions"

2. **Word Limit Check**
   - Ensures output stays within 150-word limit
   - Promotes concise, focused summaries

3. **Privacy Guidelines Check**
   - Uses LLM to detect any PII leakage
   - Critical for medical data compliance

4. **Overall Quality Score**
   - LLM-based assessment of extraction quality
   - Considers accuracy, completeness, and format

These tests will be validated against our expert-annotated dataset to ensure they align with human judgment. This alignment process helps us understand how well our automated evaluation matches medical expert standards.

Let's implement each test:

#### Software Tests: Older and more rigid approach

In [None]:
test_output = annotated_medical_data[0][1]["output"]

In [None]:
@weave.op
def test_adheres_to_required_keys(model_output: str):
    # Required medical keys
    required_keys = [
        "Chief complaint",
        "History of present illness",
        "Physical examination",
        "Symptoms",
        "New medications with dosages",
        "Follow-up instructions",
    ]

    # Convert to lowercase for case-insensitive matching
    output_lower = model_output.lower()

    # Check if all required keys are present
    for key in required_keys:
        if key.lower() not in output_lower:
            return int(False)

    return int(True)

In [None]:
test_adheres_to_required_keys(test_output)

In [None]:
@weave.op
def test_adheres_to_word_limit(model_output: str):
    return int(len(model_output.split()) <= 150)

In [None]:
test_adheres_to_word_limit(test_output)

#### LLM Judges: Newer and more flexible approach

In [None]:
display_prompt(medical_privacy_system_prompt)
display_prompt(medical_privacy_judge_prompt)

In [None]:
@weave.op
def judge_adheres_to_privacy_guidelines(model_output: str):
    llm = LLMClient(model_name=MODEL, client_type=MODEL_CLIENT)
    response = llm.predict(
        user_prompt=medical_privacy_judge_prompt.format(text=model_output),
        system_prompt=medical_privacy_system_prompt,
        schema=MedicalPrivacyJudgement,
    )
    try:
        result = json.loads(response.text.strip("\n"))
        return int(not result[0]["contains_pii"])
    except:
        return int(True)

In [None]:
judge_adheres_to_privacy_guidelines(test_output)

In [None]:
display_prompt(medical_task_score_system_prompt)
display_prompt(medical_task_score_prompt)

In [None]:
@weave.op
def judge_overall_score(model_output: str):
    llm = LLMClient(model_name=MODEL, client_type=MODEL_CLIENT)
    response = llm.predict(
        user_prompt=medical_task_score_prompt.format(text=model_output),
        system_prompt=medical_task_score_system_prompt,
        schema=MedicalTaskScoreJudgement,
    )
    try:
        result = json.loads(response.text.strip("\n"))
        return int(result[0]["score"])
    except:
        return int(False)

In [None]:
judge_overall_score(test_output)

### We already have a dataset of annotated medical data. We can use our tests to evaluate the outputs of our LLM.

In [None]:
@weave.op
def annotated_data_passthrough(input, output):
    return output

In [None]:
evaluation_data = [
    {
        "input": annotated_row[0]["input"],
        "output": annotated_row[1]["output"],
        "scores": {
            "human_required_keys": deserialized_row.presence_of_keys,
            "human_word_limit": deserialized_row.word_count,
            "human_absence_of_PII": deserialized_row.absence_of_PII,
            "human_overall_score": annotated_row[2],
        },
    }
    for annotated_row in annotated_medical_data
    if (deserialized_row := deserialize_model(annotated_row[3], MainCriteria))
][0:5]

In [None]:
# Create evaluation
evaluation = weave.Evaluation(
    dataset=evaluation_data,
    scorers=[
        test_adheres_to_required_keys,
        test_adheres_to_word_limit,
        judge_adheres_to_privacy_guidelines,
        judge_overall_score,
    ],
)

# Run evaluation
evals = asyncio.run(evaluation.evaluate(annotated_data_passthrough))

### But do our test outputs adhere to the annotation expectations?

We need to measure how well our automated evaluations match human judgment. We'll:

1. **Measure Alignment**
   - Compare automated test results with expert annotations using kappa scores
   - Weight different aspects based on their importance (privacy, completeness, etc.)
   - Find where automated tests disagree with human experts

2. **Use These Results**
   - Chapter 2 will focus on improving the LLM judges that show poor alignment
   - We'll learn to refine prompts based on these alignment scores
   - Build better evaluation systems by focusing on the weakest areas first

These alignment measurements are crucial - they tell us which parts of our automated system need the most work, especially for critical aspects like privacy checks and medical accuracy.

In [None]:
# Get the evaluation call id from the evaluation object which you can see in the URL above!
# This line will break for you!
eval_call_id = "01944203-0c3f-7c92-a0dc-69e2d2f2df26"

In [None]:
df = get_evaluation_predictions(weave_client, eval_call_id)
df

#### Software Tests: Minimal Alignment and hard to optimize

In [None]:
kappa_scores = calculate_kappa_scores(df, tuple_columns=["required_keys", "word_limit"])
for metric, score in kappa_scores.items():
    print(f"{metric}: {score:.3f}")

#### LLM Judges: Higher Alignment and easier to optimize

In [None]:
kappa_scores = calculate_kappa_scores(df, tuple_columns=["privacy", "overall"])
for metric, score in kappa_scores.items():
    print(f"{metric}: {score:.3f}")

## Resources

- [Hamel's LLM Judge](https://hamel.dev/blog/posts/llm-judge/)
- [Hamel's LLM Evaluation](https://hamel.dev/blog/posts/evals/)
- [Clef's LLM Evaluation](https://huggingface.co/blog/clefourrier/llm-evaluation)
- [Eugene Yan's LLM Evaluators](https://eugeneyan.com/writing/llm-evaluators/)
- [Shreya's AI Engineering Flywheel](https://www.sh-reya.com/blog/ai-engineering-flywheel/)
- [Who Validates the Validators?](https://arxiv.org/abs/2404.12272)