# Productionizing Evals Code

We'll refactor our evaluation code from notebooks into a proper Python project structure. This makes it easy to run evaluations regularly and integrate them into CI/CD pipelines if we want.

## Evaluation Architecture Overview

We want to make running evaluations frictionless. There are two main parts:

- Ground truth synthetic data generation (run once)
- Evaluations (run regularly)

The first part (data generation) is less critical since we run it infrequently. We can keep it in a notebook format.

The second part (evaluation) is more important. We want to run it often, so it must be easy to execute.

The evaluations part consists of two steps:

1. Apply agent to ground truth data to get the output
2. Analyze the output to assess the quality

To make it easier to execute, we'll move it from a Jupyter notebook into a proper Python project.

## Ground Truth Data Generation

We won't spend a lot of time on this since we don't need to run this code frequently.

First, let's extract all the code from the notebook:

```bash
jupyter nbconvert --to=script ground-truth.ipynb
mv ground-truth.py ground_truth_generator.py
```

This converts our Jupyter notebook into a Python script for easier refactoring.

Now ask ChatGPT to refactor the code:

```text
Make sure all the code is organized into separate functions with no global variables.
```

[Here's my refactoring conversation.](https://chatgpt.com/share/68f7454c-a2b8-800a-a811-0af5eb045e5a)

In the video, I use GitHub Copilot for this refactoring process.

You can see the final results here: https://github.com/alexeygrigorev/ai-bootcamp-codespace/blob/main/week3/code/evals/generate_data.py

We can also create a simple script for sampling data (ground_truth_sample.py):

In [None]:
import pandas as pd

df_ground_truth = pd.read_csv('ground_truth_evidently.csv')
df_sample = df_ground_truth.sample(n=50)

df_sample.to_csv('ground_truth_evidently_sample.csv', index=False)

This script creates smaller samples from our full dataset for faster testing and development.

## Agent Application Module

Now let's handle step 1 of the evaluation part: applying the model to our ground truth data.

We'll save the output as pickle files for persistence and debugging.

First, convert the notebook to a Python script:

```bash
jupyter nbconvert --to=script 22-llm-ground-truth.ipynb
mv 22-llm-ground-truth.py eval_apply.py
```

Now we need to clean up the experimental code. We did a lot of experimenting in the notebook, so we can delete everything we don't need.

After refactoring, we end up with two organized files:

- [eval_common.py](https://github.com/alexeygrigorev/ai-bootcamp-codespace/blob/main/week3/code/evals/eval_common.py) - shared utilities

- [eval_agent_run.py](https://github.com/alexeygrigorev/ai-bootcamp-codespace/blob/main/week3/code/evals/eval_agent_run.py) - agent execution logic

[Here's my conversation with ChatGPT about this refactoring.](https://chatgpt.com/share/68f74adf-438c-800a-a59a-713aa863dcc8) In the video, I used Copilot for the same process.

## Judge Evaluation Module

The first step is now organized. Let's add the second - the actual evaluation logic:

```bash
jupyter nbconvert --to=script 23-llm-judge-eval.ipynb
mv 23-llm-judge-eval.py eval_judge.py
```

This extracts our LLM judge evaluation code into a proper Python module.

My refactoring prompt for this step:

```text
Most of this code should go to eval_agent_judge.py with the evaluation logic.
It should save the report to the reports folder.

We already have the utils file, so use it for repeated functionality.
Finally, create a simple script eval_orchestrator.py which will be a CLI app.
This orchestrator puts everything together: first applying the model (eval_agent_run.py) 
then evaluating the model (eval_agent_judge.py).
```

The refactoring produces two files:

- [eval_agent_judge.py](https://github.com/alexeygrigorev/ai-bootcamp-codespace/blob/main/week3/code/evals/eval_agent_judge.py) - LLM judge evaluation logic
- [eval_orchestrator.py](https://github.com/alexeygrigorev/ai-bootcamp-codespace/blob/main/week3/code/evals/eval_orchestrator.py) - CLI orchestration

## GitHub Copilot Refactoring Approach

In the video, I used GitHub Copilot for refactoring the code.

Here's my prompt from the video:

```text
I have two files that were jupyter notebooks: evals/eval_agent_judge.py evals/eval_agent_run.py

Now I want to refactor these files: organize code into functions, make sure there are 
no global variables and that the code is modular.

I also want to create two new files:

First one with common utilities for both judge and run.
Second for orchestrating the evaluation - running both of them: first run and then judge, 
and showing a nice report (with price for both steps).

Note: the cost object returned from toyaikit is not a float, it's an object

@dataclass
class CostInfo:
    input_cost: float
    output_cost: float
    total_cost: float
```

## Running the Complete Pipeline

Now we can run the entire evaluation pipeline with simple commands:

Create a sample dataset:

```bash
uv run python -m evals.sample_ground_truth \
    --sample-size 25 \
    --extra-indices 150 \
    --input evals/ground_truth_evidently.csv \
    --output=evals/gt-sample.csv
```

This creates a manageable sample from our full ground truth dataset.

Run the complete evaluation:

```bash
uv run python -m evals.eval_orchestrator \
    --csv evals/gt-sample.csv 
```

This command runs both the agent application and judge evaluation steps automatically.

## Sample Evaluation Report

Here's an example report (for 5 examples):

```text
Evaluation Report
==================
Average scores:
instructions_follow    0.8
instructions_avoid     1.0
answer_relevant        1.0
answer_clear           1.0
answer_citations       0.6
completeness           0.2
tool_call_search       1.0

Total Evaluation Cost: $0.02
Samples Evaluated: 5
```

The report provides clear metrics and cost tracking for each evaluation run.

## Benefits

Now we can easily integrate this into CI/CD pipelines. We can regularly monitor if any of these criteria scores decline over time.

But we can use this framework for much more than monitoring. We can compare different approaches and see which one works best.

For example, we could test different chunking strategies, prompt variations, or model configurations. Each approach can be evaluated systematically using the same framework.