## Production-Ready Agent Engineering: From MCP to RL

### Lecture 3: Agent Evals + Optimization

**Instructor: Will Brown**

*Date: June 19, 2025*

#### Understanding Existing Evals
- Don't reinvent the wheel

#### Reward Functions as Evals
- Measuring Task Success
- Format Adherence
- Text Similarity + Golden Answers
- LLM Judges for Qualitative Attributes
- Benchmarking Tool Use

#### Synthetic Data and Finetuning
- Agent Environments as Data Engines
- SFT Basics

#### Reward Optimization via DSPy
- Modules
- Signatures
- Metrics
- Optimizers

### Understanding Existing Evals
- Popular evals: https://artificialanalysis.ai/
- TogetherAI dashboard: https://whichllm.together.ai/use-case
- https://artificialanalysis.ai/
- Role of Model Spec: https://cdn.openai.com/spec/model-spec-2024-05-08.html 
- Agent benchmarks:
    - BFCL v3 https://gorilla.cs.berkeley.edu/blogs/13_bfcl_v3_multi_turn.html
    - Tau-Bench https://sierra.ai/blog/benchmarking-ai-agents 
    - MCBench https://mcbench.ai/about


Most important thing: **clearly specifying what you want your model/agent/system to actually do!**

### Deterministic Evals
- Simple math problems
- Multiple choice
- Checkable text properties

In [None]:
### GSM8k -- deterministic verification

import verifiers as vf
from verifiers.utils.data_utils import load_example_dataset, extract_boxed_answer


dataset = load_example_dataset("gsm8k")

print(dataset[0])
print("-"*100)
ans = "The answer is \\boxed{18}"
print("boxed answer: ", extract_boxed_answer(ans))

system_prompt = """
Think step-by-step inside <think>...</think> tags.

Then, give your final numerical answer inside \\boxed{{...}}.
"""

parser = vf.ThinkParser(extract_fn=extract_boxed_answer)
def correct_answer_reward_func(completion, answer, **kwargs):
    response = parser.parse_answer(completion) or ''
    return 1.0 if response == answer else 0.0

rubric = vf.Rubric(funcs=[
    correct_answer_reward_func,
    parser.get_format_reward_func()
], weights=[1.0, 0.2])

vf_env = vf.SingleTurnEnv(
    dataset=dataset,
    system_prompt=system_prompt,
    parser=parser,
    rubric=rubric,
)

{'question': "Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?", 'answer': '18'}
----------------------------------------------------------------------------------------------------
boxed answer:  18


In [26]:
from openai import OpenAI
import os

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
model = "gpt-4.1-mini"

results = vf_env.evaluate( # noqa: F821
    client=client,
    model=model,
    num_samples=20,
    max_concurrent=10,
)

2025-06-23 22:29:35 - verifiers.envs.SingleTurnEnv - INFO - eval_dataset is not set, falling back to train dataset
Running 20 rollouts: 100%|██████████| 20/20 [00:14<00:00,  1.39it/s]
Evaluating 20 rollouts: 100%|██████████| 20/20 [00:00<00:00, 3535.47it/s]


In [29]:
# Results
import numpy as np
print("\nRewards:")
for k, v in results.items():
    if "reward" in k:
        print(k, f"avg: {np.mean(v):.2f}")
        print(k, 'all:', v)

question = results["question"][0]
answer = results["answer"][0]
rollout = results["completion"][0]

print(f"\nQuestion: {question}")
print(f"\nAnswer: {answer}")
print("\nRollout:")
for i, msg in enumerate(rollout):
    print(f"{i+1}. {msg['role']}: {msg['content']}")


Rewards:
correct_answer_reward_func avg: 0.85
correct_answer_reward_func all: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 1.0, 1.0]
format_reward_func avg: 1.00
format_reward_func all: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
reward avg: 1.05
reward all: [1.2, 1.2, 1.2, 1.2, 1.2, 1.2, 1.2, 1.2, 1.2, 1.2, 1.2, 1.2, 0.2, 1.2, 0.2, 1.2, 0.2, 1.2, 1.2, 1.2]

Question: Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?

Answer: 18

Rollout:
1. assistant: <think>
First, Janet's ducks lay 16 eggs per day.

She eats 3 eggs for breakfast, so subtract 3 from 16:
16 - 3 = 13 eggs left.

She uses 4 eggs for baking muffins for her friends, so subtract 4 from 13:
13 -

In [None]:
### IFeval --- instruction-following evals

prompt = "Write a logic quiz for teenagers about a chesterfield. In your entire response, the letter t should appear at most once."

def reward_func(completion, **kwargs):
    """Check if the letter t appears at most once in the completion."""
    return 1.0 if completion.count("t") <= 1 else 0.0

models = ["gpt-4.1-nano", "gpt-4.1-mini", "gpt-4.1"]

for model in models:
    print('-'*100)
    print(model + ':')
    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "user", "content": prompt}
        ]
    )
    print(response.choices[0].message.content)
    print('-'*50)
    print(reward_func(response.choices[0].message.content))

----------------------------------------------------------------------------------------------------
gpt-4.1-nano:
**Logic Quiz: The Chesterfield Challenge**

Imagine you own a beautiful chesterfield sofa. Here are some clues about it:

1. If the sofa is grey, then it is comfortable.
2. The sofa is not red.
3. If the sofa is not comfortable, then it is green.

Based on these clues, answer the following questions:

**Questions:**

1. Is the sofa grey or green?
2. Can the sofa be both grey and comfortable?
3. If the sofa is red, what can you say about its comfort?

**Think carefully and choose your answers wisely!**
--------------------------------------------------
0.0
----------------------------------------------------------------------------------------------------
gpt-4.1-mini:
Sure! Here's a quiz about a chesterfield, with the letter "t" appearing only once:

---

**Logic Quiz: The Mystery of a Chesterfield**

1. A person owns a chesterfield. If the chesterfield is red, they say "R

In [33]:
prompt = "Write a riddle for kids about auspices but make sure you don't use any commas."

def reward_func(completion, **kwargs):
    return 1.0 if "," not in completion else 0.0


models = ["gpt-4.1-nano", "gpt-4.1-mini", "gpt-4.1"]

for model in models:
    print('-'*100)
    print(model + ':')
    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "user", "content": prompt}
        ]
    )
    print(response.choices[0].message.content)
    print('-'*50)
    print(reward_func(response.choices[0].message.content))

----------------------------------------------------------------------------------------------------
gpt-4.1-nano:
Here's a fun riddle for kids about auspices:

I help you see what’s coming next  
By watching birds or signs in the sky  
People trust me when they’re unsure  
Can you guess what I am mysterious and high?
--------------------------------------------------
1.0
----------------------------------------------------------------------------------------------------
gpt-4.1-mini:
I am a sign from birds above  
A little clue to guide with love  
Look to the sky to know what’s right  
What am I that shines your light?
--------------------------------------------------
1.0
----------------------------------------------------------------------------------------------------
gpt-4.1:
I am what birds in the sky can tell  
Long ago people watched me to know if things would go well  
I am a sign from above that can help you plan  
Guess what I am if you can
--------------------------------

In [34]:
prompt = "What happened when the Tang dynasty of China was in power? Make sure to use the word war at least 8 times, and the word peace at least 10 times."

def reward_func(completion, **kwargs):
    war_count = completion.count("war")
    peace_count = completion.count("peace")
    return 1.0 if war_count >= 8 and peace_count >= 10 else 0.0

models = ["gpt-4.1-nano", "gpt-4.1-mini", "gpt-4.1"]

for model in models:
    print('-'*100)
    print(model + ':')
    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "user", "content": prompt}
        ]
    )
    print(response.choices[0].message.content)
    print('-'*50)
    print(reward_func(response.choices[0].message.content))

----------------------------------------------------------------------------------------------------
gpt-4.1-nano:
During the Tang dynasty of China, which lasted from 618 to 907 AD, a remarkable era of cultural, political, and military development unfolded, marked by both periods of war and peace. The Tang dynasty, known for its flourishing arts and vibrant trade, experienced significant conflicts that shaped its history, yet also cherished long stretches of peace that contributed to its golden age.

The initial rise of the Tang dynasty was a time of war as the empire sought to unify China after the fall of the Sui dynasty. Through strategic military campaigns and skilled leadership, the Tang managed to assert control over extensive territories, but these conquests often led to ongoing war efforts. Despite the challenges of war, the Tang leadership emphasized the importance of peace within the empire, striving to maintain stability and prosperity. Peace was seen as essential for econom

In [43]:
from datasets import Dataset

animals = ["dogs", "cats", "birds", "fish", "horses", "rabbits"]
sports = ["soccer", "basketball", "tennis", "baseball", "hockey", "football"]
letters = ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z"]
counts = [1, 2]

### product of all lists
rows = []
for animal in animals:
    for sport in sports:
        for letter in letters:
            for count in counts:
                if count == 1:
                    prompt = f"Write a short story about {animal} playing {sport} which includes the letter {letter} at most once."
                else:
                    prompt = f"Write a short story about {animal} playing {sport} which includes the letter {letter} at most twice."
                row = {
                    "question": prompt,
                    "answer": "",
                    "info": {"letter": letter, "count": count, "animal": animal, "sport": sport}
                }
                rows.append(row)

dataset = Dataset.from_list(rows).shuffle()

dataset[:10]

{'question': ['Write a short story about fish playing basketball which includes the letter q at most once.',
  'Write a short story about horses playing hockey which includes the letter a at most once.',
  'Write a short story about birds playing tennis which includes the letter q at most once.',
  'Write a short story about horses playing soccer which includes the letter g at most twice.',
  'Write a short story about rabbits playing soccer which includes the letter a at most twice.',
  'Write a short story about birds playing tennis which includes the letter x at most once.',
  'Write a short story about rabbits playing basketball which includes the letter g at most once.',
  'Write a short story about birds playing baseball which includes the letter x at most once.',
  'Write a short story about cats playing basketball which includes the letter z at most once.',
  'Write a short story about rabbits playing soccer which includes the letter t at most twice.'],
 'answer': ['', '', '', 

In [51]:

def reward_func(completion, info, **kwargs):
    letter = info["letter"]
    count = info["count"]
    response = completion[-1]["content"]
    count_letter = response.count(letter)
    return 1.0 if count_letter <= count else 0.0

rubric = vf.Rubric(funcs=[reward_func], weights=[1.0])

vf_env = vf.SingleTurnEnv(
    dataset=dataset,
    parser=vf.Parser(),
    rubric=rubric,
)

results = vf_env.evaluate(
    client=client,
    model="gpt-4.1-mini",
    num_samples=20,
    max_concurrent=10,
)

for k, v in results.items():
    if "reward" in k:
        print(k, f"avg: {np.mean(v):.2f}")
        print(k, 'all:', v)

question = results["question"][0]
completion = results["completion"][0]

print(f"\nQuestion: {question}")
print(f"\nCompletion: {completion}")

Map (num_proc=32):   0%|          | 0/1872 [00:00<?, ? examples/s]

2025-06-23 23:11:40 - verifiers.envs.SingleTurnEnv - INFO - eval_dataset is not set, falling back to train dataset
Running 20 rollouts: 100%|██████████| 20/20 [00:06<00:00,  2.87it/s]
Evaluating 20 rollouts: 100%|██████████| 20/20 [00:00<00:00, 6795.70it/s]

reward_func avg: 0.25
reward_func all: [1.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
reward avg: 0.25
reward all: [1.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

Question: Write a short story about fish playing basketball which includes the letter q at most once.

Completion: [{'role': 'assistant', 'content': 'In a shimmering coral reef, a team of colorful fish gathered around a round sea sponge that served as a basketball hoop. Finley the clownfish dribbled a tiny pearl, passing it skillfully to Tina the tang. The fish darted and weaved through swaying seaweed, their scales flashing in the sunlight. With a flick of her tail, Tina launched the pearl toward the sponge. It bounced off the rim but was caught by Oscar the angelfish, who made a swift move and scored the winning point. The reef echoed with bubbly cheers as the fish celebrated their thrilling game beneath the waves.'}]




In [46]:
question = results["question"][1]
completion = results["completion"][1]

print(f"\nQuestion: {question}")
print(f"\nCompletion: {completion}")


Question: Write a short story about horses playing hockey which includes the letter a at most once.

Completion: [{'role': 'assistant', 'content': 'In a quiet village, horses loved to have fun. Every evening, they gathered on the open field to enjoy a game. They used a small, round stone for the puck, and two wooden sticks for handling it.  \n\nOne swift mare, swift and bright like dawn, led the team with passion. Her friends, tall and strong, skated swiftly on the frozen ground. Wings of snow shimmered as they played under moonlight.  \n\nLaughter and neighs filled the cold night, uniting all in joyful cheer. When the game ended, they trotted home, dreams full of play, ready for another match soon.'}]


### Non-Verifiable Tasks

- Expert Models as Golden Answers
- Embedding Metrics
- Text-Based Metrics
- LLM Judges
- Best-Of-N
- Synthesis-Of-N


In [None]:

"""
- LLM judge for summary quality


- variance issue



- GT = GPT-4.1 summary, measure mini vs nano




"""



In [65]:
import random
import os
WIKI_DIR = "../lec1-agent-patterns/data/wiki"
files = os.listdir(WIKI_DIR)

articles = random.sample(files, 20)

print(articles)

['David.md', 'Nusrat Jahan Choudhury.md', 'Kubernetes.md', 'Shrek.md', 'Military organization.md', 'Doris family.md', 'Dr. Romantic.md', 'List of Riverdale episodes.md', 'Western Asia.md', 'Sam Walton.md', 'Billy Crystal.md', 'Mermaid.md', 'Shut Up and Dance _Black Mirror.md', 'Bernie Sanders.md', 'List of rifle cartridges.md', 'Ludwig Wittgenstein.md', 'Antifa _United States.md', 'Great Purge.md', 'Star Trek Into Darkness.md', 'Saint Petersburg.md']


In [None]:
import random
import asyncio
import nest_asyncio
from tqdm.notebook import tqdm
from openai import AsyncOpenAI

# Enable asyncio in Jupyter
nest_asyncio.apply()

# Initialize async OpenAI client
openai_client = AsyncOpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

# Semaphore to limit concurrent requests
semaphore = asyncio.Semaphore(3)

async def generate_summary_for_file(filepath: str, model: str = "gpt-4.1") -> dict:
    """
    Generate a summary for a given wiki file using gpt-4.1.
    Returns list of dicts with summary and filename.
    """
    async with semaphore:  # Limit concurrent requests
        # Read file content directly
        with open(filepath, 'r', encoding='utf-8') as f:
            content = f.read()

        filename = os.path.basename(filepath)

        # Prompt for GPT-4.1
        prompt = f"""Given the following article, generate a 3 paragraph summary.

Article:
{content[:100000]}
"""
        # Call GPT-4.1
        tries = 0
        answer = None
        while tries < 5:
            try:
                response = await openai_client.chat.completions.create(
                    model=model,
                    messages=[
                        {"role": "user", "content": prompt}
                    ],
                    temperature=0.7
                )
                answer = response.choices[0].message.content
                if answer:
                    break
            except Exception as e:
                print(f"Error generating summary for {filename}: {e}")
                tries += 1
                continue
        if not answer:
            answer = "Error generating summary"
        result = {"question": prompt, "answer": answer, "info": {"filename": filename}}
        return result

async def generate_summaries(articles: list[str], model: str = "gpt-4.1") -> list[dict]:
    """
    Generate summaries for wiki pages using parallel processing.
    Returns consolidated list of all summaries with metadata.
    """

    # Create tasks for parallel processing
    tasks = []
    for filename in articles:
        filepath = os.path.join(WIKI_DIR, filename)
        task = generate_summary_for_file(filepath, model=model)
        tasks.append(task)

    # Execute all tasks in parallel with progress bar
    all_results = []
    with tqdm(total=len(tasks), desc="Generating summaries") as pbar:
        for coro in asyncio.as_completed(tasks):
            try:
                result = await coro
                all_results.append(result)
                pbar.update(1)
            except Exception as e:
                pbar.update(1)
                continue

    # Return results directly (they're already individual dicts)
    return all_results

async def main():
    summaries = await generate_summaries(articles=articles, model="gpt-4.1")
    print(f"\nGenerated {len(summaries)} total summaries")
    return summaries

# Run the async function
summaries = await main()

In [76]:
summaries

[{'question': 'Given the following article, generate a 3 paragraph summary.\n\nArticle:\n# Sam Walton\n\n*Revision ID: 1155533706 | Timestamp: 2023-05-18T15:33:51Z*\n\n---\n\n| birth_place        = [Oklahoma](Kingfisher,)(Kingfisher, Oklahoma), U.S.\n\n| death_date         =\n\n| death_place        = [Rock, Arkansas](Little)(Little Rock, Arkansas), U.S.\n\n| occupation         = Founder of [Walmart](Walmart) and [Club](Sam\'s)(Sam\'s Club)\n\n| spouse             =\n\n| resting_place      = Bentonville Cemetery\n\n| children           =\n\n| relatives          =\n\n| alma_mater         = [of Missouri](University)(University of Missouri) ([BS](Bachelor of Science))\n\n| module             =\n\n| branch             =\n\n| rank               =  [Captain](Captain (United States O-3))\n\n| battles            = [War II](World)(World War II)\n\n| unit               = [Intelligence Corps](Military)(Military Intelligence Corps (United States Army))\n\n| serviceyears       = 1942–1945\n\n}}\n\n}

In [77]:
### gpt-4.1 dataset

from datasets import Dataset

dataset = Dataset.from_list(summaries)

dataset.push_to_hub("willcb/gpt-4.1-wiki-summaries")

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

CommitInfo(commit_url='https://huggingface.co/datasets/willcb/gpt-4.1-wiki-summaries/commit/b754a0277355dd1d2cf00ef4f2732d58b1094d3f', commit_message='Upload dataset', commit_description='', oid='b754a0277355dd1d2cf00ef4f2732d58b1094d3f', pr_url=None, repo_url=RepoUrl('https://huggingface.co/datasets/willcb/gpt-4.1-wiki-summaries', endpoint='https://huggingface.co', repo_type='dataset', repo_id='willcb/gpt-4.1-wiki-summaries'), pr_revision=None, pr_num=None)

In [86]:
import numpy as np
from openai import OpenAI

def lcs_reward_func(completion, answer, **kwargs) -> float:
    from difflib import SequenceMatcher
    response = completion[-1]["content"]
    similarity = SequenceMatcher(None, response, answer).ratio()
    return similarity

def embed_sim_reward_func(completion, answer, **kwargs) -> float:
    embed_client = OpenAI()
    response_embedding = embed_client.embeddings.create(
        input=completion[-1]["content"],
        model="text-embedding-3-small"
    )
    answer_embedding = embed_client.embeddings.create(
        input=answer,
        model="text-embedding-3-small"
    )
    response_embedding = np.array(response_embedding.data[0].embedding)
    response_embedding = response_embedding / np.linalg.norm(response_embedding)
    answer_embedding = np.array(answer_embedding.data[0].embedding)
    answer_embedding = answer_embedding / np.linalg.norm(answer_embedding)
    similarity = np.dot(response_embedding, answer_embedding)
    return similarity

rubric = vf.Rubric(funcs=[
    lcs_reward_func,
    embed_sim_reward_func
], weights=[1.0, 1.0])

vf_env = vf.SingleTurnEnv(
    dataset=dataset,
    parser=vf.Parser(),
    rubric=rubric,
)

num_proc must be <= 20. Reducing num_proc to 20 for dataset of size 20.


Map (num_proc=20):   0%|          | 0/20 [00:00<?, ? examples/s]

In [87]:
self_results = vf_env.evaluate(
    client=client,
    model="gpt-4.1",
    num_samples=20,
    max_concurrent=10,
)

for k, v in self_results.items():
    if "reward" in k:
        print(k, f"avg: {np.mean(v):.2f}")
        print(k, 'all:', v)

question = self_results["question"][0]
completion = self_results["completion"][0]

print(f"\nQuestion: {question}")
print(f"\nCompletion: {completion}")

2025-06-24 10:15:17 - verifiers.envs.SingleTurnEnv - INFO - eval_dataset is not set, falling back to train dataset

[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
Running 20 rollouts: 100%|██████████| 20/20 [00:36<00:00,  1.82s/it]

[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
Evaluating 20 rollouts: 100%|██████████| 20/20 [00:03<00:00,  5.54it/s]

lcs_reward_func avg: 0.13
lcs_reward_func all: [0.0803724577309483, 0.19175408636252744, 0.2544122544122544, 0.13941018766756033, 0.14767392232180965, 0.21756347542511065, 0.02366412213740458, 0.1269296740994854, 0.08983829107606309, 0.05341526825809501, 0.08894536213468869, 0.22346891070593736, 0.14889508689122505, 0.18609916232737153, 0.1878345498783455, 0.11586392724823173, 0.08729769494850417, 0.06044297246946802, 0.059788716565520345, 0.10574905581200168]
embed_sim_reward_func avg: 0.95
embed_sim_reward_func all: [np.float64(0.9571752864440783), np.float64(0.9278927582245645), np.float64(0.9873815749165633), np.float64(0.9485403118044977), np.float64(0.9554418600884089), np.float64(0.974147127164064), np.float64(0.9460962997222373), np.float64(0.948956839746449), np.float64(0.9732000494449582), np.float64(0.9630537241895378), np.float64(0.945533971970178), np.float64(0.9534816223694373), np.float64(0.9474304782890451), np.float64(0.9363046420188501), np.float64(0.9843900096987924)




In [88]:
mini_results = vf_env.evaluate(
    client=client,
    model="gpt-4.1-mini",
    num_samples=20,
    max_concurrent=10,
)

for k, v in mini_results.items():
    if "reward" in k:
        print(k, f"avg: {np.mean(v):.2f}")
        print(k, 'all:', v)

question = mini_results["question"][0]
completion = mini_results["completion"][0]

print(f"\nQuestion: {question}")
print(f"\nCompletion: {completion}")

2025-06-24 10:15:57 - verifiers.envs.SingleTurnEnv - INFO - eval_dataset is not set, falling back to train dataset

[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
Running 20 rollouts: 100%|██████████| 20/20 [00:32<00:00,  1.63s/it]

[A
[A
[A
[A
[A
[A
[A
Evaluating 20 rollouts: 100%|██████████| 20/20 [00:02<00:00,  9.12it/s]

lcs_reward_func avg: 0.12
lcs_reward_func all: [0.0651748859439496, 0.13475352984889769, 0.23301654347060893, 0.074000791452315, 0.07343589743589743, 0.29660099588655553, 0.04380118072748048, 0.13986617742283616, 0.016787038844427095, 0.11644630016237532, 0.07294117647058823, 0.07416364369205965, 0.19101123595505617, 0.10680970149253731, 0.2156347119292374, 0.12130835185773262, 0.09391304347826086, 0.057233704292527825, 0.091557395905222, 0.15105619768832204]
embed_sim_reward_func avg: 0.95
embed_sim_reward_func all: [np.float64(0.9392544006983792), np.float64(0.9231813435202407), np.float64(0.9845638838964986), np.float64(0.9524833605156819), np.float64(0.9421727326586798), np.float64(0.9829247238931331), np.float64(0.9333645416641402), np.float64(0.930474387185977), np.float64(0.9404636395755938), np.float64(0.9636231670747043), np.float64(0.9035962246081117), np.float64(0.9573791412806698), np.float64(0.9728089701685086), np.float64(0.9665687680730746), np.float64(0.9751236508128751




In [89]:
nano_results = vf_env.evaluate(
    client=client,
    model="gpt-4.1-nano",
    num_samples=20,
    max_concurrent=10,
)

for k, v in nano_results.items():
    if "reward" in k:
        print(k, f"avg: {np.mean(v):.2f}")
        print(k, 'all:', v)

question = nano_results["question"][0]
completion = nano_results["completion"][0]

print(f"\nQuestion: {question}")
print(f"\nCompletion: {completion}")

2025-06-24 10:16:32 - verifiers.envs.SingleTurnEnv - INFO - eval_dataset is not set, falling back to train dataset

[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
Running 20 rollouts: 100%|██████████| 20/20 [00:11<00:00,  1.68it/s]

[A
[A
[A
[A
[A
[A
[A
Evaluating 20 rollouts: 100%|██████████| 20/20 [00:02<00:00,  9.66it/s]

lcs_reward_func avg: 0.09
lcs_reward_func all: [0.10372608257804633, 0.1204091120409112, 0.1292517006802721, 0.0411992657556598, 0.06040992448759439, 0.1031400412560165, 0.020050125313283207, 0.2433405327573794, 0.054498473103124265, 0.046335697399527184, 0.07372049854150092, 0.08114456544949819, 0.14866696791685494, 0.08749078397640699, 0.15583756345177666, 0.12325285895806862, 0.07341269841269842, 0.06189640035118525, 0.06565176022835395, 0.09668025626092021]
embed_sim_reward_func avg: 0.94
embed_sim_reward_func all: [np.float64(0.9113146464501998), np.float64(0.9238003206464502), np.float64(0.9621104324797234), np.float64(0.9437116051574328), np.float64(0.9514531549410946), np.float64(0.9587545611713953), np.float64(0.9013830337845667), np.float64(0.9065238847108323), np.float64(0.9528892691648425), np.float64(0.9227218704537914), np.float64(0.9554745335091913), np.float64(0.9616795033576861), np.float64(0.9656431554441138), np.float64(0.9270944994268459), np.float64(0.9773798160182




In [105]:
import random
judge_model = "gpt-4.1-mini"
judge_parser = vf.ThinkParser()

def judge_reward_func(prompt, completion, answer, **kwargs) -> float:
    article = prompt[-1]["content"].split("Article:")[1]

    # flip coin for position of summary A and B
    completion_pos = 0
    summary_a = completion[-1]["content"]
    summary_b = answer
    if random.random() < 0.5:
        summary_a, summary_b = summary_b, summary_a
        completion_pos = 1

    judge_prompt = f"""You will be given an article and two summaries (A and B).

    Think step-by-step inside <think>...</think> tags, then return either "A" or "B" based on which summary is better.

    Article:
    {article}

    Summary A:
    {summary_a}

    Summary B:
    {summary_b}"""

    response = client.chat.completions.create(
        model=judge_model,
        messages=[
            {"role": "user", "content": judge_prompt}
        ]
    )
    result_str = response.choices[0].message.content or ""
    result = judge_parser.parse(result_str)
    if "A" in result:
        return 1.0 if completion_pos == 0 else 0.0
    else:
        return 1.0 if completion_pos == 1 else 0.0

rubric = vf.Rubric(funcs=[
    judge_reward_func
], weights=[1.0])

vf_env = vf.SingleTurnEnv(
    dataset=dataset,
    parser=vf.Parser(),
    rubric=rubric,
)


num_proc must be <= 20. Reducing num_proc to 20 for dataset of size 20.


Map (num_proc=20):   0%|          | 0/20 [00:00<?, ? examples/s]

In [106]:
self_results = vf_env.evaluate(
    client=client,
    model="gpt-4.1-mini",
    num_samples=20,
    max_concurrent=10,
)

for k, v in self_results.items():
    if "reward" in k:
        print(k, f"avg: {np.mean(v):.2f}")
        print(k, 'all:', v)

2025-06-24 10:27:48 - verifiers.envs.SingleTurnEnv - INFO - eval_dataset is not set, falling back to train dataset


[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

Running 20 rollouts: 100%|██████████| 20/20 [00:23<00:00,  1.18s/it]


[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

Evaluating 20 rollouts: 100%|██████████| 20/20 [00:27<00:00,  1.35s/it]

judge_reward_func avg: 0.50
judge_reward_func all: [1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 1.0]
reward avg: 0.50
reward all: [1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 1.0]





In [107]:
mini_results = vf_env.evaluate(
    client=client,
    model="gpt-4.1-mini",
    num_samples=20,
    max_concurrent=10,
)

for k, v in mini_results.items():
    if "reward" in k:
        print(k, f"avg: {np.mean(v):.2f}")
        print(k, 'all:', v)

2025-06-24 10:32:47 - verifiers.envs.SingleTurnEnv - INFO - eval_dataset is not set, falling back to train dataset


[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

Running 20 rollouts: 100%|██████████| 20/20 [00:44<00:00,  2.23s/it]


[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

Evaluating 20 rollouts: 100%|██████████| 20/20 [00:39<00:00,  1.96s/it]

judge_reward_func avg: 0.60
judge_reward_func all: [1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 1.0, 1.0]
reward avg: 0.60
reward all: [1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0, 0.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 1.0, 1.0]





In [108]:
nano_results = vf_env.evaluate(
    client=client,
    model="gpt-4.1-mini",
    num_samples=20,
    max_concurrent=10,
)

for k, v in nano_results.items():
    if "reward" in k:
        print(k, f"avg: {np.mean(v):.2f}")
        print(k, 'all:', v)


2025-06-24 10:34:11 - verifiers.envs.SingleTurnEnv - INFO - eval_dataset is not set, falling back to train dataset


[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

Running 20 rollouts: 100%|██████████| 20/20 [00:26<00:00,  1.32s/it]


[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

Evaluating 20 rollouts: 100%|██████████| 20/20 [00:21<00:00,  1.05s/it]

judge_reward_func avg: 0.55
judge_reward_func all: [0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0]
reward avg: 0.55
reward all: [0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0]





### Agent Evals

In [116]:
import verifiers as vf
from verifiers.utils import load_example_dataset
from verifiers.tools import python

TOOL_PROMPT = """
Think step-by-step inside <think>...</think> tags in each message, then either call a tool inside <tool>...</tool> tags, or give your final answer inside <answer>...</answer> tags.

You have access to the following tools to help solve problems:

{tool_descriptions}

Tools can be called by writing a JSON command inside <tool> tags with:
- "name": the name of the tool to use
- "args": the arguments for the tool

Example usage:
<tool>
{{"name": "python", "args": {{"code": "import sympy\nx = sympy.symbols('x')\nprint(sympy.solve(x**2 - 4, x))"}}}}
</tool>

After concluding your message with a tool call,
you will then see the tool's output inside <result> tags as a new message. \
You may call tools multiple times if needed. \
Tool state does not persist between calls. \
Always use tools to solve problems whenever possible, rather than using your own knowledge.

The <answer>...</answer> tags should contain only your final answer as a numeric expression.

Example:
<think>
Let's submit the answer.
</think>
<answer>
\\frac{{1}}{{2}}
</answer>
"""

dataset = load_example_dataset("math", split="train")
vf_env = vf.ToolEnv(
    eval_dataset=dataset,
    system_prompt=TOOL_PROMPT,
    llm_fields=["think", ("tool", "answer")],
    env_fields=["result"],
    few_shot=[],
    tools=[python],
    max_steps=5
)

vf_env.rubric = vf.RubricGroup(rubrics=[vf_env.rubric, vf.JudgeRubric()])

2025-06-24 11:12:51 - verifiers.rubrics.RubricGroup - INFO - Initialized RubricGroup with 2 rubrics


return_description: The output of the code (truncated to 1000 chars) or an error message (str)


In [117]:
math_results = vf_env.evaluate(
    client=client,
    model="gpt-4.1-nano",
    num_samples=10,
    max_concurrent=10,
)

for k, v in math_results.items():
    if "reward" in k:
        print(k, f"avg: {np.mean(v):.2f}")
        print(k, 'all:', v)



[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

Running 10 rollouts: 100%|██████████| 10/10 [00:32<00:00,  3.24s/it]


Evaluating 10 rollouts: 100%|██████████| 10/10 [00:00<00:00, 876.44it/s]


[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

Evaluating 10 rollouts: 100%|██████████| 10/10 [00:03<00:00,  2.92it/s]

correct_answer_reward_func avg: 0.50
correct_answer_reward_func all: [0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0]
tool_execution_reward_func avg: 0.03
tool_execution_reward_func all: [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.3333333333333333]
format_reward_func avg: 0.90
format_reward_func all: [1.0, 0.6000000000000001, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.7333333333333334, 0.6666666666666666]
python_reward_func avg: 0.03
python_reward_func all: [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.3333333333333333]
python_count_reward_func avg: 0.10
python_count_reward_func all: [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0]
python_attempt_reward_func avg: 0.50
python_attempt_reward_func all: [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 3.0]
reward avg: 3.49
reward all: [0.2, 0.12000000000000002, 5.2, 5.2, 5.2, 5.2, 4.2, 5.2, 0.1466666666666667, 4.2]
judge_reward_func avg: 0.70
judge_reward_func all: [0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0]





In [118]:
math_results = vf_env.evaluate(
    client=client,
    model="gpt-4.1-mini",
    num_samples=10,
    max_concurrent=10,
)

for k, v in math_results.items():
    if "reward" in k:
        print(k, f"avg: {np.mean(v):.2f}")
        print(k, 'all:', v)



[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

Running 10 rollouts: 100%|██████████| 10/10 [00:42<00:00,  4.24s/it]


Evaluating 10 rollouts: 100%|██████████| 10/10 [00:00<00:00, 1002.58it/s]


[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

Evaluating 10 rollouts: 100%|██████████| 10/10 [00:01<00:00,  6.06it/s]

correct_answer_reward_func avg: 0.70
correct_answer_reward_func all: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0]
tool_execution_reward_func avg: 0.20
tool_execution_reward_func all: [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0]
format_reward_func avg: 0.96
format_reward_func all: [1.0, 1.0, 1.0, 1.0, 0.8, 1.0, 1.0, 1.0, 1.0, 0.8]
python_reward_func avg: 0.20
python_reward_func all: [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0]
python_count_reward_func avg: 0.20
python_count_reward_func all: [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0]
python_attempt_reward_func avg: 0.20
python_attempt_reward_func all: [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0]
reward avg: 4.53
reward all: [5.2, 5.2, 5.2, 5.2, 5.359999999999999, 5.2, 4.2, 5.2, 0.2, 4.36]
judge_reward_func avg: 0.90
judge_reward_func all: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0]





In [119]:
dataset_outputs = vf_env.make_dataset(results)
dataset_outputs

Dataset({
    features: ['prompt', 'completion', 'answer', 'reward', 'task'],
    num_rows: 20
})

In [None]:
math_results = vf_env.evaluate(
    client=client,
    model="gpt-4.1",
    num_samples=10,
    max_concurrent=10,
)

for k, v in math_results.items():
    if "reward" in k:
        print(k, f"avg: {np.mean(v):.2f}")
        print(k, 'all:', v)

### Synthetic Data + Finetuning
- Generating rollouts
- Filtering via reward scores
- Key hyperparameters
- LoRA vs FFT
- Working with GPUs + setting up remote environments
- Multi-GPU training (DeepSpeed)

Examples:
- https://github.com/willccbb/verifiers/tree/main/verifiers/examples/sft
- https://huggingface.co/docs/trl/en/sft_trainer
- Other trainers:
    - Unsloth https://github.com/unslothai/unsloth
    - Axolotl https://docs.axolotl.ai/ 
    - TorchTune https://github.com/pytorch/torchtune

### (Prompt) Optimization via DSPy: Crash Course
- Models
    - https://dspy.ai/api/models/LM/
- Signatures
    - e.g. "question -> answer"
    - comma-separated inputs / outputs
    - https://medium.com/@aabhi02/dspy-deep-dive-part-2-understanding-signatures-410755fa29fe
- Modules
    - https://dspy.ai/api/modules/ReAct/
    - https://dspy.ai/tutorials/customer_service_agent/ 
- Metrics
- Optimizers
    - https://dspy.ai/api/optimizers/MIPROv2/