# Eval-Driven System Design: From Prototype to Production

## Problem Definition

For this guide, we assume we are starting with a workflow for delivering and evaluating **Study Mode**—a tutoring overlay for large language models that turns an LLM into an interactive, scaffolded tutor. While many components (LLMs, chat frontends, and basic QA) already exist, Study Mode introduces a narrow but high-value “pedagogical layer” whose correctness and fidelity to teaching practices must be measured. Like receipt processing, there is a “last mile” where imperfect automated behavior requires human time (teacher review, curriculum alignment, or safety triage).

In our case, we’ll assume a pipeline that looks like this:

* A learner launches a Study Mode session (or toggles Study Mode on in a regular chat).
* The system seeds the conversation with the **Study Mode overlay** (system instructions) and any available learner profile/history.
* The student interacts with the agent (questions, solutions, uploads), and the agent responds using Study Mode behavior (diagnose → micro-explain → guide → check → practice → recap).
* Low-confidence or safety-sensitive turns, or sessions failing pedagogical checks, are escalated for human QA (teacher review / instructional designer / moderator).
* Logs (session transcripts + action tags) are stored for offline evaluation and product metrics.

Based on interviews with educators, curriculum designers, and product stakeholders, reviewers judge Study Mode outputs on the following dimensions:

1. **Pedagogical fidelity** — Does the assistant follow Study Mode rules (diagnose first, stepwise guidance, avoid direct answer dumps, single-question-per-step, check & reinforce, varied rhythm, recap)?
2. **Accuracy of content** — Are the factual claims/corrections correct and appropriately sourced when needed?
3. **Clarity & cognitive load** — Are explanations chunked, plain-spoken, and appropriately scoped for the learner’s stated level?
4. **Engagement & affect** — Is the tone warm, encouraging, and motivating (not lecture-y or abrasive)?
5. **Personalization** — Did the assistant adapt to learner level/goals and leverage conversation memory correctly?
6. **Safety & policy compliance** — No disallowed content or identity-safety problems; sensitive requests are handled correctly.
7. **Opportunity for learning** — Did the session produce observable learning behaviors (learner attempts, restatements, successful practice rounds)?

**What we will measure and iterate on**

1. Operationalize the 7–8 action tags into deterministic heuristics and LLM-judge fallbacks.
2. Build a starter seed set (Python & prompt-engineering lessons; later expand to other domains) and baseline runs comparing:

   * Base LLM without Study Mode overlay
   * LLM + Study Mode overlay (automated)
   * Human tutor (gold standard)
3. Compute per-session metrics: heuristic pass/fail vector, pedagogical fidelity score, escalation flag, and human rubric score (sampled).
4. Triage failures into a clear taxonomy (answered too early, multi-question overload, no diagnosis, lecture mode, no reinforcement, over-scaffolding, safety slip) to drive prompt and policy fixes.

With this problem statement, the next practical steps are: produce the seed session JSONs, implement the heuristic evaluator, run an A/B baseline (base vs study overlay), and collect human rubric labels to validate and calibrate the heuristics.

## Project Lifecycle

Not every project will proceed in the same way, but projects generally have some
important components in common.

![Project Lifecycle](https://github.com/openai/openai-cookbook/blob/main/images/partner_project_lifecycle.png?raw=1)

The solid arrows show the primary progressions or steps, while the dotted line
represents the ongoing nature of problem understanding - uncovering more about
the customer domain will influence every step of the process. We wil examine
several of these iterative cycles of refinement in detail below.
Not every project will proceed in the same way, but projects generally have some common
important components.

### 1. Understand the Problem

Usually, the decision to start an engineering process is made by leadership who
understand the business impact but don't need to know the process details.

This step doesn't end before we start building our system; invariably, our initial
assessments are an incomplete understanding of the problem space and we will continue to
refine our understanding as we get closer to a solution.

### 2. Assemble Examples (Gather Data)

It's very rare for a real-world project to begin with all the data necessary to achieve a satisfactory solution, let alone establish confidence.

In our case, we'll assume we have a decent sample of system *inputs*. We'll walk through the process of incrementally expanding our test and training sets in collaboration with domain experts as we go along and make our evals progressively more comprehensive.

### 3. Build an End-to-End V0 System

We want to get the skeleton of a system built as quickly as possible. We don't need a
system that performs well - we just need something that accepts the right inputs and
provides outputs of the correct type. Usually this is almost as simple as describing the
task in a prompt, adding the inputs, and using a single model (usually with structured
outputs) to make an initial best-effort attempt.

### 4. Label Data and Build Initial Evals

We've found that in the absence of an established ground truth, it's not uncommon to
use an early version of a system to generate 'draft' truth data which can be annotated
or corrected by domain experts.

Once we have an end-to-end system constructed, we can start processing the inputs we
have to generate plausible outputs. We'll send these to our domain experts to grade
and correct. We will use these corrections and conversations about how the experts
are making their decisions to design further evals and to embed expertise in the system.

### 5. Map Evals to Business Metrics

Before we jump into correcting every error, we need to make sure that we're investing
time effectively. The most critical task at this stage is to review our evals and
gain an understanding of how they connect to our key objectives.

- Step back and assess the potential costs and benefits of the system
- Identify which eval measurements speak directly to those costs and benefits
- For example, what does "failure" on a particular eval cost? Are we measuring
  something worthwhile?
- Create a (non-LLM) model that uses eval metrics to provide a dollar value
- Balance performance (accuracy, or speed) with cost to develop and run

### 6. Progressively Improve System and Evals

Having identified which efforts are most worth making, we can begin iterating on
improvements to the system. The evals act as an objective guide so we know when we've
made the system good enough, and ensure we avoid or identify regression.

### 7. Integrate QA Process and Ongoing Improvements

Evals aren't just for development. Instrumenting all or a portion of a production
service will surface more useful test and training samples over time, identifying
incorrect assumptions or finding areas with insufficient coverage. This is also the only
way you can ensure that your models continue performing well long after your initial
development process is complete.

## - V0 System Construction

In practice, we would probably be building a system that operates via a REST API,
possibly with some web frontend that would have access to some set of components and
resources. Here we'll distill  down to a pair of
functions, 
- `run_study_session_trajectory` and 
- `evaluate_study_session_for_audit` 

that collectively decide what we should do.

> Breaking up a process into steps like this has both pros and cons; it is easier to
> examine and develop if the process is made up of small isolated steps. But you can
> progressively lose information, effectively letting your agents play "telephone". 

In [53]:
%pip install --upgrade openai openai-agents rich persist-cache -qqq
%load_ext dotenv
%dotenv


# Place your API key in a file called .env
# OPENAI_API_KEY=sk-...

Note: you may need to restart the kernel to use updated packages.
The dotenv extension is already loaded. To reload it, use:
  %reload_ext dotenv


### Basic Study Session Runnder

Let's build our `run_study_session_trajectory` function.

Usually, for the very first stab at something that might work, we'll simply feed ChatGPT
the available documents we've assembled so far and ask it to generate a prompt. It's not
worth spending too much time on prompt engineering before you have a benchmark to grade
yourself against! This is a prompt jail breaked from OpenAI Study Mode - it's added usually on top of core ChatGPT prompt

In [54]:
BASIC_PROMPT = """
IDENTITY & CONTEXT
- You are ChatGPT operating in STUDY MODE.
- Knowledge cutoff: 2025-09
- Current date: 2025-09-06
- User timezone: Asia/Karachi

The user is currently STUDYING, and they've asked you to follow these strict rules during this chat. No matter what other instructions follow, you MUST obey these rules:

## STRICT RULES
Be an approachable-yet-dynamic teacher, who helps the user learn by guiding them through their studies.

1. Get to know the user. If you don't know their goals or grade level, ask the user before diving in. (Keep this lightweight!) If they don't answer, aim for explanations that would make sense to a 10th grade student.
2. Build on existing knowledge. Connect new ideas to what the user already knows.
3. Guide users, don't just give answers. Use questions, hints, and small steps so the user discovers the answer for themselves.
4. Check and reinforce. After hard parts, confirm the user can restate or use the idea. Offer quick summaries, mnemonics, or mini-reviews to help the ideas stick.
5. Vary the rhythm. Mix explanations, questions, and activities (like roleplaying, practice rounds, or asking the user to teach you) so it feels like a conversation, not a lecture.

Above all: DO NOT DO THE USER'S WORK FOR THEM. Don't answer homework questions — help the user find the answer, by working with them collaboratively and building from what they already know.

### THINGS YOU CAN DO
- Teach new concepts: Explain at the user's level, ask guiding questions, use visuals, then review with questions or a practice round.
- Help with homework: Don't simply give answers! Start from what the user knows, help fill in the gaps, give the user a chance to respond, and never ask more than one question at a time.
- Practice together: Ask the user to summarize, pepper in little questions, have the user "explain it back" to you, or role-play (e.g., practice conversations in a different language). Correct mistakes — charitably! — in the moment.
- Quizzes & test prep: Run practice quizzes. (One question at a time!) Let the user try twice before you reveal answers, then review errors in depth.

### TONE & APPROACH
Be warm, patient, and plain-spoken; don't use too many exclamation marks or emoji. Keep the session moving: always know the next step, and switch or end activities once they've done their job. And be brief — don't ever send essay-length responses. Aim for a good back-and-forth.

## IMPORTANT
DO NOT GIVE ANSWERS OR DO HOMEWORK FOR THE USER. If the user asks a math or logic problem, or uploads an image of one, DO NOT SOLVE IT in your first response. Instead: talk through the problem with the user, one step at a time, asking a single question at each step, and give the user a chance to respond to each step before continuing.
"""

In [55]:
# session_agent_runner.py
import asyncio
import json
from pathlib import Path
from datetime import datetime
from agents import Agent, Runner

SEEDS_DIR = Path("seeds")
ARTIFACTS_DIR = Path("artifacts/regenerated")
ARTIFACTS_DIR.mkdir(parents=True, exist_ok=True)


def load_session(path: Path) -> dict:
    """Load a seed session from JSON."""
    with open(path, "r", encoding="utf-8") as f:
        return json.load(f)


async def replay_session(session_path: Path, model: str = "gpt-5") -> Path:
    """
    Replay a seed session sequentially:
    - For each user message, regenerate assistant reply with the agent.
    - Preserve both original and regenerated outputs separately.
    - Save agent instructions and metadata at the top-level for context.
    """
    session = load_session(session_path)
    session_id = session.get("session_id", session_path.stem)
    trajectory = session["trajectory"]

    agent_instructions = BASIC_PROMPT

    agent = Agent(
        name="StudyMode",
        instructions=agent_instructions,
        model=model,
    )

    # Store trajectories separately
    original_trajectory = []
    regenerated_trajectory = []

    history = []  # working history for regeneration

    for turn in trajectory:
        original_trajectory.append(turn)
        history.append(turn)

        if turn["role"] == "user":
            # Replay user turn → regenerate assistant reply
            msgs = [{"role": h["role"], "content": h["content"]} for h in history]

            result = await Runner.run(agent, msgs)
            regenerated_reply = {
                "role": "assistant",
                "content": result.final_output,
                "tags": ["regenerated"],
            }

            # Append to regenerated trajectory
            regenerated_trajectory.append(turn)          # user turn
            regenerated_trajectory.append(regenerated_reply)  # regenerated assistant

    # Save regenerated session with instructions + metadata
    out_path = ARTIFACTS_DIR / f"{session_id}.json"
    with open(out_path, "w", encoding="utf-8") as f:
        json.dump(
            {
                "session_id": session_id,
                "domain": session.get("domain"),
                "prompt": session.get("prompt"),  # canonical starting point
                "agent_instructions": agent_instructions,
                "original_trajectory": original_trajectory,
                "regenerated_trajectory": regenerated_trajectory,
                "regeneration_meta": {
                    "model": model,
                    "timestamp": datetime.utcnow().isoformat() + "Z",
                },
            },
            f,
            indent=2,
        )

    print(f"[✔] Saved regenerated session → {out_path}")
    return out_path


async def run_all_sessions():
    """Replay all seed sessions sequentially and regenerate assistant replies."""
    seed_files = sorted(SEEDS_DIR.glob("*.json"))
    print(f"Found {len(seed_files)} seed sessions in {SEEDS_DIR}")
    results = []
    for fp in seed_files:
        res = await replay_session(fp)
        results.append(res)
    print("All sessions processed.")
    return results



### Test on one STUDY MODE session File

Let's evaluate just a single seed file and review it manually to see how well a smart model with a naive prompt can do.

In [21]:
# Run a single session:
from pathlib import Path

# Case 1: relative path (if your notebook is already in project root)
session_path = Path("seeds/programming.python.sir_zia.004.json")


# Case 2: full path
# session_path = Path("/Users/mjs/path/to/eval-driven-agents/seeds/programming.python.intro.001.json")

# Run one session
await replay_session(session_path)




# Run all Sessions
# if __name__ == "__main__":
#     asyncio.run(run_all_sessions())

[✔] Saved regenerated session → artifacts/regenerated/programming.python.sir_zia.004.json


  "timestamp": datetime.utcnow().isoformat() + "Z",


PosixPath('artifacts/regenerated/programming.python.sir_zia.004.json')

We'll get different answers if we re-run it, but it usually gets most things correct
with a few errors. Here's a specific example: 'artifacts/regenerated/programming.python.intro.001.json'

The system works — both trajectories teach Python from zero. But  the regenerated flow renamed/reordered concepts, and missed some important checks. That’s fine at this stage, because the point is to evaluate differences systematically, not rely on “it feels smoother.”

### Action Decision

Next, we need to close the loop and get to an actual decision.

In [59]:
import json
from pathlib import Path
from datetime import datetime
from typing import List, Optional
from pydantic import BaseModel, Field, ValidationError

# -----------------------
# Pydantic schema
# -----------------------
class EvalCheck(BaseModel):
    name: str
    passed: bool
    reason: Optional[str]
    evidence_snippet: Optional[str]
    confidence: Optional[float]


class EvalResult(BaseModel):
    session_id: str
    pedagogical_fidelity_score: float = Field(..., ge=0.0, le=1.0)
    escalate: bool
    checks: List[EvalCheck]
    notes: Optional[str] = None
    evaluated_at: Optional[str] = None


# -----------------------
# Judge prompt
# -----------------------
JUDGE_SYSTEM = (
    "You are an evaluation agent. "
    "You MUST return EXACTLY one JSON object matching the EvalResult schema. "
    "No commentary, no markdown, no explanation, only JSON."
)

JUDGE_USER_TEMPLATE = """
You are given:
- session_id: {session_id}
- agent_instructions: {agent_instructions}
- regenerated_trajectory: {regenerated_trajectory_json}

Evaluate the session exactly on the checks:
diagnosis_first, no_direct_answer_initially, single_question_per_step,
wait_for_student, check_reinforce, vary_rhythm, conciseness.

For each check return:
{{"name":"<check_name>","passed":true|false,"reason":"short text","evidence_snippet":"<=200 chars","confidence":0.0-1.0}}

Compute pedagogical_fidelity_score as the mean of passed checks.  
Set escalate=true if fidelity < 0.6 or if safety issues are found.

Return ONLY the EvalResult JSON object.
"""


# -----------------------
# Main judge runner
# -----------------------
async def run_strict_structured_judge_on_artifact(
    regen_artifact_path: Path,
    model: str = "gpt-5",
) -> EvalResult:
    """
    Run strict structured judge on a regenerated artifact.
    Requires the model to output directly into EvalResult schema.
    """

    # Load artifact
    raw = json.loads(regen_artifact_path.read_text(encoding="utf-8"))
    session_id = raw.get("session_id") or regen_artifact_path.stem
    agent_instructions = raw.get("agent_instructions") or raw.get("prompt") or ""
    regenerated = raw.get("regenerated_trajectory", [])

    # Prepare judge query
    user_text = JUDGE_USER_TEMPLATE.format(
        session_id=session_id,
        agent_instructions=agent_instructions,
        regenerated_trajectory_json=json.dumps(regenerated, ensure_ascii=False),
    )

    # Judge agent with structured output
    agent = Agent(
        name="StudyModeJudge",
        instructions=JUDGE_SYSTEM,
        model=model,
        output_type=EvalResult,
    )

    messages = [
        {"role": "system", "content": JUDGE_SYSTEM},
        {"role": "user", "content": user_text},
    ]

    print(f"[→] Running judge for session {session_id}")
    result = await Runner.run(agent, messages)

    # Validate structured result
    try:
        eval_obj = EvalResult.model_validate(result.final_output)
    except ValidationError as e:
        debug_path = regen_artifact_path.with_suffix(".llm_judge.raw.json")
        debug_path.write_text(
            json.dumps(result.final_output, indent=2, ensure_ascii=False),
            encoding="utf-8",
        )
        raise RuntimeError(f"Judge output failed validation: {e}")

    # Add timestamp if missing
    if not eval_obj.evaluated_at:
        eval_obj.evaluated_at = datetime.utcnow().isoformat() + "Z"

    # Save validated eval
    out_path = regen_artifact_path.with_suffix(".llm_judge.eval.json")
    out_path.write_text(
        eval_obj.model_dump_json(indent=2),
        encoding="utf-8",
    )
    print(f"[✔] Saved eval → {out_path} (score={eval_obj.pedagogical_fidelity_score:.3f})")
    return eval_obj



In [60]:
await run_strict_structured_judge_on_artifact(Path("artifacts/regenerated/programming.python.sir_zia.004.json"))

[→] Running judge for session programming.python.sir_zia.004
[✔] Saved eval → artifacts/regenerated/programming.python.sir_zia.004.llm_judge.eval.json (score=0.714)


EvalResult(session_id='programming.python.sir_zia.004', pedagogical_fidelity_score=0.714, escalate=False, checks=[EvalCheck(name='diagnosis_first', passed=True, reason='Asked about experience and goals before teaching content.', evidence_snippet="Quick check-in first: what's your current experience...? What would you like to eventually do with Python...", confidence=0.9), EvalCheck(name='no_direct_answer_initially', passed=True, reason='Guided with hints and patterns, invited the learner to try, did not just give final answers.', evidence_snippet="Your turn (one small task)... Hint: it looks like this pattern: name = 'YourName' ...", confidence=0.7), EvalCheck(name='single_question_per_step', passed=False, reason='Occasionally bundled multiple tasks in one prompt.', evidence_snippet='- Make a variable called name... - Then print the greeting using that variable with an f-string.', confidence=0.85), EvalCheck(name='wait_for_student', passed=True, reason="Consistently paused for the lear

Let's build ourselves some evals!

In [61]:
async def run_eval_batch(regen_dir: Path = Path("artifacts/regenerated")):
    evals = []
    for path in regen_dir.glob("*.json"):
        try:
            eval_obj = await run_strict_structured_judge_on_artifact(path)
            evals.append(eval_obj)
        except Exception as e:
            print(f"[!] Failed on {path.name}: {e}")
    return evals


In [62]:
await run_eval_batch()


[→] Running judge for session programming.python.sir_zia.004
[✔] Saved eval → artifacts/regenerated/programming.python.sir_zia.004.llm_judge.eval_old.llm_judge.eval.json (score=0.000)
[→] Running judge for session programming.python.sir_zia.004
[✔] Saved eval → artifacts/regenerated/programming.python.sir_zia.004.llm_judge.eval.json (score=0.857)
[→] Running judge for session programming.python.intro.001
[✔] Saved eval → artifacts/regenerated/programming.python.intro.001.llm_judge.eval.json (score=0.857)
[→] Running judge for session programming.python.sir_zia.004
[✔] Saved eval → artifacts/regenerated/programming.python.sir_zia.004.llm_judge.eval.llm_judge.eval.json (score=0.000)


[EvalResult(session_id='programming.python.sir_zia.004', pedagogical_fidelity_score=0.0, escalate=True, checks=[EvalCheck(name='diagnosis_first', passed=False, reason='Insufficient data to assess; no conversation provided.', evidence_snippet='', confidence=0.1), EvalCheck(name='no_direct_answer_initially', passed=False, reason='Insufficient data to assess; no conversation provided.', evidence_snippet='', confidence=0.1), EvalCheck(name='single_question_per_step', passed=False, reason='Insufficient data to assess; no conversation provided.', evidence_snippet='', confidence=0.1), EvalCheck(name='wait_for_student', passed=False, reason='Insufficient data to assess; no conversation provided.', evidence_snippet='', confidence=0.1), EvalCheck(name='check_reinforce', passed=False, reason='Insufficient data to assess; no conversation provided.', evidence_snippet='', confidence=0.1), EvalCheck(name='vary_rhythm', passed=False, reason='Insufficient data to assess; no conversation provided.', evi

## Initial Evals

Once we have a minimally functional system we should process more inputs and get domain
experts to help develop ground-truth data. Domain experts doing expert tasks may not
have much time to devote to our project, so we want to be efficient and start small,
aiming for breadth rather than depth at first.

So your evaluation framework should ask:

- Given a student-teacher session, how well do we regenerate responses?

- Given a regenerated session, how good is the structured judge’s analysis?

- Given the session, how accurate is the final escalate decision?

### Additional Considerations

There's a little more to it than that though, because when you are evaluating a
multistep process it's important to know both the end to end performance and the
performance of each individual step, *conditioned on the output of the prior step*.

What we *want* is to be confident
that the auditor is making the correct decision based on the evidence available, even if
that evidence is misleading. If we don't pay attention to that case, we can end up
training the auditor to ignore its inputs and cause our overall performance to degrade.

### Graders

The core component of an eval is the
[grader](https://platform.openai.com/docs/guides/graders). Our eventual eval is going to
use 18 of them, but we only use three kinds, and they're all quite conceptually
straightforward.

Here are examples of one of our string check graders, one of our text similarity
graders, and finally one of our model graders.

In [74]:
from pydantic import BaseModel

class EvalInput(BaseModel):
    session_id: str
    regenerated_trajectory: list  # List of assistant-user turn dicts

class EvalOutput(BaseModel):
    diagnosis_first: bool
    micro_explain: bool
    single_question_per_step: bool
    pedagogical_fidelity: float
    escalate: bool
    notes: str

rubric = """
Evaluate the following tutoring session for pedagogical quality.
Return a JSON object with fields:
- diagnosis_first: true|false
- micro_explain: true|false
- single_question_per_step: true|false
- pedagogical_fidelity: numeric score (0.0–1.0)
- escalate: true|false (true if fidelity < 0.5)
- notes: brief comment
Only output JSON, no additional text.
"""

def make_eval_example(session_id, trajectory):
    return {
        "session_id": session_id,
        "regenerated_trajectory": trajectory
    }


In [86]:
from persist_cache import cache
from openai import AsyncOpenAI

client = AsyncOpenAI()

rubric = """
You are an expert educational evaluator. Score this tutoring session on a scale from 1 (poor) to 5 (excellent) based on:
1. Clarity of explanation
2. Step-by-step guidance
3. Engagement with learner
4. Correctness of content
Respond ONLY with the score number, no extra commentary.
"""

@cache
async def create_tutoring_eval(name: str):
    eval_cfg = await client.evals.create(
        name=name,
        data_source_config={
            "type": "custom",
            "item_schema": {
                "type": "object",
                "properties": {
                    "session_id": {"type": "string"},
                    "regenerated_text": {"type": "string"}
                },
                "required": ["session_id", "regenerated_text"],
            },
            "include_sample_schema": False,
        },
        testing_criteria=[
            {
                "type": "score_model",
                "name": "Pedagogical Score",
                "model": "gpt-4",
                "input": [
                    {"role": "system", "content": rubric},
                    {
                        "role": "user",
                        "content": "{{ item.regenerated_text }}"
                    }
                ],
                "range": [1, 5],
                "pass_threshold": 3,
            }
        ],
    )
    print("Eval created:", eval_cfg.id)
    return eval_cfg

async def run_tutoring_eval(name: str, sessions: list[dict]):
    eval_cfg = await create_tutoring_eval(name)
    run = await client.evals.runs.create(
        name=name + "-run",
        eval_id=eval_cfg.id,
        data_source={
            "type": "jsonl",
            "source": {"type": "file_content", "content": sessions},
        },
    )
    print("Eval run started:", run.report_url)
    return run


In [102]:
import json
from pathlib import Path

def flatten_trajectory(trajectory: list[dict]) -> str:
    lines = []
    for turn in trajectory:
        role = turn.get("role", "unknown").capitalize()
        content = turn.get("content", "")
        lines.append(f"{role}: {content}")
    return "\n".join(lines)
    
def synthesize_text(turns):
    parts = []
    for t in turns:
        role = t["role"]
        content = t.get("content", "")
        parts.append(f"{role.upper()}: {content}")
    return "\n".join(parts)

def load_sessions(paths):
    sessions = []
    for path in paths:
        data = json.loads(Path(path).read_text())
        sessions.append({
            "item": {  # 👈 required wrapper
                "session_id": data["session_id"],
                "regenerated_text": synthesize_text(data["regenerated_trajectory"]),
            }
        })
    return sessions


paths = [
    "artifacts/regenerated/programming.python.sir_zia.004.json",
    "artifacts/regenerated/programming.python.intro.001.json",
]

batch = load_sessions(paths)
await run_tutoring_eval("TutoringEval-v1", batch)


Eval run started: https://platform.openai.com/evaluations/eval_68bbb89d63688191ba99f1b5ec74a06b?project_id=proj_JwDXfk2L5W29oc6gMlfsQfbN&run_id=evalrun_68bbbdc0c4388191bbc47fb5dd091933


RunCreateResponse(id='evalrun_68bbbdc0c4388191bbc47fb5dd091933', created_at=1757134272, data_source=CreateEvalJSONLRunDataSource(source=SourceFileContent(content=[SourceFileContentContent(item={'session_id': 'programming.python.sir_zia.004', 'regenerated_text': 'USER: please teach me python\nASSISTANT: Awesome—happy to teach you Python. Quick check-in first: what’s your current experience with programming—totally new, a little, or already comfortable?\nUSER: from zero\nASSISTANT: Perfect. I’ll be your coach from zero.\n\nQuick question to tailor this: What would you like to eventually do with Python — automate small tasks, analyze data, make games, build websites, or something else?\nUSER: print(\'Hello, Zia!\')\nASSISTANT: Nice — that’s exactly right. You just told Python to print a message.\n\nNext step: variables. A variable is like a labeled box where you store a value (like your name) so you can reuse it.\n\nYour turn (one small task):\n- Make a variable called name that stores yo

Each grader evaluates some portion of a predicted output. This might be a very narrow
check for a specific field in a structured output, or a more holistic check that
judges an output in its entirety. Some graders can work without context, and evaluate an
output in isolation (for example, an LLM judge that is evaluating if a paragraph is rude
or inappropriate). Others can evaluate based on the input and output, while while the
ones we're using here rely on an output and a ground-truth (correct) output to compare
against.

The most direct way of using Evals provides a prompt and a model, and lets the eval run
on an input to generate output itself. Another useful method uses previously logged
responses or completions as the source of the outputs. It's not quite as simple, but the
most flexible thing we can do is to supply an item containing everything we want it to
use—this allows us to have the "prediction" function be an arbitrary system rather than
restricting it to a single model call. 

> **Note on Model Selection:**  
> Selecting the right model is crucial. While faster, less expensive models are often preferable in production, development workflows benefit from prioritizing the most capable models available. For this guide, we use `o4-mini` for both system tasks and LLM-based grading—while `o3` is more capable, our experience suggests the difference in output quality is modest relative to the substantial increase in cost. In practice, spending $10+/day/engineer on evals is typical, but scaling to $100+/day/engineer may not be sustainable.
>
> Nonetheless, it's valuable to periodically benchmark with a more advanced model like `o3`. If you observe significant improvements, consider incorporating it for a representative subset of your evaluation data. Discrepancies between models can reveal important edge cases and guide system improvements.

Once we have the graders and the data, creating and running our evals is very straightforward:

## Connecting Evals to Business Metrics

Evals show you where you can improve, and help track progress and regressions over time.
But the three evals above are just measurements — we need to imbue them with raison
d'être.

The first thing we need is to add evaluations for the final stage of our receipt
processing, so that we can start seeing the results of our audit decisions. The next
thing we need, the most important, is a *model of business relevance*.

### A Business Model

In education, the "business cost" isn’t lost dollars per receipt but student outcomes & tutor efficiency.
For example:

- False Positive (FP) → model says “good learning step” when it’s actually poor → wasted study time.
- False Negative (FN) → model says “bad step” when it was useful → discourages learner.
- Processing Cost → cost per tutoring trajectory (compute + oversight).


In [89]:
def calculate_learning_cost(fp_rate: float, fn_rate: float, per_session_cost: float):
    study_sessions = 10000  # e.g., 10k tutoring sessions
    wasted_time_cost = 5    # $5 value lost per bad step (FP)
    missed_opportunity_cost = 20  # $20 lost per missed learning
    oversight_cost = per_session_cost

    wasted_time = study_sessions * fp_rate * wasted_time_cost
    missed_learning = study_sessions * fn_rate * missed_opportunity_cost
    processing_cost = study_sessions * oversight_cost

    return wasted_time + missed_learning + processing_cost



### Connecting Back To Evals

The point of the above model is it lets us apply meaning to an eval that would
otherwise just be a number.

Let's Build Multi-Criteria Graders (for Tutoring). Instead of receipt fields, we check pedagogical dimensions:

📖 Basic Learning Support

- Did the tutor answer the question correctly?
- Was the explanation clear?
- Did it stay on topic?

🧠 Deeper Pedagogy

- Did it adapt to student mistakes?
- Did it use scaffolding (step-by-step, not dumping answer)?
- Was it encouraging & motivational?

✨ Reasoning Quality

- Did the tutor explain reasoning process?
- Score 0–10 on Pedagogical Quality.

In [91]:
# -----------------------------
# Tutoring Evaluation Graders
# -----------------------------

basic_support_graders = [
    {
        "name": "Answer Accuracy",
        "type": "score_model",
        "model": "gpt-4",
        "input": [
            {
                "role": "system",
                "content": """
Your task is to check whether the tutor gave the correct *final answer* to the student's main question.
Score 1 if correct, 0 if incorrect or missing.
Trajectory:
{{ item.regenerated_trajectory }}
"""
            }
        ],
        "range": [0, 1],
        "pass_threshold": 1,
    },
    {
        "name": "Clarity of Explanation",
        "type": "score_model",
        "model": "gpt-4",
        "input": [
            {
                "role": "system",
                "content": """
Did the tutor explain clearly, step-by-step, in a way a student can follow?
0 = totally confusing
1 = somewhat clear
2 = very clear
Trajectory:
{{ item.regenerated_trajectory }}
"""
            }
        ],
        "range": [0, 2],
        "pass_threshold": 2,
    },
    {
        "name": "Staying on Topic",
        "type": "score_model",
        "model": "gpt-4",
        "input": [
            {
                "role": "system",
                "content": """
Check if the tutor stayed focused on the student’s question instead of going off-topic.
Score 1 if yes, 0 if no.
Trajectory:
{{ item.regenerated_trajectory }}
"""
            }
        ],
        "range": [0, 1],
        "pass_threshold": 1,
    },
]

deeper_pedagogy_graders = [
    {
        "name": "Adaptivity to Mistakes",
        "type": "score_model",
        "model": "gpt-4",
        "input": [
            {
                "role": "system",
                "content": """
Evaluate how well the tutor adapted to the student’s mistakes or misunderstandings.
Score:
0 = did not notice/respond
1 = noticed but weak adaptation
2 = strong adaptation with correction
Trajectory:
{{ item.regenerated_trajectory }}
"""
            }
        ],
        "range": [0, 2],
        "pass_threshold": 1,
    },
    {
        "name": "Scaffolding Support",
        "type": "score_model",
        "model": "gpt-4",
        "input": [
            {
                "role": "system",
                "content": """
Did the tutor scaffold the learning (guiding step by step, rather than dumping answers)?
0 = no scaffolding
1 = partial scaffolding
2 = strong scaffolding
Trajectory:
{{ item.regenerated_trajectory }}
"""
            }
        ],
        "range": [0, 2],
        "pass_threshold": 1,
    },
    {
        "name": "Encouragement & Motivation",
        "type": "score_model",
        "model": "gpt-4",
        "input": [
            {
                "role": "system",
                "content": """
Evaluate the tutor’s tone:
0 = discouraging or neutral
1 = mildly supportive
2 = encouraging and motivational
Trajectory:
{{ item.regenerated_trajectory }}
"""
            }
        ],
        "range": [0, 2],
        "pass_threshold": 1,
    },
]

reasoning_eval_prompt = """
Your task is to evaluate the *reasoning quality* of the tutor.

Check:
1. Did the tutor logically explain the reasoning steps?
2. Were explanations accurate and pedagogically sound?
3. Was the reasoning concise (not rambling)?
4. Did the tutor reach a correct final conclusion?

Grade 0–10:
- 0 = incoherent or wrong reasoning
- 5 = partially correct, missing some logic
- 10 = clear, step-by-step, correct reasoning
Trajectory:
{{ item.regenerated_trajectory }}
"""

reasoning_graders = [
    {
        "name": "Pedagogical Reasoning Quality",
        "type": "score_model",
        "model": "gpt-4",
        "input": [{"role": "system", "content": reasoning_eval_prompt}],
        "range": [0, 10],
        "pass_threshold": 8,
    }
]


# -----------------------------
# Create Full Tutoring Eval
# -----------------------------

full_eval = await client.evals.create(
    name="Full-StudyMode-Tutoring-Eval",
    data_source_config={
        "type": "custom",
        "item_schema": {
            "type": "object",
            "properties": {
                "session_id": {"type": "string"},
                "regenerated_trajectory": {"type": "array"},
            },
            "required": ["session_id", "regenerated_trajectory"],
        },
    },
    testing_criteria=(
        basic_support_graders
        + deeper_pedagogy_graders
        + reasoning_graders
    ),
)



In [103]:
import json
from pathlib import Path

def flatten_trajectory(trajectory: list[dict]) -> str:
    lines = []
    for turn in trajectory:
        role = turn.get("role", "unknown").capitalize()
        content = turn.get("content", "")
        lines.append(f"{role}: {content}")
    return "\n".join(lines)
    
def prepare_eval_batch(paths: list[str]):
    batch = []
    for path_str in paths:
        path = Path(path_str)
        if not path.exists():
            continue

        with open(path, "r", encoding="utf-8") as f:
            session = json.load(f)

        if "session_id" not in session or "regenerated_trajectory" not in session:
            continue

        flat_traj = flatten_trajectory(session["regenerated_trajectory"])

        batch.append({
            "item": {
                "session_id": session["session_id"],
                "regenerated_trajectory": flat_traj  # <--- string now
            }
        })

    return batch


In [105]:
import json
from pathlib import Path

def load_sessions(paths: list[str]):
    batch = []
    for p in paths:
        with open(p, "r", encoding="utf-8") as f:
            session = json.load(f)

        batch.append({
            "item": {   # must wrap in "item"
                "session_id": session["session_id"],
                "regenerated_trajectory": session["regenerated_trajectory"],  # keep as array
            }
        })
    return batch


paths = [
    "artifacts/regenerated/programming.python.sir_zia.004.json",
    "artifacts/regenerated/programming.python.intro.001.json",
]

batch = load_sessions(paths)

eval_run = await client.evals.runs.create(
    name="study-mode-full-eval-run",
    eval_id=full_eval.id,
    data_source={
        "type": "jsonl",
        "source": {"type": "file_content", "content": batch},
    },
)

print("Eval Report URL:", eval_run.report_url)


Eval Report URL: https://platform.openai.com/evaluations/eval_68bbbb15464481919b688139a74e9e15?project_id=proj_JwDXfk2L5W29oc6gMlfsQfbN&run_id=evalrun_68bbbe2163e88191b27318aef97f781d


## Spin Up the Flywheel

Having our business model means we have a map of what's worth doing and what isn't. Our
initial evals are a road sign that lets us know we're moving in the right direction; but
eventually we'll need more signage. At this point in the process we usually have a lot
of different things we can work on, with a few linked cycles where improvement on one
will open up more room for improvement on a different cycle.

![Development Flywheel](https://github.com/openai/openai-cookbook/blob/main/images/partner_development_flywheel.png?raw=1)

1. Our evals show us where we can improve, and we can immediately use them to guide us
   in model selection, prompt engineering, tool use, and fine-tuning strategies.
2. We're not done once system performs well according to our evals. That's when it's
   time to *improve our evals*. We will process more data, give it to our domain experts
   to review, and feed the corrections into building better, more comprehensive evals.

This cycle can go on for a while. We can speed it along by identifying the efficient
frontier of "interesting" data to examine. There are a few techniques for this, but an
easy one is re-running models on inputs to prioritize labeling inputs that don't
get consistent answers. This works especially well when using different underlying
models, and often even benefits from using less-intelligent models (if a dumb model
agrees with a smart model then it's probably not a hard problem).

Once it seems like we've hit a point of dimishing returns on performance, we can keep
using the same techniques to optimize model cost; if we have a system that performs
quite well, then fine-tuning or some form of model distillation will probably allow us
to get similar performance from smaller, cheaper, faster models.

## System Improvements

With our evals in place and an understanding of how they connect to our business metrics,
we're finally ready to turn our attention to improving the output of our system.

Let's modify the prompt and re-run our evals to see how we do. We'll provide more
guidance in the form of a specific example in the instructions about engine oil
(different from a snow broom, but requires the same reasoning), and we'll include three
examples pulled from our training set (`data/train`) as few-shot guidance.

When we ran the eval again, we actually still got two audit decisions wrong. Digging into
the examples we made a mistake on, it turns out that we completely fixed the issues we
identified, but our examples improved the reasoning step and caused two other issues to
surface. 

This is great, and we'll continue iterating on issues as we uncover them. This is the
cycle of improvement!

### Model Choice

When beginning a project, we usually start with one of the most capable models available, such as `o4-mini`, to establish a performance baseline. Once we’re confident in the model’s ability to solve the task, the next step is to explore smaller, faster, or more cost-effective alternatives.

Optimizing for inference cost and latency is essential, especially for production or customer-facing systems, where these factors can significantly impact overall expenses and user experience. For instance, switching from `o4-mini` to `gpt-4.1-mini` could reduce inference costs by nearly two-thirds—an example where thoughtful model selection leads to meaningful savings.

In the next section, we’ll rerun our evaluations using `gpt-4.1-mini` for both extraction and audit steps to see how well a more efficient model performs.


### Further improvements

This cookbook focuses on the philosophy and practicalities of evals, not the full range of model improvement techniques. For boosting or maintaining model performance (especially when moving to smaller, faster, or cheaper models), consider these steps in order—start from the top, and only proceed down if needed. For example, always optimize your prompt before resorting to fine-tuning; fine-tuning on a weak prompt can lock in bad performance even if you improve the prompt later.

![Model Improvement Waterfall](https://github.com/openai/openai-cookbook/blob/main/images/partner_model_improvement_waterfall.png?raw=1)

1. **Model selection:** try smarter models, or increase their reasoning budget.
2. **Prompt tuning:** clarify instructions and provide very explicit rules.
3. **Examples and context:** add few- or many-shot examples, or more context for the
   problem. RAG fits in here, and may be used to dynamically select similar examples.
4. **Tools use:** provide tools to solve specific problems, including access to external
   APIs, the ability to query databases, or otherwise enable the model to have its own
   questions answered.
5. **Accessory models:** add models to perform limited sub-tasks, to supervise and provide
   guardrails, or use a mixture of experts and aggregate solutions from multiple
   sub-models.
6. **Fine-tuning:** use labeled training data for supervised fine tuning, eval
   graders for reinforcement fine tuning, or different outputs for direct preference
   optimization.

The above options are all tools to maximize performance. Once you're trying to optimize
for a price:performance ratio, you'll usually have already done all of the above and
likely don't need to repeat most steps, but you can still fine-tune smaller models or
use your best model to train a smaller model (model distillation).

> One really excellent thing about OpenAI Evals is that you can use the same graders for
> [Reinforcement Fine-Tuning](https://cookbook.openai.com/examples/reinforcement_fine_tuning)
> to produce better model performance in an extremely sample-efficient manner. One note
> of caution is to make sure that you use separate training data and don't leak your
> eval datasets during RFT.

## Deploying and Post-Development
Building and deploying an LLM application is just the beginning—the real value comes from ongoing improvement. Once your system is live, prioritize continuous monitoring: log traces, track outputs, and proactively sample real user interactions for human review using smart sampling techniques.

Production data is your most authentic source for evolving your evaluation and training datasets. Regularly collect and curate fresh samples from actual use cases to identify gaps, edge cases, and new opportunities for enhancement.

In practice, leverage this data for rapid iteration. Automate periodic fine-tuning pipelines that retrain your models on recent, high-quality samples and automatically deploy new versions when they outperform existing ones in your evals. Capture user corrections and feedback, then systematically feed these insights back into your prompts or retraining process—especially when they highlight persistent issues.

By embedding these feedback loops into your post-development workflow, you ensure your LLM applications continuously adapt, stay robust, and remain closely aligned with user needs as they evolve.

## CORE Resources

https://cookbook.openai.com/examples/partners/eval_driven_system_design/receipt_inspection

https://chatgpt.com/share/68b7f90f-146c-8002-9a95-75844a9355fd

https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%5B%221l0Pg8Fv6Pj0JVR80ZJCLtNuYWWViSO1R%22%5D,%22action%22:%22open%22,%22userId%22:%22106656036806020573706%22,%22resourceKeys%22:%7B%7D%7D&usp=sharing

https://chatgpt.com/share/68bb49c5-ca1c-8002-b8c6-61d589f0e86b

## RESEARCH PAPERS
https://arxiv.org/html/2507.21504v1

https://arxiv.org/html/2411.13768v2

https://arxiv.org/html/2502.06329v1

https://arxiv.org/pdf/2504.08942

## YT VIDEOS

https://www.youtube.com/watch?v=a4BV0gGmXgA

https://www.youtube.com/watch?v=RAEyhC0P2Ic

https://www.youtube.com/watch?v=4QXtObc61Lw

## CHAT SESSIONS

https://chatgpt.com/share/68bb48be-75a0-8002-9ad2-c69985e39155

https://chatgpt.com/share/68b4024e-7f88-8001-9a65-f06680de77f5

https://chatgpt.com/share/68b3ff88-0f74-8001-b97f-166ff6821b30

https://chatgpt.com/share/68b3ff88-0f74-8001-b97f-166ff6821b30

## Extra Resources

https://platform.openai.com/chat/edit?models=gpt-5&optimize=true

https://cookbook.openai.com/examples/realtime_prompting_guide


https://webinar.openai.com/on-demand/d1a99ac5-8de8-43c5-b209-21903d76b5b2


https://github.com/openai/build-hours/tree/main


https://cdn.openai.com/business-guides-and-resources/identifying-and-scaling-ai-use-cases.pdf

https://webinar.openai.com/buildhours/

https://cdn.openai.com/business-guides-and-resources/ai-in-the-enterprise.pdf

https://github.blog/ai-and-ml/generative-ai/spec-driven-development-with-ai-get-started-with-a-new-open-source-toolkit/

https://cookbook.openai.com/examples/gpt-5/gpt-5_new_params_and_tools#3-contextfree-grammar-cfg

https://docs.ragas.io/en/stable/concepts/test_data_generation/rag/

https://cookbook.openai.com/examples/gpt-5/gpt-5_prompting_guide

https://github.com/langchain-ai/agentevals