# Building a Codex Evaluation Pipeline before Rollout

When you want to roll Codex out into large-scale engineering workflows, the fastest way to build confidence is to **evaluate it on work you already know how to do well**. Large mechanical refactoring tasks, such as API or interface migrations are a common example. They are time-consuming, repetitive, and well-understood, which makes them strong candidates for offloading to a coding agent like Codex.

That said, success here depends on getting several moving parts right. You need to validate that your prompts, documentation, and context are sufficient for the agent to operate effectively, while also understanding which Codex model is best suited for the task and how much reasoning depth it actually requires. Rigorous evaluation is what turns these questions from guesswork into confidence.

The goal here, is to measure how well Codex matches your real-world standards **before** handing it more autonomy.

In this cookbook, we walk through a simple Codex evaluation pipeline leveraging the Codex SDK. We’ll dive into code examples, highlight the strategy, and show how to turn historical human patches into a repeatable Codex benchmark. 

The pipeline mirrors how you might evaluate a model as a business: define tasks, run Codex on the same codebase humans solved, and compare outputs with a grader. The overall process, can be broken down to the following steps:

1. Define evaluation tasks from real historical refactors.
2. Recreate the exact code context with git worktrees.
3. Run Codex with a focused task prompt.
4. Grade Codex output against the original human patch.
5. Summarize and visualize results in a dashboard.



## High-level Architecture at a Glance

The pipeline mirrors the below reference diagram. You can think of it as four stages:

1. **Task ingestion**: Parse a YAML task file describing refactor work.
2. **Orchestration**: Check out the right base commit into a Git worktree.
3. **Codex solve**: Run the prompt against the workspace and capture a diff.
4. **Grading & scoring**: Compare generated vs. human diffs and store results.
5. **Data Analysis**: Aggregate results across tasks and runs to analyze model performance, failure modes, and trade-offs (e.g. accuracy vs. reasoning depth), enabling informed decisions about prompts, models, and rollout readiness.


<img src="./images/eval_pipeline.jpeg" alt="Evaluation pipeline" width=800/>

In the next section, we'll be going through the implementation details step by step.

You can find the full working example on the Github repository [here](https://github.com/openai/openai-cookbook/tree/main/examples/codex)

## Step 1: Define evaluation tasks (YAML)

Tasks are stored as structured YAML in `evals/tasks.yaml`. Each task includes:

- The **Gerrit Change-Id** of a historical patch
- A prompt describing the refactoring task
- Metadata like patch id, path, and summary

Example:

```yaml
tasks:
- name: "192755_embed-fonts-decomposed-pdf"
  patch_id: 192755
  change_id: Ib6e354d5a4a9076b81e6a26fe78bdd4994024ec1
  author: "John Doe"
  patch_summary: >
    Add scrollbar width to desired size if enabled
  files_changed: 1
  code_scope_path: source/tbxctrls/linectrl.cxx
  task_prompt: >
    Add scrollbar width to desired size if enabled in tbxctrls
```

## Step 2 — Retrieve the ground‑truth diff

The pipeline resolves the commit hash into:

- **Merged commit** (the human patch you want to match)
- **Base commit** (its parent; the repo state before the change)

That logic lives in [`blog/code_example/src/git.ts`](https://github.com/openai/openai-cookbook/tree/main/examples/codex/codex_evaluation/eval_pipeline/blog/code_example/src/git.ts) and [`blog/code_example/src/grader.ts`](https://github.com/openai/openai-cookbook/tree/main/examples/codex/codex_evaluation/eval_pipeline/blog/code_example/src/grader.ts). 

*The *Key idea:** this lets you compare the model against the *actual* human patch for the same task.

```ts
export async function resolveBaseCommitFromCommitHash(repoDir: string, commitHash: string) {
  const mergedCommit = await runGitCapture(['rev-parse', commitHash], repoDir);
  const baseCommit = await runGitCapture(['rev-parse', `${mergedCommit}^`], repoDir);
  return { mergedCommit, baseCommit };
}

export async function getCommitDiff(commitHash: string, repoDir: string): Promise<string> {
  const { mergedCommit, baseCommit } = await resolveBaseCommitFromCommitHash(repoDir, commitHash);
  return runGitCapture(['diff', '--no-color', `${baseCommit}..${mergedCommit}`], repoDir);
}
```

In [`blog/code_example/src/main.ts`](https://github.com/openai/openai-cookbook/tree/main/examples/codex/codex_evaluation/eval_pipeline/blog/code_example/src/main.ts), that diff is written to `evals_output/<run>/original_diffs/<task>.diff`, giving you a canonical “human patch” to grade against.

## Step 3 — Run Codex in a clean worktree

This is the heart of the evaluation. In [`blog/code_example/src/solver.ts`](https://github.com/openai/openai-cookbook/tree/main/examples/codex/codex_evaluation/eval_pipeline/blog/code_example/src/solver.ts), each task is executed like this:

1. Create a git worktree at the base commit.
2. Start a Codex thread that points at that worktree.
3. Run the prompt.
4. Capture the resulting diff.

**Core snippet (simplified):**


```ts
const workTreeName = `wk-eval-${runId}-${row.commit_hash}`;
const { baseCommit } = await resolveBaseCommitFromCommitHash(workingDirectory, row.commit_hash);
await addWorktree(workTreeName, workingDirectory, baseCommit);

const thread = codex.startThread({
  skipGitRepoCheck: true,
  workingDirectory: `${workingDirectory}/${workTreeName}`,
  sandboxMode: 'workspace-write',
});

const prompt = row.additional_context
  ? `${row.task}\n\nAdditional context:\n${row.additional_context}`
  : row.task;

await thread.run(prompt);
const diff = await diffWorktreeStream(fullWorkingDirectory, ['--no-color']);
```

**What you get:**

- A generated patch (`diffs/<task>.diff`)
- A Codex execution log (`codex_logs/<task>.codex.log`)
- Timing and metadata for benchmarking (`results.json`)

This is exactly the “model solver” box in the reference architecture.

## Step 4 — Grade model output

We want to score *semantic similarity*, not formatting. The grader in [`blog/code_example/src/grader.ts`](https://github.com/openai/openai-cookbook/tree/main/examples/codex/codex_evaluation/eval_pipeline/blog/code_example/src/grader.ts) uses GPT‑5 to compare the generated diff and the original diff. It scores each task from 1–5 and returns a short rationale.

**Grading prompt (condensed from the actual code):**

```text
You are an expert evaluator of code changes.
You are given two git diffs and a task description.

Evaluate the semantic similarity of GeneratedDiff vs OriginalDiff.
Ignore formatting and style.
Return a score 1–5 with a short rationale.
```

The actual implementation uses OpenAI’s structured response parsing with a Zod schema:

```ts
export const GradeResult = z.object({
  score: z.number().min(0).max(5),
  explanation: z.string().min(1),
});
  model: 'gpt-5',
  instructions,
  input: originalDiff,
  text: { format: zodTextFormat(GradeResult, 'gradeResult') },
});
```
**Why this matters:** you can compare *what matters* (task intent), not superficial patch shape.

## Step 5 — Record metrics and analyze results

The evaluation run is logged in `evals_output/<experiment>-<run_id>/results.json` via [`blog/code_example/src/reporting.ts`](https://github.com/openai/openai-cookbook/tree/main/examples/codex/codex_evaluation/eval_pipeline/blog/code_example/src/reporting.ts). This file includes:

- Task‑level scores, errors, and durations
- Run‑level success rate and average score
- Paths to diffs and Codex logs

The `analytics/` folder contains a lightweight Flask dashboard that reads these results and renders:

- Experiment summaries
- Per‑task averages
- Score distributions

To launch it:

```bash
FLASK_APP=analytics.app flask run
# or
python analytics/app.py
```

Set `EVALS_OUTPUT_DIR` if you store outputs elsewhere.
The [`blog/code_example/analysis/analyze.py`](https://github.com/openai/openai-cookbook/tree/main/examples/codex/codex_evaluation/eval_pipeline/blog/code_example/analysis/analyze.py) script summarizes a run and prints quick metrics.

## Putting it all together (CLI example)

The CLI wraps the full pipeline in one command (`src/main.ts`). Example usage:
The CLI wraps the full pipeline in one command ([`blog/code_example/src/main.ts`](https://github.com/openai/openai-cookbook/tree/main/examples/codex/codex_evaluation/eval_pipeline/blog/code_example/src/main.ts)). Example usage:

```bash
npm run solve -- --tasks evals/tasks.yaml \
npm run solve -- --tasks data/tasks.yaml \
  -w /path/to/your/repo \
  -o evals_output \
  -e refactor-eval \
  -m gpt-5-codex \
  -r medium
```

This will:

1. Load tasks from YAML.
2. Resolve Gerrit diffs.
3. Ask Codex to solve each task.
2. Resolve **git commit hashes** into base + merged commits.
3. Ask Codex to solve each task in a worktree.
4. Grade each solution.
5. Save results for dashboarding.
5. Save results for analysis.

## How to adapt this to your business use case

A few pragmatic tips when applying this to your own refactoring or bug‑fix workloads:

- **Start with high‑signal tasks.** Pull 20–50 historical patches that represent your “core” work.
- **Keep prompts stable.** Variation in prompt templates makes evals hard to compare.
- **Track costs.** `src/reporting.ts` extracts usage from Codex logs; use this to estimate ROI.
- **Track costs.** [`blog/code_example/src/reporting.ts`](https://github.com/openai/openai-cookbook/tree/main/examples/codex/codex_evaluation/eval_pipeline/blog/code_example/src/reporting.ts) extracts usage from Codex logs; use this to estimate ROI.
- **Repeat runs.** Use the `--repeat` flag to evaluate variance across runs.
- **Instrument failures.** Log both “no diff” and grading errors to identify failure modes quickly.