Skip to content

feat(eval): add FRAMES benchmark evaluation pipeline#445

Merged
EnjoyBacon7 merged 7 commits into
refactor/hexagonalfrom
feat/frames-benchmark-pipeline-hexagonal
Jun 5, 2026
Merged

feat(eval): add FRAMES benchmark evaluation pipeline#445
EnjoyBacon7 merged 7 commits into
refactor/hexagonalfrom
feat/frames-benchmark-pipeline-hexagonal

Conversation

@EnjoyBacon7
Copy link
Copy Markdown
Collaborator

@EnjoyBacon7 EnjoyBacon7 commented Jun 4, 2026

Summary

Brings the FRAMES benchmark evaluation pipeline (originally proposed in #305, never merged on dev) into the hexagonal architecture, placed at tests/load/automatic-evaluation-pipeline/ where Phase 13 relocated the eval pipeline.

Two standalone HTTP-client scripts (no package imports — they drive the OpenRAG API):

  • setup_frames.py — loads the FRAMES dataset (HuggingFace, cached), extracts every unique Wikipedia URL, downloads each article (md/pdf/html), creates the partition, and uploads + polls each file to a terminal state.
  • eval_frames.py — runs the 824-question benchmark in four modes: default RAG, --no-rag, --oracle (gold articles fed directly), and --gold-workspaces (one workspace per question). Scores each answer with a normalized exact-match and an LLM judge, with a per-reasoning-type breakdown.

Refactor compatibility

Verified against the current source that all API contracts the scripts depend on still hold: /health_check, /v1/chat/completions (incl. metadata.workspace and extra as a JSON string with chunk_url sources), /v1/models, partition create/delete, workspace create/list/attach, multipart upload (task_status_url), and task polling (task_state).

Hardening over #305

  • Include CANCELLED in the upload terminal states and bound the status poll with MAX_POLL_SECONDS, so a cancelled or stuck indexing task can no longer spin the poll loop forever.
  • Parse Retry-After as either a number of seconds or an HTTP-date, falling back to backoff instead of raising and dropping the article.
  • Guard a missing task_status_url instead of crashing.
  • Derive the empty-title fallback filename from a stable md5 hash so both scripts resolve the same file.
  • Validate MODEL/BASE_URL/API_KEY (and judge variants) up front with a clear message.
  • Run the OpenRAG health check in default RAG mode too.
  • Tolerate a rejected workspace file attachment instead of aborting the whole run.
  • Unify wiki_links parsing across both scripts (glued multi-URL entries are split identically), so setup_frames downloads exactly the articles eval_frames expects in oracle/gold modes.

Tests

tests/load/automatic-evaluation-pipeline/test_frames.py: 68 pure-logic and respx-mocked HTTP tests covering every API-calling function and each hardening case, plus an opt-in read-only live contract test gated on OPENRAG_SMOKE_URL / OPENRAG_SMOKE_TOKEN.

Validated end-to-end against a running token-mode stack: health, /v1/models, chat response shape, and a full create → index → poll → delete cycle.

Summary by CodeRabbit

  • New Features

    • Added an end-to-end FRAMES evaluation pipeline with multiple evaluation modes (RAG, oracle, direct LLM), workspace handling, concurrency limits, health checks, and JSON results/summary output.
    • Added a dataset preparation/indexing tool to fetch and cache referenced Wikipedia articles, deterministically name files, handle retries/rate-limits, and upload/index content with task polling.
  • Tests

    • Added comprehensive tests for parsing, downloads, retry/health behaviors, indexing/upload flows, workspace management, scoring, and optional live integration.

Port the FRAMES benchmark eval scripts (PR #305) into the new layout at
tests/load/automatic-evaluation-pipeline/. Standalone HTTP clients against
the OpenRAG API:

- setup_frames.py: download the Wikipedia articles referenced by FRAMES
  (md/pdf/html) and index them into a partition.
- eval_frames.py: run the 824-question benchmark with LLM-judge scoring and
  fuzzy exact match, in RAG / no-rag / oracle / gold-workspaces modes.
Audit fixes for the ported FRAMES eval pipeline:

- setup_frames: include CANCELLED in TERMINAL_STATES and bound the upload
  status poll with MAX_POLL_SECONDS, so a cancelled or stuck indexing task can
  no longer spin the poll loop forever.
- setup_frames: parse Retry-After as either a number of seconds or an HTTP-date,
  falling back to backoff instead of raising ValueError and silently dropping
  the article.
- setup_frames: guard a missing task_status_url instead of crashing on None.
- both scripts: derive the empty-title fallback filename from a stable md5 hash
  rather than the per-process-salted built-in hash(), so eval_frames resolves
  the same name setup_frames wrote (oracle and gold-file matching).
- eval_frames: validate MODEL/BASE_URL/API_KEY (and judge variants) up front
  with a clear message instead of failing opaquely inside ChatOpenAI.
- eval_frames: run the OpenRAG health check in default RAG mode too.
- eval_frames: tolerate a rejected workspace file attachment (some gold
  articles may be un-indexed) instead of aborting the whole run.
Pure-logic unit tests (Wikipedia parsing, exact-match scoring, oracle context,
filename stability) plus respx-mocked HTTP tests covering every API-calling
function. The mocked tests pin the audit fixes: CANCELLED as a terminal upload
state, the bounded poll timeout, the missing task_status_url guard, tolerant
workspace file attachment, Retry-After date parsing, and LLM-env validation.

Includes an opt-in read-only live contract test gated on OPENRAG_SMOKE_URL /
OPENRAG_SMOKE_TOKEN.
setup_frames.extract_all_titles fed glued multi-URL strings straight to
extract_title_from_url, producing one garbage title and silently dropping the
remaining articles — while eval_frames.parse_wiki_links splits them. Rows with
several Wikipedia URLs in one wiki_links entry therefore went un-downloaded yet
were expected by oracle / gold-workspace modes (logged as missing).

Add the same parse_wiki_links (with the embedded-URL splitter) to setup_frames
and route extract_all_titles through it, so both scripts resolve an identical
article set. Tests assert the two parsers agree, including the glued-URL case.
Drop references to the change history (audit-fix labels, prior implementations,
cross-script wording) in favor of objective descriptions of what the code does.
Replace the ordinal 'first run' / 'first time only' wording with the actual
trigger (no local cache present).
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 4, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: ba209010-10cd-4014-a40c-b6d3ac9289c7

📥 Commits

Reviewing files that changed from the base of the PR and between 9c878a8 and 17b88ac.

📒 Files selected for processing (2)
  • tests/load/automatic-evaluation-pipeline/eval_frames.py
  • tests/load/automatic-evaluation-pipeline/setup_frames.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • tests/load/automatic-evaluation-pipeline/eval_frames.py
  • tests/load/automatic-evaluation-pipeline/setup_frames.py

📝 Walkthrough

Walkthrough

Adds two CLI scripts and tests to download/index FRAMES Wikipedia articles into OpenRAG and evaluate RAG/oracle/no-RAG modes with LLM-based structured accuracy judging; includes robust HTTP retrying, deterministic filenames, concurrency limits, and comprehensive unit and integration tests.

Changes

FRAMES Benchmark Evaluation and Indexing Pipeline

Layer / File(s) Summary
Shared parsing, utilities, and dataset loading
tests/load/automatic-evaluation-pipeline/eval_frames.py, tests/load/automatic-evaluation-pipeline/setup_frames.py, tests/load/automatic-evaluation-pipeline/test_frames.py
FRAMES dataset caching (local JSON or HuggingFace), wiki-link/title extraction robust to multiple encodings, deterministic safe filenames (sanitized + md5 fallback), HTTP retry wrapper with Retry-After handling and exponential backoff, and fail-fast environment validation for LLM/JUDGE settings.
Wikipedia article download and OpenRAG indexing
tests/load/automatic-evaluation-pipeline/setup_frames.py
Concurrent Wikipedia fetching (md/html/pdf) with retries and rate-limit handling, skip/cached file logic, partition health/creation checks, multipart uploads with metadata, upload task polling to terminal states (COMPLETED/FAILED/CANCELLED) or timeout, and per-partition summary reporting.
OpenRAG evaluation modes and LLM-based accuracy judging
tests/load/automatic-evaluation-pipeline/eval_frames.py
Implements RAG queries with source extraction, gold-workspace creation and file attachment (tolerant of per-file rejections), oracle mode (LLM with cleaned gold HTML), no-RAG mode (direct LLM), concurrent structured accuracy judging via LLM, exact-match scoring, result annotation, and CLI orchestration that saves JSON outputs.
Comprehensive test coverage
tests/load/automatic-evaluation-pipeline/test_frames.py
Unit tests for parsing, filename behavior, retry-after parsing, scoring helpers, and result construction; async mocked HTTP tests for retry logic, health/partition/workspace endpoints, query behavior, file uploads and polling edge cases (409 skip, CANCELLED terminal, timeout, missing task URL); env-validation tests and an optional live OpenRAG smoke test.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I hopped through frames of distant lore,
Gathering pages, bytes, and more,
OpenRAG hums and LLMs reply,
Judges tally truth beneath the sky—
Benchmarks bloom where rabbits explore.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 61.54% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat(eval): add FRAMES benchmark evaluation pipeline' is clear, specific, and directly describes the main change: adding a new FRAMES benchmark evaluation pipeline.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/frames-benchmark-pipeline-hexagonal

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (1)
tests/load/automatic-evaluation-pipeline/test_frames.py (1)

298-307: ⚡ Quick win

Remove unused monkeypatch parameter and consider splitting into two test functions.

  1. The monkeypatch fixture parameter is declared but never used in the function body.
  2. Defining two respx.get().mock() calls for the same URL sequentially within a single test may not behave as expected—the second mock definition might not properly override or coexist with the first, depending on respx's route-matching behavior.
♻️ Recommended refactor: split into two focused tests
-@pytest.mark.asyncio
-@respx.mock
-async def test_check_health(monkeypatch):
-    respx.get(f"{su.OPENRAG_BASE_URL}/health_check").mock(return_value=httpx.Response(200, text="ok"))
-    async with httpx.AsyncClient() as client:
-        assert await su.check_health(client) is True
-    respx.get(f"{su.OPENRAG_BASE_URL}/health_check").mock(side_effect=httpx.ConnectError("down"))
-    async with httpx.AsyncClient() as client:
-        assert await su.check_health(client) is False
+@pytest.mark.asyncio
+@respx.mock
+async def test_check_health_success():
+    respx.get(f"{su.OPENRAG_BASE_URL}/health_check").mock(return_value=httpx.Response(200, text="ok"))
+    async with httpx.AsyncClient() as client:
+        assert await su.check_health(client) is True
+
+
+@pytest.mark.asyncio
+@respx.mock
+async def test_check_health_connection_error():
+    respx.get(f"{su.OPENRAG_BASE_URL}/health_check").mock(side_effect=httpx.ConnectError("down"))
+    async with httpx.AsyncClient() as client:
+        assert await su.check_health(client) is False
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/load/automatic-evaluation-pipeline/test_frames.py` around lines 298 -
307, Remove the unused monkeypatch fixture from test_check_health, and split the
single test into two focused async tests (e.g., test_check_health_success and
test_check_health_failure) that each call
respx.get(f"{su.OPENRAG_BASE_URL}/health_check").mock(...) once; keep the
`@pytest.mark.asyncio` and `@respx.mock` decorators on each test and call
su.check_health with an httpx.AsyncClient inside each test so the success case
returns True and the ConnectError case returns False.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/load/automatic-evaluation-pipeline/eval_frames.py`:
- Around line 160-161: The truthy check "if limit:" incorrectly treats 0 as
unset and thus ignores a user-supplied --limit 0; change the condition to an
explicit None check ("if limit is not None:") around the slicing of dataset (the
block that currently does "dataset = dataset[:limit]") so that limit=0 is
honored while still skipping slicing when limit is omitted.
- Around line 744-757: The script currently calls dataset =
load_dataset_cached(...) and resolves docs_dir before checking
args.from_results, causing --from-results to depend on dataset/cache/network
availability and misreport totals; change the control flow so the
args.from_results branch runs before any dataset loading or docs_dir resolution:
detect args.from_results, resolve from_path, load JSON results, call
annotate_exact_match(results) and await run_accuracy_judging(...) and then
return/exit early (or skip dataset loading) so load_dataset_cached(...) and
docs_dir normalization (the dataset and docs_dir variables) are not executed
when --from-results is used; apply the same change to the other identical block
that handles from_results later in the file.
- Line 710: The CLI argument for concurrency
(parser.add_argument("--concurrency", type=int, default=4)) must be validated as
strictly positive to avoid creating an asyncio.Semaphore(0) and deadlocking
tasks; fix by enforcing a positive integer either by replacing the arg
definition with a validator (e.g., a custom positive_int type or choices) or by
adding a post-parse check that calls parser.error(...) or raises if
args.concurrency < 1; update the same validation for the other occurrence at the
second add_argument call around line 725 and ensure any subsequent use (where
concurrency is passed into asyncio.Semaphore) only receives a validated >0
value.
- Line 41: Replace direct imports/usage of loguru.logger and plain print calls
with the project's structured logger: import get_logger from
openrag.utils.logger and create a logger instance via get_logger(); then replace
all raw logger calls and print statements (including the import on top and
occurrences around the blocks that previously used print at the noted ranges)
with logger.bind(...) to attach contextual fields like file_id, partition, mode,
workspace before logging. Specifically, change the import "from loguru import
logger" to "from openrag.utils.logger import get_logger", call get_logger()
(e.g., logger = get_logger()) at module start or inside main, and update usages
(the top-level logger usage and the print sites referenced) to use
logger.bind(...).info()/warning()/error() with the appropriate bound context.

In `@tests/load/automatic-evaluation-pipeline/setup_frames.py`:
- Around line 90-91: The check treating a zero limit as falsy causes --limit 0
to load all rows; change the conditional from "if limit:" to "if limit is not
None:" so that a value of 0 is respected; update the branch around the dataset
slicing (the lines using "limit" and "dataset = dataset[:limit]") to use this
new check so explicit zero limits correctly slice the dataset.
- Line 39: Replace the direct "from loguru import logger" import with the
project logger: import get_logger from openrag.utils.logger and create a module
logger instance (e.g., logger = get_logger()). In places that currently use the
raw logger or prints (including the blocks referenced around the file where
prints/loguru logger calls appear), bind contextual fields like file_id and
partition via logger = logger.bind(file_id=file_id, partition=partition) and
replace print calls or direct loguru usage with logger.info/debug/error as
appropriate so all messages use the centralized, contextualized logger.
- Around line 445-448: The parser currently allows non-positive values for the
CLI flags added with parser.add_argument("--concurrency", ...) and
parser.add_argument("--wiki-concurrency", ...), which can deadlock semaphores;
after parsing (args.concurrency and args.wiki_concurrency) validate that both
are positive integers and raise a clear argparse.ArgumentTypeError or call
parser.error if not (or coerce with a custom type function passed to
parser.add_argument) so the program exits early with a helpful message instead
of allowing zero/negative values to proceed.

---

Nitpick comments:
In `@tests/load/automatic-evaluation-pipeline/test_frames.py`:
- Around line 298-307: Remove the unused monkeypatch fixture from
test_check_health, and split the single test into two focused async tests (e.g.,
test_check_health_success and test_check_health_failure) that each call
respx.get(f"{su.OPENRAG_BASE_URL}/health_check").mock(...) once; keep the
`@pytest.mark.asyncio` and `@respx.mock` decorators on each test and call
su.check_health with an httpx.AsyncClient inside each test so the success case
returns True and the ConnectError case returns False.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6682d911-3748-4cc9-85b2-f88194d74c62

📥 Commits

Reviewing files that changed from the base of the PR and between e207d64 and a5e2948.

📒 Files selected for processing (3)
  • tests/load/automatic-evaluation-pipeline/eval_frames.py
  • tests/load/automatic-evaluation-pipeline/setup_frames.py
  • tests/load/automatic-evaluation-pipeline/test_frames.py

Comment thread tests/load/automatic-evaluation-pipeline/eval_frames.py
Comment thread tests/load/automatic-evaluation-pipeline/eval_frames.py Outdated
Comment thread tests/load/automatic-evaluation-pipeline/eval_frames.py Outdated
Comment thread tests/load/automatic-evaluation-pipeline/eval_frames.py Outdated
Comment thread tests/load/automatic-evaluation-pipeline/setup_frames.py
Comment thread tests/load/automatic-evaluation-pipeline/setup_frames.py Outdated
Comment thread tests/load/automatic-evaluation-pipeline/setup_frames.py Outdated
Structured logging instead of raw prints, boundary-correct limit checks, positive
concurrency validation, and --from-results control flow optimization:

eval_frames.py:
- Replace print calls with logger.info/error
- Fix 'if limit:' to 'if limit is not None:' (line 160)
- Add _positive_int validator for --concurrency and --judge-concurrency
- Restructure main to load dataset only when needed; --from-results now runs before
  dataset loading to avoid unnecessary cache/network operations

setup_frames.py:
- Replace print calls with logger.info/error
- Fix 'if limit:' to 'if limit is not None:' (line 90)
- Add _positive_int validator for --concurrency and --wiki-concurrency

Scripts remain standalone HTTP clients (loguru only, no openrag package imports).
@EnjoyBacon7 EnjoyBacon7 force-pushed the feat/frames-benchmark-pipeline-hexagonal branch from 9c878a8 to 17b88ac Compare June 5, 2026 11:50
@EnjoyBacon7 EnjoyBacon7 merged commit 9822eaa into refactor/hexagonal Jun 5, 2026
6 checks passed
@EnjoyBacon7 EnjoyBacon7 deleted the feat/frames-benchmark-pipeline-hexagonal branch June 5, 2026 11:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant