feat(eval): add FRAMES benchmark evaluation pipeline#445
Conversation
Port the FRAMES benchmark eval scripts (PR #305) into the new layout at tests/load/automatic-evaluation-pipeline/. Standalone HTTP clients against the OpenRAG API: - setup_frames.py: download the Wikipedia articles referenced by FRAMES (md/pdf/html) and index them into a partition. - eval_frames.py: run the 824-question benchmark with LLM-judge scoring and fuzzy exact match, in RAG / no-rag / oracle / gold-workspaces modes.
Audit fixes for the ported FRAMES eval pipeline: - setup_frames: include CANCELLED in TERMINAL_STATES and bound the upload status poll with MAX_POLL_SECONDS, so a cancelled or stuck indexing task can no longer spin the poll loop forever. - setup_frames: parse Retry-After as either a number of seconds or an HTTP-date, falling back to backoff instead of raising ValueError and silently dropping the article. - setup_frames: guard a missing task_status_url instead of crashing on None. - both scripts: derive the empty-title fallback filename from a stable md5 hash rather than the per-process-salted built-in hash(), so eval_frames resolves the same name setup_frames wrote (oracle and gold-file matching). - eval_frames: validate MODEL/BASE_URL/API_KEY (and judge variants) up front with a clear message instead of failing opaquely inside ChatOpenAI. - eval_frames: run the OpenRAG health check in default RAG mode too. - eval_frames: tolerate a rejected workspace file attachment (some gold articles may be un-indexed) instead of aborting the whole run.
Pure-logic unit tests (Wikipedia parsing, exact-match scoring, oracle context, filename stability) plus respx-mocked HTTP tests covering every API-calling function. The mocked tests pin the audit fixes: CANCELLED as a terminal upload state, the bounded poll timeout, the missing task_status_url guard, tolerant workspace file attachment, Retry-After date parsing, and LLM-env validation. Includes an opt-in read-only live contract test gated on OPENRAG_SMOKE_URL / OPENRAG_SMOKE_TOKEN.
setup_frames.extract_all_titles fed glued multi-URL strings straight to extract_title_from_url, producing one garbage title and silently dropping the remaining articles — while eval_frames.parse_wiki_links splits them. Rows with several Wikipedia URLs in one wiki_links entry therefore went un-downloaded yet were expected by oracle / gold-workspace modes (logged as missing). Add the same parse_wiki_links (with the embedded-URL splitter) to setup_frames and route extract_all_titles through it, so both scripts resolve an identical article set. Tests assert the two parsers agree, including the glued-URL case.
Drop references to the change history (audit-fix labels, prior implementations, cross-script wording) in favor of objective descriptions of what the code does.
Replace the ordinal 'first run' / 'first time only' wording with the actual trigger (no local cache present).
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
📝 WalkthroughWalkthroughAdds two CLI scripts and tests to download/index FRAMES Wikipedia articles into OpenRAG and evaluate RAG/oracle/no-RAG modes with LLM-based structured accuracy judging; includes robust HTTP retrying, deterministic filenames, concurrency limits, and comprehensive unit and integration tests. ChangesFRAMES Benchmark Evaluation and Indexing Pipeline
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 7
🧹 Nitpick comments (1)
tests/load/automatic-evaluation-pipeline/test_frames.py (1)
298-307: ⚡ Quick winRemove unused
monkeypatchparameter and consider splitting into two test functions.
- The
monkeypatchfixture parameter is declared but never used in the function body.- Defining two
respx.get().mock()calls for the same URL sequentially within a single test may not behave as expected—the second mock definition might not properly override or coexist with the first, depending on respx's route-matching behavior.♻️ Recommended refactor: split into two focused tests
-@pytest.mark.asyncio -@respx.mock -async def test_check_health(monkeypatch): - respx.get(f"{su.OPENRAG_BASE_URL}/health_check").mock(return_value=httpx.Response(200, text="ok")) - async with httpx.AsyncClient() as client: - assert await su.check_health(client) is True - respx.get(f"{su.OPENRAG_BASE_URL}/health_check").mock(side_effect=httpx.ConnectError("down")) - async with httpx.AsyncClient() as client: - assert await su.check_health(client) is False +@pytest.mark.asyncio +@respx.mock +async def test_check_health_success(): + respx.get(f"{su.OPENRAG_BASE_URL}/health_check").mock(return_value=httpx.Response(200, text="ok")) + async with httpx.AsyncClient() as client: + assert await su.check_health(client) is True + + +@pytest.mark.asyncio +@respx.mock +async def test_check_health_connection_error(): + respx.get(f"{su.OPENRAG_BASE_URL}/health_check").mock(side_effect=httpx.ConnectError("down")) + async with httpx.AsyncClient() as client: + assert await su.check_health(client) is False🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/load/automatic-evaluation-pipeline/test_frames.py` around lines 298 - 307, Remove the unused monkeypatch fixture from test_check_health, and split the single test into two focused async tests (e.g., test_check_health_success and test_check_health_failure) that each call respx.get(f"{su.OPENRAG_BASE_URL}/health_check").mock(...) once; keep the `@pytest.mark.asyncio` and `@respx.mock` decorators on each test and call su.check_health with an httpx.AsyncClient inside each test so the success case returns True and the ConnectError case returns False.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/load/automatic-evaluation-pipeline/eval_frames.py`:
- Around line 160-161: The truthy check "if limit:" incorrectly treats 0 as
unset and thus ignores a user-supplied --limit 0; change the condition to an
explicit None check ("if limit is not None:") around the slicing of dataset (the
block that currently does "dataset = dataset[:limit]") so that limit=0 is
honored while still skipping slicing when limit is omitted.
- Around line 744-757: The script currently calls dataset =
load_dataset_cached(...) and resolves docs_dir before checking
args.from_results, causing --from-results to depend on dataset/cache/network
availability and misreport totals; change the control flow so the
args.from_results branch runs before any dataset loading or docs_dir resolution:
detect args.from_results, resolve from_path, load JSON results, call
annotate_exact_match(results) and await run_accuracy_judging(...) and then
return/exit early (or skip dataset loading) so load_dataset_cached(...) and
docs_dir normalization (the dataset and docs_dir variables) are not executed
when --from-results is used; apply the same change to the other identical block
that handles from_results later in the file.
- Line 710: The CLI argument for concurrency
(parser.add_argument("--concurrency", type=int, default=4)) must be validated as
strictly positive to avoid creating an asyncio.Semaphore(0) and deadlocking
tasks; fix by enforcing a positive integer either by replacing the arg
definition with a validator (e.g., a custom positive_int type or choices) or by
adding a post-parse check that calls parser.error(...) or raises if
args.concurrency < 1; update the same validation for the other occurrence at the
second add_argument call around line 725 and ensure any subsequent use (where
concurrency is passed into asyncio.Semaphore) only receives a validated >0
value.
- Line 41: Replace direct imports/usage of loguru.logger and plain print calls
with the project's structured logger: import get_logger from
openrag.utils.logger and create a logger instance via get_logger(); then replace
all raw logger calls and print statements (including the import on top and
occurrences around the blocks that previously used print at the noted ranges)
with logger.bind(...) to attach contextual fields like file_id, partition, mode,
workspace before logging. Specifically, change the import "from loguru import
logger" to "from openrag.utils.logger import get_logger", call get_logger()
(e.g., logger = get_logger()) at module start or inside main, and update usages
(the top-level logger usage and the print sites referenced) to use
logger.bind(...).info()/warning()/error() with the appropriate bound context.
In `@tests/load/automatic-evaluation-pipeline/setup_frames.py`:
- Around line 90-91: The check treating a zero limit as falsy causes --limit 0
to load all rows; change the conditional from "if limit:" to "if limit is not
None:" so that a value of 0 is respected; update the branch around the dataset
slicing (the lines using "limit" and "dataset = dataset[:limit]") to use this
new check so explicit zero limits correctly slice the dataset.
- Line 39: Replace the direct "from loguru import logger" import with the
project logger: import get_logger from openrag.utils.logger and create a module
logger instance (e.g., logger = get_logger()). In places that currently use the
raw logger or prints (including the blocks referenced around the file where
prints/loguru logger calls appear), bind contextual fields like file_id and
partition via logger = logger.bind(file_id=file_id, partition=partition) and
replace print calls or direct loguru usage with logger.info/debug/error as
appropriate so all messages use the centralized, contextualized logger.
- Around line 445-448: The parser currently allows non-positive values for the
CLI flags added with parser.add_argument("--concurrency", ...) and
parser.add_argument("--wiki-concurrency", ...), which can deadlock semaphores;
after parsing (args.concurrency and args.wiki_concurrency) validate that both
are positive integers and raise a clear argparse.ArgumentTypeError or call
parser.error if not (or coerce with a custom type function passed to
parser.add_argument) so the program exits early with a helpful message instead
of allowing zero/negative values to proceed.
---
Nitpick comments:
In `@tests/load/automatic-evaluation-pipeline/test_frames.py`:
- Around line 298-307: Remove the unused monkeypatch fixture from
test_check_health, and split the single test into two focused async tests (e.g.,
test_check_health_success and test_check_health_failure) that each call
respx.get(f"{su.OPENRAG_BASE_URL}/health_check").mock(...) once; keep the
`@pytest.mark.asyncio` and `@respx.mock` decorators on each test and call
su.check_health with an httpx.AsyncClient inside each test so the success case
returns True and the ConnectError case returns False.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 6682d911-3748-4cc9-85b2-f88194d74c62
📒 Files selected for processing (3)
tests/load/automatic-evaluation-pipeline/eval_frames.pytests/load/automatic-evaluation-pipeline/setup_frames.pytests/load/automatic-evaluation-pipeline/test_frames.py
Structured logging instead of raw prints, boundary-correct limit checks, positive concurrency validation, and --from-results control flow optimization: eval_frames.py: - Replace print calls with logger.info/error - Fix 'if limit:' to 'if limit is not None:' (line 160) - Add _positive_int validator for --concurrency and --judge-concurrency - Restructure main to load dataset only when needed; --from-results now runs before dataset loading to avoid unnecessary cache/network operations setup_frames.py: - Replace print calls with logger.info/error - Fix 'if limit:' to 'if limit is not None:' (line 90) - Add _positive_int validator for --concurrency and --wiki-concurrency Scripts remain standalone HTTP clients (loguru only, no openrag package imports).
9c878a8 to
17b88ac
Compare
Summary
Brings the FRAMES benchmark evaluation pipeline (originally proposed in #305, never merged on
dev) into the hexagonal architecture, placed attests/load/automatic-evaluation-pipeline/where Phase 13 relocated the eval pipeline.Two standalone HTTP-client scripts (no package imports — they drive the OpenRAG API):
setup_frames.py— loads the FRAMES dataset (HuggingFace, cached), extracts every unique Wikipedia URL, downloads each article (md/pdf/html), creates the partition, and uploads + polls each file to a terminal state.eval_frames.py— runs the 824-question benchmark in four modes: default RAG,--no-rag,--oracle(gold articles fed directly), and--gold-workspaces(one workspace per question). Scores each answer with a normalized exact-match and an LLM judge, with a per-reasoning-type breakdown.Refactor compatibility
Verified against the current source that all API contracts the scripts depend on still hold:
/health_check,/v1/chat/completions(incl.metadata.workspaceandextraas a JSON string withchunk_urlsources),/v1/models, partition create/delete, workspace create/list/attach, multipart upload (task_status_url), and task polling (task_state).Hardening over #305
CANCELLEDin the upload terminal states and bound the status poll withMAX_POLL_SECONDS, so a cancelled or stuck indexing task can no longer spin the poll loop forever.Retry-Afteras either a number of seconds or an HTTP-date, falling back to backoff instead of raising and dropping the article.task_status_urlinstead of crashing.MODEL/BASE_URL/API_KEY(and judge variants) up front with a clear message.wiki_linksparsing across both scripts (glued multi-URL entries are split identically), sosetup_framesdownloads exactly the articleseval_framesexpects in oracle/gold modes.Tests
tests/load/automatic-evaluation-pipeline/test_frames.py: 68 pure-logic andrespx-mocked HTTP tests covering every API-calling function and each hardening case, plus an opt-in read-only live contract test gated onOPENRAG_SMOKE_URL/OPENRAG_SMOKE_TOKEN.Validated end-to-end against a running token-mode stack: health,
/v1/models, chat response shape, and a full create → index → poll → delete cycle.Summary by CodeRabbit
New Features
Tests