test: comprehensive E2E tests for eval framework (73 new, 211 total) by johnnichev · Pull Request #17 · johnnichev/selectools

johnnichev · 2026-03-22T11:49:46Z

Summary

73 new E2E tests covering every eval framework feature with real Agent instances (no mocks). Total eval tests: 211.

Coverage by module:

Module	E2E Tests
EvalSuite (sync, async, concurrent, errors)	10
12 deterministic evaluators	19
10 LLM-as-judge evaluators	12
HTML report + SVG charts	6
JUnit XML structure	2
Dataset → Report pipeline	1
Regression detection	2
Pairwise A/B comparison	3
Snapshot testing	2
Badge generation	3
Synthetic test generation	1
CLI	2
Live dashboard	2
Report statistics	6
Edge cases (unicode, weighted, multi-assert)	2

Bug fix: suite.py now reads model from agent.config.model instead of agent._model.

Test plan

All 211 eval tests pass
Full test suite passes
Pre-commit hooks pass

End-to-end tests using real Agent with SharedFakeProvider and SharedToolCallProvider (no mocks). Covers every gap from audit: - EvalSuite: basic run, tool calls, mixed results, empty cases, concurrency, progress callback, error handling, async run, tags - All 12 deterministic evaluators E2E with real Agent - All 10 LLM evaluators E2E with SharedFakeProvider as judge - HTML report: full render, donut SVG, histogram SVG, error cases - JUnit XML: structure validation, failure/error elements - Dataset → Suite → Report → Export pipeline - Regression detection with baseline save/compare - Pairwise A/B comparison with real agents - Snapshot testing: create, compare, detect changes - Badge generation from real eval runs - Synthetic test case generation - CLI help verification - Live dashboard HTML validation - Unicode content, weighted accuracy, multiple assertions per case, tag filtering, report statistics Fix: suite.py reads model from agent.config.model (not agent._model) Total eval tests: 211 (was 138)

Agent core observers (6 fixes): - astream() cancellation/budget paths now build proper results with trace steps and async observer events (#14) - arun() fires async observers for cancel/budget/max-iter (#15) - _aexecute_tools_parallel fires async observer events (#16) - _aexecute_tools_parallel tracks tool_usage/tool_tokens (#17) - _acheck_policy fires async on_policy_decision observer (#10M) - astream() max-iter path fires async on_run_end (#12M) Tools + providers (7 fixes): - Anthropic empty content list guard (#19) - Bool rejected for int/float params (#20) - ToolRegistry.tool() has screen_output/terminal/requires_approval (#21) - MultiMCPClient list_all_tools() copies tools before prefixing (#22) - Streamable-http 3-tuple unpacking robust handling (#23) - _serialize_result returns "" for None (#24) - StructuredOutputEvaluator handles __slots__ (#45) RAG (6 fixes): - SQLiteVectorStore search documented limitation (#25) - InMemoryVectorStore max_documents warning (#26) - Pinecone metadata.get instead of .pop (#27) - ContextualChunker None content guard (#28) - Filter overfetch: top_k*4 when filter present (#29) - OpenAI embed_texts batching at 2048 (#30) Memory (5 fixes): - FileKnowledgeStore reads under lock (#32) - SQLiteSessionStore WAL mode (#33) - SQLiteKnowledgeStore indexes on query columns (#34) - query() LIMIT after TTL filter (#35) - Redis save() category update in pipeline (#36) Evals (4 fixes): - 16 LLM evaluators fail on unparseable score (#37) - XSS fix: textContent instead of innerHTML (#38) - Donut SVG 360° arc: two semicircles (#39) - Suite completed counter under threading.Lock (#46) Security (5 fixes): - REWRITE/WARN guardrails tracked in trace (#40) - SSN regex requires consistent separators (#41) - Topic guardrail Unicode normalization (#42) - Coherence usage tracked in agent costs (#43) - Coherence fail_closed option (#44) Full suite: 2013 passed.

@tool

…+ fixed Final thorough audit pass after the user asked "is there anything you feel even 1% not confident about?" with explicit instruction to verify AND fix everything. Nine residual concerns were addressed; two surfaced real shipping blockers that isolated testing had not caught. Verified as not a regression (no code change needed): - #12 RAGTool descriptor pickling: function-based @tool() also fails to serialize for the same reason (decorator replaces function in the module namespace). Pickling Tools/Agents has never been supported in selectools — only cache_redis.py uses pickle, and only for (Message, UsageStats) tuples. Documented the limitation in RAGTool's class docstring along with a thread-safety note. Fixes landed: Bug 9 — Langfuse 3.x rewrite (real shipping blocker) ---------------------------------------------------- mypy caught ``"Langfuse" has no attribute "trace"`` in src/selectools/observe/langfuse.py:65. Langfuse 3.x removed the top-level Langfuse.trace() / trace.generation() / trace.span() / trace.update() API and replaced it with start_span() / start_generation() / update_current_trace() / update_current_span(). The existing selectools LangfuseObserver was written for 2.x and would crash at runtime on every call against Langfuse 3.x (which pyproject.toml's langfuse>=2.0.0 constraint does not exclude). The existing mock-based test_langfuse_observer.py never caught it because mocks accept any method call. The e2e test in tests/test_e2e_langfuse_observer.py skipped due to missing LANGFUSE_PUBLIC_KEY env var, so the real code path had never executed. - Rewrote src/selectools/observe/langfuse.py for Langfuse 3.x API: on_run_start now creates a root span via client.start_span(); child generations and spans use root.start_generation() / root.start_span() (which attach to the same trace); usage info moved from usage= to usage_details=, with new cost_details= for dollar cost; every span now calls .end() explicitly since Langfuse 3.x is context-manager oriented; root span finalization uses update_trace() + update() + end(). - Updated 4 affected mock tests in tests/test_langfuse_observer.py to the v3 API (client.start_span, root.start_generation, root.start_span). 19 Langfuse mock tests now pass. #13 image_url e2e regression coverage ------------------------------------- Added TestMultimodalRealProvidersImageUrl in tests/test_e2e_multimodal.py with three new tests (one per provider) that send https://github.githubassets.com/favicons/favicon.png through the ContentPart(type="image_url") path. Verified that OpenAI, Anthropic, and Gemini all return "GitHub" in their reply. GitHub's CDN serves bot User-Agents unlike Wikipedia's CDN, which is documented separately in the MULTIMODAL.md URL-reachability warning. #14 CHANGELOG clarification --------------------------- Added a "Note on the three latent bugs below" block before the Fixed section explaining that bugs 6, 7, 8 (RAGTool @tool() on methods and both multimodal content_parts drops) were pre-existing in earlier releases but never surfaced because no test actually exercised them end-to-end. This pre-empts the reasonable reader question "why didn't earlier users report these?". #15 Pre-existing broken mkdocs anchors -------------------------------------- - QUICKSTART.md: #code-tools-2--v0210 (double dash) was wrong. mkdocs Material slugifies the em-dash in "Code Tools (2) — v0.21.0" to a single hyphen, producing code-tools-2-v0210. Fixed the link. - PARSER.md: both #parsing-strategy and #json-extraction anchors were broken because a stray unbalanced 3-backtick fence at line 124 was greedy-pairing with line 128, shifting every downstream fence pair by one and accidentally wrapping ## Parsing Strategy and ## JSON Extraction inside a code block. Deleting line 124 plus converting one 4-backtick close on line 205 to a 3-backtick close rebalanced all the fences. Both headings now render as real h2 elements and the TOC anchors resolve. mkdocs build: zero broken-anchor warnings. #16 README relative docs/ links ------------------------------- README.md is outside docs/ and must use absolute GitHub URLs per docs/CLAUDE.md. Batch-converted all 37 ](docs/*.md) relative links to ](https://github.com/johnnichev/selectools/blob/main/docs/*.md). #17 Pre-existing mypy errors — all 46 fixed, mypy src/ is now clean ------------------------------------------------------------------ Success: no issues found in 150 source files. - 20 no-any-return errors across 13 files: added # type: ignore[no-any-return] with explanatory context. These were all external-library Any leaks (json.loads, dict.get on Any, psycopg2, ollama client, openai SDK returns, etc.) where the runtime type is correct but the type-stub exposure is Any. - 14 no-untyped-def errors in observer.py SimpleStepObserver graph callbacks (lines 1634-1676): added full type annotations matching the AgentObserver base class signatures (str/int/float/Exception/List[str] per event). Fixed one Liskov substitution violation where my initial annotation used List[str] for new_plan but the base class uses str. - 8 no-untyped-def errors in serve/app.py BaseHTTPRequestHandler methods (do_GET, do_POST, do_OPTIONS, _json_response, _html_response, log_message, handle_stream, _stream): added -> None returns and Any / str parameter types. Imported Iterator and AsyncIterator from typing. - pipeline.py:439 astream: added -> AsyncIterator[Any]. - observe/trace_store.py:349 _iter_entries: added -> Iterator[Dict[str, Any]]. - agent/config.py:215 _unpack nested helper: added (Any, type) -> Any. - trace.py:506: ``dataclasses.asdict`` was rejecting ``DataclassInstance | type[DataclassInstance]`` (too wide). Narrowed with ``not isinstance(obj, type)`` so mypy sees a non-type dataclass. - providers/_openai_compat.py:560: expanded existing # type: ignore from [return-value] to [return-value,no-any-return] to cover the second error code. - serve/_starlette_app.py:105: eval_dashboard was declared to return HTMLResponse but the unauth-redirect branch returns a RedirectResponse. Widened the return type to Response to match the neighbouring handlers (builder, provider_health). #18 Landing page feature content for v0.21.0 --------------------------------------------- Three text-only bento card updates (no layout changes): - RAG card: "4 store backends" → "7 store backends" with the full list enumerated plus CSV/JSON/HTML/URL loaders mentioned. - Toolbox card: added explicit v0.21.0 additions (Python + shell execution, DuckDuckGo search, GitHub REST API, SQLite + Postgres). - Audit card retitled to "Audit + observability" and expanded to mention OTelObserver (GenAI semantic conventions) and LangfuseObserver as the new v0.21.0 shipping surfaces for trace export to Datadog / Jaeger / Langfuse Cloud / any OTLP backend. #19 FAISS variant of App 3 Knowledge Base Librarian --------------------------------------------------- Added TestApp3b_KnowledgeBaseLibrarianFAISS in tests/test_e2e_v0_21_0_apps.py — the same CSV + JSON + HTML librarian persona but backed by FAISSVectorStore instead of Qdrant. Runnable without Docker, and with different anchor phrases (OSPREY-88, CRESCENT, AURORA-SOUTH) so it doesn't shadow the Qdrant variant when both run. Three tests, all passing against real OpenAI embeddings + real OpenAI gpt-4o-mini. #20 RAGTool docstring notes --------------------------- Added a "Notes" block to RAGTool explaining: - Thread safety: the vector store handles its own locking, but mutating top_k / score_threshold / include_scores after attaching to an Agent is not thread-safe. - Cross-process serialization: not supported, same reason function-based @tool() tools aren't supported. Verification ------------ - mypy src/: Success: no issues found in 150 source files - Full non-e2e suite: 4961 passed, 3 skipped, 248 deselected (+9 from new image_url + async multimodal + FAISS librarian tests), 0 regressions - Full e2e suite with Qdrant + Postgres running: 70 collected, 64 passed, 6 skipped (Azure x2 + Langfuse x1 credential-dependent + 3 Qdrant tests when the container isn't running), 0 failures - mkdocs build: zero broken-anchor warnings (QUICKSTART + PARSER both clean now) - diff CHANGELOG.md docs/CHANGELOG.md: byte-identical

johnnichev merged commit 16a7200 into main Mar 22, 2026

johnnichev deleted the test/eval-comprehensive branch March 22, 2026 11:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: comprehensive E2E tests for eval framework (73 new, 211 total)#17

test: comprehensive E2E tests for eval framework (73 new, 211 total)#17
johnnichev merged 1 commit intomainfrom
test/eval-comprehensive

johnnichev commented Mar 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

johnnichev commented Mar 22, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant