feat: pairwise A/B eval, test generator, live dashboard, badges, snapshots by johnnichev · Pull Request #15 · johnnichev/selectools

johnnichev · 2026-03-22T03:57:47Z

Summary

5 advanced eval features that no other agent framework offers:

PairwiseEval — Compare two agents head-to-head on the same test cases. Automatic winner determination by verdict, then latency, then cost. PairwiseReport with per-case breakdown, aggregate stats, and to_dict() serialization.
generate_cases() — LLM-powered synthetic test case generator. Pass your agent's tools, get back diverse test cases across happy path, edge case, error handling, and adversarial categories. Parses markdown-fenced JSON, handles malformed output gracefully.
serve_eval() — Live browser dashboard. Starts a local HTTP server, opens the browser, shows real-time progress bar, accuracy/cost/latency stats, and per-case results updating via polling. Dark theme matching selectools brand. Zero external dependencies (stdlib http.server).
generate_badge() / generate_detailed_badge() — Shields.io-style SVG badges for README. Color-coded by accuracy (green/cyan/blue/yellow/orange/red). Detailed variant shows pass/total counts. Ready for CI to commit to repo.
SnapshotStore — Jest-style snapshot testing for AI agents. Capture exact agent outputs on first run, diff against snapshot on subsequent runs. Per-field change detection (content, tool_calls, verdict, iterations). SnapshotResult with new/removed/changed/unchanged case lists.

28 new tests (total eval: 138, total project: ~1700). Zero external dependencies.

Test plan

All 138 eval tests pass
All pre-commit hooks pass (black, isort, flake8, mypy, bandit)
Package imports cleanly

…shots 5 advanced eval features: 1. PairwiseEval — compare two agents head-to-head on same test cases, automatic winner determination by verdict/latency/cost 2. generate_cases() — LLM-powered synthetic test case generator from tool definitions, generates edge cases and adversarial inputs 3. serve_eval() — live browser dashboard with real-time progress bar, accuracy/cost/latency stats, auto-polling, dark theme 4. generate_badge() — shields.io-style SVG badges for README (accuracy color-coded green/yellow/red), detailed variant with pass/total counts 5. SnapshotStore — Jest-style snapshot testing for agent outputs, compare/save/diff with per-field change detection 28 new tests (total eval tests: 138). Zero external dependencies.

Agent core observers (6 fixes): - astream() cancellation/budget paths now build proper results with trace steps and async observer events (#14) - arun() fires async observers for cancel/budget/max-iter (#15) - _aexecute_tools_parallel fires async observer events (#16) - _aexecute_tools_parallel tracks tool_usage/tool_tokens (#17) - _acheck_policy fires async on_policy_decision observer (#10M) - astream() max-iter path fires async on_run_end (#12M) Tools + providers (7 fixes): - Anthropic empty content list guard (#19) - Bool rejected for int/float params (#20) - ToolRegistry.tool() has screen_output/terminal/requires_approval (#21) - MultiMCPClient list_all_tools() copies tools before prefixing (#22) - Streamable-http 3-tuple unpacking robust handling (#23) - _serialize_result returns "" for None (#24) - StructuredOutputEvaluator handles __slots__ (#45) RAG (6 fixes): - SQLiteVectorStore search documented limitation (#25) - InMemoryVectorStore max_documents warning (#26) - Pinecone metadata.get instead of .pop (#27) - ContextualChunker None content guard (#28) - Filter overfetch: top_k*4 when filter present (#29) - OpenAI embed_texts batching at 2048 (#30) Memory (5 fixes): - FileKnowledgeStore reads under lock (#32) - SQLiteSessionStore WAL mode (#33) - SQLiteKnowledgeStore indexes on query columns (#34) - query() LIMIT after TTL filter (#35) - Redis save() category update in pipeline (#36) Evals (4 fixes): - 16 LLM evaluators fail on unparseable score (#37) - XSS fix: textContent instead of innerHTML (#38) - Donut SVG 360° arc: two semicircles (#39) - Suite completed counter under threading.Lock (#46) Security (5 fixes): - REWRITE/WARN guardrails tracked in trace (#40) - SSN regex requires consistent separators (#41) - Topic guardrail Unicode normalization (#42) - Coherence usage tracked in agent costs (#43) - Coherence fail_closed option (#44) Full suite: 2013 passed.

@tool

…+ fixed Final thorough audit pass after the user asked "is there anything you feel even 1% not confident about?" with explicit instruction to verify AND fix everything. Nine residual concerns were addressed; two surfaced real shipping blockers that isolated testing had not caught. Verified as not a regression (no code change needed): - #12 RAGTool descriptor pickling: function-based @tool() also fails to serialize for the same reason (decorator replaces function in the module namespace). Pickling Tools/Agents has never been supported in selectools — only cache_redis.py uses pickle, and only for (Message, UsageStats) tuples. Documented the limitation in RAGTool's class docstring along with a thread-safety note. Fixes landed: Bug 9 — Langfuse 3.x rewrite (real shipping blocker) ---------------------------------------------------- mypy caught ``"Langfuse" has no attribute "trace"`` in src/selectools/observe/langfuse.py:65. Langfuse 3.x removed the top-level Langfuse.trace() / trace.generation() / trace.span() / trace.update() API and replaced it with start_span() / start_generation() / update_current_trace() / update_current_span(). The existing selectools LangfuseObserver was written for 2.x and would crash at runtime on every call against Langfuse 3.x (which pyproject.toml's langfuse>=2.0.0 constraint does not exclude). The existing mock-based test_langfuse_observer.py never caught it because mocks accept any method call. The e2e test in tests/test_e2e_langfuse_observer.py skipped due to missing LANGFUSE_PUBLIC_KEY env var, so the real code path had never executed. - Rewrote src/selectools/observe/langfuse.py for Langfuse 3.x API: on_run_start now creates a root span via client.start_span(); child generations and spans use root.start_generation() / root.start_span() (which attach to the same trace); usage info moved from usage= to usage_details=, with new cost_details= for dollar cost; every span now calls .end() explicitly since Langfuse 3.x is context-manager oriented; root span finalization uses update_trace() + update() + end(). - Updated 4 affected mock tests in tests/test_langfuse_observer.py to the v3 API (client.start_span, root.start_generation, root.start_span). 19 Langfuse mock tests now pass. #13 image_url e2e regression coverage ------------------------------------- Added TestMultimodalRealProvidersImageUrl in tests/test_e2e_multimodal.py with three new tests (one per provider) that send https://github.githubassets.com/favicons/favicon.png through the ContentPart(type="image_url") path. Verified that OpenAI, Anthropic, and Gemini all return "GitHub" in their reply. GitHub's CDN serves bot User-Agents unlike Wikipedia's CDN, which is documented separately in the MULTIMODAL.md URL-reachability warning. #14 CHANGELOG clarification --------------------------- Added a "Note on the three latent bugs below" block before the Fixed section explaining that bugs 6, 7, 8 (RAGTool @tool() on methods and both multimodal content_parts drops) were pre-existing in earlier releases but never surfaced because no test actually exercised them end-to-end. This pre-empts the reasonable reader question "why didn't earlier users report these?". #15 Pre-existing broken mkdocs anchors -------------------------------------- - QUICKSTART.md: #code-tools-2--v0210 (double dash) was wrong. mkdocs Material slugifies the em-dash in "Code Tools (2) — v0.21.0" to a single hyphen, producing code-tools-2-v0210. Fixed the link. - PARSER.md: both #parsing-strategy and #json-extraction anchors were broken because a stray unbalanced 3-backtick fence at line 124 was greedy-pairing with line 128, shifting every downstream fence pair by one and accidentally wrapping ## Parsing Strategy and ## JSON Extraction inside a code block. Deleting line 124 plus converting one 4-backtick close on line 205 to a 3-backtick close rebalanced all the fences. Both headings now render as real h2 elements and the TOC anchors resolve. mkdocs build: zero broken-anchor warnings. #16 README relative docs/ links ------------------------------- README.md is outside docs/ and must use absolute GitHub URLs per docs/CLAUDE.md. Batch-converted all 37 ](docs/*.md) relative links to ](https://github.com/johnnichev/selectools/blob/main/docs/*.md). #17 Pre-existing mypy errors — all 46 fixed, mypy src/ is now clean ------------------------------------------------------------------ Success: no issues found in 150 source files. - 20 no-any-return errors across 13 files: added # type: ignore[no-any-return] with explanatory context. These were all external-library Any leaks (json.loads, dict.get on Any, psycopg2, ollama client, openai SDK returns, etc.) where the runtime type is correct but the type-stub exposure is Any. - 14 no-untyped-def errors in observer.py SimpleStepObserver graph callbacks (lines 1634-1676): added full type annotations matching the AgentObserver base class signatures (str/int/float/Exception/List[str] per event). Fixed one Liskov substitution violation where my initial annotation used List[str] for new_plan but the base class uses str. - 8 no-untyped-def errors in serve/app.py BaseHTTPRequestHandler methods (do_GET, do_POST, do_OPTIONS, _json_response, _html_response, log_message, handle_stream, _stream): added -> None returns and Any / str parameter types. Imported Iterator and AsyncIterator from typing. - pipeline.py:439 astream: added -> AsyncIterator[Any]. - observe/trace_store.py:349 _iter_entries: added -> Iterator[Dict[str, Any]]. - agent/config.py:215 _unpack nested helper: added (Any, type) -> Any. - trace.py:506: ``dataclasses.asdict`` was rejecting ``DataclassInstance | type[DataclassInstance]`` (too wide). Narrowed with ``not isinstance(obj, type)`` so mypy sees a non-type dataclass. - providers/_openai_compat.py:560: expanded existing # type: ignore from [return-value] to [return-value,no-any-return] to cover the second error code. - serve/_starlette_app.py:105: eval_dashboard was declared to return HTMLResponse but the unauth-redirect branch returns a RedirectResponse. Widened the return type to Response to match the neighbouring handlers (builder, provider_health). #18 Landing page feature content for v0.21.0 --------------------------------------------- Three text-only bento card updates (no layout changes): - RAG card: "4 store backends" → "7 store backends" with the full list enumerated plus CSV/JSON/HTML/URL loaders mentioned. - Toolbox card: added explicit v0.21.0 additions (Python + shell execution, DuckDuckGo search, GitHub REST API, SQLite + Postgres). - Audit card retitled to "Audit + observability" and expanded to mention OTelObserver (GenAI semantic conventions) and LangfuseObserver as the new v0.21.0 shipping surfaces for trace export to Datadog / Jaeger / Langfuse Cloud / any OTLP backend. #19 FAISS variant of App 3 Knowledge Base Librarian --------------------------------------------------- Added TestApp3b_KnowledgeBaseLibrarianFAISS in tests/test_e2e_v0_21_0_apps.py — the same CSV + JSON + HTML librarian persona but backed by FAISSVectorStore instead of Qdrant. Runnable without Docker, and with different anchor phrases (OSPREY-88, CRESCENT, AURORA-SOUTH) so it doesn't shadow the Qdrant variant when both run. Three tests, all passing against real OpenAI embeddings + real OpenAI gpt-4o-mini. #20 RAGTool docstring notes --------------------------- Added a "Notes" block to RAGTool explaining: - Thread safety: the vector store handles its own locking, but mutating top_k / score_threshold / include_scores after attaching to an Agent is not thread-safe. - Cross-process serialization: not supported, same reason function-based @tool() tools aren't supported. Verification ------------ - mypy src/: Success: no issues found in 150 source files - Full non-e2e suite: 4961 passed, 3 skipped, 248 deselected (+9 from new image_url + async multimodal + FAISS librarian tests), 0 regressions - Full e2e suite with Qdrant + Postgres running: 70 collected, 64 passed, 6 skipped (Azure x2 + Langfuse x1 credential-dependent + 3 Qdrant tests when the container isn't running), 0 failures - mkdocs build: zero broken-anchor warnings (QUICKSTART + PARSER both clean now) - diff CHANGELOG.md docs/CHANGELOG.md: byte-identical

johnnichev merged commit fa53d23 into main Mar 22, 2026

johnnichev deleted the feat/eval-advanced branch March 22, 2026 03:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: pairwise A/B eval, test generator, live dashboard, badges, snapshots#15

feat: pairwise A/B eval, test generator, live dashboard, badges, snapshots#15
johnnichev merged 1 commit intomainfrom
feat/eval-advanced

johnnichev commented Mar 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

johnnichev commented Mar 22, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant