feat: eval framework — 22 built-in evaluators, LLM-as-judge, regression detection by johnnichev · Pull Request #13 · johnnichev/selectools

johnnichev · 2026-03-22T02:56:53Z

Summary

New selectools.evals package — built-in agent evaluation that ships with the library. No separate install, no SaaS account, no external dependencies.

22 evaluators (12 deterministic + 10 LLM-as-judge):

Deterministic: ToolUse, Contains, Output, StructuredOutput, Performance, JsonValidity, Length, StartsWith, EndsWith, PIILeak, InjectionResistance, Custom
LLM-as-judge: LLMJudge (generic rubric), Correctness, Relevance, Faithfulness, Hallucination, Toxicity, Coherence, Completeness, Bias, Summary

Infrastructure:

EvalSuite + TestCase — declarative assertions, concurrent execution
EvalReport — accuracy, latency p50/p95/p99, cost, weighted scoring, tag filtering
DatasetLoader — load test cases from JSON/YAML files
BaselineStore + RegressionResult — save baselines, detect regressions across runs
HTML report — self-contained dark-theme dashboard
JUnit XML — CI integration for GitHub Actions, Jenkins, GitLab CI

110 new tests (total: 1676). Zero external dependencies.

Test plan

All 110 eval tests pass
Full test suite passes (1676 total)
All pre-commit hooks pass (black, isort, flake8, mypy, bandit)
Package imports cleanly (from selectools import EvalSuite, TestCase, EvalReport)

New selectools.evals package with: - EvalSuite: orchestrates running test cases against an agent - TestCase: declarative assertions (expect_tool, expect_contains, expect_output, expect_parsed, latency/cost/iteration thresholds, custom evaluators) - 6 built-in evaluators: ToolUse, Contains, Output, StructuredOutput, Performance, Custom - EvalReport: accuracy, latency p50/p95/p99, cost, weighted scoring, tag filtering, failure breakdown by evaluator - DatasetLoader: load test cases from JSON/YAML files - BaselineStore + RegressionResult: save baselines, detect regressions across runs - HTML report: self-contained dark-theme report with summary dashboard - JUnit XML: CI integration for GitHub Actions, Jenkins, GitLab CI 64 new tests (total: 1630). Zero external dependencies.

Deterministic evaluators (12): - ToolUse, Contains, Output, StructuredOutput, Performance, Custom - NEW: JsonValidity, Length, StartsWith, EndsWith, PIILeak, InjectionResistance LLM-as-judge evaluators (10) — use any Provider, zero external deps: - LLMJudge (generic rubric), Correctness, Relevance, Faithfulness, Hallucination, Toxicity, Coherence, Completeness, Bias, Summary TestCase gains new fields: reference, context, rubric (for LLM judges), expect_json, expect_starts/ends_with, expect_min/max_length, expect_no_pii, expect_no_injection 110 tests (was 64). All passing.

Security: - Path traversal in JsonFileSessionStore — validate session_id (#9) - Unicode homoglyph bypass in injection screening — NFKD + zero-width strip + homoglyph map (#13) Data integrity: - FileKnowledgeStore._save_all() atomic write via tmp + os.replace (#10) - JsonFileSessionStore.save() atomic write (#31) Agent core: - astream() uses self._effective_model (was self.config.model) (#1) - Sync _check_policy rejects async confirm_action with clear error (#2) - Sync _streaming_call isinstance(chunk, str) guard (#18) Providers: - FallbackProvider stream()/astream() record success after consumption, not before — circuit breaker now works for streaming (#3) - Gemini response.text ValueError catch for tool-call-only responses (#4) Tools: - aexecute() uses run_in_executor(None) shared executor (#5) - execute() awaits coroutines from async tools via asyncio.run (#6) RAG: - Hybrid search O(n²) → O(1) via text_to_key dict lookup (#7) - SQLiteVectorStore thread safety + WAL mode (#8) Evals: - OutputEvaluator catches re.error on invalid regex (#11) - JsonValidityEvaluator respects expect_json=False (#12) 16 new regression tests. Full suite: 2000 passed.

@tool

…+ fixed Final thorough audit pass after the user asked "is there anything you feel even 1% not confident about?" with explicit instruction to verify AND fix everything. Nine residual concerns were addressed; two surfaced real shipping blockers that isolated testing had not caught. Verified as not a regression (no code change needed): - #12 RAGTool descriptor pickling: function-based @tool() also fails to serialize for the same reason (decorator replaces function in the module namespace). Pickling Tools/Agents has never been supported in selectools — only cache_redis.py uses pickle, and only for (Message, UsageStats) tuples. Documented the limitation in RAGTool's class docstring along with a thread-safety note. Fixes landed: Bug 9 — Langfuse 3.x rewrite (real shipping blocker) ---------------------------------------------------- mypy caught ``"Langfuse" has no attribute "trace"`` in src/selectools/observe/langfuse.py:65. Langfuse 3.x removed the top-level Langfuse.trace() / trace.generation() / trace.span() / trace.update() API and replaced it with start_span() / start_generation() / update_current_trace() / update_current_span(). The existing selectools LangfuseObserver was written for 2.x and would crash at runtime on every call against Langfuse 3.x (which pyproject.toml's langfuse>=2.0.0 constraint does not exclude). The existing mock-based test_langfuse_observer.py never caught it because mocks accept any method call. The e2e test in tests/test_e2e_langfuse_observer.py skipped due to missing LANGFUSE_PUBLIC_KEY env var, so the real code path had never executed. - Rewrote src/selectools/observe/langfuse.py for Langfuse 3.x API: on_run_start now creates a root span via client.start_span(); child generations and spans use root.start_generation() / root.start_span() (which attach to the same trace); usage info moved from usage= to usage_details=, with new cost_details= for dollar cost; every span now calls .end() explicitly since Langfuse 3.x is context-manager oriented; root span finalization uses update_trace() + update() + end(). - Updated 4 affected mock tests in tests/test_langfuse_observer.py to the v3 API (client.start_span, root.start_generation, root.start_span). 19 Langfuse mock tests now pass. #13 image_url e2e regression coverage ------------------------------------- Added TestMultimodalRealProvidersImageUrl in tests/test_e2e_multimodal.py with three new tests (one per provider) that send https://github.githubassets.com/favicons/favicon.png through the ContentPart(type="image_url") path. Verified that OpenAI, Anthropic, and Gemini all return "GitHub" in their reply. GitHub's CDN serves bot User-Agents unlike Wikipedia's CDN, which is documented separately in the MULTIMODAL.md URL-reachability warning. #14 CHANGELOG clarification --------------------------- Added a "Note on the three latent bugs below" block before the Fixed section explaining that bugs 6, 7, 8 (RAGTool @tool() on methods and both multimodal content_parts drops) were pre-existing in earlier releases but never surfaced because no test actually exercised them end-to-end. This pre-empts the reasonable reader question "why didn't earlier users report these?". #15 Pre-existing broken mkdocs anchors -------------------------------------- - QUICKSTART.md: #code-tools-2--v0210 (double dash) was wrong. mkdocs Material slugifies the em-dash in "Code Tools (2) — v0.21.0" to a single hyphen, producing code-tools-2-v0210. Fixed the link. - PARSER.md: both #parsing-strategy and #json-extraction anchors were broken because a stray unbalanced 3-backtick fence at line 124 was greedy-pairing with line 128, shifting every downstream fence pair by one and accidentally wrapping ## Parsing Strategy and ## JSON Extraction inside a code block. Deleting line 124 plus converting one 4-backtick close on line 205 to a 3-backtick close rebalanced all the fences. Both headings now render as real h2 elements and the TOC anchors resolve. mkdocs build: zero broken-anchor warnings. #16 README relative docs/ links ------------------------------- README.md is outside docs/ and must use absolute GitHub URLs per docs/CLAUDE.md. Batch-converted all 37 ](docs/*.md) relative links to ](https://github.com/johnnichev/selectools/blob/main/docs/*.md). #17 Pre-existing mypy errors — all 46 fixed, mypy src/ is now clean ------------------------------------------------------------------ Success: no issues found in 150 source files. - 20 no-any-return errors across 13 files: added # type: ignore[no-any-return] with explanatory context. These were all external-library Any leaks (json.loads, dict.get on Any, psycopg2, ollama client, openai SDK returns, etc.) where the runtime type is correct but the type-stub exposure is Any. - 14 no-untyped-def errors in observer.py SimpleStepObserver graph callbacks (lines 1634-1676): added full type annotations matching the AgentObserver base class signatures (str/int/float/Exception/List[str] per event). Fixed one Liskov substitution violation where my initial annotation used List[str] for new_plan but the base class uses str. - 8 no-untyped-def errors in serve/app.py BaseHTTPRequestHandler methods (do_GET, do_POST, do_OPTIONS, _json_response, _html_response, log_message, handle_stream, _stream): added -> None returns and Any / str parameter types. Imported Iterator and AsyncIterator from typing. - pipeline.py:439 astream: added -> AsyncIterator[Any]. - observe/trace_store.py:349 _iter_entries: added -> Iterator[Dict[str, Any]]. - agent/config.py:215 _unpack nested helper: added (Any, type) -> Any. - trace.py:506: ``dataclasses.asdict`` was rejecting ``DataclassInstance | type[DataclassInstance]`` (too wide). Narrowed with ``not isinstance(obj, type)`` so mypy sees a non-type dataclass. - providers/_openai_compat.py:560: expanded existing # type: ignore from [return-value] to [return-value,no-any-return] to cover the second error code. - serve/_starlette_app.py:105: eval_dashboard was declared to return HTMLResponse but the unauth-redirect branch returns a RedirectResponse. Widened the return type to Response to match the neighbouring handlers (builder, provider_health). #18 Landing page feature content for v0.21.0 --------------------------------------------- Three text-only bento card updates (no layout changes): - RAG card: "4 store backends" → "7 store backends" with the full list enumerated plus CSV/JSON/HTML/URL loaders mentioned. - Toolbox card: added explicit v0.21.0 additions (Python + shell execution, DuckDuckGo search, GitHub REST API, SQLite + Postgres). - Audit card retitled to "Audit + observability" and expanded to mention OTelObserver (GenAI semantic conventions) and LangfuseObserver as the new v0.21.0 shipping surfaces for trace export to Datadog / Jaeger / Langfuse Cloud / any OTLP backend. #19 FAISS variant of App 3 Knowledge Base Librarian --------------------------------------------------- Added TestApp3b_KnowledgeBaseLibrarianFAISS in tests/test_e2e_v0_21_0_apps.py — the same CSV + JSON + HTML librarian persona but backed by FAISSVectorStore instead of Qdrant. Runnable without Docker, and with different anchor phrases (OSPREY-88, CRESCENT, AURORA-SOUTH) so it doesn't shadow the Qdrant variant when both run. Three tests, all passing against real OpenAI embeddings + real OpenAI gpt-4o-mini. #20 RAGTool docstring notes --------------------------- Added a "Notes" block to RAGTool explaining: - Thread safety: the vector store handles its own locking, but mutating top_k / score_threshold / include_scores after attaching to an Agent is not thread-safe. - Cross-process serialization: not supported, same reason function-based @tool() tools aren't supported. Verification ------------ - mypy src/: Success: no issues found in 150 source files - Full non-e2e suite: 4961 passed, 3 skipped, 248 deselected (+9 from new image_url + async multimodal + FAISS librarian tests), 0 regressions - Full e2e suite with Qdrant + Postgres running: 70 collected, 64 passed, 6 skipped (Azure x2 + Langfuse x1 credential-dependent + 3 Qdrant tests when the container isn't running), 0 failures - mkdocs build: zero broken-anchor warnings (QUICKSTART + PARSER both clean now) - diff CHANGELOG.md docs/CHANGELOG.md: byte-identical

johnnichev added 2 commits March 21, 2026 22:23

johnnichev merged commit 2ab1d28 into main Mar 22, 2026

johnnichev deleted the feat/eval-framework branch March 22, 2026 02:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: eval framework — 22 built-in evaluators, LLM-as-judge, regression detection#13

feat: eval framework — 22 built-in evaluators, LLM-as-judge, regression detection#13
johnnichev merged 2 commits intomainfrom
feat/eval-framework

johnnichev commented Mar 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

johnnichev commented Mar 22, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant