Skip to content

feat: eval framework — 22 built-in evaluators, LLM-as-judge, regression detection#13

Merged
johnnichev merged 2 commits intomainfrom
feat/eval-framework
Mar 22, 2026
Merged

feat: eval framework — 22 built-in evaluators, LLM-as-judge, regression detection#13
johnnichev merged 2 commits intomainfrom
feat/eval-framework

Conversation

@johnnichev
Copy link
Copy Markdown
Owner

Summary

New selectools.evals package — built-in agent evaluation that ships with the library. No separate install, no SaaS account, no external dependencies.

22 evaluators (12 deterministic + 10 LLM-as-judge):

  • Deterministic: ToolUse, Contains, Output, StructuredOutput, Performance, JsonValidity, Length, StartsWith, EndsWith, PIILeak, InjectionResistance, Custom
  • LLM-as-judge: LLMJudge (generic rubric), Correctness, Relevance, Faithfulness, Hallucination, Toxicity, Coherence, Completeness, Bias, Summary

Infrastructure:

  • EvalSuite + TestCase — declarative assertions, concurrent execution
  • EvalReport — accuracy, latency p50/p95/p99, cost, weighted scoring, tag filtering
  • DatasetLoader — load test cases from JSON/YAML files
  • BaselineStore + RegressionResult — save baselines, detect regressions across runs
  • HTML report — self-contained dark-theme dashboard
  • JUnit XML — CI integration for GitHub Actions, Jenkins, GitLab CI

110 new tests (total: 1676). Zero external dependencies.

Test plan

  • All 110 eval tests pass
  • Full test suite passes (1676 total)
  • All pre-commit hooks pass (black, isort, flake8, mypy, bandit)
  • Package imports cleanly (from selectools import EvalSuite, TestCase, EvalReport)

New selectools.evals package with:

- EvalSuite: orchestrates running test cases against an agent
- TestCase: declarative assertions (expect_tool, expect_contains,
  expect_output, expect_parsed, latency/cost/iteration thresholds,
  custom evaluators)
- 6 built-in evaluators: ToolUse, Contains, Output, StructuredOutput,
  Performance, Custom
- EvalReport: accuracy, latency p50/p95/p99, cost, weighted scoring,
  tag filtering, failure breakdown by evaluator
- DatasetLoader: load test cases from JSON/YAML files
- BaselineStore + RegressionResult: save baselines, detect regressions
  across runs
- HTML report: self-contained dark-theme report with summary dashboard
- JUnit XML: CI integration for GitHub Actions, Jenkins, GitLab CI

64 new tests (total: 1630). Zero external dependencies.
Deterministic evaluators (12):
- ToolUse, Contains, Output, StructuredOutput, Performance, Custom
- NEW: JsonValidity, Length, StartsWith, EndsWith, PIILeak, InjectionResistance

LLM-as-judge evaluators (10) — use any Provider, zero external deps:
- LLMJudge (generic rubric), Correctness, Relevance, Faithfulness,
  Hallucination, Toxicity, Coherence, Completeness, Bias, Summary

TestCase gains new fields: reference, context, rubric (for LLM judges),
expect_json, expect_starts/ends_with, expect_min/max_length,
expect_no_pii, expect_no_injection

110 tests (was 64). All passing.
@johnnichev johnnichev merged commit 2ab1d28 into main Mar 22, 2026
@johnnichev johnnichev deleted the feat/eval-framework branch March 22, 2026 02:56
johnnichev added a commit that referenced this pull request Mar 24, 2026
Security:
- Path traversal in JsonFileSessionStore — validate session_id (#9)
- Unicode homoglyph bypass in injection screening — NFKD + zero-width
  strip + homoglyph map (#13)

Data integrity:
- FileKnowledgeStore._save_all() atomic write via tmp + os.replace (#10)
- JsonFileSessionStore.save() atomic write (#31)

Agent core:
- astream() uses self._effective_model (was self.config.model) (#1)
- Sync _check_policy rejects async confirm_action with clear error (#2)
- Sync _streaming_call isinstance(chunk, str) guard (#18)

Providers:
- FallbackProvider stream()/astream() record success after consumption,
  not before — circuit breaker now works for streaming (#3)
- Gemini response.text ValueError catch for tool-call-only responses (#4)

Tools:
- aexecute() uses run_in_executor(None) shared executor (#5)
- execute() awaits coroutines from async tools via asyncio.run (#6)

RAG:
- Hybrid search O(n²) → O(1) via text_to_key dict lookup (#7)
- SQLiteVectorStore thread safety + WAL mode (#8)

Evals:
- OutputEvaluator catches re.error on invalid regex (#11)
- JsonValidityEvaluator respects expect_json=False (#12)

16 new regression tests. Full suite: 2000 passed.
johnnichev added a commit that referenced this pull request Apr 8, 2026
…+ fixed

Final thorough audit pass after the user asked "is there anything you
feel even 1% not confident about?" with explicit instruction to verify
AND fix everything. Nine residual concerns were addressed; two surfaced
real shipping blockers that isolated testing had not caught.

Verified as not a regression (no code change needed):
- #12 RAGTool descriptor pickling: function-based @tool() also fails
  to serialize for the same reason (decorator replaces function in the
  module namespace). Pickling Tools/Agents has never been supported in
  selectools — only cache_redis.py uses pickle, and only for
  (Message, UsageStats) tuples. Documented the limitation in RAGTool's
  class docstring along with a thread-safety note.

Fixes landed:

Bug 9 — Langfuse 3.x rewrite (real shipping blocker)
----------------------------------------------------
mypy caught ``"Langfuse" has no attribute "trace"`` in
src/selectools/observe/langfuse.py:65. Langfuse 3.x removed the top-level
Langfuse.trace() / trace.generation() / trace.span() / trace.update()
API and replaced it with start_span() / start_generation() /
update_current_trace() / update_current_span(). The existing
selectools LangfuseObserver was written for 2.x and would crash at
runtime on every call against Langfuse 3.x (which pyproject.toml's
langfuse>=2.0.0 constraint does not exclude). The existing mock-based
test_langfuse_observer.py never caught it because mocks accept any
method call. The e2e test in tests/test_e2e_langfuse_observer.py
skipped due to missing LANGFUSE_PUBLIC_KEY env var, so the real code
path had never executed.

- Rewrote src/selectools/observe/langfuse.py for Langfuse 3.x API:
  on_run_start now creates a root span via client.start_span(); child
  generations and spans use root.start_generation() / root.start_span()
  (which attach to the same trace); usage info moved from usage= to
  usage_details=, with new cost_details= for dollar cost; every span
  now calls .end() explicitly since Langfuse 3.x is context-manager
  oriented; root span finalization uses update_trace() + update() + end().
- Updated 4 affected mock tests in tests/test_langfuse_observer.py to
  the v3 API (client.start_span, root.start_generation, root.start_span).
  19 Langfuse mock tests now pass.

#13 image_url e2e regression coverage
-------------------------------------
Added TestMultimodalRealProvidersImageUrl in
tests/test_e2e_multimodal.py with three new tests (one per provider)
that send https://github.githubassets.com/favicons/favicon.png through
the ContentPart(type="image_url") path. Verified that OpenAI, Anthropic,
and Gemini all return "GitHub" in their reply. GitHub's CDN serves bot
User-Agents unlike Wikipedia's CDN, which is documented separately in
the MULTIMODAL.md URL-reachability warning.

#14 CHANGELOG clarification
---------------------------
Added a "Note on the three latent bugs below" block before the Fixed
section explaining that bugs 6, 7, 8 (RAGTool @tool() on methods and
both multimodal content_parts drops) were pre-existing in earlier
releases but never surfaced because no test actually exercised them
end-to-end. This pre-empts the reasonable reader question "why didn't
earlier users report these?".

#15 Pre-existing broken mkdocs anchors
--------------------------------------
- QUICKSTART.md: #code-tools-2--v0210 (double dash) was wrong. mkdocs
  Material slugifies the em-dash in "Code Tools (2) — v0.21.0" to a
  single hyphen, producing code-tools-2-v0210. Fixed the link.
- PARSER.md: both #parsing-strategy and #json-extraction anchors were
  broken because a stray unbalanced 3-backtick fence at line 124 was
  greedy-pairing with line 128, shifting every downstream fence pair by
  one and accidentally wrapping ## Parsing Strategy and ## JSON
  Extraction inside a code block. Deleting line 124 plus converting one
  4-backtick close on line 205 to a 3-backtick close rebalanced all the
  fences. Both headings now render as real h2 elements and the
  TOC anchors resolve. mkdocs build: zero broken-anchor warnings.

#16 README relative docs/ links
-------------------------------
README.md is outside docs/ and must use absolute GitHub URLs per
docs/CLAUDE.md. Batch-converted all 37 ](docs/*.md) relative links to
](https://github.com/johnnichev/selectools/blob/main/docs/*.md).

#17 Pre-existing mypy errors — all 46 fixed, mypy src/ is now clean
------------------------------------------------------------------
Success: no issues found in 150 source files.

- 20 no-any-return errors across 13 files: added
  # type: ignore[no-any-return] with explanatory context. These were
  all external-library Any leaks (json.loads, dict.get on Any, psycopg2,
  ollama client, openai SDK returns, etc.) where the runtime type is
  correct but the type-stub exposure is Any.
- 14 no-untyped-def errors in observer.py SimpleStepObserver graph
  callbacks (lines 1634-1676): added full type annotations matching the
  AgentObserver base class signatures (str/int/float/Exception/List[str]
  per event). Fixed one Liskov substitution violation where my initial
  annotation used List[str] for new_plan but the base class uses str.
- 8 no-untyped-def errors in serve/app.py BaseHTTPRequestHandler methods
  (do_GET, do_POST, do_OPTIONS, _json_response, _html_response,
  log_message, handle_stream, _stream): added -> None returns and Any /
  str parameter types. Imported Iterator and AsyncIterator from typing.
- pipeline.py:439 astream: added -> AsyncIterator[Any].
- observe/trace_store.py:349 _iter_entries: added -> Iterator[Dict[str, Any]].
- agent/config.py:215 _unpack nested helper: added (Any, type) -> Any.
- trace.py:506: ``dataclasses.asdict`` was rejecting
  ``DataclassInstance | type[DataclassInstance]`` (too wide). Narrowed
  with ``not isinstance(obj, type)`` so mypy sees a non-type dataclass.
- providers/_openai_compat.py:560: expanded existing # type: ignore
  from [return-value] to [return-value,no-any-return] to cover the
  second error code.
- serve/_starlette_app.py:105: eval_dashboard was declared to return
  HTMLResponse but the unauth-redirect branch returns a RedirectResponse.
  Widened the return type to Response to match the neighbouring
  handlers (builder, provider_health).

#18 Landing page feature content for v0.21.0
---------------------------------------------
Three text-only bento card updates (no layout changes):

- RAG card: "4 store backends" → "7 store backends" with the full list
  enumerated plus CSV/JSON/HTML/URL loaders mentioned.
- Toolbox card: added explicit v0.21.0 additions (Python + shell
  execution, DuckDuckGo search, GitHub REST API, SQLite + Postgres).
- Audit card retitled to "Audit + observability" and expanded to
  mention OTelObserver (GenAI semantic conventions) and
  LangfuseObserver as the new v0.21.0 shipping surfaces for trace
  export to Datadog / Jaeger / Langfuse Cloud / any OTLP backend.

#19 FAISS variant of App 3 Knowledge Base Librarian
---------------------------------------------------
Added TestApp3b_KnowledgeBaseLibrarianFAISS in
tests/test_e2e_v0_21_0_apps.py — the same CSV + JSON + HTML librarian
persona but backed by FAISSVectorStore instead of Qdrant. Runnable
without Docker, and with different anchor phrases (OSPREY-88,
CRESCENT, AURORA-SOUTH) so it doesn't shadow the Qdrant variant when
both run. Three tests, all passing against real OpenAI embeddings +
real OpenAI gpt-4o-mini.

#20 RAGTool docstring notes
---------------------------
Added a "Notes" block to RAGTool explaining:
- Thread safety: the vector store handles its own locking, but mutating
  top_k / score_threshold / include_scores after attaching to an Agent
  is not thread-safe.
- Cross-process serialization: not supported, same reason function-based
  @tool() tools aren't supported.

Verification
------------
- mypy src/: Success: no issues found in 150 source files
- Full non-e2e suite: 4961 passed, 3 skipped, 248 deselected (+9 from
  new image_url + async multimodal + FAISS librarian tests), 0 regressions
- Full e2e suite with Qdrant + Postgres running: 70 collected, 64 passed,
  6 skipped (Azure x2 + Langfuse x1 credential-dependent + 3 Qdrant
  tests when the container isn't running), 0 failures
- mkdocs build: zero broken-anchor warnings (QUICKSTART + PARSER both
  clean now)
- diff CHANGELOG.md docs/CHANGELOG.md: byte-identical
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant