Skip to content

test: comprehensive E2E tests for eval framework (73 new, 211 total)#17

Merged
johnnichev merged 1 commit intomainfrom
test/eval-comprehensive
Mar 22, 2026
Merged

test: comprehensive E2E tests for eval framework (73 new, 211 total)#17
johnnichev merged 1 commit intomainfrom
test/eval-comprehensive

Conversation

@johnnichev
Copy link
Copy Markdown
Owner

Summary

73 new E2E tests covering every eval framework feature with real Agent instances (no mocks). Total eval tests: 211.

Coverage by module:

Module E2E Tests
EvalSuite (sync, async, concurrent, errors) 10
12 deterministic evaluators 19
10 LLM-as-judge evaluators 12
HTML report + SVG charts 6
JUnit XML structure 2
Dataset → Report pipeline 1
Regression detection 2
Pairwise A/B comparison 3
Snapshot testing 2
Badge generation 3
Synthetic test generation 1
CLI 2
Live dashboard 2
Report statistics 6
Edge cases (unicode, weighted, multi-assert) 2

Bug fix: suite.py now reads model from agent.config.model instead of agent._model.

Test plan

  • All 211 eval tests pass
  • Full test suite passes
  • Pre-commit hooks pass

End-to-end tests using real Agent with SharedFakeProvider and
SharedToolCallProvider (no mocks). Covers every gap from audit:

- EvalSuite: basic run, tool calls, mixed results, empty cases,
  concurrency, progress callback, error handling, async run, tags
- All 12 deterministic evaluators E2E with real Agent
- All 10 LLM evaluators E2E with SharedFakeProvider as judge
- HTML report: full render, donut SVG, histogram SVG, error cases
- JUnit XML: structure validation, failure/error elements
- Dataset → Suite → Report → Export pipeline
- Regression detection with baseline save/compare
- Pairwise A/B comparison with real agents
- Snapshot testing: create, compare, detect changes
- Badge generation from real eval runs
- Synthetic test case generation
- CLI help verification
- Live dashboard HTML validation
- Unicode content, weighted accuracy, multiple assertions per case,
  tag filtering, report statistics

Fix: suite.py reads model from agent.config.model (not agent._model)

Total eval tests: 211 (was 138)
@johnnichev johnnichev merged commit 16a7200 into main Mar 22, 2026
@johnnichev johnnichev deleted the test/eval-comprehensive branch March 22, 2026 11:49
johnnichev added a commit that referenced this pull request Mar 24, 2026
Agent core observers (6 fixes):
- astream() cancellation/budget paths now build proper results with
  trace steps and async observer events (#14)
- arun() fires async observers for cancel/budget/max-iter (#15)
- _aexecute_tools_parallel fires async observer events (#16)
- _aexecute_tools_parallel tracks tool_usage/tool_tokens (#17)
- _acheck_policy fires async on_policy_decision observer (#10M)
- astream() max-iter path fires async on_run_end (#12M)

Tools + providers (7 fixes):
- Anthropic empty content list guard (#19)
- Bool rejected for int/float params (#20)
- ToolRegistry.tool() has screen_output/terminal/requires_approval (#21)
- MultiMCPClient list_all_tools() copies tools before prefixing (#22)
- Streamable-http 3-tuple unpacking robust handling (#23)
- _serialize_result returns "" for None (#24)
- StructuredOutputEvaluator handles __slots__ (#45)

RAG (6 fixes):
- SQLiteVectorStore search documented limitation (#25)
- InMemoryVectorStore max_documents warning (#26)
- Pinecone metadata.get instead of .pop (#27)
- ContextualChunker None content guard (#28)
- Filter overfetch: top_k*4 when filter present (#29)
- OpenAI embed_texts batching at 2048 (#30)

Memory (5 fixes):
- FileKnowledgeStore reads under lock (#32)
- SQLiteSessionStore WAL mode (#33)
- SQLiteKnowledgeStore indexes on query columns (#34)
- query() LIMIT after TTL filter (#35)
- Redis save() category update in pipeline (#36)

Evals (4 fixes):
- 16 LLM evaluators fail on unparseable score (#37)
- XSS fix: textContent instead of innerHTML (#38)
- Donut SVG 360° arc: two semicircles (#39)
- Suite completed counter under threading.Lock (#46)

Security (5 fixes):
- REWRITE/WARN guardrails tracked in trace (#40)
- SSN regex requires consistent separators (#41)
- Topic guardrail Unicode normalization (#42)
- Coherence usage tracked in agent costs (#43)
- Coherence fail_closed option (#44)

Full suite: 2013 passed.
johnnichev added a commit that referenced this pull request Apr 8, 2026
…+ fixed

Final thorough audit pass after the user asked "is there anything you
feel even 1% not confident about?" with explicit instruction to verify
AND fix everything. Nine residual concerns were addressed; two surfaced
real shipping blockers that isolated testing had not caught.

Verified as not a regression (no code change needed):
- #12 RAGTool descriptor pickling: function-based @tool() also fails
  to serialize for the same reason (decorator replaces function in the
  module namespace). Pickling Tools/Agents has never been supported in
  selectools — only cache_redis.py uses pickle, and only for
  (Message, UsageStats) tuples. Documented the limitation in RAGTool's
  class docstring along with a thread-safety note.

Fixes landed:

Bug 9 — Langfuse 3.x rewrite (real shipping blocker)
----------------------------------------------------
mypy caught ``"Langfuse" has no attribute "trace"`` in
src/selectools/observe/langfuse.py:65. Langfuse 3.x removed the top-level
Langfuse.trace() / trace.generation() / trace.span() / trace.update()
API and replaced it with start_span() / start_generation() /
update_current_trace() / update_current_span(). The existing
selectools LangfuseObserver was written for 2.x and would crash at
runtime on every call against Langfuse 3.x (which pyproject.toml's
langfuse>=2.0.0 constraint does not exclude). The existing mock-based
test_langfuse_observer.py never caught it because mocks accept any
method call. The e2e test in tests/test_e2e_langfuse_observer.py
skipped due to missing LANGFUSE_PUBLIC_KEY env var, so the real code
path had never executed.

- Rewrote src/selectools/observe/langfuse.py for Langfuse 3.x API:
  on_run_start now creates a root span via client.start_span(); child
  generations and spans use root.start_generation() / root.start_span()
  (which attach to the same trace); usage info moved from usage= to
  usage_details=, with new cost_details= for dollar cost; every span
  now calls .end() explicitly since Langfuse 3.x is context-manager
  oriented; root span finalization uses update_trace() + update() + end().
- Updated 4 affected mock tests in tests/test_langfuse_observer.py to
  the v3 API (client.start_span, root.start_generation, root.start_span).
  19 Langfuse mock tests now pass.

#13 image_url e2e regression coverage
-------------------------------------
Added TestMultimodalRealProvidersImageUrl in
tests/test_e2e_multimodal.py with three new tests (one per provider)
that send https://github.githubassets.com/favicons/favicon.png through
the ContentPart(type="image_url") path. Verified that OpenAI, Anthropic,
and Gemini all return "GitHub" in their reply. GitHub's CDN serves bot
User-Agents unlike Wikipedia's CDN, which is documented separately in
the MULTIMODAL.md URL-reachability warning.

#14 CHANGELOG clarification
---------------------------
Added a "Note on the three latent bugs below" block before the Fixed
section explaining that bugs 6, 7, 8 (RAGTool @tool() on methods and
both multimodal content_parts drops) were pre-existing in earlier
releases but never surfaced because no test actually exercised them
end-to-end. This pre-empts the reasonable reader question "why didn't
earlier users report these?".

#15 Pre-existing broken mkdocs anchors
--------------------------------------
- QUICKSTART.md: #code-tools-2--v0210 (double dash) was wrong. mkdocs
  Material slugifies the em-dash in "Code Tools (2) — v0.21.0" to a
  single hyphen, producing code-tools-2-v0210. Fixed the link.
- PARSER.md: both #parsing-strategy and #json-extraction anchors were
  broken because a stray unbalanced 3-backtick fence at line 124 was
  greedy-pairing with line 128, shifting every downstream fence pair by
  one and accidentally wrapping ## Parsing Strategy and ## JSON
  Extraction inside a code block. Deleting line 124 plus converting one
  4-backtick close on line 205 to a 3-backtick close rebalanced all the
  fences. Both headings now render as real h2 elements and the
  TOC anchors resolve. mkdocs build: zero broken-anchor warnings.

#16 README relative docs/ links
-------------------------------
README.md is outside docs/ and must use absolute GitHub URLs per
docs/CLAUDE.md. Batch-converted all 37 ](docs/*.md) relative links to
](https://github.com/johnnichev/selectools/blob/main/docs/*.md).

#17 Pre-existing mypy errors — all 46 fixed, mypy src/ is now clean
------------------------------------------------------------------
Success: no issues found in 150 source files.

- 20 no-any-return errors across 13 files: added
  # type: ignore[no-any-return] with explanatory context. These were
  all external-library Any leaks (json.loads, dict.get on Any, psycopg2,
  ollama client, openai SDK returns, etc.) where the runtime type is
  correct but the type-stub exposure is Any.
- 14 no-untyped-def errors in observer.py SimpleStepObserver graph
  callbacks (lines 1634-1676): added full type annotations matching the
  AgentObserver base class signatures (str/int/float/Exception/List[str]
  per event). Fixed one Liskov substitution violation where my initial
  annotation used List[str] for new_plan but the base class uses str.
- 8 no-untyped-def errors in serve/app.py BaseHTTPRequestHandler methods
  (do_GET, do_POST, do_OPTIONS, _json_response, _html_response,
  log_message, handle_stream, _stream): added -> None returns and Any /
  str parameter types. Imported Iterator and AsyncIterator from typing.
- pipeline.py:439 astream: added -> AsyncIterator[Any].
- observe/trace_store.py:349 _iter_entries: added -> Iterator[Dict[str, Any]].
- agent/config.py:215 _unpack nested helper: added (Any, type) -> Any.
- trace.py:506: ``dataclasses.asdict`` was rejecting
  ``DataclassInstance | type[DataclassInstance]`` (too wide). Narrowed
  with ``not isinstance(obj, type)`` so mypy sees a non-type dataclass.
- providers/_openai_compat.py:560: expanded existing # type: ignore
  from [return-value] to [return-value,no-any-return] to cover the
  second error code.
- serve/_starlette_app.py:105: eval_dashboard was declared to return
  HTMLResponse but the unauth-redirect branch returns a RedirectResponse.
  Widened the return type to Response to match the neighbouring
  handlers (builder, provider_health).

#18 Landing page feature content for v0.21.0
---------------------------------------------
Three text-only bento card updates (no layout changes):

- RAG card: "4 store backends" → "7 store backends" with the full list
  enumerated plus CSV/JSON/HTML/URL loaders mentioned.
- Toolbox card: added explicit v0.21.0 additions (Python + shell
  execution, DuckDuckGo search, GitHub REST API, SQLite + Postgres).
- Audit card retitled to "Audit + observability" and expanded to
  mention OTelObserver (GenAI semantic conventions) and
  LangfuseObserver as the new v0.21.0 shipping surfaces for trace
  export to Datadog / Jaeger / Langfuse Cloud / any OTLP backend.

#19 FAISS variant of App 3 Knowledge Base Librarian
---------------------------------------------------
Added TestApp3b_KnowledgeBaseLibrarianFAISS in
tests/test_e2e_v0_21_0_apps.py — the same CSV + JSON + HTML librarian
persona but backed by FAISSVectorStore instead of Qdrant. Runnable
without Docker, and with different anchor phrases (OSPREY-88,
CRESCENT, AURORA-SOUTH) so it doesn't shadow the Qdrant variant when
both run. Three tests, all passing against real OpenAI embeddings +
real OpenAI gpt-4o-mini.

#20 RAGTool docstring notes
---------------------------
Added a "Notes" block to RAGTool explaining:
- Thread safety: the vector store handles its own locking, but mutating
  top_k / score_threshold / include_scores after attaching to an Agent
  is not thread-safe.
- Cross-process serialization: not supported, same reason function-based
  @tool() tools aren't supported.

Verification
------------
- mypy src/: Success: no issues found in 150 source files
- Full non-e2e suite: 4961 passed, 3 skipped, 248 deselected (+9 from
  new image_url + async multimodal + FAISS librarian tests), 0 regressions
- Full e2e suite with Qdrant + Postgres running: 70 collected, 64 passed,
  6 skipped (Azure x2 + Langfuse x1 credential-dependent + 3 Qdrant
  tests when the container isn't running), 0 failures
- mkdocs build: zero broken-anchor warnings (QUICKSTART + PARSER both
  clean now)
- diff CHANGELOG.md docs/CHANGELOG.md: byte-identical
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant