test: eval hardening — 31 E2E tests with real Agent + edge cases by johnnichev · Pull Request #22 · johnnichev/selectools

johnnichev · 2026-03-22T19:00:59Z

Summary

31 hardening tests covering real tool execution, edge cases, feature chains, evaluator isolation, export roundtrips, concurrent stress, and observer integration.

All tests use real Agent instances with SharedToolCallProvider and SharedFakeProvider — no mocks.

Total eval tests: 340. Total project: 1906. All passing.

Test plan

1906 tests pass (full suite)
42/42 manual E2E checks pass
Pre-commit hooks pass

Tests using SharedToolCallProvider (real tool execution): - Tool call assertions: pass, fail, multiple, args exact, args mismatch, order Edge cases: - Empty response, very long response (10k words), unicode/emoji, special chars - Single case, zero weight, all cases error, identical latencies Feature chains: - Run → snapshot → baseline → regression detection pipeline - History improvement tracking over 3 runs - Pairwise with tool calls - Badge from full run - HTML report with mixed verdicts Evaluator isolation: - No assertions = always pass - Only set fields checked (no false positives) - Multiple failures on same case all recorded Export roundtrips: - JSON roundtrip with metadata verification - Markdown with/without failures - JUnit XML structure validation Stress: 50 cases concurrent, progress callback verification Observer: all events fire, broken observer doesn't crash eval Total eval tests: 340. Total project: 1906.

Agent core observers (6 fixes): - astream() cancellation/budget paths now build proper results with trace steps and async observer events (#14) - arun() fires async observers for cancel/budget/max-iter (#15) - _aexecute_tools_parallel fires async observer events (#16) - _aexecute_tools_parallel tracks tool_usage/tool_tokens (#17) - _acheck_policy fires async on_policy_decision observer (#10M) - astream() max-iter path fires async on_run_end (#12M) Tools + providers (7 fixes): - Anthropic empty content list guard (#19) - Bool rejected for int/float params (#20) - ToolRegistry.tool() has screen_output/terminal/requires_approval (#21) - MultiMCPClient list_all_tools() copies tools before prefixing (#22) - Streamable-http 3-tuple unpacking robust handling (#23) - _serialize_result returns "" for None (#24) - StructuredOutputEvaluator handles __slots__ (#45) RAG (6 fixes): - SQLiteVectorStore search documented limitation (#25) - InMemoryVectorStore max_documents warning (#26) - Pinecone metadata.get instead of .pop (#27) - ContextualChunker None content guard (#28) - Filter overfetch: top_k*4 when filter present (#29) - OpenAI embed_texts batching at 2048 (#30) Memory (5 fixes): - FileKnowledgeStore reads under lock (#32) - SQLiteSessionStore WAL mode (#33) - SQLiteKnowledgeStore indexes on query columns (#34) - query() LIMIT after TTL filter (#35) - Redis save() category update in pipeline (#36) Evals (4 fixes): - 16 LLM evaluators fail on unparseable score (#37) - XSS fix: textContent instead of innerHTML (#38) - Donut SVG 360° arc: two semicircles (#39) - Suite completed counter under threading.Lock (#46) Security (5 fixes): - REWRITE/WARN guardrails tracked in trace (#40) - SSN regex requires consistent separators (#41) - Topic guardrail Unicode normalization (#42) - Coherence usage tracked in agent costs (#43) - Coherence fail_closed option (#44) Full suite: 2013 passed.

HIGH fixes recovered: - Bug #22: MultiMCPClient copy.copy(tool) before prefix mutation - Bug #26: InMemoryVectorStore max_documents capacity warning - Bug #27: Pinecone metadata.get not .pop (stops mutating response) - Bug #28: ContextualChunker (content or "").strip() None guard - Bug #29: InMemoryVectorStore overfetch top_k*4 when filter present - Bug #36: Redis save() category srem moved into pipeline - Bug #38: XSS fix — createElement/textContent replaces innerHTML - Bug #41: SSN regex uses strict alternation (?:\d{3}-\d{2}-\d{4}|\d{9}) LOW fix: - Bug #35: KnowledgeMemory.remember() uses UTC consistently - Fixed naive/aware datetime comparison in prune_old_logs - Updated tests to use UTC dates Full suite: 2013 passed.

Source: LlamaIndex #20880 (same class: `alpha = query.alpha or 0.5` swallowed `alpha=0.0`). CohereReranker.rerank used `top_n=top_k or len(results)` which silently promoted `top_k=0` (user asking for no results) to the full list. Round-1 pitfall #22 class, new instance in the rag/ module. Fix: `top_n=top_k if top_k is not None else len(results)`. Also adds docs/superpowers/plans/2026-04-11-round2-quickwins.md — the 4-bug round-2 plan derived from the LangChain/LlamaIndex competitive-mining research report.

…ount fields Source: LangChain #36500. `token_usage.get("total_tokens") or fallback` silently replaces provider-reported 0 for cached completions. Round-1 pitfall #22 instance not yet swept in providers/. gemini_provider.py used `(usage.prompt_token_count or 0) if usage else 0` in both sync complete() (lines 158-159) and stream path (lines 505-506). If the Gemini API ever returns prompt_token_count=None alongside a real candidates_token_count, the `or 0` conflates "unknown" with "zero" and under-reports total_tokens. Fix: use `x if x is not None else 0` guard pattern on both paths. Grep confirmed the `or 0` pattern only appears on gemini_provider.py token fields — no other provider affected.

johnnichev merged commit 4700bcc into main Mar 22, 2026

johnnichev deleted the test/eval-hardening branch March 22, 2026 19:01

johnnichev added a commit that referenced this pull request Apr 5, 2026

docs: add 5 new common pitfalls from ralph hunt findings (#22-#26)

eab63ee

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: eval hardening — 31 E2E tests with real Agent + edge cases#22

test: eval hardening — 31 E2E tests with real Agent + edge cases#22
johnnichev merged 1 commit intomainfrom
test/eval-hardening

johnnichev commented Mar 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

johnnichev commented Mar 22, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant