test: eval hardening — 31 E2E tests with real Agent + edge cases#22
Merged
johnnichev merged 1 commit intomainfrom Mar 22, 2026
Merged
test: eval hardening — 31 E2E tests with real Agent + edge cases#22johnnichev merged 1 commit intomainfrom
johnnichev merged 1 commit intomainfrom
Conversation
Tests using SharedToolCallProvider (real tool execution): - Tool call assertions: pass, fail, multiple, args exact, args mismatch, order Edge cases: - Empty response, very long response (10k words), unicode/emoji, special chars - Single case, zero weight, all cases error, identical latencies Feature chains: - Run → snapshot → baseline → regression detection pipeline - History improvement tracking over 3 runs - Pairwise with tool calls - Badge from full run - HTML report with mixed verdicts Evaluator isolation: - No assertions = always pass - Only set fields checked (no false positives) - Multiple failures on same case all recorded Export roundtrips: - JSON roundtrip with metadata verification - Markdown with/without failures - JUnit XML structure validation Stress: 50 cases concurrent, progress callback verification Observer: all events fire, broken observer doesn't crash eval Total eval tests: 340. Total project: 1906.
johnnichev
added a commit
that referenced
this pull request
Mar 24, 2026
Agent core observers (6 fixes): - astream() cancellation/budget paths now build proper results with trace steps and async observer events (#14) - arun() fires async observers for cancel/budget/max-iter (#15) - _aexecute_tools_parallel fires async observer events (#16) - _aexecute_tools_parallel tracks tool_usage/tool_tokens (#17) - _acheck_policy fires async on_policy_decision observer (#10M) - astream() max-iter path fires async on_run_end (#12M) Tools + providers (7 fixes): - Anthropic empty content list guard (#19) - Bool rejected for int/float params (#20) - ToolRegistry.tool() has screen_output/terminal/requires_approval (#21) - MultiMCPClient list_all_tools() copies tools before prefixing (#22) - Streamable-http 3-tuple unpacking robust handling (#23) - _serialize_result returns "" for None (#24) - StructuredOutputEvaluator handles __slots__ (#45) RAG (6 fixes): - SQLiteVectorStore search documented limitation (#25) - InMemoryVectorStore max_documents warning (#26) - Pinecone metadata.get instead of .pop (#27) - ContextualChunker None content guard (#28) - Filter overfetch: top_k*4 when filter present (#29) - OpenAI embed_texts batching at 2048 (#30) Memory (5 fixes): - FileKnowledgeStore reads under lock (#32) - SQLiteSessionStore WAL mode (#33) - SQLiteKnowledgeStore indexes on query columns (#34) - query() LIMIT after TTL filter (#35) - Redis save() category update in pipeline (#36) Evals (4 fixes): - 16 LLM evaluators fail on unparseable score (#37) - XSS fix: textContent instead of innerHTML (#38) - Donut SVG 360° arc: two semicircles (#39) - Suite completed counter under threading.Lock (#46) Security (5 fixes): - REWRITE/WARN guardrails tracked in trace (#40) - SSN regex requires consistent separators (#41) - Topic guardrail Unicode normalization (#42) - Coherence usage tracked in agent costs (#43) - Coherence fail_closed option (#44) Full suite: 2013 passed.
johnnichev
added a commit
that referenced
this pull request
Mar 24, 2026
HIGH fixes recovered: - Bug #22: MultiMCPClient copy.copy(tool) before prefix mutation - Bug #26: InMemoryVectorStore max_documents capacity warning - Bug #27: Pinecone metadata.get not .pop (stops mutating response) - Bug #28: ContextualChunker (content or "").strip() None guard - Bug #29: InMemoryVectorStore overfetch top_k*4 when filter present - Bug #36: Redis save() category srem moved into pipeline - Bug #38: XSS fix — createElement/textContent replaces innerHTML - Bug #41: SSN regex uses strict alternation (?:\d{3}-\d{2}-\d{4}|\d{9}) LOW fix: - Bug #35: KnowledgeMemory.remember() uses UTC consistently - Fixed naive/aware datetime comparison in prune_old_logs - Updated tests to use UTC dates Full suite: 2013 passed.
johnnichev
added a commit
that referenced
this pull request
Apr 5, 2026
johnnichev
added a commit
that referenced
this pull request
Apr 11, 2026
Source: LlamaIndex #20880 (same class: `alpha = query.alpha or 0.5` swallowed `alpha=0.0`). CohereReranker.rerank used `top_n=top_k or len(results)` which silently promoted `top_k=0` (user asking for no results) to the full list. Round-1 pitfall #22 class, new instance in the rag/ module. Fix: `top_n=top_k if top_k is not None else len(results)`. Also adds docs/superpowers/plans/2026-04-11-round2-quickwins.md — the 4-bug round-2 plan derived from the LangChain/LlamaIndex competitive-mining research report.
johnnichev
added a commit
that referenced
this pull request
Apr 11, 2026
…ount fields
Source: LangChain #36500. `token_usage.get("total_tokens") or fallback`
silently replaces provider-reported 0 for cached completions. Round-1
pitfall #22 instance not yet swept in providers/.
gemini_provider.py used `(usage.prompt_token_count or 0) if usage else 0`
in both sync complete() (lines 158-159) and stream path (lines 505-506).
If the Gemini API ever returns prompt_token_count=None alongside a real
candidates_token_count, the `or 0` conflates "unknown" with "zero" and
under-reports total_tokens.
Fix: use `x if x is not None else 0` guard pattern on both paths.
Grep confirmed the `or 0` pattern only appears on gemini_provider.py
token fields — no other provider affected.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
31 hardening tests covering real tool execution, edge cases, feature chains, evaluator isolation, export roundtrips, concurrent stress, and observer integration.
All tests use real Agent instances with SharedToolCallProvider and SharedFakeProvider — no mocks.
Total eval tests: 340. Total project: 1906. All passing.
Test plan