Skip to content

test: eval hardening — 31 E2E tests with real Agent + edge cases#22

Merged
johnnichev merged 1 commit intomainfrom
test/eval-hardening
Mar 22, 2026
Merged

test: eval hardening — 31 E2E tests with real Agent + edge cases#22
johnnichev merged 1 commit intomainfrom
test/eval-hardening

Conversation

@johnnichev
Copy link
Copy Markdown
Owner

Summary

31 hardening tests covering real tool execution, edge cases, feature chains, evaluator isolation, export roundtrips, concurrent stress, and observer integration.

All tests use real Agent instances with SharedToolCallProvider and SharedFakeProvider — no mocks.

Total eval tests: 340. Total project: 1906. All passing.

Test plan

  • 1906 tests pass (full suite)
  • 42/42 manual E2E checks pass
  • Pre-commit hooks pass

Tests using SharedToolCallProvider (real tool execution):
- Tool call assertions: pass, fail, multiple, args exact, args mismatch, order

Edge cases:
- Empty response, very long response (10k words), unicode/emoji, special chars
- Single case, zero weight, all cases error, identical latencies

Feature chains:
- Run → snapshot → baseline → regression detection pipeline
- History improvement tracking over 3 runs
- Pairwise with tool calls
- Badge from full run
- HTML report with mixed verdicts

Evaluator isolation:
- No assertions = always pass
- Only set fields checked (no false positives)
- Multiple failures on same case all recorded

Export roundtrips:
- JSON roundtrip with metadata verification
- Markdown with/without failures
- JUnit XML structure validation

Stress: 50 cases concurrent, progress callback verification
Observer: all events fire, broken observer doesn't crash eval

Total eval tests: 340. Total project: 1906.
@johnnichev johnnichev merged commit 4700bcc into main Mar 22, 2026
@johnnichev johnnichev deleted the test/eval-hardening branch March 22, 2026 19:01
johnnichev added a commit that referenced this pull request Mar 24, 2026
Agent core observers (6 fixes):
- astream() cancellation/budget paths now build proper results with
  trace steps and async observer events (#14)
- arun() fires async observers for cancel/budget/max-iter (#15)
- _aexecute_tools_parallel fires async observer events (#16)
- _aexecute_tools_parallel tracks tool_usage/tool_tokens (#17)
- _acheck_policy fires async on_policy_decision observer (#10M)
- astream() max-iter path fires async on_run_end (#12M)

Tools + providers (7 fixes):
- Anthropic empty content list guard (#19)
- Bool rejected for int/float params (#20)
- ToolRegistry.tool() has screen_output/terminal/requires_approval (#21)
- MultiMCPClient list_all_tools() copies tools before prefixing (#22)
- Streamable-http 3-tuple unpacking robust handling (#23)
- _serialize_result returns "" for None (#24)
- StructuredOutputEvaluator handles __slots__ (#45)

RAG (6 fixes):
- SQLiteVectorStore search documented limitation (#25)
- InMemoryVectorStore max_documents warning (#26)
- Pinecone metadata.get instead of .pop (#27)
- ContextualChunker None content guard (#28)
- Filter overfetch: top_k*4 when filter present (#29)
- OpenAI embed_texts batching at 2048 (#30)

Memory (5 fixes):
- FileKnowledgeStore reads under lock (#32)
- SQLiteSessionStore WAL mode (#33)
- SQLiteKnowledgeStore indexes on query columns (#34)
- query() LIMIT after TTL filter (#35)
- Redis save() category update in pipeline (#36)

Evals (4 fixes):
- 16 LLM evaluators fail on unparseable score (#37)
- XSS fix: textContent instead of innerHTML (#38)
- Donut SVG 360° arc: two semicircles (#39)
- Suite completed counter under threading.Lock (#46)

Security (5 fixes):
- REWRITE/WARN guardrails tracked in trace (#40)
- SSN regex requires consistent separators (#41)
- Topic guardrail Unicode normalization (#42)
- Coherence usage tracked in agent costs (#43)
- Coherence fail_closed option (#44)

Full suite: 2013 passed.
johnnichev added a commit that referenced this pull request Mar 24, 2026
HIGH fixes recovered:
- Bug #22: MultiMCPClient copy.copy(tool) before prefix mutation
- Bug #26: InMemoryVectorStore max_documents capacity warning
- Bug #27: Pinecone metadata.get not .pop (stops mutating response)
- Bug #28: ContextualChunker (content or "").strip() None guard
- Bug #29: InMemoryVectorStore overfetch top_k*4 when filter present
- Bug #36: Redis save() category srem moved into pipeline
- Bug #38: XSS fix — createElement/textContent replaces innerHTML
- Bug #41: SSN regex uses strict alternation (?:\d{3}-\d{2}-\d{4}|\d{9})

LOW fix:
- Bug #35: KnowledgeMemory.remember() uses UTC consistently
- Fixed naive/aware datetime comparison in prune_old_logs
- Updated tests to use UTC dates

Full suite: 2013 passed.
johnnichev added a commit that referenced this pull request Apr 11, 2026
Source: LlamaIndex #20880 (same class: `alpha = query.alpha or 0.5`
swallowed `alpha=0.0`). CohereReranker.rerank used `top_n=top_k or len(results)`
which silently promoted `top_k=0` (user asking for no results) to the full
list. Round-1 pitfall #22 class, new instance in the rag/ module.

Fix: `top_n=top_k if top_k is not None else len(results)`.

Also adds docs/superpowers/plans/2026-04-11-round2-quickwins.md — the
4-bug round-2 plan derived from the LangChain/LlamaIndex competitive-mining
research report.
johnnichev added a commit that referenced this pull request Apr 11, 2026
…ount fields

Source: LangChain #36500. `token_usage.get("total_tokens") or fallback`
silently replaces provider-reported 0 for cached completions. Round-1
pitfall #22 instance not yet swept in providers/.

gemini_provider.py used `(usage.prompt_token_count or 0) if usage else 0`
in both sync complete() (lines 158-159) and stream path (lines 505-506).
If the Gemini API ever returns prompt_token_count=None alongside a real
candidates_token_count, the `or 0` conflates "unknown" with "zero" and
under-reports total_tokens.

Fix: use `x if x is not None else 0` guard pattern on both paths.
Grep confirmed the `or 0` pattern only appears on gemini_provider.py
token fields — no other provider affected.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant