release: v0.17.0 — Built-in Eval Framework by johnnichev · Pull Request #23 · johnnichev/selectools

johnnichev · 2026-03-22T19:14:49Z

Selectools v0.17.0 — Built-in Eval Framework

The only AI agent framework with a comprehensive evaluation suite built in. No separate install, no SaaS account, no external dependencies.

Highlights

39 evaluators — 21 deterministic + 18 LLM-as-judge
A/B testing — PairwiseEval compares agents head-to-head
Regression detection — BaselineStore tracks accuracy across runs
Snapshot testing — Jest-style output snapshots for AI agents
4 pre-built templates — customer support, RAG, safety, code quality
Interactive HTML report — donut chart, histogram, trend sparkline, expandable rows
GitHub Action — automatic PR comments with eval results
CLI — python -m selectools.evals run cases.json
Cost estimation — suite.estimate_cost() before running
History tracking — accuracy/cost/latency trends over time
3 new observer events — on_eval_start, on_eval_case_end, on_eval_end
pip install selectools[evals] — optional PyYAML dependency
report.to_markdown() — paste into GitHub issues/Slack/PRs
340 eval tests across 7 test files
1960 total tests, all passing

Files changed

Version bump: __init__.py + pyproject.toml (0.16.7 → 0.17.0)
Count fixes across 14 files (tests, evaluators, examples, observer events)
ROADMAP: v0.17.0 marked complete
CHANGELOG: comprehensive release entry

Test plan

1906 tests pass (full suite)
All pre-commit hooks pass
MkDocs build clean
Version consistent across init.py and pyproject.toml
No stale counts in non-historical files

Version bump 0.16.7 → 0.17.0 across __init__.py and pyproject.toml. Count fixes across 14 files: - Tests: 1758 → 1960 (README, CLAUDE.md, CONTRIBUTING, docs/index, landing) - Evaluators: 22 → 39 (README, CLAUDE.md, EVALS.md) - Examples: 39 → 40 (README, docs/index) - Observer events: 25 → 28 (CLAUDE.md, README, docs/index, ARCHITECTURE, AGENT) - Eval tests: 309 → 340 (README, CHANGELOG) - CONTRIBUTING version: v0.16.7 → v0.17.0 - ROADMAP: v0.17.0 marked complete CHANGELOG updated with comprehensive v0.17.0 entry. MkDocs build clean. 1906 tests passing.

Agent core observers (6 fixes): - astream() cancellation/budget paths now build proper results with trace steps and async observer events (#14) - arun() fires async observers for cancel/budget/max-iter (#15) - _aexecute_tools_parallel fires async observer events (#16) - _aexecute_tools_parallel tracks tool_usage/tool_tokens (#17) - _acheck_policy fires async on_policy_decision observer (#10M) - astream() max-iter path fires async on_run_end (#12M) Tools + providers (7 fixes): - Anthropic empty content list guard (#19) - Bool rejected for int/float params (#20) - ToolRegistry.tool() has screen_output/terminal/requires_approval (#21) - MultiMCPClient list_all_tools() copies tools before prefixing (#22) - Streamable-http 3-tuple unpacking robust handling (#23) - _serialize_result returns "" for None (#24) - StructuredOutputEvaluator handles __slots__ (#45) RAG (6 fixes): - SQLiteVectorStore search documented limitation (#25) - InMemoryVectorStore max_documents warning (#26) - Pinecone metadata.get instead of .pop (#27) - ContextualChunker None content guard (#28) - Filter overfetch: top_k*4 when filter present (#29) - OpenAI embed_texts batching at 2048 (#30) Memory (5 fixes): - FileKnowledgeStore reads under lock (#32) - SQLiteSessionStore WAL mode (#33) - SQLiteKnowledgeStore indexes on query columns (#34) - query() LIMIT after TTL filter (#35) - Redis save() category update in pipeline (#36) Evals (4 fixes): - 16 LLM evaluators fail on unparseable score (#37) - XSS fix: textContent instead of innerHTML (#38) - Donut SVG 360° arc: two semicircles (#39) - Suite completed counter under threading.Lock (#46) Security (5 fixes): - REWRITE/WARN guardrails tracked in trace (#40) - SSN regex requires consistent separators (#41) - Topic guardrail Unicode normalization (#42) - Coherence usage tracked in agent costs (#43) - Coherence fail_closed option (#44) Full suite: 2013 passed.

@tool

BUG-16: _build_cancelled_result called _session_save but was missing _extract_entities and _extract_kg_triples. When a run was cancelled via CancellationToken, any entities/KG triples that had been collected during the turn were silently lost. Now mirrors _build_max_iterations_result and _build_budget_ exceeded_result which call all three persistence methods. BUG-22: @tool() treated Optional[T] without a default value as required. Some LLMs refuse to call a tool when an 'optional' parameter has no way to represent None. Now detects Optional types via Union[T, None] and marks them is_optional=True even without a default value. Cross-referenced from CLAUDE.md pitfall #23 and Agno #7066.

…amples ## Summary Three rounds of competitive bug mining across 9 repositories (~325k combined stars) surfaced and shipped 34 confirmed-live bugs with TDD regression tests. ### Round 1 (Agno + PraisonAI): BUG-01 – BUG-22 22 bugs: streaming tool-call drops, typing.Literal crashes, asyncio.run re-entry, HITL interrupt propagation, ConversationMemory thread safety, think-tag stripping, RAG batch limits, MCP concurrent race, str→typed coercion, Union typing, multi-interrupt generators, GraphState fail-fast, session namespace, summary cap, cancelled-result persistence, AgentTrace lock, async observer logging, clone isolation, OTel/Langfuse locks, vector store dedup, Optional[T] handling. ### Round 2 (LangChain + LangGraph + CrewAI + n8n + LlamaIndex + AutoGen): BUG-23 – BUG-26 4 bugs: reranker top_k=0 falsy fallback, _dedup_search_results text-only keying, in-memory filter operator-dict silent-ignore, Gemini provider or-0 usage metadata. ### Round 3 (LiteLLM + Pydantic AI + Haystack): BUG-27 – BUG-34 8 bugs: FallbackProvider retry list incomplete (529/504/408/522/524), Azure deployment-name family detection bypass, bare list/dict tool schemas missing items/properties, pipeline.parallel shared input mutation, malformed tool-call JSON silent drop, run_in_executor drops contextvars at 5 sites, astream missing aclosing on provider generators, max_iterations consumed by structured-retry budget. ### Documentation & Content - Cookbook expanded from 7 to 30 recipes - 6 new examples (89–94) - Module docs updated: TOOLS, VECTOR_STORES, PROVIDERS, AGENT, PIPELINE - CLAUDE.md pitfalls 27–30 added - Stale counts swept across 13 files ### Stats - **5,064 tests** (up from 5,015; +104 regression tests) - **94 examples** (up from 88) - **30 cookbook recipes** (up from 7) - Cross-referenced: Agno (16), PraisonAI (5), LlamaIndex (3), LangChain (1), LiteLLM (2), Pydantic AI (4), Haystack (2), CLAUDE.md pitfall #23 (1) - First cross-round compound validation: CrewAI round-2 candidate confirmed by Haystack round-3 grep ## Test plan - [x] Full non-E2E suite: 5064 passed, 3 skipped, 0 failed - [x] All 6 new examples verified with PYTHONPATH=src - [x] Pre-commit hooks (black, isort, flake8, mypy, bandit) green on every commit - [x] CI green on PR

johnnichev merged commit cc9209d into main Mar 22, 2026
1 check passed

johnnichev deleted the release/v0.17.0 branch March 22, 2026 19:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

release: v0.17.0 — Built-in Eval Framework#23

release: v0.17.0 — Built-in Eval Framework#23
johnnichev merged 1 commit intomainfrom
release/v0.17.0

johnnichev commented Mar 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant