Skip to content

release: v0.17.0 — Built-in Eval Framework#23

Merged
johnnichev merged 1 commit intomainfrom
release/v0.17.0
Mar 22, 2026
Merged

release: v0.17.0 — Built-in Eval Framework#23
johnnichev merged 1 commit intomainfrom
release/v0.17.0

Conversation

@johnnichev
Copy link
Copy Markdown
Owner

Selectools v0.17.0 — Built-in Eval Framework

The only AI agent framework with a comprehensive evaluation suite built in. No separate install, no SaaS account, no external dependencies.

Highlights

  • 39 evaluators — 21 deterministic + 18 LLM-as-judge
  • A/B testing — PairwiseEval compares agents head-to-head
  • Regression detection — BaselineStore tracks accuracy across runs
  • Snapshot testing — Jest-style output snapshots for AI agents
  • 4 pre-built templates — customer support, RAG, safety, code quality
  • Interactive HTML report — donut chart, histogram, trend sparkline, expandable rows
  • GitHub Action — automatic PR comments with eval results
  • CLIpython -m selectools.evals run cases.json
  • Cost estimationsuite.estimate_cost() before running
  • History tracking — accuracy/cost/latency trends over time
  • 3 new observer events — on_eval_start, on_eval_case_end, on_eval_end
  • pip install selectools[evals] — optional PyYAML dependency
  • report.to_markdown() — paste into GitHub issues/Slack/PRs
  • 340 eval tests across 7 test files
  • 1960 total tests, all passing

Files changed

  • Version bump: __init__.py + pyproject.toml (0.16.7 → 0.17.0)
  • Count fixes across 14 files (tests, evaluators, examples, observer events)
  • ROADMAP: v0.17.0 marked complete
  • CHANGELOG: comprehensive release entry

Test plan

  • 1906 tests pass (full suite)
  • All pre-commit hooks pass
  • MkDocs build clean
  • Version consistent across init.py and pyproject.toml
  • No stale counts in non-historical files

Version bump 0.16.7 → 0.17.0 across __init__.py and pyproject.toml.

Count fixes across 14 files:
- Tests: 1758 → 1960 (README, CLAUDE.md, CONTRIBUTING, docs/index, landing)
- Evaluators: 22 → 39 (README, CLAUDE.md, EVALS.md)
- Examples: 39 → 40 (README, docs/index)
- Observer events: 25 → 28 (CLAUDE.md, README, docs/index, ARCHITECTURE, AGENT)
- Eval tests: 309 → 340 (README, CHANGELOG)
- CONTRIBUTING version: v0.16.7 → v0.17.0
- ROADMAP: v0.17.0 marked complete

CHANGELOG updated with comprehensive v0.17.0 entry.
MkDocs build clean. 1906 tests passing.
@johnnichev johnnichev merged commit cc9209d into main Mar 22, 2026
1 check passed
@johnnichev johnnichev deleted the release/v0.17.0 branch March 22, 2026 19:15
johnnichev added a commit that referenced this pull request Mar 24, 2026
Agent core observers (6 fixes):
- astream() cancellation/budget paths now build proper results with
  trace steps and async observer events (#14)
- arun() fires async observers for cancel/budget/max-iter (#15)
- _aexecute_tools_parallel fires async observer events (#16)
- _aexecute_tools_parallel tracks tool_usage/tool_tokens (#17)
- _acheck_policy fires async on_policy_decision observer (#10M)
- astream() max-iter path fires async on_run_end (#12M)

Tools + providers (7 fixes):
- Anthropic empty content list guard (#19)
- Bool rejected for int/float params (#20)
- ToolRegistry.tool() has screen_output/terminal/requires_approval (#21)
- MultiMCPClient list_all_tools() copies tools before prefixing (#22)
- Streamable-http 3-tuple unpacking robust handling (#23)
- _serialize_result returns "" for None (#24)
- StructuredOutputEvaluator handles __slots__ (#45)

RAG (6 fixes):
- SQLiteVectorStore search documented limitation (#25)
- InMemoryVectorStore max_documents warning (#26)
- Pinecone metadata.get instead of .pop (#27)
- ContextualChunker None content guard (#28)
- Filter overfetch: top_k*4 when filter present (#29)
- OpenAI embed_texts batching at 2048 (#30)

Memory (5 fixes):
- FileKnowledgeStore reads under lock (#32)
- SQLiteSessionStore WAL mode (#33)
- SQLiteKnowledgeStore indexes on query columns (#34)
- query() LIMIT after TTL filter (#35)
- Redis save() category update in pipeline (#36)

Evals (4 fixes):
- 16 LLM evaluators fail on unparseable score (#37)
- XSS fix: textContent instead of innerHTML (#38)
- Donut SVG 360° arc: two semicircles (#39)
- Suite completed counter under threading.Lock (#46)

Security (5 fixes):
- REWRITE/WARN guardrails tracked in trace (#40)
- SSN regex requires consistent separators (#41)
- Topic guardrail Unicode normalization (#42)
- Coherence usage tracked in agent costs (#43)
- Coherence fail_closed option (#44)

Full suite: 2013 passed.
johnnichev added a commit that referenced this pull request Apr 11, 2026
BUG-16: _build_cancelled_result called _session_save but was
missing _extract_entities and _extract_kg_triples. When a run
was cancelled via CancellationToken, any entities/KG triples
that had been collected during the turn were silently lost.
Now mirrors _build_max_iterations_result and _build_budget_
exceeded_result which call all three persistence methods.

BUG-22: @tool() treated Optional[T] without a default value as
required. Some LLMs refuse to call a tool when an 'optional'
parameter has no way to represent None. Now detects Optional
types via Union[T, None] and marks them is_optional=True even
without a default value.

Cross-referenced from CLAUDE.md pitfall #23 and Agno #7066.
johnnichev added a commit that referenced this pull request Apr 13, 2026
…amples

## Summary

Three rounds of competitive bug mining across 9 repositories (~325k combined stars) surfaced and shipped 34 confirmed-live bugs with TDD regression tests.

### Round 1 (Agno + PraisonAI): BUG-01 – BUG-22
22 bugs: streaming tool-call drops, typing.Literal crashes, asyncio.run re-entry, HITL interrupt propagation, ConversationMemory thread safety, think-tag stripping, RAG batch limits, MCP concurrent race, str→typed coercion, Union typing, multi-interrupt generators, GraphState fail-fast, session namespace, summary cap, cancelled-result persistence, AgentTrace lock, async observer logging, clone isolation, OTel/Langfuse locks, vector store dedup, Optional[T] handling.

### Round 2 (LangChain + LangGraph + CrewAI + n8n + LlamaIndex + AutoGen): BUG-23 – BUG-26
4 bugs: reranker top_k=0 falsy fallback, _dedup_search_results text-only keying, in-memory filter operator-dict silent-ignore, Gemini provider or-0 usage metadata.

### Round 3 (LiteLLM + Pydantic AI + Haystack): BUG-27 – BUG-34
8 bugs: FallbackProvider retry list incomplete (529/504/408/522/524), Azure deployment-name family detection bypass, bare list/dict tool schemas missing items/properties, pipeline.parallel shared input mutation, malformed tool-call JSON silent drop, run_in_executor drops contextvars at 5 sites, astream missing aclosing on provider generators, max_iterations consumed by structured-retry budget.

### Documentation & Content
- Cookbook expanded from 7 to 30 recipes
- 6 new examples (89–94)
- Module docs updated: TOOLS, VECTOR_STORES, PROVIDERS, AGENT, PIPELINE
- CLAUDE.md pitfalls 27–30 added
- Stale counts swept across 13 files

### Stats
- **5,064 tests** (up from 5,015; +104 regression tests)
- **94 examples** (up from 88)
- **30 cookbook recipes** (up from 7)
- Cross-referenced: Agno (16), PraisonAI (5), LlamaIndex (3), LangChain (1), LiteLLM (2), Pydantic AI (4), Haystack (2), CLAUDE.md pitfall #23 (1)
- First cross-round compound validation: CrewAI round-2 candidate confirmed by Haystack round-3 grep

## Test plan
- [x] Full non-E2E suite: 5064 passed, 3 skipped, 0 failed
- [x] All 6 new examples verified with PYTHONPATH=src
- [x] Pre-commit hooks (black, isort, flake8, mypy, bandit) green on every commit
- [x] CI green on PR
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant