diff --git a/CHANGELOG.md b/CHANGELOG.md index f4d70de..07d417e 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -5,6 +5,89 @@ All notable changes to selectools will be documented in this file. The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). +## [0.17.0] - 2026-03-22 + +### Added + +**Built-in Eval Framework** — the only AI agent framework with a comprehensive evaluation suite built in. No separate install, no SaaS account, no external dependencies. + +#### Evaluators (39 total) + +**21 deterministic evaluators** (no API calls): +- `ToolUseEvaluator` — tool name, tool list, argument value assertions +- `ContainsEvaluator` — substring present/absent (case-insensitive) +- `OutputEvaluator` — exact match and regex matching +- `StructuredOutputEvaluator` — parsed field assertions (deep subset match) +- `PerformanceEvaluator` — iteration count, latency, and cost thresholds +- `JsonValidityEvaluator` — valid JSON output +- `LengthEvaluator` — min/max character count +- `WordCountEvaluator` — min/max word count +- `StartsWithEvaluator` / `EndsWithEvaluator` — prefix/suffix assertions +- `ToolOrderEvaluator` — tools called in expected sequence +- `UniqueToolsEvaluator` — no duplicate tool calls +- `PIILeakEvaluator` — SSN, email, phone, credit card, ZIP detection +- `InjectionResistanceEvaluator` — 10 prompt injection patterns +- `RefusalEvaluator` — detect appropriate refusal of harmful requests +- `SentimentEvaluator` — keyword-based positive/negative/neutral detection +- `PythonValidityEvaluator` — valid Python syntax (with code fence stripping) +- `SQLValidityEvaluator` — SQL statement validation +- `URLValidityEvaluator` — well-formed URL detection +- `MarkdownFormatEvaluator` — markdown formatting detection +- `CustomEvaluator` — any user-defined callable + +**18 LLM-as-judge evaluators** (use any Provider): +- `LLMJudgeEvaluator` — generic rubric scoring (0-10) +- `CorrectnessEvaluator` — correct vs reference answer +- `RelevanceEvaluator` — response relevant to query +- `FaithfulnessEvaluator` — grounded in provided context (RAG) +- `HallucinationEvaluator` — fabricated information detection +- `ToxicityEvaluator` — harmful/inappropriate content +- `CoherenceEvaluator` — well-structured and logical +- `CompletenessEvaluator` — fully addresses the query +- `BiasEvaluator` — gender, racial, political bias +- `SummaryEvaluator` — summary accuracy and coverage +- `ConcisenessEvaluator` — not overly verbose +- `InstructionFollowingEvaluator` — followed specific instructions +- `ToneEvaluator` — matches expected tone +- `ContextRecallEvaluator` — RAG: used all relevant context +- `ContextPrecisionEvaluator` — RAG: retrieved context was relevant +- `GrammarEvaluator` — grammatically correct and fluent +- `SafetyEvaluator` — comprehensive safety check + +#### Infrastructure + +- `EvalSuite` — orchestrates eval runs with sync/async/concurrent execution +- `EvalReport` — accuracy, latency p50/p95/p99, cost, weighted scoring, tag filtering, failure breakdown +- `DatasetLoader` — load test cases from JSON/YAML files +- `BaselineStore` + `RegressionResult` — save baselines, detect regressions across runs +- `PairwiseEval` — compare two agents head-to-head with automatic winner determination +- `SnapshotStore` — Jest-style snapshot testing for AI agent outputs +- `generate_cases()` — LLM-powered synthetic test case generator from tool definitions +- `generate_badge()` — shields.io-style SVG badges for README +- `serve_eval()` — live browser dashboard with real-time eval progress +- `HistoryStore` — track accuracy/cost/latency across runs with trend analysis +- Interactive HTML report with donut chart, latency histogram, trend sparkline, expandable rows, filtering +- JUnit XML for CI (GitHub Actions, Jenkins, GitLab CI) +- `report.to_markdown()` — markdown summary for GitHub issues and PRs +- CLI: `python -m selectools.evals run/compare` +- GitHub Action at `.github/actions/eval/` with automatic PR comments +- Cost estimation: `suite.estimate_cost()` before running +- 4 pre-built templates: `customer_support_suite()`, `rag_quality_suite()`, `safety_suite()`, `code_quality_suite()` +- `pip install selectools[evals]` for optional PyYAML dependency + +#### Observer Integration + +- 3 new observer events: `on_eval_start`, `on_eval_case_end`, `on_eval_end` +- Compatible with `LoggingObserver` for structured JSON eval logs + +#### Testing + +- **309 new eval tests** across 6 test files (unit, integration, E2E) +- 40 example scripts (2 eval-specific: `39_eval_framework.py`, `40_eval_advanced.py`) +- Full module documentation: `docs/modules/EVALS.md` + +--- + ## [0.16.7] - 2026-03-16 ### Removed diff --git a/README.md b/README.md index 45ba832..b62643e 100644 --- a/README.md +++ b/README.md @@ -4,17 +4,40 @@ [](https://johnnichev.github.io/selectools) [](https://www.gnu.org/licenses/lgpl-3.0) [](https://www.python.org/downloads/) +[](https://johnnichev.github.io/selectools/modules/EVALS/) An open-source project from **[NichevLabs](https://nichevlabs.com)**. **Production-ready AI agents with tool calling, RAG, and hybrid search.** Connect LLMs to your Python functions, embed and search your documents with vector + keyword fusion, stream responses in real time, and dynamically manage tools at runtime. Works with OpenAI, Anthropic, Gemini, and Ollama. Tracks costs automatically. -## What's New in v0.16.7 +## What's New in v0.17.0 -**Cleanup release** — Removed unused CLI module, completed README example table (28-38), fixed stale doc counts. +**Built-in Eval Framework** — 39 evaluators, A/B testing, regression detection, and more. No separate install needed. -- **CLI removed** — `selectools` console script entry point removed (unused, flagged by package safety scanners) -- **1758 tests** across unit, integration, regression, and E2E +```python +from selectools.evals import EvalSuite, TestCase + +suite = EvalSuite(agent=agent, cases=[ + TestCase(input="Cancel account", expect_tool="cancel_sub", expect_no_pii=True), + TestCase(input="Balance?", expect_contains="balance", expect_latency_ms_lte=500), +]) +report = suite.run() +print(report.accuracy) # 0.95 +print(report.latency_p50) # 142ms +report.to_html("report.html") +``` + +- **39 Evaluators** — 21 deterministic + 18 LLM-as-judge (tool use, correctness, safety, RAG, code, format) +- **A/B Testing** — `PairwiseEval` compares two agents head-to-head +- **Regression Detection** — `BaselineStore` tracks accuracy across runs +- **Snapshot Testing** — Jest-style output snapshots for AI agents +- **Pre-built Templates** — `customer_support_suite()`, `safety_suite()`, `rag_quality_suite()`, `code_quality_suite()` +- **Interactive HTML Report** — donut chart, histogram, trend line, expandable rows, filtering +- **GitHub Action** — automatic PR comments with eval results +- **CLI** — `python -m selectools.evals run cases.json --html report.html` +- **Cost Estimation** — `suite.estimate_cost()` before running +- **History Tracking** — `HistoryStore` with trend analysis +- **309 eval tests**, zero external dependencies > Full changelog: [CHANGELOG.md](https://github.com/johnnichev/selectools/blob/main/CHANGELOG.md) diff --git a/docs/CHANGELOG.md b/docs/CHANGELOG.md index f4d70de..07d417e 100644 --- a/docs/CHANGELOG.md +++ b/docs/CHANGELOG.md @@ -5,6 +5,89 @@ All notable changes to selectools will be documented in this file. The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). +## [0.17.0] - 2026-03-22 + +### Added + +**Built-in Eval Framework** — the only AI agent framework with a comprehensive evaluation suite built in. No separate install, no SaaS account, no external dependencies. + +#### Evaluators (39 total) + +**21 deterministic evaluators** (no API calls): +- `ToolUseEvaluator` — tool name, tool list, argument value assertions +- `ContainsEvaluator` — substring present/absent (case-insensitive) +- `OutputEvaluator` — exact match and regex matching +- `StructuredOutputEvaluator` — parsed field assertions (deep subset match) +- `PerformanceEvaluator` — iteration count, latency, and cost thresholds +- `JsonValidityEvaluator` — valid JSON output +- `LengthEvaluator` — min/max character count +- `WordCountEvaluator` — min/max word count +- `StartsWithEvaluator` / `EndsWithEvaluator` — prefix/suffix assertions +- `ToolOrderEvaluator` — tools called in expected sequence +- `UniqueToolsEvaluator` — no duplicate tool calls +- `PIILeakEvaluator` — SSN, email, phone, credit card, ZIP detection +- `InjectionResistanceEvaluator` — 10 prompt injection patterns +- `RefusalEvaluator` — detect appropriate refusal of harmful requests +- `SentimentEvaluator` — keyword-based positive/negative/neutral detection +- `PythonValidityEvaluator` — valid Python syntax (with code fence stripping) +- `SQLValidityEvaluator` — SQL statement validation +- `URLValidityEvaluator` — well-formed URL detection +- `MarkdownFormatEvaluator` — markdown formatting detection +- `CustomEvaluator` — any user-defined callable + +**18 LLM-as-judge evaluators** (use any Provider): +- `LLMJudgeEvaluator` — generic rubric scoring (0-10) +- `CorrectnessEvaluator` — correct vs reference answer +- `RelevanceEvaluator` — response relevant to query +- `FaithfulnessEvaluator` — grounded in provided context (RAG) +- `HallucinationEvaluator` — fabricated information detection +- `ToxicityEvaluator` — harmful/inappropriate content +- `CoherenceEvaluator` — well-structured and logical +- `CompletenessEvaluator` — fully addresses the query +- `BiasEvaluator` — gender, racial, political bias +- `SummaryEvaluator` — summary accuracy and coverage +- `ConcisenessEvaluator` — not overly verbose +- `InstructionFollowingEvaluator` — followed specific instructions +- `ToneEvaluator` — matches expected tone +- `ContextRecallEvaluator` — RAG: used all relevant context +- `ContextPrecisionEvaluator` — RAG: retrieved context was relevant +- `GrammarEvaluator` — grammatically correct and fluent +- `SafetyEvaluator` — comprehensive safety check + +#### Infrastructure + +- `EvalSuite` — orchestrates eval runs with sync/async/concurrent execution +- `EvalReport` — accuracy, latency p50/p95/p99, cost, weighted scoring, tag filtering, failure breakdown +- `DatasetLoader` — load test cases from JSON/YAML files +- `BaselineStore` + `RegressionResult` — save baselines, detect regressions across runs +- `PairwiseEval` — compare two agents head-to-head with automatic winner determination +- `SnapshotStore` — Jest-style snapshot testing for AI agent outputs +- `generate_cases()` — LLM-powered synthetic test case generator from tool definitions +- `generate_badge()` — shields.io-style SVG badges for README +- `serve_eval()` — live browser dashboard with real-time eval progress +- `HistoryStore` — track accuracy/cost/latency across runs with trend analysis +- Interactive HTML report with donut chart, latency histogram, trend sparkline, expandable rows, filtering +- JUnit XML for CI (GitHub Actions, Jenkins, GitLab CI) +- `report.to_markdown()` — markdown summary for GitHub issues and PRs +- CLI: `python -m selectools.evals run/compare` +- GitHub Action at `.github/actions/eval/` with automatic PR comments +- Cost estimation: `suite.estimate_cost()` before running +- 4 pre-built templates: `customer_support_suite()`, `rag_quality_suite()`, `safety_suite()`, `code_quality_suite()` +- `pip install selectools[evals]` for optional PyYAML dependency + +#### Observer Integration + +- 3 new observer events: `on_eval_start`, `on_eval_case_end`, `on_eval_end` +- Compatible with `LoggingObserver` for structured JSON eval logs + +#### Testing + +- **309 new eval tests** across 6 test files (unit, integration, E2E) +- 40 example scripts (2 eval-specific: `39_eval_framework.py`, `40_eval_advanced.py`) +- Full module documentation: `docs/modules/EVALS.md` + +--- + ## [0.16.7] - 2026-03-16 ### Removed diff --git a/landing/index.html b/landing/index.html index 4fbbfaf..b529aec 100644 --- a/landing/index.html +++ b/landing/index.html @@ -88,7 +88,7 @@
The only agent framework with a built-in eval suite. No separate install, no SaaS account, no external dependencies. 22 evaluators out of the box.
+The only agent framework with a built-in eval suite. No separate install, no SaaS account, no external dependencies. 39 evaluators out of the box.
An honest comparison. Choose LangChain for ecosystem breadth. Choose Selectools when compliance and observability aren't optional.
-| Capability | -Selectools | -LangChain | +Capability | +Selectools | +LangChain | +CrewAI |
|---|---|---|---|---|---|---|
| Execution traces | -Built-in | -LangSmith (paid) | +Execution traces | +Built-in | +LangSmith (paid) | +Limited logging | +
| Guardrails | +Built-in (5 types) | +NeMo (separate) | +Not built-in | +|||
| Audit logging | +Built-in (4 privacy levels) | +DIY | +Not built-in | +|||
| Agent evaluation | +Built-in (39 evaluators) | +LangSmith (paid) | +Not built-in | +|||
| Cost tracking | +Automatic per-call | +Manual | +Not built-in | +|||
| Injection defense | +15 patterns + coherence | +Not included | +Not included | +|||
| Setup | +1 package | +5+ packages | +1 package | |||
| Guardrails | -Built-in (5 types) | -NeMo (separate project) | +Multi-agent | +Coming v0.17.1 | +LangGraph | +Core feature | +
| Community | +Growing | +Massive | +Large | +
| Capability | +Selectools | +DeepEval | +Promptfoo | +LangSmith | +|||
|---|---|---|---|---|---|---|---|
| Install | +Built-in | +Separate pip | +Node.js CLI | +pip + account | |||
| Audit logging | -Built-in (4 privacy levels) | -Build it yourself | +Evaluators | +39 | +50+ | +~20 | +~8 |
| Cost tracking | -Automatic per-call | -Manual / LangSmith | +A/B testing | +PairwiseEval | +No | +Side-by-side | +Experiments |
| Injection defense | -15 patterns + coherence | -Not included | +Snapshot testing | +SnapshotStore | +No | +No | +No |
| Setup | -1 package | -5+ packages | +Regression detection | +Local (JSON) | +Cloud only | +CLI + GitHub | +SaaS only |
| Reasoning visibility | -result.reasoning | -Not available | +HTML report | +Interactive (charts) | +Cloud only | +Self-contained | +SaaS UI |
| Agent evaluation | -Built-in (22 evaluators) | -LangSmith (paid) or DeepEval (separate) | +Works offline | +Yes | +Partial | +Yes | +No |
| Community | -Growing | -Massive | +Price | +Free | +Free + SaaS | +Free | +$39/seat/mo |