diff --git a/CHANGELOG.md b/CHANGELOG.md index f5e09e9..85562d3 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -5,6 +5,54 @@ All notable changes to selectools will be documented in this file. The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). +## [0.17.5] - 2026-03-23 + +### Fixed — Bug Hunt (91 validated fixes across 7 subsystems) + +#### Critical (13) +- **Path traversal in `JsonFileSessionStore`** — session IDs now validated against directory escape +- **Unicode homoglyph bypass** in prompt injection screening — NFKD normalization + zero-width stripping +- **`FallbackProvider` stream** records success after consumption, not before — circuit breaker works for streaming +- **Gemini `response.text` ValueError** on tool-call-only responses — caught and handled +- **`astream()` model_selector** was using `self.config.model` — now uses `self._effective_model` +- **Sync `_check_policy`** silently approved async `confirm_action` — now rejects with clear error +- **`aexecute()` ThreadPoolExecutor per call** — replaced with shared executor via `run_in_executor(None)` +- **`execute()` on async tools** returned coroutine string repr — now awaits via `asyncio.run` +- **Hybrid search O(n²)** `_find_matching_key` — replaced with O(1) `text_to_key` dict lookup +- **`SQLiteVectorStore`** no thread safety — added `threading.Lock` + WAL mode +- **`FileKnowledgeStore._save_all()`** not crash-safe — atomic write via tmp + `os.replace` +- **`OutputEvaluator`** crashed on invalid regex — wrapped in `try/except re.error` +- **`JsonValidityEvaluator`** ignored `expect_json=False` — guard now checks falsy, not just None + +#### High (26) +- **`astream()` cancellation/budget paths** now build proper trace steps + fire async observer events +- **`arun()` early exits** now fire `_anotify_observers("on_run_end")` for cancel/budget/max-iter +- **`_aexecute_tools_parallel`** fires async observer events + tracks `tool_usage`/`tool_tokens` +- **Sync `_streaming_call`** no longer stringifies `ToolCall` objects (pitfall #2) +- **16 LLM evaluators** silently passed on unparseable scores — now return `EvalFailure` +- **XSS in eval dashboard** — `innerHTML` replaced with `createElement`/`textContent` +- **Donut SVG 360° arc** renders nothing — now draws two semicircles for full annulus +- **SSN regex** matched ZIP+4 codes — now requires consistent separators +- **Coherence LLM costs** tracked in `CoherenceResult.usage` + merged into agent usage +- **Coherence `fail_closed`** option added (default: fail-open for backward compat) +- Plus 16 more HIGH fixes across tools, RAG, memory, and security subsystems + +#### Medium (30) and Low (22) +- `datetime.utcnow()` → `datetime.now(timezone.utc)` throughout knowledge stores +- `ConversationMemory.clear()` now resets `_summary` +- SQLite WAL mode + indexes for knowledge and session stores +- Non-deterministic `hash()` → `hashlib.sha256` for document IDs in 3 vector stores +- OpenAI `embed_texts()` batching at 2048 per request +- Tool result caching: `_serialize_result` returns `""` for None, not `"None"` +- Bool values rejected for int/float tool parameters +- `ToolRegistry.tool()` now forwards `screen_output`, `terminal`, `requires_approval` +- Plus 40+ more fixes (see `.private/BUG_HUNT_VALIDATED.md` for complete list) + +### Added +- **Async guardrails** — `Guardrail.acheck()` with `asyncio.to_thread` default, `GuardrailsPipeline.acheck_input()`/`acheck_output()`, `Agent._arun_input_guardrails()`. `arun()`/`astream()` no longer block the event loop during guardrail checks. +- 40 new regression tests covering all critical and high-severity fixes +- 5 new entries in CLAUDE.md Common Pitfalls (#14-#18) + ## [0.17.4] - 2026-03-22 ### Added diff --git a/CLAUDE.md b/CLAUDE.md index c37d9b9..050a085 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -94,7 +94,7 @@ src/selectools/ ├── junit.py # JUnit XML for CI └── __main__.py # CLI: python -m selectools.evals -tests/ # 2113 tests (unit, integration, regression, E2E) +tests/ # 2183 tests (unit, integration, regression, E2E) ├── agent/ # Agent core tests ├── providers/ # Provider-specific tests ├── rag/ # RAG pipeline tests @@ -304,6 +304,16 @@ Every `AgentTrace` contains `TraceStep` entries with one of these types: 13. **Hooks are deprecated — use observers**: `AgentConfig.hooks` (a plain dict of callbacks) is deprecated. Passing `hooks` emits a `DeprecationWarning` and internally wraps the dict via `_HooksAdapter(AgentObserver)`. New code should always use `AgentObserver` or `AsyncAgentObserver` instead. +14. **FallbackProvider `stream()` / `astream()` must record success AFTER consumption**: The generator must be fully consumed before calling `_record_success()`. Recording before consumption means the circuit breaker never trips on streaming errors. Fixed in v0.17.5. + +15. **`astream()` direct provider calls must use `self._effective_model`**: Unlike `run()`/`arun()` which go through `_call_provider`/`_acall_provider`, `astream()` calls providers directly. All model references in `astream()` must use `self._effective_model`, not `self.config.model`. + +16. **Async observer events must fire in all exit paths**: The shared `_build_cancelled_result`, `_build_budget_exceeded_result`, and `_build_max_iterations_result` only fire sync observers. In `arun()`/`astream()`, always add `await self._anotify_observers(...)` after calling these helpers. + +17. **`datetime.utcnow()` is deprecated — use `datetime.now(timezone.utc)`**: All datetime defaults in dataclasses must use `field(default_factory=lambda: datetime.now(timezone.utc))`, not `default_factory=datetime.utcnow`. The `is_expired` property and pruning code must also use aware datetimes. + +18. **Guardrails have async support**: `Guardrail.acheck()` runs sync `check()` via `asyncio.to_thread` by default. `GuardrailsPipeline` has `acheck_input()`/`acheck_output()`. `arun()`/`astream()` use `_arun_input_guardrails()` with `skip_guardrails=True` in `_prepare_run()` to avoid blocking the event loop. + ## Current Roadmap - **v0.15.0** ✅ Enterprise Reliability (guardrails, audit, screening, coherence) @@ -319,7 +329,7 @@ Every `AgentTrace` contains `TraceStep` entries with one of these types: - **v0.17.1** ✅ MCP Client/Server — MCPClient, mcp_tools(), MCPServer, MultiMCPClient, circuit breaker - **v0.17.3** ✅ Agent Runtime Controls — token budget, cancellation, cost attribution, structured results, approval gate, SimpleStepObserver - **v0.17.4** ✅ Agent Intelligence — token estimation, model switching, knowledge memory enhancement (4 store backends) -- **v0.17.5** 🟡 Tech Debt & Quick Wins — bug fixes, ReAct/CoT strategies, tool result caching, Python 3.9–3.13 CI +- **v0.17.5** ✅ Bug Hunt & Async Guardrails — 91 validated fixes, async guardrails, 40 regression tests - **v0.17.6** 🟡 Caching & Context — semantic caching, prompt compression, conversation branching - **v0.18.0** 🟡 Multi-Agent Orchestration — see `MULTI_AGENT_PLAN.md` - **v0.18.x** 🟡 Composability Layer — Pipeline with `@step` + `|` operator (LCEL alternative) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 53ccc63..0cc8e3b 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -3,7 +3,7 @@ Thank you for your interest in contributing to Selectools! We welcome contributions from the community. **Current Version:** v0.17.4 -**Test Status:** 2113 tests passing (100%) +**Test Status:** 2183 tests passing (100%) **Python:** 3.13+ ## Getting Started @@ -74,7 +74,7 @@ Similar to `npm run` scripts, here are the common commands for this project: ### Testing ```bash -# Run all tests (2113 tests) +# Run all tests (2183 tests) pytest tests/ -v # Run tests quietly (summary only) @@ -264,7 +264,7 @@ selectools/ │ ├── embeddings/ # Embedding providers │ ├── rag/ # RAG: vector stores, chunking, loaders │ └── toolbox/ # 24 pre-built tools -├── tests/ # Test suite (2113 tests) +├── tests/ # Test suite (2183 tests) │ ├── agent/ # Agent tests │ ├── rag/ # RAG tests │ ├── tools/ # Tool tests @@ -370,7 +370,7 @@ We especially welcome contributions in these areas: - Add comparison guides (vs LangChain, LlamaIndex) ### 🧪 **Testing** -- Increase test coverage (currently 2113 tests passing!) +- Increase test coverage (currently 2183 tests passing!) - Add performance benchmarks - Improve E2E test stability with retry/rate-limit handling diff --git a/README.md b/README.md index 48aa1e5..d97013b 100644 --- a/README.md +++ b/README.md @@ -171,7 +171,7 @@ report.to_html("report.html") - **49 Examples**: RAG, hybrid search, streaming, structured output, traces, batch, policy, observer, guardrails, audit, sessions, entity memory, knowledge graph, eval framework, and more - **Built-in Eval Framework**: 39 evaluators (21 deterministic + 18 LLM-as-judge), A/B testing, regression detection, HTML reports, JUnit XML, snapshot testing - **AgentObserver Protocol**: 31 lifecycle events with `run_id` correlation, `LoggingObserver`, `SimpleStepObserver`, OTel export -- **2113 Tests**: Unit, integration, regression, and E2E with real API calls +- **2183 Tests**: Unit, integration, regression, and E2E with real API calls ## Install @@ -740,7 +740,7 @@ pytest tests/ -x -q # All tests pytest tests/ -k "not e2e" # Skip E2E (no API keys needed) ``` -2082 tests covering parsing, agent loop, providers, RAG pipeline, hybrid search, advanced chunking, dynamic tools, caching, streaming, guardrails, sessions, memory, eval framework, budget/cancellation, knowledge stores, and E2E integration. +2183 tests covering parsing, agent loop, providers, RAG pipeline, hybrid search, advanced chunking, dynamic tools, caching, streaming, guardrails, sessions, memory, eval framework, budget/cancellation, knowledge stores, and E2E integration. ## License diff --git a/ROADMAP.md b/ROADMAP.md index 1d3595e..6566de4 100644 --- a/ROADMAP.md +++ b/ROADMAP.md @@ -208,9 +208,9 @@ v0.17.3 ✅ Agent Runtime Controls v0.17.4 ✅ Agent Intelligence Token estimation → Model switching → Knowledge memory enhancement (4 store backends) -v0.17.5 🟡 Tech Debt & Quick Wins - Stream fallback fix → abatch thread safety → async guardrails → ReAct/CoT strategies - → Tool result caching → Python 3.9–3.13 CI matrix +v0.17.5 ✅ Bug Hunt & Async Guardrails + 91 validated fixes (13 critical, 26 high, 52 medium+low) → Async guardrails + → 40 regression tests → 5 new Common Pitfalls v0.17.6 🟡 Caching & Context Semantic caching → Prompt compression → Conversation branching diff --git a/docs/CHANGELOG.md b/docs/CHANGELOG.md index f5e09e9..85562d3 100644 --- a/docs/CHANGELOG.md +++ b/docs/CHANGELOG.md @@ -5,6 +5,54 @@ All notable changes to selectools will be documented in this file. The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). +## [0.17.5] - 2026-03-23 + +### Fixed — Bug Hunt (91 validated fixes across 7 subsystems) + +#### Critical (13) +- **Path traversal in `JsonFileSessionStore`** — session IDs now validated against directory escape +- **Unicode homoglyph bypass** in prompt injection screening — NFKD normalization + zero-width stripping +- **`FallbackProvider` stream** records success after consumption, not before — circuit breaker works for streaming +- **Gemini `response.text` ValueError** on tool-call-only responses — caught and handled +- **`astream()` model_selector** was using `self.config.model` — now uses `self._effective_model` +- **Sync `_check_policy`** silently approved async `confirm_action` — now rejects with clear error +- **`aexecute()` ThreadPoolExecutor per call** — replaced with shared executor via `run_in_executor(None)` +- **`execute()` on async tools** returned coroutine string repr — now awaits via `asyncio.run` +- **Hybrid search O(n²)** `_find_matching_key` — replaced with O(1) `text_to_key` dict lookup +- **`SQLiteVectorStore`** no thread safety — added `threading.Lock` + WAL mode +- **`FileKnowledgeStore._save_all()`** not crash-safe — atomic write via tmp + `os.replace` +- **`OutputEvaluator`** crashed on invalid regex — wrapped in `try/except re.error` +- **`JsonValidityEvaluator`** ignored `expect_json=False` — guard now checks falsy, not just None + +#### High (26) +- **`astream()` cancellation/budget paths** now build proper trace steps + fire async observer events +- **`arun()` early exits** now fire `_anotify_observers("on_run_end")` for cancel/budget/max-iter +- **`_aexecute_tools_parallel`** fires async observer events + tracks `tool_usage`/`tool_tokens` +- **Sync `_streaming_call`** no longer stringifies `ToolCall` objects (pitfall #2) +- **16 LLM evaluators** silently passed on unparseable scores — now return `EvalFailure` +- **XSS in eval dashboard** — `innerHTML` replaced with `createElement`/`textContent` +- **Donut SVG 360° arc** renders nothing — now draws two semicircles for full annulus +- **SSN regex** matched ZIP+4 codes — now requires consistent separators +- **Coherence LLM costs** tracked in `CoherenceResult.usage` + merged into agent usage +- **Coherence `fail_closed`** option added (default: fail-open for backward compat) +- Plus 16 more HIGH fixes across tools, RAG, memory, and security subsystems + +#### Medium (30) and Low (22) +- `datetime.utcnow()` → `datetime.now(timezone.utc)` throughout knowledge stores +- `ConversationMemory.clear()` now resets `_summary` +- SQLite WAL mode + indexes for knowledge and session stores +- Non-deterministic `hash()` → `hashlib.sha256` for document IDs in 3 vector stores +- OpenAI `embed_texts()` batching at 2048 per request +- Tool result caching: `_serialize_result` returns `""` for None, not `"None"` +- Bool values rejected for int/float tool parameters +- `ToolRegistry.tool()` now forwards `screen_output`, `terminal`, `requires_approval` +- Plus 40+ more fixes (see `.private/BUG_HUNT_VALIDATED.md` for complete list) + +### Added +- **Async guardrails** — `Guardrail.acheck()` with `asyncio.to_thread` default, `GuardrailsPipeline.acheck_input()`/`acheck_output()`, `Agent._arun_input_guardrails()`. `arun()`/`astream()` no longer block the event loop during guardrail checks. +- 40 new regression tests covering all critical and high-severity fixes +- 5 new entries in CLAUDE.md Common Pitfalls (#14-#18) + ## [0.17.4] - 2026-03-22 ### Added diff --git a/docs/CONTRIBUTING.md b/docs/CONTRIBUTING.md index 53ccc63..0cc8e3b 100644 --- a/docs/CONTRIBUTING.md +++ b/docs/CONTRIBUTING.md @@ -3,7 +3,7 @@ Thank you for your interest in contributing to Selectools! We welcome contributions from the community. **Current Version:** v0.17.4 -**Test Status:** 2113 tests passing (100%) +**Test Status:** 2183 tests passing (100%) **Python:** 3.13+ ## Getting Started @@ -74,7 +74,7 @@ Similar to `npm run` scripts, here are the common commands for this project: ### Testing ```bash -# Run all tests (2113 tests) +# Run all tests (2183 tests) pytest tests/ -v # Run tests quietly (summary only) @@ -264,7 +264,7 @@ selectools/ │ ├── embeddings/ # Embedding providers │ ├── rag/ # RAG: vector stores, chunking, loaders │ └── toolbox/ # 24 pre-built tools -├── tests/ # Test suite (2113 tests) +├── tests/ # Test suite (2183 tests) │ ├── agent/ # Agent tests │ ├── rag/ # RAG tests │ ├── tools/ # Tool tests @@ -370,7 +370,7 @@ We especially welcome contributions in these areas: - Add comparison guides (vs LangChain, LlamaIndex) ### 🧪 **Testing** -- Increase test coverage (currently 2113 tests passing!) +- Increase test coverage (currently 2183 tests passing!) - Add performance benchmarks - Improve E2E test stability with retry/rate-limit handling diff --git a/docs/index.md b/docs/index.md index cf3fc4c..25182f0 100644 --- a/docs/index.md +++ b/docs/index.md @@ -139,7 +139,7 @@ print(result.reasoning) # Why the agent chose get_weather | **AgentObserver Protocol** | 31-event lifecycle observer with run/call ID correlation, `SimpleStepObserver`, and OTel export | | **Runtime Controls** | Token/cost budget limits, cooperative cancellation, per-tool approval gates, model switching per iteration | | **Eval Framework** | 39 built-in evaluators, A/B testing, regression detection, HTML reports, JUnit XML | -| **2113 Tests** | Unit, integration, regression, and E2E | +| **2183 Tests** | Unit, integration, regression, and E2E | --- diff --git a/landing/index.html b/landing/index.html index 0ed73c8..4cef16f 100644 --- a/landing/index.html +++ b/landing/index.html @@ -89,7 +89,7 @@