What's New in v3.0.0
Added
Trace UI: conversation-first observability + live streaming
- Capture layer. Every turn writes a per-turn
events.jsonlalongside
response.txt/messages.json. Captures tool invocations
(tool.before/tool.after), model invocations (model.before/
model.after), message additions (message.added), observation masking
(context.mask), and lesson injection (lesson.injected). Events carry
monotonicseqand aschema.versionheader.ObserverHooks.busand
ObservationMaskingManager.busare public settable attributes (default
None); adapters userunner.bus_for_turn(observer, capture_path)which
fans out to every bus-aware component reachable from the observer. - Per-turn capture for terminal, HTTP, and Telegram.
response.txtand
messages.jsonland in<workspace>/.blueclaw/conversations/<id>/turns/ turn-NNN/.<id>is the HTTPconversation_id, the Telegram chat ID, or
a per-processYYYYMMDD-HHMMSS-xxxxtimestamp for terminal. Helpers:
blueclaw.runner.next_capture_pathplus a purevalidate_session_idthat
rejects/,\,\x00, whitespace, control chars,./.., empty, or128 chars.
RunTracegains acapture_pathfield (relative to
workspace.root);Nonefor pre-feature traces. - Backend conversation API.
GET /api/conversations,
/api/conversations/<cid>,/api/conversations/<cid>/turns/<n>/events
expose per-cid aggregates and per-turn event streams. Aggregates computed
at query time from existing traces — no new persistence files. - Conversation-first dashboard. New
#/conversationsand
#/conversations/<cid>views. Per-turn transcript renders user / tool /
assistant inline (tool use + tool result fold into a single bordered card
with full args, full result, and show-more for long output). Deep details
panel shows a waterfall combining tool steps with model invocation bars
and a virtualized raw events stream. Flat traces view preserved as a
secondary tab. Each row also shows an inline preview chip (first line of
response.txt, ≤200 chars) with a "view full" link;
GET /api/turns/<cid>/<n>/responseand/api/turns/<cid>/<n>/messages
serve the captured artifacts (404 with expected path + hint when pruned). - Live event streaming.
blueclaw trace ui --livestarts a Unix-socket
broker at~/.blueclaw/live.sock. Any blueclaw process started afterward
detects the socket and forwards every event in real time. The dashboard
subscribes via SSE at/api/conversations/<cid>/turns/<n>/events/live,
polls/api/conversations/<cid>every 3 s to detect new turns, and
reopens the EventSource whenturn_countincreases. Backfill + dedup-
by-seq handshake (with reset on a freshschema.versionevent) ensures
no events missed across turns or reconnects. Off by default; opt in with
--live.
Unified agent runner
blueclaw/runner.pyexposesrunner_session(context manager),
finalize,finalize_error, andrun_turn.runner_session.__exit__
runscleanup_mcp_clientsunconditionally — adapters can no longer
forget it.tests/test_no_direct_create_agent.pydurability guard: any module
outsideblueclaw/runner.py/blueclaw/session.pymatching
\bcreate_agent\bfails the test. With HTTP migrated,
ALLOWLIST_PENDING_MIGRATIONis empty.tests/test_server.py::TestStreamingWorkspaceErrorCleanup: structural
regression test asserting that whenworkspace.write_traceraises mid-
stream, the SSEerrorevent emits ANDcleanup_mcp_clientsruns via
runner_session.__exit__.
Tools
http_request: Cloudflare-aware fetch + article extraction. Replaced
urllib.urlopenwithcurl_cffiusing Chrome 124 TLS impersonation so
blueclaw can fetch pages behind Cloudflare's bot challenge (Medium,
Substack, many news sites) that previously returned 403. HTML responses
run throughtrafilaturato strip boilerplate and return article title +
main body — typical reduction ~80k → 2–8k tokens, which lets smaller
local models (Ollama gemma/qwen) actually read the result.
SessionConfig.http_extract_mainYAML flag (defaulttrue) toggles
extraction. New runtime deps:curl-cffi>=0.7,trafilatura>=1.12.
Eval / test infrastructure
- Eval response capture.
blueclaw testpersistsresponse.txt,
messages.json, andinvocation.jsonper run to
~/blueclaw/test-runs/<invocation-ts>/case-<N>/run-<N>/. Decoupled from
--keep-workspace— artifacts persist regardless of workspace cleanup.
TAP formatter appends anartifacts:breadcrumb to failure records and
prints a finalArtifacts: <path>line to stderr. Capture is best-
effort: write failures log to stderr and are recorded in
invocation.json:capture_failuresbut never fail the eval. Override
the root viaBLUECLAW_ARTIFACTS_ROOTorrun_spec(..., artifacts_root=). forbidden_output_regextest assertion. Inverse ofoutput_regex
— fails the test if the regex matches. Lets specs assert on reworded
refusal phrasings that a single substring would miss.tests/eval/multi_turn_constraints.yamlbehavioral regression spec,
pinned to Sonnet 4.6 (~$1–2 per full run, ~10–15 min, manual run only).
Scope reduced after triage: single-turn proxies fabricating prior turns
were rejected by honesty-trained models; tests 2 and 4 rewritten with
instruction framing. Rule D (api-channel constraint carry-forward) is
no longer covered by automated tests here — real multi-turn fixtures
tracked as a follow-up.
System prompt
- Behavioral rules (tool-knowledge, partial-refusal, correction-
acknowledgment, cosmetic-compensation). Four new rules in the shared
**Rules:**block ofbuild_system_prompt(both terminal and api
channels) targeting failure modes from an external eval: declining
without trying available tools, silently dropping parts of a request,
ignoring user corrections, reaching for formatting to mask thin
substance. - api-channel "constraint carry-forward" rule. Replaces "Answer ONLY
what the user just asked" in the api/Telegram tone block — preserves
anti-recap intent while requiring the model to carry forward earlier
turns' constraints, deliverables, and corrections.
Notes
- Orphan events on mid-turn crash. If a turn crashes after
events.jsonlis written but beforeRunTracefinalization completes,
the events file remains on disk while the trace is missing. No automatic
cleanup yet — inspect manually withfind ~/blueclaw -name events.jsonl -newer <date>. A future release will extendblueclaw trace purge --older-than Nto cover orphans.
Changed
- Terminal sessions now carry a
conversation_id(timestamp-based per-
process session ID) on trace and history records. PreviouslyNonefor
terminal-sourced runs; downstream tooling that grouped by
conversation_idshould account for the new value. /api/tracessummaries includecapture_path, plus either
capture_preview(file exists) orcaptures_pruned: true(directory
deleted). Both fields absent when no capture exists.runner.finalize/runner.finalize_erroraccept an optional
workspace_root: Path | Nonekwarg; combined withcapture_pathit
stores the relativized path on the trace.
Security
- HTTP
POST /messageandPOST /message/streamvalidate the client-
suppliedconversation_idagainst path-traversal characters and unsafe
values. Invalid IDs receive a generic{"error": "invalid conversation_id"}400 that does not echo the rejected value (logged
server-side only).
Fixed
runner_session.__exit__enforcescleanup_mcp_clientsfor any adapter
that uses the runner — closing a class of bug structurally. The
BridgeRouter.handle_message(Telegram) cleanup miss was the proof case;
the Telegram migration to the runner (also in this release) realizes the
fix for that call site.blueclaw run "..."exits non-zero with an error message when the agent
raises (previously regressed to silent exit 0 during the terminal
migration; restored before merge).EventBus.emitno longer lets callers shadow bus-controlledseq/ts
by including those keys in the event payload — spread order swapped so
bus fields always win._drop_subscriberno longer recurses unbounded when multiple subscribers
overflow during the same emit. Drops are collected during fan-out and
the syntheticstream.droppednotices fire after the loop completes.
Installation
pip install blueclaw==3.0.0