Release v3.0.0 · jztan/blueclaw

What's New in v3.0.0

Added

Trace UI: conversation-first observability + live streaming

Capture layer. Every turn writes a per-turn events.jsonl alongside
response.txt / messages.json. Captures tool invocations
(tool.before / tool.after), model invocations (model.before /
model.after), message additions (message.added), observation masking
(context.mask), and lesson injection (lesson.injected). Events carry
monotonic seq and a schema.version header. ObserverHooks.bus and
ObservationMaskingManager.bus are public settable attributes (default
None); adapters use runner.bus_for_turn(observer, capture_path) which
fans out to every bus-aware component reachable from the observer.
Per-turn capture for terminal, HTTP, and Telegram. response.txt and
messages.json land in <workspace>/.blueclaw/conversations/<id>/turns/ turn-NNN/. <id> is the HTTP conversation_id, the Telegram chat ID, or
a per-process YYYYMMDD-HHMMSS-xxxx timestamp for terminal. Helpers:
blueclaw.runner.next_capture_path plus a pure validate_session_id that
rejects /, \, \x00, whitespace, control chars, ./.., empty, or

128 chars. RunTrace gains a capture_path field (relative to
workspace.root); None for pre-feature traces.
Backend conversation API. GET /api/conversations,
/api/conversations/<cid>, /api/conversations/<cid>/turns/<n>/events
expose per-cid aggregates and per-turn event streams. Aggregates computed
at query time from existing traces — no new persistence files.
Conversation-first dashboard. New #/conversations and
#/conversations/<cid> views. Per-turn transcript renders user / tool /
assistant inline (tool use + tool result fold into a single bordered card
with full args, full result, and show-more for long output). Deep details
panel shows a waterfall combining tool steps with model invocation bars
and a virtualized raw events stream. Flat traces view preserved as a
secondary tab. Each row also shows an inline preview chip (first line of
response.txt, ≤200 chars) with a "view full" link;
GET /api/turns/<cid>/<n>/response and /api/turns/<cid>/<n>/messages
serve the captured artifacts (404 with expected path + hint when pruned).
Live event streaming. blueclaw trace ui --live starts a Unix-socket
broker at ~/.blueclaw/live.sock. Any blueclaw process started afterward
detects the socket and forwards every event in real time. The dashboard
subscribes via SSE at /api/conversations/<cid>/turns/<n>/events/live,
polls /api/conversations/<cid> every 3 s to detect new turns, and
reopens the EventSource when turn_count increases. Backfill + dedup-
by-seq handshake (with reset on a fresh schema.version event) ensures
no events missed across turns or reconnects. Off by default; opt in with
--live.

Unified agent runner

blueclaw/runner.py exposes runner_session (context manager),
finalize, finalize_error, and run_turn. runner_session.__exit__
runs cleanup_mcp_clients unconditionally — adapters can no longer
forget it.
tests/test_no_direct_create_agent.py durability guard: any module
outside blueclaw/runner.py / blueclaw/session.py matching
\bcreate_agent\b fails the test. With HTTP migrated,
ALLOWLIST_PENDING_MIGRATION is empty.
tests/test_server.py::TestStreamingWorkspaceErrorCleanup: structural
regression test asserting that when workspace.write_trace raises mid-
stream, the SSE error event emits AND cleanup_mcp_clients runs via
runner_session.__exit__.

Tools

http_request: Cloudflare-aware fetch + article extraction. Replaced
urllib.urlopen with curl_cffi using Chrome 124 TLS impersonation so
blueclaw can fetch pages behind Cloudflare's bot challenge (Medium,
Substack, many news sites) that previously returned 403. HTML responses
run through trafilatura to strip boilerplate and return article title +
main body — typical reduction ~80k → 2–8k tokens, which lets smaller
local models (Ollama gemma/qwen) actually read the result.
SessionConfig.http_extract_main YAML flag (default true) toggles
extraction. New runtime deps: curl-cffi>=0.7, trafilatura>=1.12.

Eval / test infrastructure

Eval response capture. blueclaw test persists response.txt,
messages.json, and invocation.json per run to
~/blueclaw/test-runs/<invocation-ts>/case-<N>/run-<N>/. Decoupled from
--keep-workspace — artifacts persist regardless of workspace cleanup.
TAP formatter appends an artifacts: breadcrumb to failure records and
prints a final Artifacts: <path> line to stderr. Capture is best-
effort: write failures log to stderr and are recorded in
invocation.json:capture_failures but never fail the eval. Override
the root via BLUECLAW_ARTIFACTS_ROOT or run_spec(..., artifacts_root=).
forbidden_output_regex test assertion. Inverse of output_regex
— fails the test if the regex matches. Lets specs assert on reworded
refusal phrasings that a single substring would miss.
tests/eval/multi_turn_constraints.yaml behavioral regression spec,
pinned to Sonnet 4.6 (~$1–2 per full run, ~10–15 min, manual run only).
Scope reduced after triage: single-turn proxies fabricating prior turns
were rejected by honesty-trained models; tests 2 and 4 rewritten with
instruction framing. Rule D (api-channel constraint carry-forward) is
no longer covered by automated tests here — real multi-turn fixtures
tracked as a follow-up.

System prompt

Behavioral rules (tool-knowledge, partial-refusal, correction-
acknowledgment, cosmetic-compensation). Four new rules in the shared
**Rules:** block of build_system_prompt (both terminal and api
channels) targeting failure modes from an external eval: declining
without trying available tools, silently dropping parts of a request,
ignoring user corrections, reaching for formatting to mask thin
substance.
api-channel "constraint carry-forward" rule. Replaces "Answer ONLY
what the user just asked" in the api/Telegram tone block — preserves
anti-recap intent while requiring the model to carry forward earlier
turns' constraints, deliverables, and corrections.

Notes

Orphan events on mid-turn crash. If a turn crashes after
events.jsonl is written but before RunTrace finalization completes,
the events file remains on disk while the trace is missing. No automatic
cleanup yet — inspect manually with find ~/blueclaw -name events.jsonl -newer <date>. A future release will extend blueclaw trace purge --older-than N to cover orphans.

Changed

Terminal sessions now carry a conversation_id (timestamp-based per-
process session ID) on trace and history records. Previously None for
terminal-sourced runs; downstream tooling that grouped by
conversation_id should account for the new value.
/api/traces summaries include capture_path, plus either
capture_preview (file exists) or captures_pruned: true (directory
deleted). Both fields absent when no capture exists.
runner.finalize / runner.finalize_error accept an optional
workspace_root: Path | None kwarg; combined with capture_path it
stores the relativized path on the trace.

Security

HTTP POST /message and POST /message/stream validate the client-
supplied conversation_id against path-traversal characters and unsafe
values. Invalid IDs receive a generic {"error": "invalid conversation_id"} 400 that does not echo the rejected value (logged
server-side only).

Fixed

runner_session.__exit__ enforces cleanup_mcp_clients for any adapter
that uses the runner — closing a class of bug structurally. The
BridgeRouter.handle_message (Telegram) cleanup miss was the proof case;
the Telegram migration to the runner (also in this release) realizes the
fix for that call site.
blueclaw run "..." exits non-zero with an error message when the agent
raises (previously regressed to silent exit 0 during the terminal
migration; restored before merge).
EventBus.emit no longer lets callers shadow bus-controlled seq / ts
by including those keys in the event payload — spread order swapped so
bus fields always win.
_drop_subscriber no longer recurses unbounded when multiple subscribers
overflow during the same emit. Drops are collected during fan-out and
the synthetic stream.dropped notices fire after the loop completes.

Installation

pip install blueclaw==3.0.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v3.0.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

What's New in v3.0.0

Added

Trace UI: conversation-first observability + live streaming

Unified agent runner

Tools

Eval / test infrastructure

System prompt

Notes

Changed

Security

Fixed

Installation

Links

Uh oh!