Skip to content

v3.0.0

Latest

Choose a tag to compare

@jztan jztan released this 19 May 12:25
· 1 commit to develop since this release

What's New in v3.0.0

Added

Trace UI: conversation-first observability + live streaming

  • Capture layer. Every turn writes a per-turn events.jsonl alongside
    response.txt / messages.json. Captures tool invocations
    (tool.before / tool.after), model invocations (model.before /
    model.after), message additions (message.added), observation masking
    (context.mask), and lesson injection (lesson.injected). Events carry
    monotonic seq and a schema.version header. ObserverHooks.bus and
    ObservationMaskingManager.bus are public settable attributes (default
    None); adapters use runner.bus_for_turn(observer, capture_path) which
    fans out to every bus-aware component reachable from the observer.
  • Per-turn capture for terminal, HTTP, and Telegram. response.txt and
    messages.json land in <workspace>/.blueclaw/conversations/<id>/turns/ turn-NNN/. <id> is the HTTP conversation_id, the Telegram chat ID, or
    a per-process YYYYMMDD-HHMMSS-xxxx timestamp for terminal. Helpers:
    blueclaw.runner.next_capture_path plus a pure validate_session_id that
    rejects /, \, \x00, whitespace, control chars, ./.., empty, or

    128 chars. RunTrace gains a capture_path field (relative to
    workspace.root); None for pre-feature traces.

  • Backend conversation API. GET /api/conversations,
    /api/conversations/<cid>, /api/conversations/<cid>/turns/<n>/events
    expose per-cid aggregates and per-turn event streams. Aggregates computed
    at query time from existing traces — no new persistence files.
  • Conversation-first dashboard. New #/conversations and
    #/conversations/<cid> views. Per-turn transcript renders user / tool /
    assistant inline (tool use + tool result fold into a single bordered card
    with full args, full result, and show-more for long output). Deep details
    panel shows a waterfall combining tool steps with model invocation bars
    and a virtualized raw events stream. Flat traces view preserved as a
    secondary tab. Each row also shows an inline preview chip (first line of
    response.txt, ≤200 chars) with a "view full" link;
    GET /api/turns/<cid>/<n>/response and /api/turns/<cid>/<n>/messages
    serve the captured artifacts (404 with expected path + hint when pruned).
  • Live event streaming. blueclaw trace ui --live starts a Unix-socket
    broker at ~/.blueclaw/live.sock. Any blueclaw process started afterward
    detects the socket and forwards every event in real time. The dashboard
    subscribes via SSE at /api/conversations/<cid>/turns/<n>/events/live,
    polls /api/conversations/<cid> every 3 s to detect new turns, and
    reopens the EventSource when turn_count increases. Backfill + dedup-
    by-seq handshake (with reset on a fresh schema.version event) ensures
    no events missed across turns or reconnects. Off by default; opt in with
    --live.

Unified agent runner

  • blueclaw/runner.py exposes runner_session (context manager),
    finalize, finalize_error, and run_turn. runner_session.__exit__
    runs cleanup_mcp_clients unconditionally — adapters can no longer
    forget it.
  • tests/test_no_direct_create_agent.py durability guard: any module
    outside blueclaw/runner.py / blueclaw/session.py matching
    \bcreate_agent\b fails the test. With HTTP migrated,
    ALLOWLIST_PENDING_MIGRATION is empty.
  • tests/test_server.py::TestStreamingWorkspaceErrorCleanup: structural
    regression test asserting that when workspace.write_trace raises mid-
    stream, the SSE error event emits AND cleanup_mcp_clients runs via
    runner_session.__exit__.

Tools

  • http_request: Cloudflare-aware fetch + article extraction. Replaced
    urllib.urlopen with curl_cffi using Chrome 124 TLS impersonation so
    blueclaw can fetch pages behind Cloudflare's bot challenge (Medium,
    Substack, many news sites) that previously returned 403. HTML responses
    run through trafilatura to strip boilerplate and return article title +
    main body — typical reduction ~80k → 2–8k tokens, which lets smaller
    local models (Ollama gemma/qwen) actually read the result.
    SessionConfig.http_extract_main YAML flag (default true) toggles
    extraction. New runtime deps: curl-cffi>=0.7, trafilatura>=1.12.

Eval / test infrastructure

  • Eval response capture. blueclaw test persists response.txt,
    messages.json, and invocation.json per run to
    ~/blueclaw/test-runs/<invocation-ts>/case-<N>/run-<N>/. Decoupled from
    --keep-workspace — artifacts persist regardless of workspace cleanup.
    TAP formatter appends an artifacts: breadcrumb to failure records and
    prints a final Artifacts: <path> line to stderr. Capture is best-
    effort: write failures log to stderr and are recorded in
    invocation.json:capture_failures but never fail the eval. Override
    the root via BLUECLAW_ARTIFACTS_ROOT or run_spec(..., artifacts_root=).
  • forbidden_output_regex test assertion. Inverse of output_regex
    — fails the test if the regex matches. Lets specs assert on reworded
    refusal phrasings that a single substring would miss.
  • tests/eval/multi_turn_constraints.yaml behavioral regression spec,
    pinned to Sonnet 4.6 (~$1–2 per full run, ~10–15 min, manual run only).
    Scope reduced after triage: single-turn proxies fabricating prior turns
    were rejected by honesty-trained models; tests 2 and 4 rewritten with
    instruction framing. Rule D (api-channel constraint carry-forward) is
    no longer covered by automated tests here — real multi-turn fixtures
    tracked as a follow-up.

System prompt

  • Behavioral rules (tool-knowledge, partial-refusal, correction-
    acknowledgment, cosmetic-compensation). Four new rules in the shared
    **Rules:** block of build_system_prompt (both terminal and api
    channels) targeting failure modes from an external eval: declining
    without trying available tools, silently dropping parts of a request,
    ignoring user corrections, reaching for formatting to mask thin
    substance.
  • api-channel "constraint carry-forward" rule. Replaces "Answer ONLY
    what the user just asked" in the api/Telegram tone block — preserves
    anti-recap intent while requiring the model to carry forward earlier
    turns' constraints, deliverables, and corrections.

Notes

  • Orphan events on mid-turn crash. If a turn crashes after
    events.jsonl is written but before RunTrace finalization completes,
    the events file remains on disk while the trace is missing. No automatic
    cleanup yet — inspect manually with find ~/blueclaw -name events.jsonl -newer <date>. A future release will extend blueclaw trace purge --older-than N to cover orphans.

Changed

  • Terminal sessions now carry a conversation_id (timestamp-based per-
    process session ID) on trace and history records. Previously None for
    terminal-sourced runs; downstream tooling that grouped by
    conversation_id should account for the new value.
  • /api/traces summaries include capture_path, plus either
    capture_preview (file exists) or captures_pruned: true (directory
    deleted). Both fields absent when no capture exists.
  • runner.finalize / runner.finalize_error accept an optional
    workspace_root: Path | None kwarg; combined with capture_path it
    stores the relativized path on the trace.

Security

  • HTTP POST /message and POST /message/stream validate the client-
    supplied conversation_id against path-traversal characters and unsafe
    values. Invalid IDs receive a generic {"error": "invalid conversation_id"} 400 that does not echo the rejected value (logged
    server-side only).

Fixed

  • runner_session.__exit__ enforces cleanup_mcp_clients for any adapter
    that uses the runner — closing a class of bug structurally. The
    BridgeRouter.handle_message (Telegram) cleanup miss was the proof case;
    the Telegram migration to the runner (also in this release) realizes the
    fix for that call site.
  • blueclaw run "..." exits non-zero with an error message when the agent
    raises (previously regressed to silent exit 0 during the terminal
    migration; restored before merge).
  • EventBus.emit no longer lets callers shadow bus-controlled seq / ts
    by including those keys in the event payload — spread order swapped so
    bus fields always win.
  • _drop_subscriber no longer recurses unbounded when multiple subscribers
    overflow during the same emit. Drops are collected during fan-out and
    the synthetic stream.dropped notices fire after the loop completes.

Installation

pip install blueclaw==3.0.0

Links