Skip to content

feat(cli): CLI feature parity phase 2 - usage tracking and streaming (AI-assisted)#2352

Draft
rmorse wants to merge 19 commits intoopenclaw:mainfrom
rmorse:feat/cli-feature-parity
Draft

feat(cli): CLI feature parity phase 2 - usage tracking and streaming (AI-assisted)#2352
rmorse wants to merge 19 commits intoopenclaw:mainfrom
rmorse:feat/cli-feature-parity

Conversation

@rmorse
Copy link
Contributor

@rmorse rmorse commented Jan 26, 2026

Summary

This PR continues CLI backend improvements from #1921, adding accurate token usage tracking and real-time streaming support.

Concurrency Hardening

Fixes race conditions in CLI transcript writes:

  • Uses { flag: 'wx' } for atomic file creation (TOCTOU fix)
  • Adds session-level locking for concurrent-safe writes
  • New async API: appendMessageToTranscriptAsync, appendAssistantMessageToTranscriptAsync
  • Tests for partial failure (orphaned user message) and concurrent writes

Token Usage Tracking

⚠️ WIP: Token display is improved but not fully resolved - some edge cases remain

Fixes incorrect token display in UI (showed 558k/200k when actual was ~120k):

  • Adds usage parameter to appendMessageToTranscript functions
  • Passes result.meta.agentMeta?.usage when persisting assistant messages
  • Creates CliSessionManager class with SDK-aligned API for future use
  • Transcript entries now contain actual input/output/cache token counts

Configurable Usage Fields

Enables per-backend token field parsing:

  • Adds usageFields config to CliBackendConfig
  • Handles different API response formats (Anthropic vs OpenAI field names)
  • Correctly parses cache_creation_input_tokens (previously missed)
  • Maintains backwards compatibility via fallback defaults

Streaming NDJSON Support

⚠️ TODO: Non-streaming path (streaming: false) is currently broken - needs fix before merge

Adds real-time output for CLI backends:

  • New cli-runner/streaming.ts module with readline-based NDJSON parsing
  • Emits events as they arrive instead of waiting for full response
  • Config options: streaming?: boolean, streamingEventTypes?: string[]
  • Event mapping for Claude CLI (text, tool_use, result) and Codex CLI (item.*, turn.*)
  • Debug logging throughout pipeline for production diagnostics

Reply Directives for Streaming

Matches embedded flow's text processing:

  • Applies parseReplyDirectives to streaming text events
  • Extracts media URLs, cleans directives, computes delta from cleaned text

Test plan

  • Unit tests for CliSessionManager (17 tests)
  • Unit tests for session-utils.fs usage parameter
  • Unit tests for CLI streaming module
  • Unit tests for agent-runner-execution CLI persistence
  • Tested locally with claude-cli backend - tokens display improved
  • Verified streaming events emit in real-time
  • TODO: Fix and test non-streaming path
  • TODO: Verify token display edge cases resolved

AI-assisted

This PR was developed with AI assistance (Claude). The code has been tested locally with a live Clawdbot instance. I understand what all the code does.

rmorse and others added 9 commits January 25, 2026 20:58
- Add resumeArgs to DEFAULT_CLAUDE_BACKEND for proper --resume flag usage
- Fix gateway not preserving cliSessionIds/claudeCliSessionId in nextEntry
- Add test for CLI session ID preservation in gateway agent handler
- Update docs with new resumeArgs default
CLI backends (claude-cli etc) don't emit streaming assistant events,
causing TUI to show "(no output)" despite correct processing. Now emits
assistant event with final text before lifecycle end so server-chat
buffer gets populated for WebSocket clients.
- TOCTOU fix: use { flag: 'wx' } for atomic file creation
- Add session-level locking for concurrent-safe writes
- Add async API: appendMessageToTranscriptAsync, appendAssistantMessageToTranscriptAsync
- Add partial failure test (orphaned user message)
- Add concurrent write test
Adds usageFields config option to CliBackendConfig allowing per-backend
customization of token usage field names. This enables correct parsing
of cache_creation_input_tokens from Anthropic's API (previously missed)
while maintaining backwards compatibility through fallback defaults.

- Add usageFields type to CliBackendConfig
- Add Zod schema validation for usageFields
- Configure default fields for Claude CLI (with Anthropic's actual field names)
- Configure default fields for Codex CLI (OpenAI field names)
- Update toUsage() to use backend config with fallback to hardcoded defaults
Previously, CLI backend responses wrote hardcoded zeros for token usage
in session transcripts (input: 0, output: 0, totalTokens: 0). This caused
the UI to show incorrect token counts and status to fall back to stale
accumulated values.

Changes:
- Add usage parameter to appendMessageToTranscript and related functions
  in session-utils.fs.ts to accept NormalizedUsage from CLI backends
- Pass result.meta.agentMeta?.usage when persisting assistant messages
  in agent-runner-execution.ts
- Create CliSessionManager class with SDK-aligned API for future use:
  static factories (open/create), accessor methods, write locking
- Add comprehensive tests for both session-utils.fs usage parameter
  and the new CliSessionManager class (17 + 2 new tests)

Transcript entries now include actual input/output/cache token counts
from CLI backends like claude-cli and opus.
Adds real-time streaming output support to CLI backends, enabling
line-by-line parsing of NDJSON output instead of waiting for the
full response. This brings CLI backends closer to the embedded/API
flow by emitting events as they arrive.

Key changes:

- New streaming execution module (cli-runner/streaming.ts):
  - Uses readline to parse NDJSON lines as they arrive
  - Extracts session IDs, usage stats, and text from stream
  - Supports event type filtering with prefix matching
  - Maps CLI-specific events to Clawdbot agent events

- Config extension:
  - Added `streaming?: boolean` to enable streaming mode
  - Added `streamingEventTypes?: string[]` to filter events
  - Claude CLI defaults: stream-json format with --verbose flag
  - Codex CLI defaults: streaming enabled with item/turn events

- Event mapping for different CLI formats:
  - Claude CLI: tool_use, tool_result, text, result events
  - Codex CLI: item.*, turn.completed, thread.completed events

- Debug logging throughout the pipeline:
  - Logs raw JSON lines, parsed types, session/usage extraction
  - Logs event emission and mapping decisions
  - Helps diagnose streaming issues in production

The streaming path is enabled by default for Claude CLI and Codex CLI.
Users can disable it by setting `streaming: false` in their config.
Non-streaming path via runCommandWithTimeout remains available.
Match embedded flow's text processing: extract media URLs, clean
directives, compute delta from cleaned text.
@openclaw-barnacle openclaw-barnacle bot added docs Improvements or additions to documentation app: web-ui App: web-ui gateway Gateway runtime labels Jan 26, 2026
Conflicts resolved (kept ours):
- cli-backends.ts: keep stream-json format for streaming support
- agent-runner-execution.ts: keep transcript persistence + usage tracking
- claude-cli-runner.test.ts: keep streaming mock expectations
Claude CLI loses its system message context when resuming a session
unless the system prompt is explicitly passed on every call. Previously,
we only sent it on the first call (`systemPromptWhen: "first"`), which
caused resumed sessions to lose their system prompt context.

Changes:
- Switch from `--append-system-prompt` to `--system-prompt`: the former
  only appends to an existing system prompt, while the latter completely
  replaces it (per Claude CLI docs). This ensures consistent behavior.
- Change `systemPromptWhen` from "first" to "always" so the system
  prompt is sent on every CLI invocation, including resumes.
- Remove the redundant `!params.useResume` guard in `buildCliArgs()` -
  the `resolveSystemPromptUsage()` function already handles the
  "when to include system prompt" logic via `systemPromptWhen`.
@openclaw-barnacle openclaw-barnacle bot added agents Agent runtime and tooling and removed docs Improvements or additions to documentation labels Jan 27, 2026
@rmorse
Copy link
Contributor Author

rmorse commented Jan 27, 2026

Some things worth discussing / checking:

New backend args

systemPromptArg: "--system-prompt",
usageFields: {
    input: ["input_tokens", "inputTokens"],
    output: ["output_tokens", "outputTokens"],
    cacheRead: ["cache_read_input_tokens", "cached_input_tokens", "cacheRead"],
    cacheWrite: ["cache_creation_input_tokens", "cache_write_input_tokens", "cacheWrite"],
    total: ["total_tokens", "total"],
  },
  streaming: true,
  streamingEventTypes: ["tool_use", "tool_result", "text", "result"],
  streamingFormat: {
    text: {
      eventTypes: ["assistant"],
      contentPath: "message.content",
      matchType: "text",
      textField: "text",
    },
    toolUse: {
      eventTypes: ["assistant"],
      contentPath: "message.content",
      matchType: "tool_use",
      idField: "id",
      nameField: "name",
      inputField: "input",
    },
    toolResult: {
      eventTypes: ["user"],
      contentPath: "message.content",
      matchType: "tool_result",
      idField: "tool_use_id",
      outputField: "content",
      isErrorField: "is_error",
    },
  },
  • usageFields - defines what fields are used to extract context usage from the reponse json
  • streaming + streamingEventTypes - enable streaming for the backend, and which entries are considered as events we want to capture
  • streamingFormat - this might need some work - tried to figure out a cli agnostic way to define how to the parse the stream json result and extract the relevant data
  • systemPromptArg - changed Claude's system prompt arg to "--system-prompt" , which should replace any existing prompts - but, on my max sub at least, --append-system-prompt and --system-prompt work the same - they only append.

Questions

  1. If the cli backend wasn't fully implemented, do we need backwards compat?
  2. I made a CliSessionManager class (unused) to roughly match the api surface of the pi SessionManager class - but its going to need updates across multiple files - should we go with it, or leave as is with our modifications in src\gateway\session-utils.fs.ts instead?
  3. For streaming, we re-use parseReplyDirectives, output is a bit inconsistent, but better re-use what we already have?
  4. Token usage for displaying to the user - having a bit of hard time getting this right, using the existing calculations everythinng is coming out way off, adding a custom calculation works (matches claude codes reported context usage) - is this what we want?
    • Are the original calculations for embedded functioning/accurate? Seems its not (or direct api usage works differently to CC cli with sub at least)

@rmorse
Copy link
Contributor Author

rmorse commented Jan 27, 2026

Sorry for the tag @steipete but thought I better get your eyes on this before doing any more (see "questions" above).

CLI providers (Claude CLI) report cache_read_input_tokens as the full
cached context for each turn, unlike the embedded/API flow. This change
adds CLI-aware token calculation that uses the correct formula:
cacheRead + cacheWrite + input = total context tokens.

- Add deriveCliContextTokens() for CLI-specific calculation
- Apply CLI detection via isCliProvider() before calculating
- Update session-usage.ts, status.ts, session-utils.fs.ts to use
  CLI-aware calculation
- Add verbose logging for token flow debugging
@rmorse
Copy link
Contributor Author

rmorse commented Jan 27, 2026

Token Calculation Challenges

While implementing CLI token display, we ran into some semantic ambiguity around what "tokens" should mean in different contexts.

The Problem

Claude CLI returns different token fields with different semantics:

{
  "input_tokens": 2,
  "cache_creation_input_tokens": 3538,
  "cache_read_input_tokens": 52381,
  "output_tokens": 5
}

Meanwhile, the Anthropic API (embedded flow) returns:

{
  "input_tokens": 1200,
  "output_tokens": 340,
  "cache_creation_input_tokens": 200,
  "cache_read_input_tokens": 50,
  "total_tokens": 1790
}

Semantic Differences

Metric Formula Purpose
API total_tokens input + output + cacheRead + cacheWrite Billing - all tokens consumed
"Context" for display input + cacheRead + cacheWrite What's in the context window (excludes output)

Current Implementation

For CLI providers, we now calculate context as cacheRead + cacheWrite + input (what's in the context window this turn).

For embedded/API, we use the same formula but the API also provides total_tokens which includes output.

Questions

  1. What should the UI's "tokens" display represent?

    • Context tokens (what's in the window) = input + cacheRead + cacheWrite
    • Billing tokens (all consumed) = input + output + cacheRead + cacheWrite
  2. Should we add a separate contextTokens field to NormalizedUsage to distinguish from billing total?

  3. Are there other consumers of totalTokens that expect specific semantics?

The current fix makes CLI and embedded show similar context-based values, but wanted to flag this architectural question for review.

The extra system prompt was unconditionally appending "Tools are
disabled in this session. Do not call tools." which prevented all
CLI agents from using tools they had available.
@sebslight
Copy link
Member

Closing: This PR is marked as WIP/Draft and has been open without completion. Please reopen when the work is ready for review.

@sebslight sebslight closed this Jan 28, 2026
# Conflicts:
#	src/agents/cli-runner.ts
#	src/auto-reply/status.ts
@steipete
Copy link
Contributor

steipete commented Feb 2, 2026

@sebslight do not close good Prs!

@steipete steipete reopened this Feb 2, 2026
# Conflicts:
#	src/agents/usage.ts
#	src/auto-reply/reply/session-usage.ts
#	src/auto-reply/status.ts
#	src/gateway/session-utils.fs.test.ts
Resolve conflicts favoring main's lastCallUsage-based context tracking
and updated resolveSessionFilePath API, while preserving HEAD's
logVerbose instrumentation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling app: web-ui App: web-ui gateway Gateway runtime size: XL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants