Skip to content

Aggressive history clipping for long task runs #437

@lmorchard

Description

@lmorchard

Current state

Pilo's chat history (this.messages: ModelMessage[] in webAgent.ts) grows monotonically over a task run. Sources of appends:

Source What's appended Frequency
initializeSystemPromptAndTask (webAgent.ts:1656-1671) System + task+plan user message Once at start, once on browser reconnect
addPageSnapshot (webAgent.ts:779-868) Snapshot user message (text + optional image) Per iteration when needsPageSnapshot
generateAndProcessAction (webAgent.ts:985-989) All messages from aiResponse.response.messages (assistant turn + tool calls + tool results) Per iteration
checkAndHandleRepeatedAction (webAgent.ts:1149-1150) Repetition warning user message On warning threshold
addErrorFeedback (webAgent.ts:682-718) Step error feedback user message On non-recoverable-tool error
validateTaskCompletion (webAgent.ts:1281-1289) Validation feedback user message On validation rejection

The only trimming is truncateOldExternalContent (webAgent.ts:743-774), which runs before each new snapshot push and clips the body of prior <EXTERNAL-CONTENT> blocks:

const clipExternalContent = (text: string): string =>
  text.replace(
    /(<EXTERNAL-CONTENT[\s\S]*?>)\n[\s\S]*?\n(<\/EXTERNAL-CONTENT>)/g,
    "$1\n> [clipped for brevity]\n$2",
  );

It also replaces prior image content parts with { type: "text", text: "[screenshot clipped for brevity]" }.

The gap

Tool-call assistant messages and tool-result messages are never trimmed. Validation feedback messages, repeat warnings, and error feedback messages are never trimmed.

On a 50-iteration run with multiple validation failures and a few errors, the conversation can accumulate:

  • 50 snapshot user messages (clipped down to small markers, fine)
  • 50 assistant tool-call messages (each ~50-200 tokens; not clipped)
  • 50 tool result messages (each ~20-100 tokens; not clipped)
  • Up to ~10 validation feedback user messages (each ~100-200 tokens; not clipped)
  • Up to ~10 error feedback user messages (each ~100-300 tokens; not clipped)
  • Up to ~5 repetition warning messages (~50 tokens; not clipped)

Rough math: a worst-case 50-iteration run can grow to 15-25k tokens of non-snapshot history that the model re-reads every iteration. This is fine for frontier-class context windows but:

  1. Pure inefficiency — most of those old tool-call args are not relevant to the current step.
  2. On smaller-context models, this can squeeze out room for the current snapshot.
  3. The validator (validateTaskCompletion) builds a messages.slice(-30) view that is partly dominated by these accumulated assistant turns.

When the AI SDK eventually fails with a context-window error, it surfaces as a regular generation error which cycles through maxConsecutiveErrors before the task fails. There's no early warning.

The gap (continued)

There's also a related smaller issue: stale validation feedback messages accumulate. If the agent gets rejected on attempt 1, then again on attempt 2, both feedback messages persist into attempt 3. The most-recent feedback is the most relevant; older feedback is mostly noise.

Proposed scope

Extend truncateOldExternalContent (or rename to trimOldHistory) to also handle:

A. Clip old assistant tool-call message content

For assistant messages older than the last K (default K=3):

  • Keep role: "assistant" and the tool call structure (so the AI SDK's tool-result pairing still works).
  • Strip any text content (the model's intermediate reasoning if surfaced).
  • Replace tool call args with a placeholder object that preserves the tool name but blanks the value: { toolName: "fill", args: { /* clipped */ } }.

Tradeoff: if a tool result references something specific (e.g., the model reading back what it filled), aggressive clipping breaks recall. K=3 (or K=5 conservative) preserves enough recent context that the model can self-reference recent actions.

B. Aggregate old error/feedback messages into a single summary placeholder

Walk back from the end of messages. Find consecutive runs of validation-feedback and error-feedback user messages. Replace each run with a single placeholder:

[3 earlier feedback messages clipped: 2 validation rejections, 1 step error]

Only the most recent feedback message of each kind stays full text.

C. Tool-result message clipping

For tool-result messages older than K, replace the content with [tool result clipped for brevity]. Keep the message role/structure intact so AI SDK pairing works.

D. Surface a token-budget metric

Add an event:

SYSTEM_DEBUG_HISTORY_SIZE: {
  iterationId: string;
  estimatedTokens: number;  // rough sum of content text lengths / 4
  messageCount: number;
}

Emit once per iteration before the LLM call. Telemetry consumers (eval-judge, logs) can spot tasks approaching the context window.

E. Optional: hard cap on history age

If the agent runs >K iterations, drop intermediate snapshots entirely (not just clip the body). Keep:

  • The original system + task+plan messages
  • A summary placeholder ("[N earlier iterations summarized]" — content TBD; could be just the count)
  • The last K=10 iterations in full

Defer this to a follow-up issue (the LLM-based history compaction one) if the simpler clipping handles most cases.

Implementation notes

  • AI SDK tool-call / tool-result pairing is by toolCallId. Don't break the pairing — clipping the content is fine; deleting the message is not.
  • Don't clip the most-recent assistant tool-call message — the model needs to see its own immediately-prior action.
  • Test with a long-running task (forced to 50 iterations) — verify the trimming actually keeps token count bounded. Without this safeguard, the existing test suite doesn't exercise long histories.
  • The existing <EXTERNAL-CONTENT> regex clipping should stay — the new clipping covers a different category of messages.

Acceptance criteria

  • Assistant messages older than K iterations have their tool args clipped.
  • Tool-result messages older than K iterations are placeholders.
  • Consecutive runs of old feedback/error messages collapse into single summary placeholders.
  • SYSTEM_DEBUG_HISTORY_SIZE event fires per iteration with estimated token count and message count.
  • A 50-iteration test scenario shows history-token count plateaus rather than growing linearly.
  • Existing tests still pass; new tests cover the clipping behaviors.

Effort estimate

1-2 days. Most complexity is in preserving tool-call/tool-result pairing while clipping content.

Related issues

This is the "Tier 2" / simpler step. A "Tier 3" follow-up (separate issue) covers LLM-based summarization for tasks where even this aggressive clipping isn't enough.

Files likely affected

  • packages/core/src/webAgent.ts (truncateOldExternalContent and friends)
  • packages/core/src/events.ts (new event type)
  • packages/core/test/webAgent.test.ts (snapshot truncation describe block)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions