Aggressive history clipping for long task runs

## Current state

Pilo's chat history (`this.messages: ModelMessage[]` in `webAgent.ts`) grows monotonically over a task run. Sources of appends:

| Source | What's appended | Frequency |
|---|---|---|
| `initializeSystemPromptAndTask` (`webAgent.ts:1656-1671`) | System + task+plan user message | Once at start, once on browser reconnect |
| `addPageSnapshot` (`webAgent.ts:779-868`) | Snapshot user message (text + optional image) | Per iteration when `needsPageSnapshot` |
| `generateAndProcessAction` (`webAgent.ts:985-989`) | All messages from `aiResponse.response.messages` (assistant turn + tool calls + tool results) | Per iteration |
| `checkAndHandleRepeatedAction` (`webAgent.ts:1149-1150`) | Repetition warning user message | On warning threshold |
| `addErrorFeedback` (`webAgent.ts:682-718`) | Step error feedback user message | On non-recoverable-tool error |
| `validateTaskCompletion` (`webAgent.ts:1281-1289`) | Validation feedback user message | On validation rejection |

The **only trimming** is `truncateOldExternalContent` (`webAgent.ts:743-774`), which runs before each new snapshot push and clips the body of *prior* `<EXTERNAL-CONTENT>` blocks:

```ts
const clipExternalContent = (text: string): string =>
  text.replace(
    /(<EXTERNAL-CONTENT[\s\S]*?>)\n[\s\S]*?\n(<\/EXTERNAL-CONTENT>)/g,
    "$1\n> [clipped for brevity]\n$2",
  );
```

It also replaces prior `image` content parts with `{ type: "text", text: "[screenshot clipped for brevity]" }`.

## The gap

**Tool-call assistant messages and tool-result messages are never trimmed.** Validation feedback messages, repeat warnings, and error feedback messages are never trimmed.

On a 50-iteration run with multiple validation failures and a few errors, the conversation can accumulate:

- 50 snapshot user messages (clipped down to small markers, fine)
- 50 assistant tool-call messages (each ~50-200 tokens; **not clipped**)
- 50 tool result messages (each ~20-100 tokens; **not clipped**)
- Up to ~10 validation feedback user messages (each ~100-200 tokens; **not clipped**)
- Up to ~10 error feedback user messages (each ~100-300 tokens; **not clipped**)
- Up to ~5 repetition warning messages (~50 tokens; **not clipped**)

Rough math: a worst-case 50-iteration run can grow to 15-25k tokens of *non-snapshot* history that the model re-reads every iteration. This is fine for frontier-class context windows but:

1. Pure inefficiency — most of those old tool-call args are not relevant to the current step.
2. On smaller-context models, this can squeeze out room for the current snapshot.
3. The validator (`validateTaskCompletion`) builds a `messages.slice(-30)` view that is partly dominated by these accumulated assistant turns.

When the AI SDK eventually fails with a context-window error, it surfaces as a regular generation error which cycles through `maxConsecutiveErrors` before the task fails. There's no early warning.

## The gap (continued)

There's also a related smaller issue: stale validation feedback messages accumulate. If the agent gets rejected on attempt 1, then again on attempt 2, both feedback messages persist into attempt 3. The most-recent feedback is the most relevant; older feedback is mostly noise.

## Proposed scope

Extend `truncateOldExternalContent` (or rename to `trimOldHistory`) to also handle:

### A. Clip old assistant tool-call message content

For assistant messages older than the last K (default K=3):

- Keep `role: "assistant"` and the tool call structure (so the AI SDK's tool-result pairing still works).
- Strip any `text` content (the model's intermediate reasoning if surfaced).
- Replace tool call `args` with a placeholder object that preserves the tool name but blanks the value: `{ toolName: "fill", args: { /* clipped */ } }`.

Tradeoff: if a tool result references something specific (e.g., the model reading back what it filled), aggressive clipping breaks recall. K=3 (or K=5 conservative) preserves enough recent context that the model can self-reference recent actions.

### B. Aggregate old error/feedback messages into a single summary placeholder

Walk back from the end of `messages`. Find consecutive runs of validation-feedback and error-feedback user messages. Replace each run with a single placeholder:

```
[3 earlier feedback messages clipped: 2 validation rejections, 1 step error]
```

Only the most recent feedback message of each kind stays full text.

### C. Tool-result message clipping

For tool-result messages older than K, replace the content with `[tool result clipped for brevity]`. Keep the message role/structure intact so AI SDK pairing works.

### D. Surface a token-budget metric

Add an event:

```ts
SYSTEM_DEBUG_HISTORY_SIZE: {
  iterationId: string;
  estimatedTokens: number;  // rough sum of content text lengths / 4
  messageCount: number;
}
```

Emit once per iteration before the LLM call. Telemetry consumers (eval-judge, logs) can spot tasks approaching the context window.

### E. Optional: hard cap on history age

If the agent runs >K iterations, drop intermediate snapshots entirely (not just clip the body). Keep:

- The original system + task+plan messages
- A summary placeholder ("[N earlier iterations summarized]" — content TBD; could be just the count)
- The last K=10 iterations in full

Defer this to a follow-up issue (the LLM-based history compaction one) if the simpler clipping handles most cases.

## Implementation notes

- AI SDK tool-call / tool-result pairing is by `toolCallId`. Don't break the pairing — clipping the content is fine; deleting the message is not.
- Don't clip the most-recent assistant tool-call message — the model needs to see its own immediately-prior action.
- Test with a long-running task (forced to 50 iterations) — verify the trimming actually keeps token count bounded. Without this safeguard, the existing test suite doesn't exercise long histories.
- The existing `<EXTERNAL-CONTENT>` regex clipping should stay — the new clipping covers a different category of messages.

## Acceptance criteria

- Assistant messages older than K iterations have their tool args clipped.
- Tool-result messages older than K iterations are placeholders.
- Consecutive runs of old feedback/error messages collapse into single summary placeholders.
- `SYSTEM_DEBUG_HISTORY_SIZE` event fires per iteration with estimated token count and message count.
- A 50-iteration test scenario shows history-token count plateaus rather than growing linearly.
- Existing tests still pass; new tests cover the clipping behaviors.

## Effort estimate

1-2 days. Most complexity is in preserving tool-call/tool-result pairing while clipping content.

## Related issues

This is the "Tier 2" / simpler step. A "Tier 3" follow-up (separate issue) covers LLM-based summarization for tasks where even this aggressive clipping isn't enough.

## Files likely affected

- `packages/core/src/webAgent.ts` (`truncateOldExternalContent` and friends)
- `packages/core/src/events.ts` (new event type)
- `packages/core/test/webAgent.test.ts` (`snapshot truncation` describe block)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Aggressive history clipping for long task runs #437

Current state

The gap

The gap (continued)

Proposed scope

A. Clip old assistant tool-call message content

B. Aggregate old error/feedback messages into a single summary placeholder

C. Tool-result message clipping

D. Surface a token-budget metric

E. Optional: hard cap on history age

Implementation notes

Acceptance criteria

Effort estimate

Related issues

Files likely affected

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Source	What's appended	Frequency
`initializeSystemPromptAndTask` (`webAgent.ts:1656-1671`)	System + task+plan user message	Once at start, once on browser reconnect
`addPageSnapshot` (`webAgent.ts:779-868`)	Snapshot user message (text + optional image)	Per iteration when `needsPageSnapshot`
`generateAndProcessAction` (`webAgent.ts:985-989`)	All messages from `aiResponse.response.messages` (assistant turn + tool calls + tool results)	Per iteration
`checkAndHandleRepeatedAction` (`webAgent.ts:1149-1150`)	Repetition warning user message	On warning threshold
`addErrorFeedback` (`webAgent.ts:682-718`)	Step error feedback user message	On non-recoverable-tool error
`validateTaskCompletion` (`webAgent.ts:1281-1289`)	Validation feedback user message	On validation rejection

Aggressive history clipping for long task runs #437

Description

Current state

The gap

The gap (continued)

Proposed scope

A. Clip old assistant tool-call message content

B. Aggregate old error/feedback messages into a single summary placeholder

C. Tool-result message clipping

D. Surface a token-budget metric

E. Optional: hard cap on history age

Implementation notes

Acceptance criteria

Effort estimate

Related issues

Files likely affected

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions