Current state
Pilo's main action-loop LLM call (webAgent.ts:874-989) invokes streamText from the Vercel AI SDK with no provider-specific cache markers:
const streamResult = streamText({
...this.providerConfig,
messages: this.messages,
tools: webActionTools,
toolChoice: "required",
maxOutputTokens: DEFAULT_GENERATION_MAX_TOKENS,
abortSignal: this.abortSignal,
});
The messages array contains, in this order:
- The system prompt (built by
buildActionLoopSystemPrompt) — ~3000-4000 tokens including tool examples and best practices
- The task+plan user message (built by
buildTaskAndPlanPrompt) — ~500-1500 tokens
- Per-step snapshot user messages, assistant turns, tool results, error feedback, validation feedback (the conversation)
For a 50-iteration task on Claude with no caching, the system prompt + task+plan messages (positions 1 and 2) are billed at full input rate 50 times.
Anthropic supports prompt caching via cache_control: { type: "ephemeral" } markers on individual content parts. The Vercel AI SDK surfaces this through providerOptions.anthropic on individual messages. Cached tokens are billed at ~10% of normal input cost on hit (default 5-minute TTL).
The gap
For Claude-based runs, Pilo currently pays full input cost on tokens that are stable across the entire run. On a long task with a 4000-token system prompt and 50 iterations, that's 200,000 tokens billed that could mostly be cache hits.
OpenAI's prompt caching is automatic (no markers needed) and already applies. Gemini's caching is structurally different and not addressed here. The win is specifically for Anthropic — and via OpenRouter routing to Anthropic models.
Proposed scope
A. Detect Anthropic-routed models
In provider.ts, add a helper to determine whether the active provider is using an Anthropic model (direct or via OpenRouter):
function isAnthropicModel(providerConfig: ProviderConfig): boolean {
const modelId = providerConfig.model?.modelId ?? "";
// Direct Anthropic provider
if (providerConfig.providerOptions?.anthropic) return true;
// OpenRouter routing to Anthropic
if (/^anthropic\//.test(modelId)) return true;
// Heuristic: model name contains "claude"
if (/claude/i.test(modelId)) return true;
return false;
}
B. Mark cacheable messages
In initializeSystemPromptAndTask (webAgent.ts:1641-1672), when Anthropic-routed, mark the system message and the task+plan user message as cacheable:
const cacheableMeta = isAnthropicModel(this.providerConfig)
? { providerOptions: { anthropic: { cacheControl: { type: "ephemeral" } } } }
: {};
this.messages = [
{ role: "system", content: systemPrompt, ...cacheableMeta },
{ role: "user", content: taskAndPlan, ...cacheableMeta },
];
Verify the exact key name (providerOptions vs experimental_providerMetadata) against the installed @ai-sdk/anthropic version in package.json.
C. Optionally mark the latest snapshot as cacheable too
A more aggressive optimization: each snapshot becomes cacheable up until the next snapshot. This makes the entire conversation prefix cache-hit up to the most recent assistant turn. Tradeoff: cache writes are slightly more expensive than reads, and snapshots churn (one new every iteration), so the cache might invalidate frequently. Benchmark before enabling.
D. Surface cache metrics
streamText returns usage and providerMetadata. On Anthropic, the usage includes cacheReadInputTokens and cacheCreationInputTokens. Surface these in the AI_GENERATION event:
this.eventEmitter.emit(WebAgentEventType.AI_GENERATION, {
// ... existing fields ...
cacheReadTokens: usage?.cacheReadInputTokens ?? 0,
cacheWriteTokens: usage?.cacheCreationInputTokens ?? 0,
});
The eval-judge consumer (and any cost-tracking layer) can then compute per-task savings.
Implementation notes
- The exact AI SDK syntax for cache markers varies between SDK versions. Verify against the version pinned in
packages/core/package.json before writing the code.
- Cache markers on the system message and the task+plan message should form a single contiguous cacheable prefix. But the SDK may treat them as two separate cache entries (one per message). Test by checking
cacheReadInputTokens on the second iteration of a fresh task.
- 5-minute TTL means: tasks that pause for >5 minutes between steps lose the cache. For typical browser-automation tasks (each step is 5-30 seconds), this isn't an issue.
- The cache is per-account and per-content-prefix. The system prompt's
currentDate field changes daily — so the cache invalidates each midnight. Acceptable; can be addressed by extracting the date to an early-but-not-cached user-message position if it becomes a real cost concern.
- Don't add caching for non-Anthropic providers. OpenAI does its own automatic caching; Gemini doesn't support this style; Ollama/LM Studio don't either.
Acceptance criteria
- For Anthropic-routed providers, system + task+plan messages carry
cacheControl: { type: "ephemeral" }.
- For non-Anthropic providers, no cache markers are added (verify by message inspection in tests).
- The
AI_GENERATION event includes cacheReadTokens / cacheWriteTokens (zero when no caching applies).
- A manual smoke run on a 5-step Claude task shows
cacheReadInputTokens > 0 on step 2+.
- Tests in
packages/core/test/ cover: cache-marker presence for Anthropic, absence for others, AI_GENERATION event field shape.
Effort estimate
1-2 days including verification against the SDK and the smoke test.
Related issues
Pairs with the per-model prompt variants issue — flash variants will have a shorter cacheable prefix, but the cache markers go on whichever variant is selected.
Files likely affected
packages/core/src/provider.ts (provider detection helper)
packages/core/src/webAgent.ts (initializeSystemPromptAndTask, AI_GENERATION event)
packages/core/src/events.ts (event field additions)
packages/core/test/webAgent.test.ts
Current state
Pilo's main action-loop LLM call (
webAgent.ts:874-989) invokesstreamTextfrom the Vercel AI SDK with no provider-specific cache markers:The
messagesarray contains, in this order:buildActionLoopSystemPrompt) — ~3000-4000 tokens including tool examples and best practicesbuildTaskAndPlanPrompt) — ~500-1500 tokensFor a 50-iteration task on Claude with no caching, the system prompt + task+plan messages (positions 1 and 2) are billed at full input rate 50 times.
Anthropic supports prompt caching via
cache_control: { type: "ephemeral" }markers on individual content parts. The Vercel AI SDK surfaces this throughproviderOptions.anthropicon individual messages. Cached tokens are billed at ~10% of normal input cost on hit (default 5-minute TTL).The gap
For Claude-based runs, Pilo currently pays full input cost on tokens that are stable across the entire run. On a long task with a 4000-token system prompt and 50 iterations, that's 200,000 tokens billed that could mostly be cache hits.
OpenAI's prompt caching is automatic (no markers needed) and already applies. Gemini's caching is structurally different and not addressed here. The win is specifically for Anthropic — and via OpenRouter routing to Anthropic models.
Proposed scope
A. Detect Anthropic-routed models
In
provider.ts, add a helper to determine whether the active provider is using an Anthropic model (direct or via OpenRouter):B. Mark cacheable messages
In
initializeSystemPromptAndTask(webAgent.ts:1641-1672), when Anthropic-routed, mark the system message and the task+plan user message as cacheable:Verify the exact key name (
providerOptionsvsexperimental_providerMetadata) against the installed@ai-sdk/anthropicversion inpackage.json.C. Optionally mark the latest snapshot as cacheable too
A more aggressive optimization: each snapshot becomes cacheable up until the next snapshot. This makes the entire conversation prefix cache-hit up to the most recent assistant turn. Tradeoff: cache writes are slightly more expensive than reads, and snapshots churn (one new every iteration), so the cache might invalidate frequently. Benchmark before enabling.
D. Surface cache metrics
streamTextreturnsusageandproviderMetadata. On Anthropic, the usage includescacheReadInputTokensandcacheCreationInputTokens. Surface these in theAI_GENERATIONevent:The eval-judge consumer (and any cost-tracking layer) can then compute per-task savings.
Implementation notes
packages/core/package.jsonbefore writing the code.cacheReadInputTokenson the second iteration of a fresh task.currentDatefield changes daily — so the cache invalidates each midnight. Acceptable; can be addressed by extracting the date to an early-but-not-cached user-message position if it becomes a real cost concern.Acceptance criteria
cacheControl: { type: "ephemeral" }.AI_GENERATIONevent includescacheReadTokens/cacheWriteTokens(zero when no caching applies).cacheReadInputTokens > 0on step 2+.packages/core/test/cover: cache-marker presence for Anthropic, absence for others,AI_GENERATIONevent field shape.Effort estimate
1-2 days including verification against the SDK and the smoke test.
Related issues
Pairs with the per-model prompt variants issue — flash variants will have a shorter cacheable prefix, but the cache markers go on whichever variant is selected.
Files likely affected
packages/core/src/provider.ts(provider detection helper)packages/core/src/webAgent.ts(initializeSystemPromptAndTask, AI_GENERATION event)packages/core/src/events.ts(event field additions)packages/core/test/webAgent.test.ts