Skip to content

OpenRouter BYOK: Claude models do not benefit from prompt caching, causing excessive token costs in agent mode #312939

@fishcharlie

Description

@fishcharlie

Summary

When using OpenRouter BYOK with Anthropic Claude models in GitHub Copilot, prompt caching is silently disabled. This leads to every agentic request re-sending the full conversation context at full price, which becomes extremely expensive for long agent sessions.

Root Cause

The OpenRouterLMProvider extends AbstractOpenAICompatibleLMProvider, which sends requests using the OpenAI-compatible chat completions format via createCapiRequestBodyrawMessageToCAPI. When a CacheBreakpoint content part is present in a message, rawMessageToCAPI converts it to a CAPI-specific copilot_cache_control: { type: 'ephemeral' } field placed at the message object level (see openai.ts):

if (message.content.find(part => part.type === ChatCompletionContentPartKind.CacheBreakpoint)) {
    out.copilot_cache_control = { type: 'ephemeral' };
}

This copilot_cache_control field is a GitHub Copilot API extension understood only by CAPI. OpenRouter does not recognise it and therefore applies no caching at all.

OpenRouter's Anthropic Claude prompt caching requires one of:

  1. Automatic caching (top-level request field): "cache_control": { "type": "ephemeral" } at the root of the request body.
  2. Explicit caching (per-block): cache_control on individual content block objects within the messages array.

Neither format is emitted by the current implementation when routing through OpenRouter.

By contrast, the native AnthropicLMProvider (direct BYOK Anthropic) correctly attaches cache_control to individual content blocks using the Anthropic SDK, so caching works there. The gap is specific to the OpenAI-compatible code path used by OpenRouter.

Relevant files

Steps to Reproduce

  1. Configure OpenRouter BYOK in GitHub Copilot with an API key.
  2. Select a Claude model (e.g., anthropic/claude-sonnet-4-5 or anthropic/claude-opus-4-5).
  3. Start a multi-turn agent session with a large system prompt or long conversation context (>1024–4096 tokens depending on model).
  4. Observe via the OpenRouter Activity page or the /api/v1/generation API that cached_tokens in prompt_tokens_details is always 0 — no cache hits occur.

Expected Behavior

The system prompt and stable conversation history are cached on OpenRouter's Anthropic endpoint. Subsequent requests within the same conversation show cached_tokens > 0 and a reduced effective cost (~0.1× input price on cached portions).

Actual Behavior

cached_tokens is always 0. Every request is billed at full input token price. For a typical 10-turn agent session with a 10k-token system prompt, this is ~10× more expensive than it should be.

Cost Impact

Per OpenRouter's Anthropic pricing:

  • Cache write (5-min TTL): 1.25× base input price
  • Cache read: 0.1× base input price (90% savings)

For a long agent run with e.g. 50k tokens of stable context repeated across 20 turns, this is a ~900% cost inflation vs. what the same workload costs with the native Anthropic BYOK provider.

Suggested Fix

OpenRouterLMProvider (or the OpenAIEndpoint code path when used with OpenRouter) should translate CacheBreakpoint parts to standard cache_control objects that OpenRouter understands. The simplest approach is to add a top-level automatic caching field to the request body:

{
  "model": "anthropic/claude-sonnet-4-5",
  "cache_control": { "type": "ephemeral" },
  "messages": [ ... ]
}

This is the format documented by OpenRouter for automatic Anthropic prompt caching and is the recommended approach for multi-turn conversations.

Alternatively, cache_control could be injected onto individual content blocks in the messages array when the provider is OpenRouter and the selected model is an Anthropic Claude model.

References

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions