Summary
When using OpenRouter BYOK with Anthropic Claude models in GitHub Copilot, prompt caching is silently disabled. This leads to every agentic request re-sending the full conversation context at full price, which becomes extremely expensive for long agent sessions.
Root Cause
The OpenRouterLMProvider extends AbstractOpenAICompatibleLMProvider, which sends requests using the OpenAI-compatible chat completions format via createCapiRequestBody → rawMessageToCAPI. When a CacheBreakpoint content part is present in a message, rawMessageToCAPI converts it to a CAPI-specific copilot_cache_control: { type: 'ephemeral' } field placed at the message object level (see openai.ts):
if (message.content.find(part => part.type === ChatCompletionContentPartKind.CacheBreakpoint)) {
out.copilot_cache_control = { type: 'ephemeral' };
}
This copilot_cache_control field is a GitHub Copilot API extension understood only by CAPI. OpenRouter does not recognise it and therefore applies no caching at all.
OpenRouter's Anthropic Claude prompt caching requires one of:
- Automatic caching (top-level request field):
"cache_control": { "type": "ephemeral" } at the root of the request body.
- Explicit caching (per-block):
cache_control on individual content block objects within the messages array.
Neither format is emitted by the current implementation when routing through OpenRouter.
By contrast, the native AnthropicLMProvider (direct BYOK Anthropic) correctly attaches cache_control to individual content blocks using the Anthropic SDK, so caching works there. The gap is specific to the OpenAI-compatible code path used by OpenRouter.
Relevant files
Steps to Reproduce
- Configure OpenRouter BYOK in GitHub Copilot with an API key.
- Select a Claude model (e.g.,
anthropic/claude-sonnet-4-5 or anthropic/claude-opus-4-5).
- Start a multi-turn agent session with a large system prompt or long conversation context (>1024–4096 tokens depending on model).
- Observe via the OpenRouter Activity page or the
/api/v1/generation API that cached_tokens in prompt_tokens_details is always 0 — no cache hits occur.
Expected Behavior
The system prompt and stable conversation history are cached on OpenRouter's Anthropic endpoint. Subsequent requests within the same conversation show cached_tokens > 0 and a reduced effective cost (~0.1× input price on cached portions).
Actual Behavior
cached_tokens is always 0. Every request is billed at full input token price. For a typical 10-turn agent session with a 10k-token system prompt, this is ~10× more expensive than it should be.
Cost Impact
Per OpenRouter's Anthropic pricing:
- Cache write (5-min TTL): 1.25× base input price
- Cache read: 0.1× base input price (90% savings)
For a long agent run with e.g. 50k tokens of stable context repeated across 20 turns, this is a ~900% cost inflation vs. what the same workload costs with the native Anthropic BYOK provider.
Suggested Fix
OpenRouterLMProvider (or the OpenAIEndpoint code path when used with OpenRouter) should translate CacheBreakpoint parts to standard cache_control objects that OpenRouter understands. The simplest approach is to add a top-level automatic caching field to the request body:
{
"model": "anthropic/claude-sonnet-4-5",
"cache_control": { "type": "ephemeral" },
"messages": [ ... ]
}
This is the format documented by OpenRouter for automatic Anthropic prompt caching and is the recommended approach for multi-turn conversations.
Alternatively, cache_control could be injected onto individual content blocks in the messages array when the provider is OpenRouter and the selected model is an Anthropic Claude model.
References
Summary
When using OpenRouter BYOK with Anthropic Claude models in GitHub Copilot, prompt caching is silently disabled. This leads to every agentic request re-sending the full conversation context at full price, which becomes extremely expensive for long agent sessions.
Root Cause
The
OpenRouterLMProviderextendsAbstractOpenAICompatibleLMProvider, which sends requests using the OpenAI-compatible chat completions format viacreateCapiRequestBody→rawMessageToCAPI. When aCacheBreakpointcontent part is present in a message,rawMessageToCAPIconverts it to a CAPI-specificcopilot_cache_control: { type: 'ephemeral' }field placed at the message object level (seeopenai.ts):This
copilot_cache_controlfield is a GitHub Copilot API extension understood only by CAPI. OpenRouter does not recognise it and therefore applies no caching at all.OpenRouter's Anthropic Claude prompt caching requires one of:
"cache_control": { "type": "ephemeral" }at the root of the request body.cache_controlon individual content block objects within themessagesarray.Neither format is emitted by the current implementation when routing through OpenRouter.
By contrast, the native
AnthropicLMProvider(direct BYOK Anthropic) correctly attachescache_controlto individual content blocks using the Anthropic SDK, so caching works there. The gap is specific to the OpenAI-compatible code path used by OpenRouter.Relevant files
extensions/copilot/src/extension/byok/vscode-node/openRouterProvider.ts— OpenRouter provider (extendsAbstractOpenAICompatibleLMProvider, no caching override)extensions/copilot/src/platform/networking/common/openai.ts—rawMessageToCAPIemitscopilot_cache_controlinstead of standardcache_controlextensions/copilot/src/extension/byok/node/openAIEndpoint.ts—OpenAIEndpoint.createRequestBodycallscreateCapiRequestBodyfor the non-Responses-API pathextensions/copilot/src/extension/byok/vscode-node/anthropicProvider.ts— Native Anthropic provider that correctly applies caching (for comparison)Steps to Reproduce
anthropic/claude-sonnet-4-5oranthropic/claude-opus-4-5)./api/v1/generationAPI thatcached_tokensinprompt_tokens_detailsis always0— no cache hits occur.Expected Behavior
The system prompt and stable conversation history are cached on OpenRouter's Anthropic endpoint. Subsequent requests within the same conversation show
cached_tokens > 0and a reduced effective cost (~0.1× input price on cached portions).Actual Behavior
cached_tokensis always0. Every request is billed at full input token price. For a typical 10-turn agent session with a 10k-token system prompt, this is ~10× more expensive than it should be.Cost Impact
Per OpenRouter's Anthropic pricing:
For a long agent run with e.g. 50k tokens of stable context repeated across 20 turns, this is a ~900% cost inflation vs. what the same workload costs with the native Anthropic BYOK provider.
Suggested Fix
OpenRouterLMProvider(or theOpenAIEndpointcode path when used with OpenRouter) should translateCacheBreakpointparts to standardcache_controlobjects that OpenRouter understands. The simplest approach is to add a top-level automatic caching field to the request body:{ "model": "anthropic/claude-sonnet-4-5", "cache_control": { "type": "ephemeral" }, "messages": [ ... ] }This is the format documented by OpenRouter for automatic Anthropic prompt caching and is the recommended approach for multi-turn conversations.
Alternatively,
cache_controlcould be injected onto individual content blocks in themessagesarray when the provider is OpenRouter and the selected model is an Anthropic Claude model.References