Skip to content

v2.8.1 — prompt caching for the AiNative runtime

Latest

Choose a tag to compare

@mabou7agar mabou7agar released this 09 Jun 09:05
· 7 commits to master since this release
84463c2

Makes repeated AiNative planner steps faster/cheaper by caching the stable instruction prefix.

Caching

  • Anthropic — the planner sends its stable instruction prefix as a cache_control: ephemeral system block, so Claude orchestration caches the ~5 KB rules block across a turn's steps (faster first token, cheaper input) instead of re-reading it every call. Gated by ai-engine.engines.anthropic.prompt_caching (default true). (#60)
  • OpenAI — already auto-caches the longest common prompt prefix (no cache_control needed); the single-message prompt is left byte-identical, so the default path is unchanged. (Live-proven: 1408/1455 input tokens served from cache on a repeat call.)

Observability (#61)

AIResponse->getUsage() now surfaces cache token counts:

  • cached_tokens — OpenAI prompt-prefix hit / Anthropic cache_read
  • cache_creation_tokens — Anthropic cache write
  • normalized for Gemini and the openai-family drivers (deepseek, perplexity, nvidia) too

Note

Why no tool-search subagent: tool selection is already handled by the keyword/semantic selectors + find_tools (the ToolSearch-equivalent); caching is the orthogonal speed lever, now wired.

Full suite green: 1924 tests.