GitHub - learningCatHD/telos-sdk: TELOS SDK: a cache-aware prompt protocol and gateway for portable agent context.

Context is yours · Agents are hired

No rewrite. No compression. 90% token billing saving.

_{One canonical IR — tools, system, turns, and memory — runs unchanged across Anthropic · OpenAI · DeepSeek · vLLM · SGLang
Real 6-turn session −92.3% · Cost reported in absolute $/query-resolved — ratios can be gamed; dollars can't}

_{LEAP Lab @ Tsinghua University — a research group focused on machine learning, multimodal learning, and embodied intelligence · leaplab.ai}

Quickstart · Support Matrix · Why · Benchmark · Protocol · Roadmap · Citation

_{📖 English · Simplified Chinese}

⬢ 2 a.m. — where did all the money go?

2 a.m., agent still running. The counter in the bottom-right corner climbs to 2,847,103 — you convert it to dollars and your stomach drops. Worse: the line above reads cache_read: 0. All night long, every turn fed the same 4,000-token system prompt from scratch to the model, billed at full price.

Take the exact same 6-turn real conversation, drop it into openclaw, flip two switches:

Mode	raw input tokens	cache_read	Cost for 6 turns
passthrough (today's default)	24,151	0	$0.3623
with TELOS	0	18,701	$0.0281 (−92.3%)

Scale to 1,000 sessions: $362 → $26. In a controlled A/B/C/D run (showcase/dashboard.html, 2026-05-19) — 48 calls across 4 sessions, counterfactual bill $5.90, actual $3.74 — net saved $2.16 (−36.6%). One dev machine, one afternoon. Multiply by team scale, and that's a real server invoice every month.

Stop measuring in "X× fewer tokens." In 2026, the pricing gap between tiers of the same model family already spans 80×–150×. Anyone can inflate a ratio by stuffing the cheapest tier in the denominator — absolute dollars are the only number that doesn't lie.

⬢ 3-step to save 90%

❶ Install

pip install -U telos-sdk

❷ Connect

telos init

Auto-detects claude-code / codex / openclaw / hermes on this machine, injects config into each, and starts the local gateway in the background (state written to ~/.telos/gateway.json). No changes to your agent code.

❸ Observe

telos dashboard

Opens an offline HTML dashboard in your browser showing savings per call in absolute dollars. Every invocation is automatically appended to ~/.telos/usage.jsonl and aggregated in real time.

_{Every saving pinned to an absolute dollar figure · No cloud server required · Opens offline · ~/.telos/usage.jsonl fed directly into a single-file HTML page}

TELOS is open source. Run it on your own workflow — see whether that 92% is real, or just another "X× tokens" claim.

⬢ Support Matrix

Harness support

Harness	Typical usage	`telos init` auto-connect	Status
Claude Code	Anthropic-native coding agent workflow	✅	🟢 First-class
OpenClaw	Open-source agent runtime with TELOS parser integration	✅	🟢 First-class
Hermes	Multi-agent orchestration with independent sub-IR handling	✅	🟢 First-class
Codex	OpenAI-style coding workflow via local gateway injection	✅	🟢 Supported

Frontier model support

Model family	Provider	Through TELOS engine adapter	Notes
Claude (4.x / 4.6+)	Anthropic	✅	Explicit breakpoints and prewarm path
GPT (4+/5.x)	OpenAI	✅	Uses `prompt_cache_key` routing strategy
DeepSeek (V3+)	DeepSeek	✅	Deterministic byte-stable prefix behavior

Inference framework support

Framework	Deployment style	Through TELOS	Cache-aware capabilities
vLLM	Self-hosted OpenAI-compatible serving	✅	Explicit anchors, prewarm, cache probe/evict, partial fork-and-replace
SGLang	Self-hosted high-throughput serving	✅	Explicit anchors, prewarm, cache probe/evict, full fork-and-replace

_{Need another harness or model backend? TELOS is adapter-driven: keep the same IR and add an engine/harness adapter without rewriting your agent logic.}

⬢ TELOS solves exactly two things

① Push token efficiency to the limit. 6-turn real session −92.3%; controlled 48-call run −36.6% (net −$2.16). Every cent accounted for in absolute $/query-resolved — ratios can be faked; dollars can't.

② Return context sovereignty to you. TelosIR is an engine-agnostic, serializable, portable context representation. Your persona, your tools, your 20-turn mid-session thread — everything packed into the same stone tablet. Hand it to Claude today, move it to DeepSeek tomorrow, run it on a local vLLM tonight. Your context; agents are just hired help.

⬢ SWE-bench Verified — TELOS does not regress task correctness

Token savings are only useful if the agent still solves the problem. We ran a pre-registered A/B on SWE-bench Verified with the Hermes harness and deepseek/deepseek-v4-flash — 100 instances per arm, seeded sample across 8 repos (sphinx, matplotlib, xarray, pytest, requests, pylint, seaborn, flask). 99 instances per arm were graded under the official Docker harness (1 instance excluded due to a missing per-instance docker image upstream).

Resolved rate (docker-graded, n=99/arm, paired)

Arm	Resolved	Rate	95% Wilson CI
TELOS	45 / 99	45.5%	[36.0%, 55.2%]
Vanilla	42 / 99	42.4%	[33.2%, 52.3%]

Paired 2×2 on the same 99 instances: both resolved 33; TELOS-only 12; vanilla-only 9; neither 45. Exact McNemar two-sided p = 0.66 — the +3 pp absolute gap is not statistically significant, i.e. TELOS does not regress resolved rate at this sample size.

Token efficiency (agent-side, n=99/arm, same instances)

Per-task	TELOS	Vanilla	Δ
new_input (post-cache, billed)	93,712	198,706	−52.8%
prompt_tokens (raw + cache)	352,400	515,953	−31.7%
output_tokens	24,975	25,218	−1.0%
api_calls	32.6	32.1	+1.4%
cache_share	73.4%	61.5%	+11.9 pp
reported cost (USD)	$2.29	$3.85	−40.5%

Read this honestly. The 99-instance subset gives a Wilson CI of roughly ±10 pp on each arm. This run can rule out an absolute regression worse than ~6 pp at 95% confidence (the lower bound of the paired difference), but cannot pin Δ to ±2 pp. What it shows with high confidence is the input-token bill is roughly halved, and end-to-end cost drops ~40%, at the same correctness band. A larger run (n ≥ 400/arm) is on the roadmap to tighten the resolved-rate confidence interval further.

_{Raw outputs: agent runs in /tmp/telos-ab-n100/{with,without}/, docker-graded reports in /tmp/telos-ab-n100/docker-eval/. Reproduce: scripts/run_swebench_batch.py -n 100 --seed 7. Full technical report (pre-registered design, statistical detail, related work): docs/2026-05-26-swebench-ab.md.}

⬢ The protocol: not compression, but never breaking the prefix

Most agent frameworks treat KV-cache as a runtime gift the inference engine may or may not give you. TELOS inverts this:

Cache reuse is a structural property of the prompt itself, not a matter of runtime luck. If you never touch bytes already submitted, the cache cannot be invalidated.

That principle materializes in three interlocking ideas.

Three-color bands

Every content block declares its cache lifetime at birth — not post-hoc heuristics, not LLM guessing, but first-class structural annotation:

Band	Color	Semantics	Cache behavior
PIN	🟢	Tool defs · system prompt · current question	Permanent. Never evicted. The immutable base of every request's prefix hash
FOLD	🟡	Conversation history · tool results · large docs	Cacheable, compactable. Under pressure, replaced by a summary — PIN prefix bytes stay untouched
DROP	🔴	Timestamps · CWD · git status · PIDs	Ephemeral. Excluded entirely from the prefix hash. Must follow all BPs; never contaminates upstream bytes

The ordering invariant is absolute: PIN* → FOLD* → DROP* — within each message, across the full prompt, at every layer. This is the only structural rule that wins the cache — everything else is implementation detail.

Monotonic append

The prompt is an append-only stream. New turns only add blocks to the tail — no mutation of already-submitted bytes. A "modification" is expressed as a new block (summary, redaction), never an in-place rewrite.

Because earlier blocks are immutable and bytes are identical across turns, the inference engine's prefix-matching algorithm finds the longest common prefix on every request — not by luck, but by construction. Cache hit rate is therefore a monotonically non-decreasing function of session length: longer sessions, more reuse, never regression.

⬢ Roadmap

TELOS makes exactly one claim: context is yours, agents are hired. The current roadmap stays entirely within the cost-saving gateway narrative, with the seed of trajectory as a portable asset planted only in the last phase. Anything that can be checked off goes on the roadmap; anything that can't, doesn't.

Phase	Thesis
Phase 1 · Protocol correctness hardening	Turn "cache cannot be invalidated" from a slogan into a CI red/green light
Phase 2 · Production reliability & observability	Make the gateway safe to leave on someone else's prod traffic
Phase 3 · Take over the call chain	Go from prompt rewriter to the agent's traffic plane
Phase 4 · Context becomes an asset	Trajectories are no longer logs — they're forkable code

Citation

Core contributors: Zheng Wang, Shenzhi Wang, Yue Wu, Shiji Song, Gao Huang

@misc{wang2026telos-agent,
  title        = {Telos: A Cost-Aware Inference Infrastructure for AI Agent},
  author       = {Zheng Wang, Shenzhi Wang, HongTao Zhong, Shiji Song, Gao Huang},
  howpublished = {\url{https://github.com/learningCatHD/telos-sdk.git}},
  year         = {2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
assets		assets
docs		docs
engine		engine
gateway		gateway
harness		harness
init		init
output_filter		output_filter
proxy		proxy
replay		replay
scripts		scripts
showcase		showcase
site		site
tests		tests
.DS_Store		.DS_Store
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
README.zh-CN.md		README.zh-CN.md
__init__.py		__init__.py
bridge.py		bridge.py
cast.py		cast.py
cli.py		cli.py
cli_menu.py		cli_menu.py
config.py		config.py
corpus.py		corpus.py
demo.py		demo.py
example_ab_dashboard.html		example_ab_dashboard.html
example_developer_page.html		example_developer_page.html
harnesses.py		harnesses.py
ir.py		ir.py
pyproject.toml		pyproject.toml
refpool.py		refpool.py
registry.py		registry.py
uv.lock		uv.lock
vercel.json		vercel.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Context is yours · Agents are hired

⬢ 2 a.m. — where did all the money go?

⬢ 3-step to save 90%

❶ Install

❷ Connect

❸ Observe

⬢ Support Matrix

Harness support

Frontier model support

Inference framework support

⬢ TELOS solves exactly two things

⬢ SWE-bench Verified — TELOS does not regress task correctness

Resolved rate (docker-graded, n=99/arm, paired)

Token efficiency (agent-side, n=99/arm, same instances)

⬢ The protocol: not compression, but never breaking the prefix

Three-color bands

Monotonic append

⬢ Roadmap

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Context is yours · Agents are hired

⬢ 2 a.m. — where did all the money go?

⬢ 3-step to save 90%

❶ Install

❷ Connect

❸ Observe

⬢ Support Matrix

Harness support

Frontier model support

Inference framework support

⬢ TELOS solves exactly two things

⬢ SWE-bench Verified — TELOS does not regress task correctness

Resolved rate (docker-graded, n=99/arm, paired)

Token efficiency (agent-side, n=99/arm, same instances)

⬢ The protocol: not compression, but never breaking the prefix

Three-color bands

Monotonic append

⬢ Roadmap

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages