A Chinese version of this document is available at
README.zh.md.
An offline analysis tool that answers one question: "Given my deployment, how much can KV cache actually help?"
Feed it either a real traffic trace (JSONL) or just a model spec plus an Agent-scenario description, and it tells you:
- How much of this traffic is worth reusing — the ceiling of the hit rate.
- How much you can actually capture under your current GPU memory budget — and whether the gap is caused by capacity or by the cache policy (e.g. LRU).
- The minimum number of cards / machines required to hit a target TPS, after the hit rate is converted into throughput.
- A cold-start estimate when no trace exists yet — useful for early capacity planning when all you have is "a few Agents share one system prompt."
It is not a serving runtime. It does not participate in online scheduling. Pure offline number-crunching.
- Capacity sizing before a new product launches: no real traffic yet, but you know the rough number of Agents, the shared prompt length, and the average number of conversation turns; you want to estimate "is 1 machine × 8 H20 cards enough?"
- Planning with existing traffic: you have a JSONL trace and want to see how throughput changes if you bump GPU memory, add a host KV tier, or switch deployment shape.
- Comparing the cost of policy choices: how big is the gap between "theoretical optimum" and "actually running LRU"? Is it worth the engineering effort to optimize the policy?
- Calibrating heuristic parameters: you have a small trace, but ultimately want a generic formula to extrapolate to other scenarios. The tool can use that small trace to back-calibrate the key heuristic parameters (Zipf shape and LRU efficiency coefficient) so they better match reality.
Whether you run in trace mode or heuristic mode, all results revolve around the same set of metrics. Once you understand them, every output file makes sense:
- Content upper bound: assuming infinite capacity, the maximum reuse this traffic allows. The first gate for "is this workload even worth caching?".
- Relaxed upper bound (capacity oracle): under fixed GPU memory, the hit rate achievable by an "oracle scheduler" with full offline knowledge. This is the best you can possibly get after capacity bites.
- Exact strict-prefix optimum: on top of the relaxed bound, add the constraint "hits must follow strict prefix semantics." This is the truly achievable optimum. All TPS and machine-count planning is based on this number.
- LRU baseline: under the same capacity, what plain online LRU achieves. A simple, deployable lower bound on what real systems will produce.
- TPS / machine-count planning: multiply the hit rate by
alpha(the coefficient that converts hit savings into throughput gain), combine with "per-card baseline TPS without any hits" and "target total TPS," and back-solve the minimum number of cards / machines required.
The four hit-rate metrics always satisfy:
LRU baseline ≤ exact strict-prefix ≤ relaxed upper bound ≤ content upper bound
When reading the results, fix your eyes on two gaps:
- content upper bound − exact strict-prefix: a large gap → bottlenecked by capacity → adding memory helps.
- exact strict-prefix − LRU baseline: a large gap → bottlenecked by policy → switching to a smarter cache policy helps.
flowchart TD
A[Input] --> B{Have a trace?}
B -- Yes --> C[JSONL trace file\nor URL]
B -- No --> D[Model spec +\nAgent scenario]
B -- Raw dataset --> E[convert-conversation-dataset\nconvert-benchmark-results]
E --> C
C --> F[analyze-buckets /\naudit-buckets]
D --> G[estimate-multi-agent]
C --> H[calibrate-multi-agent]
F --> I[Oracle results\nhit_summary.csv\ntier_summary.csv]
G --> J[Heuristic estimate\nheuristic_summary.csv\nheuristic_report.md]
H --> K[Calibrated params\ncalibrated_config.json]
I --> L[Planning\nplanning_strict_prefix.csv\nplanning_lru.csv]
J --> L
K --> G
L --> M[Answer\nMin cards / machines\nto hit target TPS]
style A fill:#f0f4ff,stroke:#4a6cf7
style M fill:#f0fff4,stroke:#2ea043
style I fill:#fff8f0,stroke:#d08000
style J fill:#fff8f0,stroke:#d08000
Install first:
python3 -m pip install -e .There are 4 subcommands. If you are new, start with estimate-multi-agent — it requires no trace data and is the easiest to run.
Write a short Agent-scenario description (shared prompt length, number of concurrent Agents, new tokens per turn, etc.) and you get a hit-rate and machine-count estimate.
kvcache-upper-bound estimate-multi-agent \
--config configs/public_multi_agent_qwen3_5_27b.json \
--output-dir outputs/heuristic_qwen_1x8Best for: no real traffic yet, but you need a number for a PRD or capacity review.
Given a JSONL trace, the tool buckets requests by prompt token count (each bucket defined in the config via lower_tokens / upper_tokens, e.g. 0–32K, 32K–64K, 64K–128K, ...). Every bucket independently produces the content upper bound, relaxed upper bound, exact strict-prefix optimum, LRU baseline, and the corresponding TPS / machine-count plan.
kvcache-upper-bound analyze-buckets \
--trace https://media.githubusercontent.com/media/alibaba-edu/qwen-bailian-usagetraces-anon/main/qwen_traceA_blksz_16.jsonl \
--config configs/public_trace_qwen3_5_27b.json \
--output-dir outputs/run_traceA \
--max-records 5000Use --max-records (e.g. 5000) during debugging to validate the config, then drop it to run the full trace.
Best for: you already have a real trace and need to do capacity planning or compare deployment shapes.
audit-buckets does everything analyze-buckets does plus an extra correctness audit: it samples a small subset, recomputes it with a naive reference implementation, and cross-checks that the several strict-prefix solvers agree. The conclusion ("which results are formally proved, which are only supporting evidence") is written to correctness_report.zh.md / .en.md.
kvcache-upper-bound audit-buckets \
--trace https://media.githubusercontent.com/media/alibaba-edu/qwen-bailian-usagetraces-anon/main/qwen_traceA_blksz_16.jsonl \
--config configs/public_trace_qwen3_5_27b.json \
--output-dir outputs/run_traceA_audit--sample-request-limit controls the audit sample size (default 256). It only affects the coverage of the cross-check; the main analysis always runs over the full trace.
Best for: first time touching the project, or after modifying an oracle implementation, to confirm nothing has drifted.
If your goal is "use one generic formula to extrapolate across many scenarios" but you suspect the default heuristic parameters are off, take a small trace and back-calibrate the two key parameters: the Zipf shape zipf_s and the LRU efficiency coefficient lru_like.
kvcache-upper-bound calibrate-multi-agent \
--trace https://media.githubusercontent.com/media/alibaba-edu/qwen-bailian-usagetraces-anon/main/qwen_traceA_blksz_16.jsonl \
--bucket-config configs/public_trace_qwen3_5_27b.json \
--heuristic-config configs/public_multi_agent_qwen3_5_27b.json \
--output-dir outputs/heuristic_qwen_1x8_calibrated \
--max-records 5000This command produces two recommendations side by side:
- A calibrated heuristic config
calibrated_config.jsonwith the parameters fit to the observation. - A trace-derived "structure template"
recommended_heuristic_config.jsonthat tells you how to set the structural parameters — shared-prefix length, private window size, concurrent Agent count, etc. If the calibrated hit rate still differs significantly from observation (visible in the report as a largecontent_gap), tune the structure template instead of pushingzipf_s/lru_likefurther.
Best for: stabilizing a heuristic configuration so it can be reused for trace-less scenarios later.
configs/ ships with sample files you can copy and adapt. Pick the one closest to your scenario.
| Sample | Purpose |
|---|---|
public_trace_qwen3_5_27b.json |
1 machine × 8 H20 cards, trace analysis. Capacity is essentially unconstrained — good for studying the upper bound. |
public_trace_qwen3_5_27b_1x1_h20.json |
1 machine × 1 H20 card, trace analysis. Designed to amplify single-card memory pressure and show how HBM compresses the hit rate. |
public_trace_qwen3_5_27b_planning_norm.json |
1 × 8 with baseline_per_card_tps = 1 and planning_target_total_tps = 8, a normalized planning demo. |
public_trace_qwen3_5_27b_1x1_h20_planning_norm.json |
1 × 1 with the same normalized targets — pair it with the 1 × 8 version to compare deployment shapes head-on. |
public_multi_agent_qwen3_5_27b.json |
1 × 8, trace-free heuristic, based on the public Qwen3.5-27B model spec. |
public_multi_agent_qwen3_5_27b_1x1_h20.json |
Same as above but 1 × 1 — see how the cold-start estimate behaves when memory is tight. |
The
baseline_per_card_tps = 1/planning_target_total_tps = 8numbers in the normalized samples are illustrative only, not real production figures. Replace them with your own baseline and target TPS.
Model profile (model_profile) — describes the shape of the KV cache and the weight footprint:
n_layers/n_kv_heads/head_dim/dtype_bytes/block_size: used to compute KV bytes per token.kv_cache_layer_count: required for hybrid-attention models. Counts only the layers that actually go through a token-linear KV cache; pretending all layers do will overestimate the footprint. May be omitted for uniform-attention models, in which case the tool falls back ton_layers.tp_size/pp_size: tensor / pipeline parallel sizes, used to shard per-token KV bytes per rank. Default1.parameter_count/weight_dtype_bytes: total parameter count and weight dtype size in bytes. Only needed when you want to derive the KV budget fromgpu_memory_gb_per_card— the tool uses these to compute the weight shard footprint and subtract it from total GPU memory. Optional if you supplyhbm_kv_gb_per_carddirectly.
Deployment (bucket_deployments[] for trace mode, deployments[] for heuristic mode) — describes how many cards and how much memory:
accelerator_count: total card count (required).cards_per_machine: cards per machine (required). Machine count is derived asaccelerator_count / cards_per_machine.machine_spec: a pure spec label, e.g."h20". Only the spec name — never write"8*h20"or any form that bakes a count into the spec.hbm_kv_gb_per_card/gpu_memory_gb_per_card: single-card KV budget, choose one:- If you know how much GB each card reserves for KV cache, fill in
hbm_kv_gb_per_card. - If you only know the total GPU memory, fill in
gpu_memory_gb_per_card; the tool subtracts the weight shard and the runtime reserve to derive the KV budget.
- If you know how much GB each card reserves for KV cache, fill in
runtime_reserve_gb_per_card: per-card runtime reserve (optional).
All memory budget fields are per-card, with names ending in
_per_card. Do not use any*_per_machinenaming.
TPS and planning anchors — required to obtain "minimum cards / minimum machines":
total_tps+total_tps_unit: raw throughput input.total_tps_unitis one ofcluster_total/per_machine/per_card; the tool normalizes everything to the cluster-total value.baseline_per_card_tps: per-card TPS with zero cache hits. The starting point for planning; required if you want "minimum cards / minimum machines."planning_target_total_tps: the target total TPS for the cluster.extra_capacity_tiers: optional array describing extra host / SSD KV tiers per machine. Each entry has two fields:label(display name in the report, e.g."HBM+1T per machine") andkv_gb_per_machine(extra GB per machine, e.g.1024for 1 TiB).
Global parameters:
prefill_savings_alpha: how much of the prefill cost saved by a cache hit translates into throughput gain. Default0.8.include_output_kvcache: when prefill and decode are not separated (PD-not-disaggregated), whether to count output-stage KV cache toward occupancy and hit rate. Defaulttrue.
For trace-free mode you also fill in a scenario description under heuristic_multi_agent — these are the core inputs of the cold-start estimate:
concurrent_agents: number of concurrent Agents.shared_prefix_tokens: length (in tokens) of the prefix shared by all Agents (e.g. system prompt + tool descriptions).avg_new_tokens_per_turn: average new tokens added per conversation turn.avg_turns_per_session: average number of turns per session.private_window_tokens: per-Agent private context window size.curve_mode: shape of the hit-rate-vs-capacity curve. One oflinear/power_law_fit/zipf_harmonic. Defaultpower_law_fit.zipf_s: Zipf shape parameter, used by bothpower_law_fitandzipf_harmonic.policy_efficiency.lru_like: efficiency coefficient (≤ 1.0) approximating the loss of an online policy relative to the theoretical optimum.
These five scenario fields are exactly what
recommended_heuristic_config.jsonaims to refill — aftercalibrate-multi-agent, copy the recommended values back into yourheuristic_multi_agentblock.
Every run drops a set of files under --output-dir. You don't need to open all of them at first — find the table that answers your question and stop there.
hit_summary.csv: the core hit-rate table. One row per length bucket, listing the content upper bound, relaxed upper bound, exact strict-prefix optimum, and LRU baseline at the current HBM budget, plus a columnHBM Current Bottleneckwhose value isCapacity,Policy, orNone— telling you directly whether to add memory or change the policy. 90% of the time this is the only file you need.tier_summary.csv: capacity-tier comparison. Want to know "how much extra hit rate do I get by adding 1 TiB host KV, or 10 TiB SSD?" Look here. It expands HBM, HBM+1T, HBM+10T, ... into a long table. Each row gives the strict-prefix and LRU hit rates, the columnsStrict-Prefix Gain vs Previous Tier/LRU Gain vs Previous Tier, and aCurrent Bottleneckcolumn with the same values as above.
planning_strict_prefix.csv: planning based on the exact strict-prefix hit rate. Represents the theoretical upper bound — the minimum cards / machines required to hit the target TPS under the most ideal policy.planning_lru.csv: planning based on the LRU baseline hit rate. Represents reality — the minimum cards / machines if you actually run LRU. Comparing the two tells you "is policy optimization worth the engineering cost?"- The two planning files use the same conversion pipeline: hit rate → TPS gain → cluster TPS capacity → minimum cards / machines. They differ only in which hit rate is used.
- The "minimum cards / minimum machines" columns require both
baseline_per_card_tpsandplanning_target_total_tpsin the config; otherwise only the basic TPS-gain columns are emitted.
It joins the hit-rate columns from hit_summary.csv with the planning columns from planning_strict_prefix.csv, giving a quick one-table view of "current HBM hit rate + minimum machines needed." Use it for a quick glance; for serious comparison, prefer the three split tables above.
heuristic_summary.csv/heuristic_tier_summary.csv: same shape ashit_summary.csv/tier_summary.csv, but the numbers come from heuristic estimation instead of trace analysis.heuristic_report.zh.md/heuristic_report.en.md: bilingual explanatory reports describing the assumptions, formulas, parameters, and uncertainty bounds of this estimate. If you are unsure how much to trust the heuristic numbers, read this first.
calibration.json: calibration summary, including the trace-derived target values, the bestzipf_s/lru_likefound by grid search, and per-tier errors.calibration_trials.csv: every grid-search trial. Useful for visualizing the parameter landscape and locating the minimum-error point.calibrated_config.json: heuristic config with the best parameters filled in. Drop it straight intoestimate-multi-agent.recommended_heuristic_config.json: a "scenario description template" reverse-engineered from the trace structure. If the calibrated hit rate still differs heavily from observation, no amount ofzipf_s/lru_liketweaking will save you — the structural parameters (shared-prefix length, private window, concurrent Agents, ...) are off. Apply this template to yourheuristic_multi_agentblock and rerun.
metadata.json: this run's "receipt." Records which config was used, how many records were loaded, and the normalized deployment shape. Convenient when reviewing later: "what assumptions did this run actually use?"details.json: per-bucket intermediate details. Use it to look up specific numbers (e.g. the equivalent card count for a given bucket).correctness_report.zh.md/correctness_report.en.md: emitted only byaudit-buckets. Lists, per result, the proof path and the upper / lower bounds — i.e. what is formally proved versus what is supporting evidence.
When you need more depth:
docs/four_layer_model.md: the public-facing "Capacity → Hit rate → TPS → Machine count" four-layer framework. Recommended as the first deeper read if this is your first encounter with the tool.docs/design_guide.md: implementation invariants, algorithmic choices, and milestone planning. Read before changing code or extending features.docs/correctness_guide.md: precise definitions of every result, the proof scope, and how to read the result tables. Read when you have doubts about the numbers.