-
Notifications
You must be signed in to change notification settings - Fork 59
Description
Hi, during the first step in process_requests, the code has a 20% probability of "detecting a prefix" in the context (unless we run with --disable-prefix-caching).
If we hit this 20% probability, it creates a prefix entry in the Python metadata structure (PrefixCacheEntry, if not already existing). Then, it attempts to measure "how long it takes to load this prefix from the cache" and subtracts the prefix tokens from the request's context.
However, the issue is that the key for this prefix was never actually written to the cache. So access_cache() doesn't load anything (it always returns None), but we still measure the time of this operation, and we still subtract the tokens from the context.
I did a quick test to verify this. I added this code in benchmark.py:
# 1. Check for a prefix cache hit.
if self.prefix_cache_manager:
prefix_entry, remaining_tokens = self.prefix_cache_manager.check_prefix_cache(request, self.model_config)
if prefix_entry:
cache_type = 'system' if prefix_entry.prefix_type == PrefixType.SYSTEM_PROMPT else 'common'
location, read_lat = self.cache.access_cache(prefix_entry.kv_cache_key, request.phase, cache_type)
# ADDED CODE: Check if prefix was actually loaded from cache
if location is not None:
print(f"🎯 PREFIX CACHE HIT! Key: {prefix_entry.kv_cache_key}, Location: {location}")
exit(1)
else:
print(f"❌ PREFIX CACHE MISS (as expected). Key: {prefix_entry.kv_cache_key}")
##### end of ADDED CODE
storage_latency += read_lat
request.context_tokens = remaining_tokensThe only results I obtained were misses:
Starting benchmark... Mode: standard (prefill+decode)
--------------------------------------------------------------------------------
❌ PREFIX CACHE MISS (as expected). Key: kv_system_2ed87f49657692a2
❌ PREFIX CACHE MISS (as expected). Key: kv_system_0b7588a77cb9c73b
❌ PREFIX CACHE MISS (as expected). Key: kv_system_2ed87f49657692a2
❌ PREFIX CACHE MISS (as expected). Key: kv_system_ea8ffc489f80c165
❌ PREFIX CACHE MISS (as expected). Key: kv_system_0b7588a77cb9c73b
❌ PREFIX CACHE MISS (as expected). Key: kv_system_ea8ffc489f80c165
❌ PREFIX CACHE MISS (as expected). Key: kv_system_0b7588a77cb9c73b
================================================================================
BENCHMARK RESULTS - MLPerf KV Cache Storage BenchmarkIf we don't want to modify the code, there is a workaround: simply disable the feature entirely using --disable-prefix-caching, as it doesn't currently provide meaningful functionality.
Additionally, the following parameter is documented but never used in the code:
| Parameter | Default | Description |
|---|---|---|
max_prefix_entries |
1000 | Prefix cache capacity. LRU eviction kicks in when this limit is reached. (...) |
Below is the AI-generated analysis:
Summary
The prefix caching feature is currently incomplete and does not match the design specification in docs/MLperf v3 KV cache proposal.md or the documented behavior in README.md. While the metadata tracking and token reduction work correctly, the actual KV cache storage, reads, and LRU eviction are not implemented.
Expected Behavior (per Design Document)
According to docs/MLperf v3 KV cache proposal.md (Section 3.10, lines 1199-1211):
t=0 User A: "You are helpful..." + "Hello"
→ Miss → Full prefill → Store as kv_system_a1b2c3d4
t=1 User B: "You are helpful..." + "Hi"
→ HIT → Read cached prefix → Only prefill "Hi"
t=2 [LRU eviction of kv_system_a1b2c3d4]
t=3 User C: "You are helpful..." + "Hey"
→ Miss → Full prefill → Re-store
Expected lifecycle:
- First request with system prompt → MISS → Write prefix to cache → Store as
kv_system_{hash} - Second request with same prompt → HIT → Read prefix from cache → Only process remaining tokens
- When cache is full → LRU eviction removes least recently used prefix
- After eviction → MISS → Re-write prefix to cache
Actual Behavior
The implementation only performs metadata tracking and token reduction:
- ✅ Creates
PrefixCacheEntrymetadata object - ✅ Reduces
request.context_tokensby prefix length - ❌ Never writes prefix KV cache to storage
- ❌ Never reads prefix KV cache from storage (always returns
None) - ❌ Never evicts prefix entries (no LRU logic exists)
Code Evidence
Issue 1: No Cache Writes
File: kv_cache_benchmark/kv_cache/prefix_cache.py (lines 67-91)
def detect_system_prompt(self, context_tokens: int) -> Optional[PrefixCacheEntry]:
if random.random() < system_prompt_hit_probability:
system_prompt = random.choice(self.COMMON_SYSTEM_PROMPTS)
prefix_hash = self.hash_prefix(system_prompt, len(system_prompt.split()))
with self.lock:
if prefix_hash in self.prefix_index:
entry = self.prefix_index[prefix_hash]
entry.use_count += 1
entry.last_used = datetime.now()
return entry
else:
entry = PrefixCacheEntry(
kv_cache_key=f"kv_system_{prefix_hash}", # ← Key created
token_count=len(system_prompt.split()),
...
)
self.prefix_index[prefix_hash] = entry # ← Only metadata stored!
return entry # ← No cache.allocate_cache() call!Problem: Creates metadata but never calls self.cache.allocate_cache() to write the actual KV cache data.
Issue 2: Cache Reads Always Fail
File: kv_cache_benchmark/kv_cache/benchmark.py (lines 508-515)
if self.prefix_cache_manager:
prefix_entry, remaining_tokens = self.prefix_cache_manager.check_prefix_cache(request, self.model_config)
if prefix_entry:
cache_type = 'system' if prefix_entry.prefix_type == PrefixType.SYSTEM_PROMPT else 'common'
_, read_lat = self.cache.access_cache(prefix_entry.kv_cache_key, request.phase, cache_type)
# ↑ This ALWAYS returns (None, 0.0) because the key was never written!
storage_latency += read_lat # ← Always adds 0.0
request.context_tokens = remaining_tokens # ← This works (token reduction)Problem: access_cache() is called with kv_system_{hash} key, but this key never exists in cache_entries because it was never written. The call always returns (None, 0.0).
Issue 3: No LRU Eviction
File: kv_cache_benchmark/kv_cache/prefix_cache.py (lines 97-99)
def __init__(self, cache, max_prefix_entries: int = None):
self.cache = cache
self.max_prefix_entries = max_prefix_entries if max_prefix_entries is not None else cfg('prefix_cache', 'max_prefix_entries', default=1000)
# ↑ Parameter set but NEVER used anywhere in the code!Problem: The max_prefix_entries parameter is configured but never checked. No eviction logic exists. The prefix_index dictionary grows unbounded.
Verification:
$ grep -rn "self.max_prefix_entries" kv_cache_benchmark/kv_cache/
kv_cache_benchmark/kv_cache/prefix_cache.py:99: self.max_prefix_entries = ...
# ↑ Only appears once (assignment), never read!Impact
| Feature | Status | Impact |
|---|---|---|
| Token reduction | ✅ Works | Correctly reduces user request size |
| Metadata tracking | ✅ Works | Tracks prefix usage statistics |
| Cache storage | ❌ Missing | No actual caching happens |
| Cache reads | ❌ Broken | No I/O measured for prefix reads |
| LRU eviction | ❌ Missing | max_prefix_entries is dead code |
| Storage I/O metrics | ❌ Inaccurate | Underestimates read I/O |
| Memory pressure | ❌ Incorrect | No cache pressure from prefixes |
Reproduction Steps
- Run benchmark with
--enable-prefix-caching - Check
prefix_cache_statsin output JSON - Observe
system_prompt_hits > 0(metadata hits) - Check cache directory - no
kv_system_*files exist - Check storage read metrics - no reads for prefix cache
Proposed Fix
Option 1: Implement Full Prefix Caching (Match Design Spec)
Changes needed in prefix_cache.py:
def detect_system_prompt(self, context_tokens: int) -> Optional[PrefixCacheEntry]:
"""Simulates the detection of a common system prompt at the start of a request."""
system_prompt_hit_probability = cfg('prefix_cache', 'system_prompt_hit_probability', default=0.2)
if random.random() < system_prompt_hit_probability:
system_prompt = random.choice(self.COMMON_SYSTEM_PROMPTS)
prefix_hash = self.hash_prefix(system_prompt, len(system_prompt.split()))
with self.lock:
if prefix_hash in self.prefix_index:
entry = self.prefix_index[prefix_hash]
entry.use_count += 1
entry.last_used = datetime.now()
return entry
else:
# NEW: Check if we need to evict
if len(self.prefix_index) >= self.max_prefix_entries:
self._evict_lru_prefix()
entry = PrefixCacheEntry(
prefix_key=f"system_{prefix_hash}",
prefix_type=PrefixType.SYSTEM_PROMPT,
text_hash=prefix_hash,
token_count=len(system_prompt.split()),
kv_cache_key=f"kv_system_{prefix_hash}",
use_count=1
)
self.prefix_index[prefix_hash] = entry
# NEW: Actually write the prefix to cache
self.cache.allocate_cache(
entry.kv_cache_key,
entry.token_count,
InferencePhase.PREFILL
)
return entry
return None
def _evict_lru_prefix(self):
"""Evict the least recently used prefix entry."""
if not self.prefix_index:
return
# Find LRU entry
lru_hash = min(self.prefix_index.keys(),
key=lambda h: self.prefix_index[h].last_used)
lru_entry = self.prefix_index[lru_hash]
# Delete from cache storage
# Note: This requires adding a delete method or using cache internals
if lru_entry.kv_cache_key in self.cache.cache_entries:
# Trigger eviction through the normal cache mechanism
# or implement direct deletion
pass
# Remove from metadata
del self.prefix_index[lru_hash]Changes needed in benchmark.py:
# Line 513 - Handle cache miss for prefix
if prefix_entry:
cache_type = 'system' if prefix_entry.prefix_type == PrefixType.SYSTEM_PROMPT else 'common'
location, read_lat = self.cache.access_cache(prefix_entry.kv_cache_key, request.phase, cache_type)
# NEW: If prefix was evicted from cache, re-write it
if location is None:
self.cache.allocate_cache(
prefix_entry.kv_cache_key,
prefix_entry.token_count,
InferencePhase.PREFILL
)
# Try reading again
location, read_lat = self.cache.access_cache(prefix_entry.kv_cache_key, request.phase, cache_type)
storage_latency += read_lat
request.context_tokens = remaining_tokensOption 2: Document Current Behavior (Quick Fix)
If the current "virtual optimization" behavior is intentional, update documentation:
Update docs/MLperf v3 KV cache proposal.md Section 3.10:
### 3.10 Prefix Caching: System Prompt Optimization
**Note:** The current implementation uses a "virtual" prefix cache optimization.
Prefix metadata is tracked, and request token counts are reduced accordingly,
but the actual prefix KV cache data is not stored or read from the cache tiers.
This reduces the amount of user data written but does not add prefix read I/O
to the benchmark measurements.
**Lifecycle:**t=0 User A: "You are helpful..." + "Hello"
→ Detect prefix → Reduce tokens → Write only "Hello" portion
t=1 User B: "You are helpful..." + "Hi"
→ Detect prefix → Reduce tokens → Write only "Hi" portion
Update README.md:
### Prefix Cache Settings
**Note:** Prefix caching currently operates as a "virtual optimization" - it reduces
the size of user requests by the prefix token count, but does not actually store or
read prefix KV cache data. The `max_prefix_entries` parameter is reserved for future
implementation of full prefix cache eviction.Recommendation
Option 1 (full implementation) is recommended to match the design specification and provide accurate storage I/O measurements for workloads with shared system prompts.
Option 2 (documentation update) is acceptable if the current behavior is intentional and the design document needs to be updated to reflect the actual implementation.
Additional Notes
- The
bytes_savedmetric inprefix_cache_statsis misleading - it counts "virtual" savings from token reduction, not actual cache reuse - The design document explicitly shows prefix cache entries being evicted (line 1207), which is not implemented
- The README states "LRU eviction kicks in when this limit is reached" which is factually incorrect
Environment
- Version: v3.0.0b1
- Files affected:
kv_cache_benchmark/kv_cache/prefix_cache.pykv_cache_benchmark/kv_cache/benchmark.pykv_cache_benchmark/docs/MLperf v3 KV cache proposal.mdkv_cache_benchmark/README.md