Skip to content

KVCache: Prefix Cache Implementation Does Not Match Design Specification #283

@LouisDDN

Description

@LouisDDN

Hi, during the first step in process_requests, the code has a 20% probability of "detecting a prefix" in the context (unless we run with --disable-prefix-caching).

If we hit this 20% probability, it creates a prefix entry in the Python metadata structure (PrefixCacheEntry, if not already existing). Then, it attempts to measure "how long it takes to load this prefix from the cache" and subtracts the prefix tokens from the request's context.

However, the issue is that the key for this prefix was never actually written to the cache. So access_cache() doesn't load anything (it always returns None), but we still measure the time of this operation, and we still subtract the tokens from the context.

I did a quick test to verify this. I added this code in benchmark.py:

# 1. Check for a prefix cache hit.
if self.prefix_cache_manager:
    prefix_entry, remaining_tokens = self.prefix_cache_manager.check_prefix_cache(request, self.model_config)
    if prefix_entry:
        cache_type = 'system' if prefix_entry.prefix_type == PrefixType.SYSTEM_PROMPT else 'common'
        location, read_lat = self.cache.access_cache(prefix_entry.kv_cache_key, request.phase, cache_type)
        
        # ADDED CODE: Check if prefix was actually loaded from cache
        if location is not None:
            print(f"🎯 PREFIX CACHE HIT! Key: {prefix_entry.kv_cache_key}, Location: {location}")
            exit(1)
        else:
            print(f"❌ PREFIX CACHE MISS (as expected). Key: {prefix_entry.kv_cache_key}")
        ##### end of ADDED CODE
        
        storage_latency += read_lat
        request.context_tokens = remaining_tokens

The only results I obtained were misses:

Starting benchmark... Mode: standard (prefill+decode)
--------------------------------------------------------------------------------
❌ PREFIX CACHE MISS (as expected). Key: kv_system_2ed87f49657692a2
❌ PREFIX CACHE MISS (as expected). Key: kv_system_0b7588a77cb9c73b
❌ PREFIX CACHE MISS (as expected). Key: kv_system_2ed87f49657692a2
❌ PREFIX CACHE MISS (as expected). Key: kv_system_ea8ffc489f80c165
❌ PREFIX CACHE MISS (as expected). Key: kv_system_0b7588a77cb9c73b
❌ PREFIX CACHE MISS (as expected). Key: kv_system_ea8ffc489f80c165
❌ PREFIX CACHE MISS (as expected). Key: kv_system_0b7588a77cb9c73b

================================================================================
BENCHMARK RESULTS - MLPerf KV Cache Storage Benchmark

If we don't want to modify the code, there is a workaround: simply disable the feature entirely using --disable-prefix-caching, as it doesn't currently provide meaningful functionality.

Additionally, the following parameter is documented but never used in the code:

Parameter Default Description
max_prefix_entries 1000 Prefix cache capacity. LRU eviction kicks in when this limit is reached. (...)

Below is the AI-generated analysis:

Summary

The prefix caching feature is currently incomplete and does not match the design specification in docs/MLperf v3 KV cache proposal.md or the documented behavior in README.md. While the metadata tracking and token reduction work correctly, the actual KV cache storage, reads, and LRU eviction are not implemented.


Expected Behavior (per Design Document)

According to docs/MLperf v3 KV cache proposal.md (Section 3.10, lines 1199-1211):

t=0  User A: "You are helpful..." + "Hello"
     → Miss → Full prefill → Store as kv_system_a1b2c3d4

t=1  User B: "You are helpful..." + "Hi"
     → HIT → Read cached prefix → Only prefill "Hi"

t=2  [LRU eviction of kv_system_a1b2c3d4]

t=3  User C: "You are helpful..." + "Hey"
     → Miss → Full prefill → Re-store

Expected lifecycle:

  1. First request with system prompt → MISSWrite prefix to cache → Store as kv_system_{hash}
  2. Second request with same prompt → HITRead prefix from cache → Only process remaining tokens
  3. When cache is full → LRU eviction removes least recently used prefix
  4. After eviction → MISSRe-write prefix to cache

Actual Behavior

The implementation only performs metadata tracking and token reduction:

  1. ✅ Creates PrefixCacheEntry metadata object
  2. ✅ Reduces request.context_tokens by prefix length
  3. Never writes prefix KV cache to storage
  4. Never reads prefix KV cache from storage (always returns None)
  5. Never evicts prefix entries (no LRU logic exists)

Code Evidence

Issue 1: No Cache Writes

File: kv_cache_benchmark/kv_cache/prefix_cache.py (lines 67-91)

def detect_system_prompt(self, context_tokens: int) -> Optional[PrefixCacheEntry]:
    if random.random() < system_prompt_hit_probability:
        system_prompt = random.choice(self.COMMON_SYSTEM_PROMPTS)
        prefix_hash = self.hash_prefix(system_prompt, len(system_prompt.split()))
        
        with self.lock:
            if prefix_hash in self.prefix_index:
                entry = self.prefix_index[prefix_hash]
                entry.use_count += 1
                entry.last_used = datetime.now()
                return entry
            else:
                entry = PrefixCacheEntry(
                    kv_cache_key=f"kv_system_{prefix_hash}",  # ← Key created
                    token_count=len(system_prompt.split()),
                    ...
                )
                self.prefix_index[prefix_hash] = entry  # ← Only metadata stored!
                return entry  # ← No cache.allocate_cache() call!

Problem: Creates metadata but never calls self.cache.allocate_cache() to write the actual KV cache data.

Issue 2: Cache Reads Always Fail

File: kv_cache_benchmark/kv_cache/benchmark.py (lines 508-515)

if self.prefix_cache_manager:
    prefix_entry, remaining_tokens = self.prefix_cache_manager.check_prefix_cache(request, self.model_config)
    if prefix_entry:
        cache_type = 'system' if prefix_entry.prefix_type == PrefixType.SYSTEM_PROMPT else 'common'
        _, read_lat = self.cache.access_cache(prefix_entry.kv_cache_key, request.phase, cache_type)
        # ↑ This ALWAYS returns (None, 0.0) because the key was never written!
        storage_latency += read_lat  # ← Always adds 0.0
        request.context_tokens = remaining_tokens  # ← This works (token reduction)

Problem: access_cache() is called with kv_system_{hash} key, but this key never exists in cache_entries because it was never written. The call always returns (None, 0.0).

Issue 3: No LRU Eviction

File: kv_cache_benchmark/kv_cache/prefix_cache.py (lines 97-99)

def __init__(self, cache, max_prefix_entries: int = None):
    self.cache = cache
    self.max_prefix_entries = max_prefix_entries if max_prefix_entries is not None else cfg('prefix_cache', 'max_prefix_entries', default=1000)
    # ↑ Parameter set but NEVER used anywhere in the code!

Problem: The max_prefix_entries parameter is configured but never checked. No eviction logic exists. The prefix_index dictionary grows unbounded.

Verification:

$ grep -rn "self.max_prefix_entries" kv_cache_benchmark/kv_cache/
kv_cache_benchmark/kv_cache/prefix_cache.py:99:        self.max_prefix_entries = ...
# ↑ Only appears once (assignment), never read!

Impact

Feature Status Impact
Token reduction ✅ Works Correctly reduces user request size
Metadata tracking ✅ Works Tracks prefix usage statistics
Cache storage ❌ Missing No actual caching happens
Cache reads ❌ Broken No I/O measured for prefix reads
LRU eviction ❌ Missing max_prefix_entries is dead code
Storage I/O metrics ❌ Inaccurate Underestimates read I/O
Memory pressure ❌ Incorrect No cache pressure from prefixes

Reproduction Steps

  1. Run benchmark with --enable-prefix-caching
  2. Check prefix_cache_stats in output JSON
  3. Observe system_prompt_hits > 0 (metadata hits)
  4. Check cache directory - no kv_system_* files exist
  5. Check storage read metrics - no reads for prefix cache

Proposed Fix

Option 1: Implement Full Prefix Caching (Match Design Spec)

Changes needed in prefix_cache.py:

def detect_system_prompt(self, context_tokens: int) -> Optional[PrefixCacheEntry]:
    """Simulates the detection of a common system prompt at the start of a request."""
    system_prompt_hit_probability = cfg('prefix_cache', 'system_prompt_hit_probability', default=0.2)
    if random.random() < system_prompt_hit_probability:
        system_prompt = random.choice(self.COMMON_SYSTEM_PROMPTS)
        prefix_hash = self.hash_prefix(system_prompt, len(system_prompt.split()))

        with self.lock:
            if prefix_hash in self.prefix_index:
                entry = self.prefix_index[prefix_hash]
                entry.use_count += 1
                entry.last_used = datetime.now()
                return entry
            else:
                # NEW: Check if we need to evict
                if len(self.prefix_index) >= self.max_prefix_entries:
                    self._evict_lru_prefix()

                entry = PrefixCacheEntry(
                    prefix_key=f"system_{prefix_hash}",
                    prefix_type=PrefixType.SYSTEM_PROMPT,
                    text_hash=prefix_hash,
                    token_count=len(system_prompt.split()),
                    kv_cache_key=f"kv_system_{prefix_hash}",
                    use_count=1
                )
                self.prefix_index[prefix_hash] = entry

                # NEW: Actually write the prefix to cache
                self.cache.allocate_cache(
                    entry.kv_cache_key,
                    entry.token_count,
                    InferencePhase.PREFILL
                )

                return entry
    return None

def _evict_lru_prefix(self):
    """Evict the least recently used prefix entry."""
    if not self.prefix_index:
        return

    # Find LRU entry
    lru_hash = min(self.prefix_index.keys(),
                   key=lambda h: self.prefix_index[h].last_used)
    lru_entry = self.prefix_index[lru_hash]

    # Delete from cache storage
    # Note: This requires adding a delete method or using cache internals
    if lru_entry.kv_cache_key in self.cache.cache_entries:
        # Trigger eviction through the normal cache mechanism
        # or implement direct deletion
        pass

    # Remove from metadata
    del self.prefix_index[lru_hash]

Changes needed in benchmark.py:

# Line 513 - Handle cache miss for prefix
if prefix_entry:
    cache_type = 'system' if prefix_entry.prefix_type == PrefixType.SYSTEM_PROMPT else 'common'
    location, read_lat = self.cache.access_cache(prefix_entry.kv_cache_key, request.phase, cache_type)

    # NEW: If prefix was evicted from cache, re-write it
    if location is None:
        self.cache.allocate_cache(
            prefix_entry.kv_cache_key,
            prefix_entry.token_count,
            InferencePhase.PREFILL
        )
        # Try reading again
        location, read_lat = self.cache.access_cache(prefix_entry.kv_cache_key, request.phase, cache_type)

    storage_latency += read_lat
    request.context_tokens = remaining_tokens

Option 2: Document Current Behavior (Quick Fix)

If the current "virtual optimization" behavior is intentional, update documentation:

Update docs/MLperf v3 KV cache proposal.md Section 3.10:

### 3.10 Prefix Caching: System Prompt Optimization

**Note:** The current implementation uses a "virtual" prefix cache optimization.
Prefix metadata is tracked, and request token counts are reduced accordingly,
but the actual prefix KV cache data is not stored or read from the cache tiers.
This reduces the amount of user data written but does not add prefix read I/O
to the benchmark measurements.

**Lifecycle:**

t=0 User A: "You are helpful..." + "Hello"
→ Detect prefix → Reduce tokens → Write only "Hello" portion

t=1 User B: "You are helpful..." + "Hi"
→ Detect prefix → Reduce tokens → Write only "Hi" portion

Update README.md:

### Prefix Cache Settings

**Note:** Prefix caching currently operates as a "virtual optimization" - it reduces
the size of user requests by the prefix token count, but does not actually store or
read prefix KV cache data. The `max_prefix_entries` parameter is reserved for future
implementation of full prefix cache eviction.

Recommendation

Option 1 (full implementation) is recommended to match the design specification and provide accurate storage I/O measurements for workloads with shared system prompts.

Option 2 (documentation update) is acceptable if the current behavior is intentional and the design document needs to be updated to reflect the actual implementation.


Additional Notes

  • The bytes_saved metric in prefix_cache_stats is misleading - it counts "virtual" savings from token reduction, not actual cache reuse
  • The design document explicitly shows prefix cache entries being evicted (line 1207), which is not implemented
  • The README states "LRU eviction kicks in when this limit is reached" which is factually incorrect

Environment

  • Version: v3.0.0b1
  • Files affected:
    • kv_cache_benchmark/kv_cache/prefix_cache.py
    • kv_cache_benchmark/kv_cache/benchmark.py
    • kv_cache_benchmark/docs/MLperf v3 KV cache proposal.md
    • kv_cache_benchmark/README.md

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions