KVCache: Prefix Cache Implementation Does Not Match Design Specification

Hi, during the first step in `process_requests`, the code has a 20% probability of "detecting a prefix" in the context (unless we run with `--disable-prefix-caching`). 

If we hit this 20% probability, it creates a prefix entry in the Python metadata structure (`PrefixCacheEntry`, if not already existing). Then, it attempts to measure "how long it takes to load this prefix from the cache" and subtracts the prefix tokens from the request's context.

However, the issue is that the key for this prefix was never actually written to the cache. So `access_cache()` doesn't load anything (it always returns `None`), but we still measure the time of this operation, and we still subtract the tokens from the context.

I did a quick test to verify this. I added this code in `benchmark.py`:

```python
# 1. Check for a prefix cache hit.
if self.prefix_cache_manager:
    prefix_entry, remaining_tokens = self.prefix_cache_manager.check_prefix_cache(request, self.model_config)
    if prefix_entry:
        cache_type = 'system' if prefix_entry.prefix_type == PrefixType.SYSTEM_PROMPT else 'common'
        location, read_lat = self.cache.access_cache(prefix_entry.kv_cache_key, request.phase, cache_type)
        
        # ADDED CODE: Check if prefix was actually loaded from cache
        if location is not None:
            print(f"🎯 PREFIX CACHE HIT! Key: {prefix_entry.kv_cache_key}, Location: {location}")
            exit(1)
        else:
            print(f"❌ PREFIX CACHE MISS (as expected). Key: {prefix_entry.kv_cache_key}")
        ##### end of ADDED CODE
        
        storage_latency += read_lat
        request.context_tokens = remaining_tokens
```

The only results I obtained were misses:

```bash
Starting benchmark... Mode: standard (prefill+decode)
--------------------------------------------------------------------------------
❌ PREFIX CACHE MISS (as expected). Key: kv_system_2ed87f49657692a2
❌ PREFIX CACHE MISS (as expected). Key: kv_system_0b7588a77cb9c73b
❌ PREFIX CACHE MISS (as expected). Key: kv_system_2ed87f49657692a2
❌ PREFIX CACHE MISS (as expected). Key: kv_system_ea8ffc489f80c165
❌ PREFIX CACHE MISS (as expected). Key: kv_system_0b7588a77cb9c73b
❌ PREFIX CACHE MISS (as expected). Key: kv_system_ea8ffc489f80c165
❌ PREFIX CACHE MISS (as expected). Key: kv_system_0b7588a77cb9c73b

================================================================================
BENCHMARK RESULTS - MLPerf KV Cache Storage Benchmark
```

**If we don't want to modify the code, there is a workaround: simply disable the feature entirely using `--disable-prefix-caching`, as it doesn't currently provide meaningful functionality.**

Additionally, the following parameter is documented but never used in the code:

| Parameter | Default | Description |
|-----------|---------|-------------|
| `max_prefix_entries` | 1000 | Prefix cache capacity. LRU eviction kicks in when this limit is reached. (...) |

Below is the AI-generated analysis:


## Summary

The prefix caching feature is currently incomplete and does not match the design specification in `docs/MLperf v3 KV cache proposal.md` or the documented behavior in `README.md`. While the metadata tracking and token reduction work correctly, the actual KV cache storage, reads, and LRU eviction are not implemented.

---

## Expected Behavior (per Design Document)

According to `docs/MLperf v3 KV cache proposal.md` (Section 3.10, lines 1199-1211):

```
t=0  User A: "You are helpful..." + "Hello"
     → Miss → Full prefill → Store as kv_system_a1b2c3d4

t=1  User B: "You are helpful..." + "Hi"
     → HIT → Read cached prefix → Only prefill "Hi"

t=2  [LRU eviction of kv_system_a1b2c3d4]

t=3  User C: "You are helpful..." + "Hey"
     → Miss → Full prefill → Re-store
```

**Expected lifecycle:**
1. First request with system prompt → **MISS** → **Write prefix to cache** → Store as `kv_system_{hash}`
2. Second request with same prompt → **HIT** → **Read prefix from cache** → Only process remaining tokens
3. When cache is full → **LRU eviction** removes least recently used prefix
4. After eviction → **MISS** → **Re-write prefix to cache**

---

## Actual Behavior

The implementation only performs metadata tracking and token reduction:

1. ✅ Creates `PrefixCacheEntry` metadata object
2. ✅ Reduces `request.context_tokens` by prefix length
3. ❌ **Never writes prefix KV cache to storage**
4. ❌ **Never reads prefix KV cache from storage** (always returns `None`)
5. ❌ **Never evicts prefix entries** (no LRU logic exists)

---

## Code Evidence

### Issue 1: No Cache Writes

**File:** `kv_cache_benchmark/kv_cache/prefix_cache.py` (lines 67-91)

```python
def detect_system_prompt(self, context_tokens: int) -> Optional[PrefixCacheEntry]:
    if random.random() < system_prompt_hit_probability:
        system_prompt = random.choice(self.COMMON_SYSTEM_PROMPTS)
        prefix_hash = self.hash_prefix(system_prompt, len(system_prompt.split()))
        
        with self.lock:
            if prefix_hash in self.prefix_index:
                entry = self.prefix_index[prefix_hash]
                entry.use_count += 1
                entry.last_used = datetime.now()
                return entry
            else:
                entry = PrefixCacheEntry(
                    kv_cache_key=f"kv_system_{prefix_hash}",  # ← Key created
                    token_count=len(system_prompt.split()),
                    ...
                )
                self.prefix_index[prefix_hash] = entry  # ← Only metadata stored!
                return entry  # ← No cache.allocate_cache() call!
```

**Problem:** Creates metadata but never calls `self.cache.allocate_cache()` to write the actual KV cache data.

### Issue 2: Cache Reads Always Fail

**File:** `kv_cache_benchmark/kv_cache/benchmark.py` (lines 508-515)

```python
if self.prefix_cache_manager:
    prefix_entry, remaining_tokens = self.prefix_cache_manager.check_prefix_cache(request, self.model_config)
    if prefix_entry:
        cache_type = 'system' if prefix_entry.prefix_type == PrefixType.SYSTEM_PROMPT else 'common'
        _, read_lat = self.cache.access_cache(prefix_entry.kv_cache_key, request.phase, cache_type)
        # ↑ This ALWAYS returns (None, 0.0) because the key was never written!
        storage_latency += read_lat  # ← Always adds 0.0
        request.context_tokens = remaining_tokens  # ← This works (token reduction)
```

**Problem:** `access_cache()` is called with `kv_system_{hash}` key, but this key never exists in `cache_entries` because it was never written. The call always returns `(None, 0.0)`.

### Issue 3: No LRU Eviction

**File:** `kv_cache_benchmark/kv_cache/prefix_cache.py` (lines 97-99)

```python
def __init__(self, cache, max_prefix_entries: int = None):
    self.cache = cache
    self.max_prefix_entries = max_prefix_entries if max_prefix_entries is not None else cfg('prefix_cache', 'max_prefix_entries', default=1000)
    # ↑ Parameter set but NEVER used anywhere in the code!
```

**Problem:** The `max_prefix_entries` parameter is configured but never checked. No eviction logic exists. The `prefix_index` dictionary grows unbounded.

**Verification:**
```bash
$ grep -rn "self.max_prefix_entries" kv_cache_benchmark/kv_cache/
kv_cache_benchmark/kv_cache/prefix_cache.py:99:        self.max_prefix_entries = ...
# ↑ Only appears once (assignment), never read!
```

---

## Impact

| Feature | Status | Impact |
|---------|--------|--------|
| **Token reduction** | ✅ Works | Correctly reduces user request size |
| **Metadata tracking** | ✅ Works | Tracks prefix usage statistics |
| **Cache storage** | ❌ Missing | No actual caching happens |
| **Cache reads** | ❌ Broken | No I/O measured for prefix reads |
| **LRU eviction** | ❌ Missing | `max_prefix_entries` is dead code |
| **Storage I/O metrics** | ❌ Inaccurate | Underestimates read I/O |
| **Memory pressure** | ❌ Incorrect | No cache pressure from prefixes |

---

## Reproduction Steps

1. Run benchmark with `--enable-prefix-caching`
2. Check `prefix_cache_stats` in output JSON
3. Observe `system_prompt_hits > 0` (metadata hits)
4. Check cache directory - no `kv_system_*` files exist
5. Check storage read metrics - no reads for prefix cache

---

## Proposed Fix

### Option 1: Implement Full Prefix Caching (Match Design Spec)

**Changes needed in `prefix_cache.py`:**

```python
def detect_system_prompt(self, context_tokens: int) -> Optional[PrefixCacheEntry]:
    """Simulates the detection of a common system prompt at the start of a request."""
    system_prompt_hit_probability = cfg('prefix_cache', 'system_prompt_hit_probability', default=0.2)
    if random.random() < system_prompt_hit_probability:
        system_prompt = random.choice(self.COMMON_SYSTEM_PROMPTS)
        prefix_hash = self.hash_prefix(system_prompt, len(system_prompt.split()))

        with self.lock:
            if prefix_hash in self.prefix_index:
                entry = self.prefix_index[prefix_hash]
                entry.use_count += 1
                entry.last_used = datetime.now()
                return entry
            else:
                # NEW: Check if we need to evict
                if len(self.prefix_index) >= self.max_prefix_entries:
                    self._evict_lru_prefix()

                entry = PrefixCacheEntry(
                    prefix_key=f"system_{prefix_hash}",
                    prefix_type=PrefixType.SYSTEM_PROMPT,
                    text_hash=prefix_hash,
                    token_count=len(system_prompt.split()),
                    kv_cache_key=f"kv_system_{prefix_hash}",
                    use_count=1
                )
                self.prefix_index[prefix_hash] = entry

                # NEW: Actually write the prefix to cache
                self.cache.allocate_cache(
                    entry.kv_cache_key,
                    entry.token_count,
                    InferencePhase.PREFILL
                )

                return entry
    return None

def _evict_lru_prefix(self):
    """Evict the least recently used prefix entry."""
    if not self.prefix_index:
        return

    # Find LRU entry
    lru_hash = min(self.prefix_index.keys(),
                   key=lambda h: self.prefix_index[h].last_used)
    lru_entry = self.prefix_index[lru_hash]

    # Delete from cache storage
    # Note: This requires adding a delete method or using cache internals
    if lru_entry.kv_cache_key in self.cache.cache_entries:
        # Trigger eviction through the normal cache mechanism
        # or implement direct deletion
        pass

    # Remove from metadata
    del self.prefix_index[lru_hash]
```

**Changes needed in `benchmark.py`:**

```python
# Line 513 - Handle cache miss for prefix
if prefix_entry:
    cache_type = 'system' if prefix_entry.prefix_type == PrefixType.SYSTEM_PROMPT else 'common'
    location, read_lat = self.cache.access_cache(prefix_entry.kv_cache_key, request.phase, cache_type)

    # NEW: If prefix was evicted from cache, re-write it
    if location is None:
        self.cache.allocate_cache(
            prefix_entry.kv_cache_key,
            prefix_entry.token_count,
            InferencePhase.PREFILL
        )
        # Try reading again
        location, read_lat = self.cache.access_cache(prefix_entry.kv_cache_key, request.phase, cache_type)

    storage_latency += read_lat
    request.context_tokens = remaining_tokens
```

### Option 2: Document Current Behavior (Quick Fix)

If the current "virtual optimization" behavior is intentional, update documentation:

**Update `docs/MLperf v3 KV cache proposal.md` Section 3.10:**

```markdown
### 3.10 Prefix Caching: System Prompt Optimization

**Note:** The current implementation uses a "virtual" prefix cache optimization.
Prefix metadata is tracked, and request token counts are reduced accordingly,
but the actual prefix KV cache data is not stored or read from the cache tiers.
This reduces the amount of user data written but does not add prefix read I/O
to the benchmark measurements.

**Lifecycle:**
```
t=0  User A: "You are helpful..." + "Hello"
     → Detect prefix → Reduce tokens → Write only "Hello" portion

t=1  User B: "You are helpful..." + "Hi"
     → Detect prefix → Reduce tokens → Write only "Hi" portion
```
```

**Update README.md:**

```markdown
### Prefix Cache Settings

**Note:** Prefix caching currently operates as a "virtual optimization" - it reduces
the size of user requests by the prefix token count, but does not actually store or
read prefix KV cache data. The `max_prefix_entries` parameter is reserved for future
implementation of full prefix cache eviction.
```

---

## Recommendation

**Option 1** (full implementation) is recommended to match the design specification and provide accurate storage I/O measurements for workloads with shared system prompts.

**Option 2** (documentation update) is acceptable if the current behavior is intentional and the design document needs to be updated to reflect the actual implementation.

---

## Additional Notes

- The `bytes_saved` metric in `prefix_cache_stats` is misleading - it counts "virtual" savings from token reduction, not actual cache reuse
- The design document explicitly shows prefix cache entries being evicted (line 1207), which is not implemented
- The README states "LRU eviction kicks in when this limit is reached" which is factually incorrect

---

## Environment

- **Version:** v3.0.0b1
- **Files affected:**
  - `kv_cache_benchmark/kv_cache/prefix_cache.py`
  - `kv_cache_benchmark/kv_cache/benchmark.py`
  - `kv_cache_benchmark/docs/MLperf v3 KV cache proposal.md`
  - `kv_cache_benchmark/README.md`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KVCache: Prefix Cache Implementation Does Not Match Design Specification #283

Summary

Expected Behavior (per Design Document)

Actual Behavior

Code Evidence

Issue 1: No Cache Writes

Issue 2: Cache Reads Always Fail

Issue 3: No LRU Eviction

Impact

Reproduction Steps

Proposed Fix

Option 1: Implement Full Prefix Caching (Match Design Spec)

Option 2: Document Current Behavior (Quick Fix)

Recommendation

Additional Notes

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature	Status	Impact
Token reduction	✅ Works	Correctly reduces user request size
Metadata tracking	✅ Works	Tracks prefix usage statistics
Cache storage	❌ Missing	No actual caching happens
Cache reads	❌ Broken	No I/O measured for prefix reads
LRU eviction	❌ Missing	`max_prefix_entries` is dead code
Storage I/O metrics	❌ Inaccurate	Underestimates read I/O
Memory pressure	❌ Incorrect	No cache pressure from prefixes

KVCache: Prefix Cache Implementation Does Not Match Design Specification #283

Description

Summary

Expected Behavior (per Design Document)

Actual Behavior

Code Evidence

Issue 1: No Cache Writes

Issue 2: Cache Reads Always Fail

Issue 3: No LRU Eviction

Impact

Reproduction Steps

Proposed Fix

Option 1: Implement Full Prefix Caching (Match Design Spec)

Option 2: Document Current Behavior (Quick Fix)

Recommendation

Additional Notes

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions