Skip to content

Prefix cache scorer should support hybrid memory allocator #1775

@liu-cong

Description

@liu-cong

What would you like to be added:

SWA doesn't take into account the full prefix.
https://docs.vllm.ai/en/latest/design/hybrid_kv_cache_manager.html#prefix-caching

While the full prefix matching algorithm should still improve over no prefix aware routing, a SWA optimized algorithm that aligns with vLLM eviction for SWA should work better.

A small design doc should be presented to discuss the implementation and how it works with existing full prefix matching algorithm.

The key here is that the indexer needs to capture:

  • num layers using full attention
  • num layers using SWA
  • SW size

And with the above info the indexer can better simulate the cache eviction process to be as close as possible with the inference engine.

Why is this needed:

Metadata

Metadata

Assignees

No one assigned

    Labels

    needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions