Skip to content

[RFC]: [Store] KVCache offloading to SSD in DFS #578

@SgtPepperr

Description

@SgtPepperr

Changes proposed

As mentioned in the previous issue #171,#333 , by offloading KV cache to SSDs to support Mooncake's multi-level caching mechanism, we can further improve the reuse rate of KV cache and address the issue of limited DRAM space in certain scenarios.

Currently, we have implemented Version 1 of KV cache offloading, #437 with the following mechanisms:

  • Client-side persistence: We plan to offload and install KV cache on DFS (3FS) to facilitate unified file synchronization across nodes. All read/write/query operations for KV cache objects are performed entirely on the client side, with the master node remaining unaware of them. The index mapping from keys to KV cache objects in the file system is maintained by a fixed indexing mechanism, where each file corresponds to a KV cache object (the filename serves as the key).
  • POSIX read/write: Currently, all file I/O operations are performed using POSIX interfaces. For put/batchput operations, we only submit a persistence request to the thread pool after a successful in-memory write, without further verification of write success. (If the write fails, the file is automatically deleted to prevent indexing by other instances.) For get operations, synchronous reads are used, while batchget employs asynchronous batch reads to improve throughput.

Future To-Do List

  1. Native 3FS Interface (Merged)
    Since the ultimate goal is to support this persistence feature on 3FS, and the current POSIX implementation (via FUSE) still impacts I/O performance, we plan to introduce a 3FS-native plugin interface to further optimize file read performance for get/batchget.

  2. Master-Managed KV Cache in SSD (Merged)
    The current implementation manages SSD KV cache on the client side, with metadata synchronization handled by DFS (the master remains unaware). While this approach ensures loose coupling, the lack of centralized management introduces consistency and performance issues. Future plans include migrating KV cache metadata to the master, leveraging an extended replica mechanism to support both memory and disk modes. Benefits include:

    • Reduced query latency: Currently, query/exist operations require filesystem access, incurring high overhead for large datasets. Moving metadata to the master enables single-RPC lookups for SSD/memory status.
    • Consistent behavior: Ensures alignment with memory semantics for operations like removeAll and tearDownAll.
    • Race condition mitigation: Resolves issues like "remove-before-write" through centralized coordination.
  3. File Eviction Mechanism (WIP)
    Currently, file deletion relies on manual user calls (remove/removeAll) or admin intervention. Without automatic eviction, long-running clusters risk storage bloat. Future versions will introduce monitoring and auto-eviction policies.

  4. Master-Triggered Eviction & Persistence (WIP)
    Presently, every successful put triggers persistence, effectively backing up KV cache entries. We aim to shift persistence to the master’s eviction phase, where evicted data is written to SSDs. Challenges include:

    • The master currently handles only metadata, not data flow.
    • Data distribution across nodes complicates persistence during eviction.
      A well-designed solution will be explored in future iterations.

We welcome feedback and suggestions on this design and implementation.

Before submitting a new issue...

  • Make sure you already searched for relevant issues and read the documentation

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions