Skip to content

PyPTO Serving Weekly Progress 2026‐06‐12

bumble0918 edited this page Jun 12, 2026 · 1 revision

PyPTO Serving Weekly Progress

Updated: 2026-06-12

1. Simple/Lingqu Multi-Level Structure Module

flowchart TD
    A["Server"] --> B["L3"]
    B --> C["L2 Prefill"]
    B --> D["L2 Decode"]
Loading

The current L3 layer has been integrated. On the serving side, the overall service is responsible for scheduling. l3work is integrated to dispatch tasks to devices, while actual model execution reuses l2work to run prefill and decode respectively.

2. SSD KV Cache Offload Module

flowchart TD
      A[Request] --> B[Scheduler lookup]

      B -->|NPU hit| C[Run on NPU]
      B -->|CPU hit| D[Load KV to NPU]
      B -->|miss| E[Allocate NPU blocks]

      D --> C
      E --> C

      C --> F[Finish / update cache]
      F --> G[Store free cached blocks to CPU]
      G --> H[LRU evict if CPU full]
Loading

The current implementation first follows the Mooncake approach. After a request enters the system, the scheduler first looks up the prefix cache:

  • If the KV cache is already on the NPU, those blocks are used directly for execution.
  • If the KV cache is in the CPU offload area, the corresponding blocks are first loaded back to the NPU before execution.
  • If there is no cache hit, new NPU blocks are allocated and computation proceeds normally.

After NPU execution completes, the scheduler updates the request state and prefix cache. For complete cached blocks that have finished and are no longer referenced by any request, an NPU-to-CPU store is triggered. If the CPU offload space is full, the least recently used CPU blocks are evicted according to LRU.

SSD support is expected to be ready next week. After that, SSD will be integrated by replacing the corresponding interfaces, and performance will be evaluated.

3. Model Parallelism

Since l3work can distribute work across multiple devices, the overall plan is to perform multi-card task allocation according to the L3 hierarchy. The specific scheduling method will be discussed with the von Neumann team.

4. DeepSeek V4 Adaptation

The model-layer support for DeepSeek V4 is nearly complete. On the serving side, two integration tasks are currently in progress:

  1. DeepSeek V4 uses a special attention mechanism, which requires adaptation.
  2. DeepSeek V4 needs to be deployed with DP + EP, which is not yet supported by the serving system.