-
Notifications
You must be signed in to change notification settings - Fork 13
PyPTO Serving Weekly Progress 2026‐06‐12
Updated: 2026-06-12
flowchart TD
A["Server"] --> B["L3"]
B --> C["L2 Prefill"]
B --> D["L2 Decode"]
The current L3 layer has been integrated. On the serving side, the overall service is responsible for scheduling. l3work is integrated to dispatch tasks to devices, while actual model execution reuses l2work to run prefill and decode respectively.
flowchart TD
A[Request] --> B[Scheduler lookup]
B -->|NPU hit| C[Run on NPU]
B -->|CPU hit| D[Load KV to NPU]
B -->|miss| E[Allocate NPU blocks]
D --> C
E --> C
C --> F[Finish / update cache]
F --> G[Store free cached blocks to CPU]
G --> H[LRU evict if CPU full]
The current implementation first follows the Mooncake approach. After a request enters the system, the scheduler first looks up the prefix cache:
- If the KV cache is already on the NPU, those blocks are used directly for execution.
- If the KV cache is in the CPU offload area, the corresponding blocks are first loaded back to the NPU before execution.
- If there is no cache hit, new NPU blocks are allocated and computation proceeds normally.
After NPU execution completes, the scheduler updates the request state and prefix cache. For complete cached blocks that have finished and are no longer referenced by any request, an NPU-to-CPU store is triggered. If the CPU offload space is full, the least recently used CPU blocks are evicted according to LRU.
SSD support is expected to be ready next week. After that, SSD will be integrated by replacing the corresponding interfaces, and performance will be evaluated.
Since l3work can distribute work across multiple devices, the overall plan is to perform multi-card task allocation according to the L3 hierarchy. The specific scheduling method will be discussed with the von Neumann team.
The model-layer support for DeepSeek V4 is nearly complete. On the serving side, two integration tasks are currently in progress:
- DeepSeek V4 uses a special attention mechanism, which requires adaptation.
- DeepSeek V4 needs to be deployed with DP + EP, which is not yet supported by the serving system.