Skip to content

PyPTO Serving — Design Philosophy

bumble0918 edited this page Jun 4, 2026 · 1 revision

Vision

PyPTO Serving is a storage-compute-integrated, minimalist inference reference implementation. It does not aim to build a feature-complete serving framework or compete with vLLM, TGI, or similar solutions. Instead, it establishes the shortest execution path from model weights to NPU operators, providing a runnable, readable, and modifiable reference design that demonstrates storage-compute integration.

Core values:

  • Minimal path — eliminate intermediate runtimes and adapter glue; weights reach operators directly.
  • Storage-compute integration — KV Cache management is pushed down to the operator level; operators own their storage.
  • Single runtime — only Simpler is used for NPU dispatch, avoiding resource contention across multiple runtimes.
  • Reference design — code is documentation; readability takes priority over feature coverage.

Core Principles

1. Non-competitive positioning

This repository is not a general-purpose inference framework. It does not perform multi-format adaptation, multi-backend abstraction, or cross-platform compatibility layers. It focuses on a single path:

HuggingFace weights → PyPTO operators → Ascend NPU execution.

Any abstraction that increases generality should be questioned: does it serve the goal of the shortest path?

2. Eliminate intermediate layers

Typical call chain in a traditional inference framework:

API → Scheduler → Framework KV Cache → Executor → Adapter → Runtime → Kernel

Target call chain in PyPTO Serving:

Request → Scheduler → PyPTO Executor → Kernel (operator manages its own KV Cache)

Fewer layers mean easier debugging and a higher performance ceiling.

3. KV Cache management at the operator level

Traditional frameworks manage KV Cache allocation, addressing, and lifecycle at the scheduling layer; operators only consume pre-prepared tensors. This project aims to push those responsibilities into the operators themselves:

Responsibility Traditional framework This project's target
Block allocation Scheduler Inside the operator
Address mapping (block_table / slot_mapping) Scheduler computes, passes in Operator computes internally
Cache read/write Framework provides views; operator reads/writes Operator manages directly
Prefix cache reuse Framework-level hash + match Operator-level awareness

This makes operators true storage-compute-integrated units — they are responsible for both computation and the storage management of their own data.

4. Single runtime

Only Simpler is used as the NPU runtime. No CANN GE, MindSpore, or other runtimes are introduced, avoiding:

  • Device resource contention across runtimes.
  • Memory isolation overhead between different runtimes.
  • Debugging difficulty across runtime boundaries.

5. Direct weight path

Model weights should be loaded from disk and reach operator-consumable format with minimal conversion. The goal is to eliminate multiple adaptation layers between HuggingFace format and operator format.

Architecture Overview

Target architecture

┌─────────────────────────────────────────────────┐
│                  Request Layer                   │
│         (HTTP / CLI / Interactive)               │
└──────────────────────┬──────────────────────────┘
                       │
┌──────────────────────▼──────────────────────────┐
│                  Scheduler                       │
│    (continuous batching, chunked prefill)        │
└──────────────────────┬──────────────────────────┘
                       │
┌──────────────────────▼──────────────────────────┐
│              PyPTO Executor                      │
│    (compile once, dispatch to kernels)           │
│    ┌─────────────────────────────────────┐       │
│    │         Kernel / Program            │       │
│    │  ┌─────────────┐  ┌─────────────┐  │       │
│    │  │  Prefill    │  │  Decode     │  │       │
│    │  │  Operator   │  │  Operator   │  │       │
│    │  │  (owns KV   │  │  (owns KV   │  │       │
│    │  │   mgmt)     │  │   mgmt)     │  │       │
│    │  └─────────────┘  └─────────────┘  │       │
│    └─────────────────────────────────────┘       │
│                       │                          │
│              Simpler Runtime                     │
│        (single runtime, NPU dispatch)            │
└─────────────────────────────────────────────────┘

Directory structure

pypto-serving/
├── python/
│   ├── cli/                         CLI entry points
│   ├── core/
│   │   ├── scheduler.py             Continuous-batch scheduling
│   │   ├── executor.py              Executor base class
│   │   ├── kv_cache.py              KV Cache page management (target: gradually thin out)
│   │   ├── model_loader.py          Weight loading
│   │   └── ...                      tokenizer, serving, and other helpers
│   └── runtime/                     Simpler runtime wrapper
├── pypto-lib/                       PyPTO operator library (submodule)
├── examples/
│   └── model/qwen3_14b/
│       ├── runner/                  Model-specific executor and runner
│       └── src/                     PyPTO kernel/program builders
└── tests/                           Tests

Key Design Decisions

Scheduling–execution separation

The scheduler decides when and for whom to compute (continuous batching, chunked prefill, preemption). The executor decides how to compute. They communicate through ScheduledRequest lists and share no internal state.

This separation is intentional — scheduling policy is orthogonal to operator execution, and mixing them hurts readability.

Paged KV Cache

Page-based management (configurable page size, default 64 tokens) is used for:

  • Prefix cache reuse — different requests share KV blocks for common prefixes.
  • Request preemption — KV blocks of inactive requests can be freed.
  • Memory efficiency — blocks are allocated on demand, not pre-reserved for full sequences.

Long-term target: page management responsibility moves from KvCacheManager into the operators.

Compile once, execute many times

PyPTO operators are compiled on first load, producing Simpler-dispatchable artifacts. All subsequent decode steps reuse the compiled artifacts with no recompilation overhead.

CPU reference implementation

CpuModelExecutor provides a pure-PyTorch CPU inference path with no NPU dependency. Its role:

  • Functional correctness reference.
  • Development and debugging without NPU hardware.
  • Not a production path — no performance optimization required.

Current State vs. Target

The following analysis compares the current codebase (feature/2026-06-03 branch) against the design goals above, ordered by priority.

1. KV Cache is still managed at the framework layer

Target: KV Cache management at the operator level.

Current state: KvCacheManager (python/core/kv_cache.py) centrally manages block allocation, address mapping, and prefix caching. Operators obtain zero-copy views via materialize_single_layer_cache() / materialize_full_layer_cache(), but block_table and slot_mapping are still computed by the framework layer (_compute_block_table_and_slot_mapping in model_runner.py) and passed into operators.

Current: Scheduler → KvCacheManager → ModelRunner(computes block_table/slot_mapping) → Kernel
Target:  Scheduler → Kernel (operator internally computes block_table/slot_mapping and reads/writes KV)

Evolution direction:

  • Step 1: Move block_table / slot_mapping computation logic from ModelRunner into PyPTO operators.
  • Step 2: Expose block allocation interface from KvCacheManager to operators; operators allocate/free autonomously.
  • Step 3: KvCacheManager degrades to a pure physical memory pool; all logical management lives inside operators.

2. Executor call chain is too deep

Target: Eliminate intermediate layers.

Current state: Four layers of executor abstraction exist:

ModelExecutor (abstract base)
  └── PyptoExecutor (PyPTO-generic abstraction)
        └── Qwen314BPyptoExecutor (model-specific, contains compilation logic)
              └── Qwen314BModelRunner (runtime wrapper, contains execution logic)
                    └── Worker (Simpler runtime wrapper)

PyptoExecutor is an intermediate layer with only one implementation (Qwen314BPyptoExecutor); its abstraction provides limited value.

Evolution direction:

  • Evaluate whether PyptoExecutor can be merged directly into Qwen314BPyptoExecutor, removing one layer of indirection.
  • Evaluate whether Qwen314BModelRunner can be inlined into the executor, so the executor directly holds a Simpler Worker.
  • Target chain: ModelExecutor → NPU Executor (directly holds Worker).

3. Weight conversion adapter layer

Target: Direct weight path; eliminate adapter glue code.

Current state: Multiple conversion layers exist between HuggingFace weights and operator-consumable format:

  • _StackedLayerView — reorganizes HF-format Q/K/V/O gate weights into the layout operators expect.
  • _KernelLayerWeights — wraps weights into operator-consumable structs.
  • The build_programs method in npu_executor.py contains extensive weight rearrangement logic.
HF weights → _StackedLayerView → _KernelLayerWeights → Worker Tensor → Operator

Evolution direction:

  • Short-term: Consolidate weight rearrangement logic into a single location (currently scattered across executor and runner).
  • Long-term: PyPTO operators natively support HF weight layout, or provide compile-time automatic conversion.

4. Multi-process architecture necessity needs evaluation

Target: Minimal path.

Current state: AsyncLLMEngine + serving_worker.py implement a multi-process architecture — the main process runs scheduling and the API, while a child process exclusively owns the NPU device. Communication goes through multiprocessing.Queue.

For single-device scenarios, multi-process introduces unnecessary complexity (inter-process communication, state synchronization, debugging difficulty).

Evolution direction:

  • Evaluate whether single-process + async IO suffices for single-device scenarios.
  • If multi-process is only necessary for multi-device cases, position it as a "multi-device extension" rather than the default architecture.
  • The default path should be the simplest: single-process, direct dispatch.

5. Serving layer scope

Target: Non-competitive positioning; reference design.

Current state: server.py implements a full OpenAI-compatible API (completion, chat completion, streaming), and bench_serving.py provides performance benchmarks. These features make the project appear to be a lightweight vLLM alternative.

Evolution direction:

  • The serving layer should be explicitly positioned as a "runnable reference example", not production-grade serving.
  • Documentation should emphasize its reference nature and avoid giving users the impression it can directly replace vLLM.
  • If production-grade serving is needed in the future, it should live outside this repository.

6. Operator-level prefix cache awareness

Target: Operators are aware of prefix caching.

Current state: Prefix caching is handled entirely by KvCacheManager at the framework layer — block hashes are computed via hash_block_tokens, and existing blocks are matched during allocation. Operators are unaware of this and passively consume blocks assigned by the framework.

Current: KvCacheManager.hash_block_tokens → block match → assign to operator
Target:  Operator internally aware of cached blocks, autonomously decides reuse

Evolution direction: This is part of "KV Cache management at the operator level" and depends on progress of item 1.


Summary

This repository's core mission is to demonstrate the shortest storage-compute-integrated inference path, not to build a feature-complete inference framework. Every architectural decision should be measured against this standard: does this change make the path shorter? Is this abstraction truly necessary? Does this code demonstrate storage-compute integration rather than framework glue?

The current code is already heading in the right direction — single Simpler runtime, paged KV Cache, compile-once execute-many. The next priorities are to continue eliminating intermediate layers, push KV Cache management down into operators, and make "storage-compute integration" a code reality rather than just an architectural concept.