-
Notifications
You must be signed in to change notification settings - Fork 13
PyPTO Serving Architecture Overview
┌──────────────────────────────────────────────────────────────────────────────┐
│ User / Client │
│ curl, benchmark, OpenAI-compatible SDK │
└───────────────────────────────────┬──────────────────────────────────────────┘
│
┌───────────────────────────────────▼──────────────────────────────────────────┐
│ Service Entry Layer │
│ │
│ python/cli/main.py parse args, build config, start uvicorn │
└───────────────────────────────────┬──────────────────────────────────────────┘
│
┌───────────────────────────────────▼──────────────────────────────────────────┐
│ HTTP API Layer │
│ │
│ python/core/server.py /v1/completions, /v1/chat/completions │
│ /v1/models, /health │
└───────────────────────────────────┬──────────────────────────────────────────┘
│
┌───────────────────────────────────▼──────────────────────────────────────────┐
│ Async Serving Control Plane, main process │
│ │
│ python/core/async_engine.py request lifecycle, engine loop, streaming│
│ │
│ ┌──────────────────────────────┐ ┌─────────────────────────────────────┐ │
│ │ python/core/scheduler.py │ │ python/core/kv_cache.py │ │
│ │ batching, prefill/decode │<->│ KV blocks, prefix cache, allocation │ │
│ │ token budget, finish states │ │ │ │
│ └──────────────────────────────┘ └─────────────────────────────────────┘ │
└───────────────────────────────────┬──────────────────────────────────────────┘
│ multiprocessing queues
┌───────────────────────────────────▼──────────────────────────────────────────┐
│ Model Execution Layer, worker process │
│ │
│ python/core/serving_worker.py owns model execution loop and NPU device │
│ python/core/model_loader.py loads model config, weights, tokenizer │
│ python/core/sampler.py converts logits to next token │
└───────────────────────────────────┬──────────────────────────────────────────┘
│
┌───────────────────────────────────▼──────────────────────────────────────────┐
│ Backend Adapter Layer │
│ │
│ python/core/executor.py backend-neutral prefill/decode interface │
│ python/core/pypto_executor.py common PyPTO executor base │
│ examples/model/qwen3_14b/runner/ Qwen3-14B NPU executor and runner glue │
└───────────────────────────────────┬──────────────────────────────────────────┘
│
┌───────────────────────────────────▼──────────────────────────────────────────┐
│ Kernel / Runtime Layer │
│ │
│ pypto-lib/models/qwen3/14b/ Qwen3-14B PyPTO kernel definitions │
│ Ascend / CANN / PyPTO runtime device execution environment │
└──────────────────────────────────────────────────────────────────────────────┘
The service entry point is python/cli/main.py. It converts command-line options such as model path, port, NPU device, maximum concurrency, maximum sequence length, KV block size, prefix cache, and chunked prefill into a serving configuration, then starts FastAPI/uvicorn.
The HTTP API is implemented in python/core/server.py. It exposes /v1/completions, /v1/chat/completions, /v1/models, and /health, with an API shape close to OpenAI-compatible serving. This layer only handles request and response formatting; it does not run the model directly.
The serving control plane lives in python/core/async_engine.py. It runs in the main process and is responsible for receiving requests, maintaining request contexts, driving the scheduling loop, communicating with the worker process, and sending generated output back to the HTTP layer. Conceptually, this is the central serving coordinator.
Request scheduling is handled by python/core/scheduler.py. It performs continuous batching by grouping multiple requests into each prefill/decode step, while also managing token budget, maximum concurrency, chunked prefill, prefix cache, and request completion state.
KV cache management is implemented in python/core/kv_cache.py. It handles block/page allocation, release, reference counting, and prefix-cache metadata. The scheduler depends on it to decide whether a request has enough KV resources to run.
Model execution happens in a worker process, implemented by python/core/serving_worker.py. The worker owns the NPU device and model execution environment. It receives scheduled batches from the main process, calls the executor/runner to run prefill or decode, and returns newly generated tokens.
Executor/Runner is the backend adapter layer. The generic interfaces are in python/core/executor.py and python/core/pypto_executor.py. The real Qwen3-14B NPU path is under examples/model/qwen3_14b/runner/, and eventually calls Qwen3 kernels from pypto-lib.
A serving request roughly flows as follows:
HTTP request
-> API layer parses the request
-> Async engine creates an internal request
-> Scheduler places the request into a batch
-> KV manager allocates or reuses KV blocks
-> Worker runs prefill/decode
-> Runner calls PyPTO kernels
-> Worker returns the new token
-> Async engine updates request state and decodes text
-> API layer returns JSON or an SSE stream
The main process focuses on scheduling, resource availability, request state, and output delivery. The worker process focuses on executing the scheduled batch on the NPU.
Future work will focus on larger-scale serving, longer context support, and a more complete hierarchical runtime:
- Support model parallelism. The current serving path is closer to a single-worker, single-device execution model. Future work should introduce tensor parallelism, pipeline parallelism, and related multi-NPU execution strategies, so prefill and decode can run cooperatively across devices. The scheduling layer will also need to represent parallel groups, communication, and worker topology.
- Integrate SSD KV cache offloading. We have built new SSD capability on the NPU side. The next step is to extend KV cache management from the current memory/device-side model into a tiered storage model that supports offload and recovery. When NPU memory or on-device KV capacity is insufficient, cold KV blocks should be moved to SSD and loaded back on demand for prefix hits or continued decode.
- Integrate with the Simple/Lingqu multi-level structure. Serving can evolve from the current single-machine control plane into an L3-L7 hierarchy: L3 Host for single-machine inference orchestration, L4 Pod for prefill/decode disaggregation and multi-host coordination, and L5-L7 for service pools, cluster coordination, and global routing. This direction requires the current scheduler, KV metadata, and worker topology to gradually become a hierarchical orchestrator/worker model.