A scalable NPU architecture for Transformer LLM inference on edge FPGAs
pccx is a hardware-software co-design framework that accelerates autoregressive decoding of Transformer-based LLMs on resource-constrained edge devices. The primary target is the Xilinx Kria KV260 SOM.
Rather than reusing a generic matrix accelerator, pccx is designed around the actual bottleneck of LLM decoding: memory bandwidth-bound GEMV, not compute-bound GEMM. The architecture separates matrix (GEMM) and vector (GEMV) datapaths, supplies weights through dedicated HP AXI ports, and uses a custom 64-bit VLIW ISA to eliminate dispatch stalls.
| Core | Configuration | Peak Throughput | Primary Use |
|---|---|---|---|
| GEMM (Matrix) | 32 Γ 32 systolic array (cascade split @ row 16) | 819 GMAC/s @ 400 MHz | Prefill, QΒ·Kα΅, scoreΒ·V |
| GEMV (Vector) | 4 cores Γ 32-MAC LUT pipeline + 5-stage reduction tree | Weight-streaming limited (~51.2 GMAC/s @ 400 MHz) | Autoregressive decoding |
| SFU / CVO | CORDIC + LUT hybrid | BF16 / FP32 promoted | Softmax, GELU, RMSNorm, RoPE |
Key design decisions:
- W4A8 precision β INT4 weights Γ INT8 activations via DSP48E2 dual-channel bit packing (1 DSP = 2 MACs)
- Precision promotion β non-linear ops (Softmax, GELU, RMSNorm, RoPE) automatically upcast to BF16/FP32 for numerical stability
- Custom 64-bit VLIW ISA β 5 opcodes:
GEMV,GEMM,MEMCPY,MEMSET,CVO; decoupled decode/dispatch eliminates front-end stalls - Shared L2 (URAM 1.75 MB) β all three cores share a central SRAM cache; GEMVβSFU are connected via a direct-connect FIFO, bypassing L2 round-trips
- Dual clock domains β 250 MHz AXI/control plane, 400 MHz core compute (Γ1.6 frequency gain over v001)
- 3.125Γ total throughput gain vs. v001 (frequency Γ dual-MAC DSP packing)
External AXI (250 MHz) Core Domain (400 MHz)
βββββββββββββββββββββ ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
S_AXIL_CTRL (HPM) βββββββββΊ npu_controller_top
ββ ctrl_npu_decoder (64-bit VLIW β opcode + body)
S_AXI_HP0/HP1 βββββββββΊ GEMM_systolic_top (32Γ16Γ2, W-Stationary)
S_AXI_HP2/HP3 βββββββββΊ GEMV_top (4 cores Γ 32-MAC LUT, 5-stage tree)
S_AXIS_ACP_FMAP βββββββββΊ βββββββββββββββββββββββββββββββββββ
M_AXIS_ACP_RESULT βββββββββ β Shared L2 Cache (URAM 1.75 MB)β
β GEMV ββFIFOβββΊ CVO_top (SFU) β
βββββββββββββββββββββββββββββββββββ
| Level | Technology | Size | Access |
|---|---|---|---|
| L1 (Activation row buffer) | Block RAM | per-core | Systolic / GEMV lanes |
| L2 (Shared cache) | URAM | 1.75 MB (114,688 Γ 128-bit) | All cores + mem_dispatcher |
| Weight stream | HP AXI port Γ 4 | DDR4 bandwidth | HP0/1 β GEMM, HP2/3 β GEMV |
| KV Cache | DDR4 (off-chip) | Up to 10β12 GB | ACP coherent port |
KV cache bandwidth wall: At 32K context (Gemma 3N E4B), the accumulated KV cache reaches ~1.31 GB. Mitigation: KV quantization (FP16βINT8/INT4), attention sink eviction, and a driver-enforced
KV_MAX_TOKENShard cap.
pccx/
βββ conf.py / index.rst # English Sphinx config & root toctree
βββ ko/ # Korean Sphinx subsite (ko-first authoring)
β βββ conf.py
β βββ docs/ # Korean documentation source
βββ docs/ # English documentation source
β βββ v002/ # Active architecture docs
β β βββ Architecture/ # Core design, DSP48E2, KV cache, rationale
β β βββ ISA/ # 64-bit VLIW instruction set reference
β β βββ Drivers/ # Host API & driver documentation
β β βββ RTL/ # Embedded RTL source reference
β βββ archive/experimental_v001/
βββ assets/images/ # Architecture diagrams (PNG)
βββ _static/ # JS/CSS (language switcher, Mermaid theme)
βββ codes/
βββ v001/hw/rtl/ # v001 RTL (archived, reference only)
βββ v002/ # v002 RTL (CI-cloned from pccx-FPGA-NPU-LLM-kv260)
Two sibling repositories round out the pccx project:
- hwkim-dev/pccx-FPGA-NPU-LLM-kv260 β active v002 SystemVerilog sources (CI-cloned into
codes/v002/). - hwkim-dev/pccx-lab β performance simulator and AI-integrated profiler (mounted under
/en/lab/and/ko/lab/on the docs site).
pccx is developed along two parallel tracks as of 2026-04-20. The tracks share RTL assets (sparse weight fetcher, SSD dispatcher, tree mask generator, EAGLE training pipeline); a long-term auto-porting compiler begins once both tracks are stable.
| Track | Target model | Goal | Horizon | Key phases |
|---|---|---|---|---|
| v002 Extended | Gemma 3N E4B | 20 tok/s measured | Week 1β49 | AβF baseline β G sparsity β H/H+ EAGLE-3 β I SSD β J Tree β K benchmark |
| v003 | Gemma 4 E4B | 12β15 tok/s | Week 16β52 (parallel) | 1 foundation β 2 EAGLE linear β 3 Tree β 4 SSD β 5 P-EAGLE + LTD |
| Auto-Porting Ξ± | Arbitrary Transformer | config.json β pccx ISA codegen |
Week 53+ (Year 2) | Parser β Resolver β Feature plugin β C-stub emitter |
Compute budget: $70β100 total for EAGLE head training ($40 if a TRC TPU grant lands). Both tracks run on the same KV260 bitstream harness β v003 branches off after v002 freeze.
β Full roadmap (EN) Β Β·Β νκ΅μ΄
Performance simulator and AI-integrated profiler, purpose-built for the pccx NPU. Pre-RTL bottleneck detection, UVM co-simulation, and LLM-driven testbench generation in one workflow.
- Repository: https://github.com/hwkim-dev/pccx-lab
- Documentation: https://hwkim-dev.github.io/pccx/en/lab/ (Korean: https://hwkim-dev.github.io/pccx/ko/lab/)
- Status: Work in Progress
The full technical documentation β architecture deep-dives, ISA encoding tables, DSP48E2 bit-packing derivation, driver API, and embedded RTL source β is published at:
Available in English and νκ΅μ΄ (Korean).
Highlights:
- Architecture Overview β block diagram, design rationale, 3.125Γ gain breakdown
- DSP48E2 W4A8 Derivation β dual-channel bit packing math
- Custom ISA Reference β 64-bit VLIW encoding, opcode table, dataflow
- RTL Source Reference β embedded SystemVerilog with live syntax highlighting
pip install -r requirements.txt
sudo apt-get install graphviz # for Graphviz diagrams
# Clone v002 RTL (required for literalinclude directives)
git clone --depth 1 \
https://github.com/hwkim-dev/pccx-FPGA-NPU-LLM-kv260 \
codes/v002
# Build English site
sphinx-build -b html . _build/html/en
# Build Korean site
sphinx-build -b html ko _build/html/ko
# Serve locally
python -m http.server --directory _build/html
# β open http://localhost:8000/en/ or /ko/| Pain point (v001) | v002 solution |
|---|---|
| Core role ambiguity (Matrix/Vector/CVO blurred) | Strict separation: GEMM / GEMV / SFU |
| Excessive intermediate bus paths | Shared L2 + direct-connect FIFO for GEMVβSFU |
| L2 β Global Cache responsibility overlap | Single unified L2 (URAM) |
| Single HP port β one systolic array bottleneck | HP0/HP1 for GEMM, HP2/HP3 for GEMV (distributed) |
| 1 DSP = 1 MAC (bit headroom wasted) | Dual-channel packing β 1 DSP = 2 MACs |
| 250 MHz ceiling (AXI clock) | Decoupled 400 MHz core domain |
Licensed under the Apache License 2.0.
See CLAUDE.md for the full build & contribution guide.
Built by @hwkim-dev Β· Documentation Β· Issues