Skip to content

hwkim-dev/pccx

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

92 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

pccx β€” Parallel Compute Core eXecutor

A scalable NPU architecture for Transformer LLM inference on edge FPGAs

License Target Architecture Precision Docs

πŸ“– Full Documentation β†’


What is pccx?

pccx is a hardware-software co-design framework that accelerates autoregressive decoding of Transformer-based LLMs on resource-constrained edge devices. The primary target is the Xilinx Kria KV260 SOM.

Rather than reusing a generic matrix accelerator, pccx is designed around the actual bottleneck of LLM decoding: memory bandwidth-bound GEMV, not compute-bound GEMM. The architecture separates matrix (GEMM) and vector (GEMV) datapaths, supplies weights through dedicated HP AXI ports, and uses a custom 64-bit VLIW ISA to eliminate dispatch stalls.


Architecture (v002)

Core Configuration Peak Throughput Primary Use
GEMM (Matrix) 32 Γ— 32 systolic array (cascade split @ row 16) 819 GMAC/s @ 400 MHz Prefill, QΒ·Kα΅€, scoreΒ·V
GEMV (Vector) 4 cores Γ— 32-MAC LUT pipeline + 5-stage reduction tree Weight-streaming limited (~51.2 GMAC/s @ 400 MHz) Autoregressive decoding
SFU / CVO CORDIC + LUT hybrid BF16 / FP32 promoted Softmax, GELU, RMSNorm, RoPE

Key design decisions:

  • W4A8 precision β€” INT4 weights Γ— INT8 activations via DSP48E2 dual-channel bit packing (1 DSP = 2 MACs)
  • Precision promotion β€” non-linear ops (Softmax, GELU, RMSNorm, RoPE) automatically upcast to BF16/FP32 for numerical stability
  • Custom 64-bit VLIW ISA β€” 5 opcodes: GEMV, GEMM, MEMCPY, MEMSET, CVO; decoupled decode/dispatch eliminates front-end stalls
  • Shared L2 (URAM 1.75 MB) β€” all three cores share a central SRAM cache; GEMV↔SFU are connected via a direct-connect FIFO, bypassing L2 round-trips
  • Dual clock domains β€” 250 MHz AXI/control plane, 400 MHz core compute (Γ—1.6 frequency gain over v001)
  • 3.125Γ— total throughput gain vs. v001 (frequency Γ— dual-MAC DSP packing)
External AXI (250 MHz)          Core Domain (400 MHz)
─────────────────────           ──────────────────────────────────────────────────────
S_AXIL_CTRL (HPM)    ────────►  npu_controller_top
                                  β”œβ”€ ctrl_npu_decoder   (64-bit VLIW β†’ opcode + body)
S_AXI_HP0/HP1        ────────►  GEMM_systolic_top      (32Γ—16Γ—2, W-Stationary)
S_AXI_HP2/HP3        ────────►  GEMV_top               (4 cores Γ— 32-MAC LUT, 5-stage tree)
S_AXIS_ACP_FMAP      ────────►  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
M_AXIS_ACP_RESULT    ◄────────  β”‚  Shared L2 Cache (URAM 1.75 MB)β”‚
                                β”‚  GEMV ──FIFO──► CVO_top (SFU)  β”‚
                                β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Memory Hierarchy

Level Technology Size Access
L1 (Activation row buffer) Block RAM per-core Systolic / GEMV lanes
L2 (Shared cache) URAM 1.75 MB (114,688 Γ— 128-bit) All cores + mem_dispatcher
Weight stream HP AXI port Γ— 4 DDR4 bandwidth HP0/1 β†’ GEMM, HP2/3 β†’ GEMV
KV Cache DDR4 (off-chip) Up to 10–12 GB ACP coherent port

KV cache bandwidth wall: At 32K context (Gemma 3N E4B), the accumulated KV cache reaches ~1.31 GB. Mitigation: KV quantization (FP16β†’INT8/INT4), attention sink eviction, and a driver-enforced KV_MAX_TOKENS hard cap.


Repository Layout

pccx/
β”œβ”€β”€ conf.py / index.rst          # English Sphinx config & root toctree
β”œβ”€β”€ ko/                          # Korean Sphinx subsite (ko-first authoring)
β”‚   β”œβ”€β”€ conf.py
β”‚   └── docs/                    # Korean documentation source
β”œβ”€β”€ docs/                        # English documentation source
β”‚   β”œβ”€β”€ v002/                    # Active architecture docs
β”‚   β”‚   β”œβ”€β”€ Architecture/        # Core design, DSP48E2, KV cache, rationale
β”‚   β”‚   β”œβ”€β”€ ISA/                 # 64-bit VLIW instruction set reference
β”‚   β”‚   β”œβ”€β”€ Drivers/             # Host API & driver documentation
β”‚   β”‚   └── RTL/                 # Embedded RTL source reference
β”‚   └── archive/experimental_v001/
β”œβ”€β”€ assets/images/               # Architecture diagrams (PNG)
β”œβ”€β”€ _static/                     # JS/CSS (language switcher, Mermaid theme)
└── codes/
    β”œβ”€β”€ v001/hw/rtl/             # v001 RTL (archived, reference only)
    └── v002/                    # v002 RTL (CI-cloned from pccx-FPGA-NPU-LLM-kv260)

Two sibling repositories round out the pccx project:


Roadmap β€” Two-Track + Auto-Porting Ξ±

pccx is developed along two parallel tracks as of 2026-04-20. The tracks share RTL assets (sparse weight fetcher, SSD dispatcher, tree mask generator, EAGLE training pipeline); a long-term auto-porting compiler begins once both tracks are stable.

Track Target model Goal Horizon Key phases
v002 Extended Gemma 3N E4B 20 tok/s measured Week 1–49 A–F baseline β†’ G sparsity β†’ H/H+ EAGLE-3 β†’ I SSD β†’ J Tree β†’ K benchmark
v003 Gemma 4 E4B 12–15 tok/s Week 16–52 (parallel) 1 foundation β†’ 2 EAGLE linear β†’ 3 Tree β†’ 4 SSD β†’ 5 P-EAGLE + LTD
Auto-Porting Ξ± Arbitrary Transformer config.json β†’ pccx ISA codegen Week 53+ (Year 2) Parser β†’ Resolver β†’ Feature plugin β†’ C-stub emitter

Compute budget: $70–100 total for EAGLE head training ($40 if a TRC TPU grant lands). Both tracks run on the same KV260 bitstream harness β€” v003 branches off after v002 freeze.

β†’ Full roadmap (EN) Β Β·Β  ν•œκ΅­μ–΄


Ecosystem

pccx-lab β€” Simulator & AI Profiler

Performance simulator and AI-integrated profiler, purpose-built for the pccx NPU. Pre-RTL bottleneck detection, UVM co-simulation, and LLM-driven testbench generation in one workflow.


Documentation

The full technical documentation β€” architecture deep-dives, ISA encoding tables, DSP48E2 bit-packing derivation, driver API, and embedded RTL source β€” is published at:

Available in English and ν•œκ΅­μ–΄ (Korean).

Highlights:


Building the Docs Locally

pip install -r requirements.txt
sudo apt-get install graphviz   # for Graphviz diagrams

# Clone v002 RTL (required for literalinclude directives)
git clone --depth 1 \
  https://github.com/hwkim-dev/pccx-FPGA-NPU-LLM-kv260 \
  codes/v002

# Build English site
sphinx-build -b html . _build/html/en

# Build Korean site
sphinx-build -b html ko _build/html/ko

# Serve locally
python -m http.server --directory _build/html
# β†’ open http://localhost:8000/en/ or /ko/

v001 β†’ v002 Migration

Pain point (v001) v002 solution
Core role ambiguity (Matrix/Vector/CVO blurred) Strict separation: GEMM / GEMV / SFU
Excessive intermediate bus paths Shared L2 + direct-connect FIFO for GEMV↔SFU
L2 ↔ Global Cache responsibility overlap Single unified L2 (URAM)
Single HP port β†’ one systolic array bottleneck HP0/HP1 for GEMM, HP2/HP3 for GEMV (distributed)
1 DSP = 1 MAC (bit headroom wasted) Dual-channel packing β†’ 1 DSP = 2 MACs
250 MHz ceiling (AXI clock) Decoupled 400 MHz core domain

License

Licensed under the Apache License 2.0.


See CLAUDE.md for the full build & contribution guide.


Built by @hwkim-dev Β· Documentation Β· Issues

About

PCCX: (Parallel Compute Core eXecutor) for Transformer(GPT... etc) and Neural networks (CNN, RNN, ResNet..) on Edge FPGAs

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors