Educational GPU architecture exploration in SpinalHDL.
SpinalGPU implements a PTX subset ISA with a custom binary encoding executed by the hardware.
| Spec | Default | Config / derivation source |
|---|---|---|
| Physical SMs | 1 | GpuConfig.default.cluster.smCount |
| Max resident CTAs per SM | 1 | Current execution model in docs/architecture.md |
| Warp size | 32 | SmConfig.default.warpSize |
| Sub-SM partitions per SM | 4 | SmConfig.default.subSmCount |
| Resident warps per sub-SM | 2 | SmConfig.default.residentWarpsPerSubSm |
| Max resident warps per SM | 8 | residentWarpCount = subSmCount * residentWarpsPerSubSm |
| Max resident threads per SM | 256 | maxBlockThreads = residentWarpCount * warpSize |
| CUDA issue lanes per sub-SM | 32 | SmConfig.default.subSmIssueWidth |
| Active CUDA lanes per SM | 128 | subSmCount * subSmIssueWidth |
| Registers per thread | 32 | SmConfig.default.registerCount |
| Shared memory per SM | 4 KiB | SmConfig.default.sharedMemoryBytes |
| Shared memory banks | 32 | SmConfig.default.sharedMemoryBankCount |
| Tensor memory per warp | 512 B | SmConfig.default.tensorMemoryBytesPerWarp |
| Tensor memory per SM | 4 KiB | tensorMemoryBytes = tensorMemoryBytesPerWarp * residentWarpCount |
All rows below show two aggregation levels:
Per sub-SM partition: one localCudaCoreArrayissue slicePer physical SM: the sum across all 4 default sub-SM partitions
With SmConfig.default, each sub-SM partition has 32 CUDA issue lanes and covers one full 32-thread warp per issued instruction because warpSize == subSmIssueWidth == 32. If you want a sustained per-lane rate, divide the per-partition number by 32.
| Operation | Per sub-SM partition | Per physical SM (default) | Config / derivation source |
|---|---|---|---|
| FP32 add / mul | 8 FLOPs/cycle | 32 FLOPs/cycle | subSmIssueWidth(32) / fpAddLatency(4) per partition, then subSmCount(4) * ... per SM |
| FP32 FMA | 12.8 FLOPs/cycle | 51.2 FLOPs/cycle | subSmIssueWidth(32) * 2 / fpFmaLatency(5) per partition, then subSmCount(4) * ... per SM |
| FP16 scalar add / mul | 8 FLOPs/cycle | 32 FLOPs/cycle | subSmIssueWidth(32) / fp16ScalarLatency(4) per partition, then subSmCount(4) * ... per SM |
| FP16 scalar FMA | 16 FLOPs/cycle | 64 FLOPs/cycle | subSmIssueWidth(32) * 2 / fp16ScalarLatency(4) per partition, then subSmCount(4) * ... per SM |
FP16 packed f16x2 add / mul |
16 FLOPs/cycle | 64 FLOPs/cycle | subSmIssueWidth(32) * 2 / fp16x2Latency(4) per partition, then subSmCount(4) * ... per SM |
| FP8 arithmetic | 0 FLOPs/cycle | 0 FLOPs/cycle | Current CUDA path implements FP8 conversion only |
These values describe GpuConfig.default; the source-of-truth configuration surfaces are GpuConfig.scala and SmConfig.scala. The terminology here stays architecture-accurate and the throughput rows reflect current implemented datapaths and sustained latency-gated behavior, not clock-based marketing peaks. FMA counts as 2 FLOPs.
Tensor note: legacy TensorCoreBlock and tcgen05 are implemented, but their current inner compute loops serialize one FP16 FMA per cycle per partition, so this README intentionally omits a headline tensor-throughput row.
Status labels below match the support meanings in docs/isa.md: Implemented, Partial, and Rejected.
| Area | Status | Current scope | Details |
|---|---|---|---|
| Execution and launch model | Partial |
Classic SIMT v1 with real 3D CTA grids | One kernel globally in flight, one resident CTA per SM, and no divergent reconvergence |
| PTX module and entry contract | Partial |
Repo PTX subset only | One .visible .entry per file, .param .u32 only, lowered into the custom SpinalGPU machine encoding |
| CUDA-core arithmetic | Implemented |
INT32, FP32, FP16, FP16x2, and packed FP8 conversion | The documented CUDA-core subset executes on the current CudaCoreArray path |
| Special-function unit | Implemented |
Unary .approx FP32 plus FP16 / FP16x2 ex2 and tanh |
The current SFU surface is intentionally limited to the documented unary subset |
| Memory surface | Partial |
.param, .global, and .shared only |
Global traffic supports aligned 16-bit and 32-bit accesses; shared-memory PTX is still word-oriented only |
| Special registers and grid builtins | Implemented |
Core SIMT, SM, and grid builtins are real | %gridid remains a narrow .u64 path even though %tid/%ntid/%ctaid/%nctaid/%smid/%nsmid are live |
| Tensor surface | Partial |
Legacy FP16 tensor v1 plus narrow FP16 tcgen05 |
Broad NVIDIA tensor families remain intentionally out of scope |
| Explicitly unsupported PTX families | Rejected |
Major PTX families stay out of scope in v1 | .const, .local, atomics, barriers, calls, BF16/FP64, broad wmma.*, broad tcgen05, and generic compiler-generated PTX are rejected |
Partial here means “implemented for a narrower educational subset than full PTX,” not “broken.” See docs/isa.md and docs/architecture.md for the full compatibility and execution-model detail.
- JDK 17
sbton yourPATHverilatoron yourPATH
If you install openjdk@17 via Homebrew on macOS, you may need:
export PATH="/opt/homebrew/opt/openjdk@17/bin:/opt/homebrew/bin:$PATH"
export JAVA_HOME="/opt/homebrew/opt/openjdk@17/libexec/openjdk.jdk/Contents/Home"From the repository root:
sbt compile
sbt devTest
sbt refreshKernels
sbt smokeTest
sbt multiSmSmoke
sbt test
sbt runsbt devTest: fast structural, unit, and controller regression loop; does not rebuild the kernel corpussbt refreshKernels: rebuildsgenerated/kernels/**fromkernels/**/*.ptxsbt smokeTest: curated SM andGpuTopexecution smoke suite; does not rebuild the kernel corpussbt multiSmSmoke: curated multi-SMGpuTop(smCount = 2)smoke suite; does not rebuild the kernel corpussbt test: full regression and automatic kernel-corpus rebuild
Use the small wrappers in scripts/ for the common inner loop:
./scripts/build-kernels.sh
./scripts/test-fast.sh
./scripts/test-watch.sh
./scripts/gen-verilog.sh
./scripts/check.sh./scripts/build-kernels.sh: compileskernels/**/*.ptxinto raw.binfiles undergenerated/kernels./scripts/test-fast.sh: rebuilds kernels and runs the defaultsbt smokeTestworkflow./scripts/test-fast.sh spinalgpu.SomeOtherSpec: runs one specific test spec./scripts/test-watch.sh: reruns the defaultsbt smokeTestworkflow whenever sources change./scripts/gen-verilog.sh: regenerates Verilog intogenerated/verilog./scripts/check.sh: runs the fullcompile -> test -> runcontract
- SM architecture note
- PTX subset ISA reference
- Machine encoding reference
- SM overview diagram
- Dispatch and dataflow diagram
- Memory hierarchy and AXI boundary diagram
- Launch and frontend execution diagram
- Repo agent guidelines
GpuConfigis the chip-level configuration used byGpuTop,GpuCluster, host control, and the single-SM compatibility wrapper.SmConfigis the SM-local configuration used by execution-core modules such as sub-SMs, caches, shared memory, LSU, and CUDA cores.GpuConfig.defaultpreserves the default single-SM repo contract by settingcluster.smCount = 1andsm = SmConfig.default.
- Canonical PTX subset kernel source files live under
kernels/. - CUDA learning kernels live under
kernels/cuda/as source-only study material. - Generated raw machine-code binaries are written under
generated/kernels/. - The corpus is organized by primary feature:
kernels/arithmetic/kernels/control/kernels/global_memory/kernels/shared_memory/kernels/sfu/kernels/special_registers/
kernels/cuda/is intentionally separate from the executable PTX corpus:.cufiles are not compiled bysbt refreshKernels.cufiles are not loaded byKernelCorpusor the simulation harnesses.cufiles are not covered by repo automation in this first milestone
src/main/scala/spinalgpu/toolchain/KernelCorpus.scalais the single source of truth for PTX source paths, generated binary paths, launch config, preload image, expected outcome, and harness coverage.src/test/scala/spinalgpu/KernelCorpusTestUtils.scalais the shared runner used by the corpus-backed simulation specs.- Success vs fault classification is stored in typed declarative expectations, not in directory names.
- The current harness split is:
GpuTop: AXI-Lite/AXI boundary smokeStreamingMultiprocessor: full externalized kernel corpusExecutionFrontendSimSpec: grouped top-level kernel suites for full, smoke, and multi-SM execution coverage
- Recommended flow:
- use
sbt devTestfor fast non-execution iteration - use
sbt refreshKernelsafter PTX or PTX-toolchain changes - use
sbt smokeTestandsbt multiSmSmokefor curated execution checks - use
sbt testfor the full regression contract
- use
Each teaching kernel follows one strict file template:
// Purpose,// Primary feature, and// Expected outcomeheader lines.version,.target,.address_size- one
.visible .entrysignature with.paramlist - declarations ordered as
.reg, then.pred, then.shared // Setup// Core// Exitor// Fault trigger
“Layers” or “modules” in this repo mean directory structure and file sections for readability. They do not imply PTX includes, macros, imports, or multi-entry PTX modules.
sbt compileresolves dependencies and compiles the SpinalHDL sources plus the PTX toolchain.sbt devTestruns the fast structural, unit, and controller suites without rebuilding the kernel corpus.sbt refreshKernelsrebuilds the PTX kernel corpus undergenerated/kernels.sbt smokeTestruns the curated SM andGpuTopexecution smoke suites without rebuilding the kernel corpus.sbt multiSmSmokeruns the curated multi-SMGpuTopsmoke suite without rebuilding the kernel corpus.sbt testrebuilds the PTX kernel corpus and then runs the full regression suite.sbt runelaboratesGpuTopand emits Verilog intogenerated/verilog.
src/main/scala/spinalgpu/GpuTop.scala: top-level wrapper exposing AXI4 memory and AXI-Lite control.src/main/scala/spinalgpu/GenerateGpuTop.scala: Verilog generator entrypoint used bysbt run.src/main/scala/spinalgpu/StreamingMultiprocessor.scala: launch, fetch, decode, issue, and writeback frontend for one SM.src/main/scala/spinalgpu/toolchain/PtxAssembler.scala: PTX subset compiler that lowers PTX source into SpinalGPU machine words.src/main/scala/spinalgpu/toolchain/KernelBinaryIO.scala: raw binary read/write helpers for generated machine code.src/main/scala/spinalgpu/toolchain/BuildKernelCorpus.scala: batch compiler for the kernel corpus.src/main/scala/spinalgpu/toolchain/KernelCorpus.scala: single declarative kernel corpus definition for source paths, launch config, preload image, and expectations.src/test/scala/spinalgpu/GpuTopSimSpec.scala: top-levelSpinalSimsmoke test with AXI memory and AXI-Lite control.src/test/scala/spinalgpu/ExecutionFrontendSimSpec.scala: groupedGpuTopfull, smoke, and multi-SM execution suites.src/test/scala/spinalgpu/StreamingMultiprocessorSimSpec.scala: grouped SM wrapper integration suite for launch, trap, and delayed-drain behavior.src/test/scala/spinalgpu/StreamingMultiprocessorKernelCorpusSpec.scala: grouped full and smoke kernel-corpus SM execution suites.src/test/scala/spinalgpu/KernelCorpusTestUtils.scala: shared preload, launch, and assertion helpers for corpus-backed simulation specs.src/test/scala/spinalgpu/KernelCorpusSpec.scala: integrity test that keepskernels/and the declarative corpus in sync.src/test/scala/spinalgpu/IsaSpec.scala: machine-encoding encoder, decoder, and disassembler tests.src/test/scala/spinalgpu/PtxAssemblerSpec.scala: PTX subset compiler and binary-emission tests.src/test/scala/spinalgpu/ArchitectureSkeletonSpec.scala: elaboration sweep and doc/diagram checks.kernels/: feature-organized PTX subset corpus for end-to-end execution tests.kernels/cuda/: source-only CUDA learning ladder for performance-study kernels, organized from foundations to LLM-style building blocks.generated/kernels/: generated raw binaries loaded by the simulation harnesses.scripts/: small local workflow helpers for fast test, watch mode, Verilog generation, and full checks.docs/: architecture notes, ISA reference, machine-encoding reference, and Mermaid diagrams for the SM/frontend design.AGENTS.md: repo-specific Codex guidance for SpinalHDL architecture and testing conventions.
- The public software-visible ISA is a PTX subset, not the internal machine encoding.
- The current hardware still executes fixed-width 32-bit SpinalGPU machine words loaded from raw
.binfiles. - The PTX compiler is intentionally educational and correctness-first, not a full
nvcc-compatible PTX implementation. - The CUDA learning ladder is documentation-plus-source only; it becomes runnable later in a CUDA-capable environment, but it does not change the default repo contract today.