Skip to content

jackulau/open_gpu

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OpenGPU

A fully-featured GPU implementation in SystemVerilog, designed from the ground up to teach modern GPU architecture.

Built with comprehensive documentation, a complete 60+ instruction ISA, working compute kernels, and full simulation support with execution traces.

Table of Contents


Overview

GPUs power everything from gaming to AI training, yet learning how they actually work at a hardware level is surprisingly difficult.

While CPU architecture is well-documented with countless tutorials and open-source implementations, GPU internals remain largely proprietary. The few open-source GPU projects that exist prioritize feature-completeness over educational clarity, making them challenging to learn from.

OpenGPU changes this.

What is OpenGPU?

Important

OpenGPU is a complete GPU implementation optimized for understanding modern GPU architecture from first principles.

Rather than focusing on graphics-specific hardware, OpenGPU emphasizes the general-purpose compute architecture shared by GPUs, ML accelerators, and tensor processors.

This project explores three fundamental aspects of GPU design:

  1. Parallel Architecture — How do GPUs achieve massive parallelism with thousands of concurrent threads?
  2. SIMT Execution — How does Single-Instruction Multiple-Thread execution work in hardware?
  3. Memory Hierarchy — How do GPUs overcome the memory bandwidth wall?

Specifications

Component Specification
Streaming Multiprocessors 8
Cores per SM 4
Total Cores 32
Threads per Warp 32
Max Concurrent Threads 4,096
Data Width 32-bit
Instruction Width 32-bit
Pipeline Depth 5 stages
L1 I-Cache 16 KB per SM
L1 D-Cache 32 KB per SM
Shared Memory 64 KB per SM
L2 Cache 512 KB (shared)
Register File 64 KB per SM
Instructions 60+
Floating Point IEEE 754 FP32

Architecture

GPU Top Level

GPU Architecture

OpenGPU is organized as a scalable array of Streaming Multiprocessors (SMs) connected through a crossbar to a shared L2 cache and memory controllers.

Components

Host Interface Receives kernel launch commands and parameters from the host CPU via an AXI-Lite interface. Stores kernel metadata including grid dimensions, block dimensions, and kernel entry point.

Global Scheduler Manages the distribution of thread blocks across available SMs. Tracks block completion and signals kernel done when all blocks finish.

Streaming Multiprocessors (SMs) The primary compute units. Each SM can execute multiple thread blocks concurrently, limited by available resources (registers, shared memory, warp slots).

L2 Cache Shared cache between all SMs. Reduces memory bandwidth requirements by caching frequently accessed data. Split into multiple slices for parallel access.

Memory Controllers Interface between the GPU and external GDDR memory. Handle request arbitration, address mapping, and data transfer scheduling.


Streaming Multiprocessor

SM Architecture

Each SM is a self-contained compute unit capable of executing hundreds of threads concurrently.

SM Controller

Manages thread block assignment to the SM. Tracks resource allocation (registers, shared memory, warp slots) and signals the global scheduler when blocks complete.

Warp Scheduler

Selects which warp to execute each cycle using the Greedy-Then-Oldest (GTO) scheduling policy:

  1. If the currently executing warp is ready, continue with it (greedy)
  2. Otherwise, select the oldest ready warp
  3. If no warps are ready, stall

The scheduler maintains a scoreboard tracking pending register writes to detect Read-After-Write (RAW) hazards.

Compute Cores

Each SM contains 4 compute cores. Every core has:

  • SIMT Stack — 8-entry stack for handling branch divergence
  • Register File — 32 registers × 32 threads × 32 bits = 4 KB
  • 5-Stage Pipeline — IF → ID → EX → MEM → WB
  • Integer ALU — 32-wide for parallel thread execution
  • FP ALU — 32-wide IEEE 754 single precision

Shared Resources

Resource Size Details
L1 I-Cache 16 KB 4-way associative, 64B lines
L1 D-Cache 32 KB 4-way associative, 128B lines
Shared Memory 64 KB 32 banks, 4 bytes each
Coalescing Unit Combines memory requests

Compute Core

Each core executes one warp (32 threads) at a time. The core architecture resembles a traditional CPU, but with 32 parallel datapaths sharing a single instruction stream.

                    ┌─────────────────────────────────────┐
                    │           WARP STATE                │
                    │  ┌────────┐ ┌────────┐ ┌─────────┐  │
                    │  │   PC   │ │  Mask  │ │  Stack  │  │
                    │  │ 32-bit │ │ 32-bit │ │ 8-entry │  │
                    │  └────────┘ └────────┘ └─────────┘  │
                    └─────────────────────────────────────┘
                                     │
                    ┌────────────────▼────────────────────┐
                    │         THREAD LANES (×32)          │
                    │  ┌──────┐┌──────┐┌──────┐   ┌──────┐│
                    │  │Lane 0││Lane 1││Lane 2│...│Lane31││
                    │  ├──────┤├──────┤├──────┤   ├──────┤│
                    │  │32 reg││32 reg││32 reg│   │32 reg││
                    │  │ ALU  ││ ALU  ││ ALU  │   │ ALU  ││
                    │  └──────┘└──────┘└──────┘   └──────┘│
                    └─────────────────────────────────────┘

Pipeline

Pipeline Architecture

Stages

IF — Instruction Fetch

  • Access L1 instruction cache using warp program counter
  • Handle cache misses by stalling and fetching from L2
  • Predict branches (simple static prediction: not taken)

ID — Instruction Decode

  • Decode 32-bit instruction into control signals
  • Read source operands from register file (32 parallel reads)
  • Generate immediate values with sign/zero extension
  • Check scoreboard for data hazards

EX — Execute

  • Perform ALU operations on all 32 threads in parallel
  • Evaluate branch conditions
  • Calculate memory addresses for load/store
  • Handle multi-cycle operations (MUL, DIV, FP)

MEM — Memory Access

  • Access L1 data cache for loads and stores
  • Coalesce memory requests from 32 threads
  • Handle cache misses
  • Sign/zero extend loaded data

WB — Write Back

  • Write results to register file (32 parallel writes)
  • Update scoreboard to release pending registers
  • Signal instruction completion

Hazard Handling

Hazard Type Detection Resolution
RAW (Read-After-Write) Scoreboard Forwarding or stall
Control (Branch) Branch unit Flush + redirect
Structural Resource tracking Stall

Data Forwarding

Three forwarding paths minimize pipeline stalls:

EX → EX    ALU result available immediately for next instruction
MEM → EX   Load data available after 1 cycle delay
WB → ID    Through register file (if forwarding not possible)

Memory System

Hierarchy

┌─────────────────────────────────────────────────────────────┐
│                     MEMORY HIERARCHY                        │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  REGISTERS         4 KB/core     1 cycle      Per-thread    │
│       │                                                     │
│       ▼                                                     │
│  L1 CACHE          48 KB/SM      ~10 cycles   Per-SM        │
│  (I$ + D$)                                                  │
│       │                                                     │
│       ▼                                                     │
│  SHARED MEM        64 KB/SM      ~20 cycles   Per-SM        │
│  (Software managed)                                         │
│       │                                                     │
│       ▼                                                     │
│  L2 CACHE          512 KB        ~100 cycles  Shared        │
│       │                                                     │
│       ▼                                                     │
│  GLOBAL MEM        4 GB          ~400 cycles  Off-chip      │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Address Space

Range Region Size Description
0x00000000 Global 256 MB Main GPU memory
0x20000000 Constant 16 MB Read-only, cached
0x30000000 Shared 1 MB Per-SM local view
0x40000000 Local 256 MB Per-thread stack

Shared Memory

Shared memory (Local Data Share) enables fast data sharing between threads in the same block.

Organization: 64 KB split into 32 banks of 2 KB each

Bank Assignment: bank = (address >> 2) & 0x1F

Access Patterns:

Pattern Example Cycles
No conflict Stride-1 access 1
Broadcast All threads same addr 1
2-way conflict Stride-16 2
32-way conflict Stride-32 32

Coalescing

The coalescing unit combines memory requests from 32 threads into minimal cache line transactions.

PERFECT COALESCING
Thread 0:  0x1000 ─┐
Thread 1:  0x1004  │
Thread 2:  0x1008  ├──▶  1 cache line request (128B)
   ...             │
Thread 31: 0x107C ─┘

SCATTERED ACCESS
Thread 0:  0x1000 ──▶  Request 1
Thread 1:  0x2000 ──▶  Request 2
Thread 2:  0x3000 ──▶  Request 3
   ...                    ...

ISA

Instruction Formats

Instruction Formats

All instructions are 32 bits with a 6-bit opcode:

Format Fields Used By
R-Type opcode, rd, rs1, rs2, func Arithmetic, logical
I-Type opcode, rd, rs1, imm16 Immediate ops, loads
S-Type opcode, rs2, rs1, offset Stores
B-Type opcode, rs1, rs2, offset Branches
U-Type opcode, rd, imm21 LUI, JAL

Instruction Set

Integer Arithmetic

Instruction Operation Description
ADD rd, rs1, rs2 rd = rs1 + rs2 Add
ADDI rd, rs1, imm rd = rs1 + imm Add immediate
SUB rd, rs1, rs2 rd = rs1 - rs2 Subtract
MUL rd, rs1, rs2 rd = rs1 × rs2 Multiply (low 32 bits)
DIV rd, rs1, rs2 rd = rs1 ÷ rs2 Divide (signed)
REM rd, rs1, rs2 rd = rs1 % rs2 Remainder

Logical & Shift

Instruction Operation Description
AND rd, rs1, rs2 rd = rs1 & rs2 Bitwise AND
OR rd, rs1, rs2 rd = rs1 | rs2 Bitwise OR
XOR rd, rs1, rs2 rd = rs1 ^ rs2 Bitwise XOR
SLL rd, rs1, rs2 rd = rs1 << rs2 Shift left
SRL rd, rs1, rs2 rd = rs1 >> rs2 Shift right logical
SRA rd, rs1, rs2 rd = rs1 >>> rs2 Shift right arithmetic

Memory

Instruction Operation Description
LW rd, offset(rs1) rd = mem[rs1+offset] Load word
SW rs2, offset(rs1) mem[rs1+offset] = rs2 Store word
LB/LH Load byte/halfword With sign extension
LBU/LHU Load byte/halfword Zero extended

Control Flow

Instruction Operation Description
BEQ rs1, rs2, offset if (rs1 == rs2) PC += offset Branch equal
BNE rs1, rs2, offset if (rs1 != rs2) PC += offset Branch not equal
BLT rs1, rs2, offset if (rs1 < rs2) PC += offset Branch less than
JAL rd, offset rd = PC+4; PC += offset Jump and link
RET End thread execution

Floating Point

Instruction Operation Description
FADD rd, rs1, rs2 rd = rs1 + rs2 FP add
FMUL rd, rs1, rs2 rd = rs1 × rs2 FP multiply
FMADD rd, rs1, rs2, rs3 rd = rs1×rs2 + rs3 Fused multiply-add

GPU Special

Instruction Description
SYNC Block-level barrier synchronization
VOTE.ALL/ANY/BAL Warp vote operations
SHFL.* Warp shuffle (exchange data between lanes)

Register File

Each thread has 32 registers. Special registers provide execution context:

Register Name Description
x0 zero Hardwired to 0
x1 threadIdx Thread index within block
x2 blockIdx Block index within grid
x3 blockDim Threads per block
x4 gridDim Blocks per grid
x5 warpIdx Warp index within block
x6 laneIdx Lane index within warp (0-31)
x7-x31 General purpose

SIMT Execution

Thread Hierarchy

                          GRID
            ┌──────────────┴──────────────┐
            │                             │
        ┌───┴───┐                     ┌───┴───┐
        │Block 0│                     │Block N│
        └───┬───┘                     └───────┘
            │
   ┌────────┼────────┬────────┐
   │        │        │        │
┌──┴──┐  ┌──┴──┐  ┌──┴──┐  ┌──┴──┐
│Warp0│  │Warp1│  │Warp2│  │Warp3│
└──┬──┘  └─────┘  └─────┘  └─────┘
   │
   ├── Thread 0
   ├── Thread 1
   ├── ...
   └── Thread 31

Warp Scheduling

The warp scheduler selects a ready warp each cycle. Warps become not-ready when:

  • Waiting for memory (cache miss)
  • Data hazard (RAW dependency)
  • Barrier synchronization
  • Long-latency operation (DIV, SQRT)

GTO Algorithm:

if (current_warp.ready):
    return current_warp           # Greedy: stay on same warp
for warp in warps_by_age:
    if (warp.ready):
        return warp               # Oldest ready warp
return STALL

Branch Divergence

When threads in a warp take different branch paths, the SIMT stack manages divergent execution:

                     ┌─────────────────────────────────┐
                     │         SIMT STACK              │
                     │  ┌─────────┬────────┬────────┐  │
                     │  │Reconv PC│  Mask  │Next PC │  │
                     │  ├─────────┼────────┼────────┤  │
                     │  │  0x120  │ 0xFFFF │ 0x108  │  │  Entry 0
                     │  │  0x140  │ 0x00FF │ 0x128  │  │  Entry 1
                     │  │   ...   │  ...   │  ...   │  │
                     │  └─────────┴────────┴────────┘  │
                     └─────────────────────────────────┘

Divergence Handling:

  1. On divergent branch, push reconvergence point to stack
  2. Execute taken-path with subset of threads (mask active threads)
  3. At reconvergence point, pop stack and execute not-taken path
  4. When stack empty, all threads reconverged

Barrier Synchronization

GPUs provide barrier instructions for synchronizing threads within a block.

SYNC — Block-Level Barrier

All warps in a block must reach the barrier before any can proceed:

Warp 0: ... SYNC ...  ──┐
Warp 1: ... SYNC ...  ──┼──▶ All warps blocked until last arrives
Warp 2: ... SYNC ...  ──┤
Warp 3: ... SYNC ...  ──┘    Then all wake simultaneously

Implementation:

  • Barrier controller tracks arrivals per barrier ID (16 concurrent barriers)
  • Arriving warp status changes to WARP_BLOCKED
  • When all active warps arrive, controller sends batch wake signal
  • All blocked warps transition back to WARP_READY

WSYNC — Warp-Level Barrier

Synchronizes all threads within a single warp (useful after shuffle operations):

SHFL.XOR  x9, x8, 1       # Exchange with neighbor
WSYNC                      # Ensure all lanes complete shuffle
ADD       x10, x8, x9     # Use exchanged data

WSYNC completes immediately since all 32 threads in a warp execute in lockstep.


Kernels

Vector Addition

Adds two vectors element-wise: C[i] = A[i] + B[i]

# Vector Addition Kernel
# Grid: N/256 blocks, Block: 256 threads

MUL     x7, x2, x3          # global_id = blockIdx * blockDim
ADD     x7, x7, x1          # global_id += threadIdx

SLLI    x8, x7, 2           # byte_offset = global_id * 4

ADD     x9, x10, x8         # addr_A = base_A + offset
ADD     x10, x11, x8        # addr_B = base_B + offset
ADD     x11, x12, x8        # addr_C = base_C + offset

LW      x13, 0(x9)          # load A[i]
LW      x14, 0(x10)         # load B[i]

ADD     x15, x13, x14       # C[i] = A[i] + B[i]

SW      x15, 0(x11)         # store C[i]

RET

Sum Reduction

Computes sum of 1 to N using parallel reduction pattern:

# Sum Reduction Kernel
# Computes: sum = 1 + 2 + 3 + ... + N

ADDI    x7, x0, 10          # N = 10
ADDI    x8, x0, 0           # sum = 0
ADDI    x9, x0, 1           # i = 1

loop:
    ADD     x8, x8, x9      # sum += i
    ADDI    x9, x9, 1       # i++
    BLT     x9, x7, loop    # while i < N
    BEQ     x9, x7, loop    # include i == N

SW      x8, 0(x0)           # store result (55)
RET

Matrix Multiplication

Multiplies two NxN matrices using tiled algorithm with shared memory:

# Matrix Multiplication with Shared Memory Tiling
# C = A × B, using 16×16 tiles

# Calculate thread position
ANDI    x7, x1, 0xF         # tx = threadIdx & 15
SRLI    x8, x1, 4           # ty = threadIdx >> 4

# Calculate global position
SLLI    x9, x2, 4
ANDI    x10, x9, 0xF0       # col_base = (blockIdx & 0xF) * 16
SRLI    x11, x2, 4
SLLI    x11, x11, 4         # row_base = (blockIdx >> 4) * 16

ADD     x12, x10, x7        # col = col_base + tx
ADD     x13, x11, x8        # row = row_base + ty

# Initialize accumulator
ADDI    x20, x0, 0          # acc = 0

# Tile loop
ADDI    x14, x0, 0          # k = 0
tile_loop:
    # Load A tile to shared memory
    # ... (tile loading)

    SYNC                    # Barrier

    # Compute partial products
    # ... (16 multiply-adds)

    SYNC                    # Barrier

    ADDI    x14, x14, 16    # k += TILE_SIZE
    BLT     x14, x4, tile_loop

# Store result
# ... (store C[row][col])
RET

Simulation

Prerequisites

Install the required tools:

# macOS
brew install icarus-verilog

# Ubuntu/Debian
sudo apt install iverilog

# Python dependencies
pip3 install cocotb

Running Tests

# Compile and run all tests
make sim

# Run specific test
make test

# Assemble a kernel
python3 sw/assembler/opengpu_asm.py sw/kernels/test_arithmetic.asm

# View waveforms (requires GTKWave)
make wave

Execution Trace

The simulation produces detailed execution traces showing cycle-by-cycle state:

╔══════════════════════════════════════════════════════════════════╗
║ Cycle: 42  |  State: EXECUTE  |  PC: 0x0010                      ║
╠══════════════════════════════════════════════════════════════════╣
║ Instruction: ADD x9, x7, x8                                      ║
║ Opcode: 0x00  |  rd: 9  |  rs1: 7  |  rs2: 8                     ║
╠══════════════════════════════════════════════════════════════════╣
║ Registers:                                                       ║
║   x7 = 10  |  x8 = 20  |  x9 = 30 (written)                      ║
╠══════════════════════════════════════════════════════════════════╣
║ Memory: No access                                                ║
╚══════════════════════════════════════════════════════════════════╝

Implementation Roadmap

Completed

  • Phase 1: Minimal Core — Single-threaded FSM-based core

    • 60+ instruction ISA
    • Integer ALU with all operations
    • Register file with GPU context
    • Memory interface with byte/half/word access
    • Python assembler
    • Basic testbench
  • Phase 2: Pipelining

    • 5-stage pipeline implementation
    • Scoreboard for hazard detection
    • Data forwarding paths (EX→EX, MEM→EX, WB→ID)
    • Branch prediction with misprediction recovery
  • Phase 3: SIMT Execution

    • 32 threads per warp (WARP_SIZE = 32)
    • 4 warps per core (WARPS_PER_CORE = 4)
    • Warp scheduler with GTO (Greedy-Then-Oldest) policy
    • SIMT stack for branch divergence (32-entry depth)
    • Barrier synchronization (SYNC/WSYNC instructions)
    • Thread mask unit for divergence analysis
    • Vote operations (VOTE.ALL/ANY/BAL)
    • Shuffle operations (SHFL.IDX/UP/DN/XOR)
  • Phase 4: Memory Hierarchy

    • L1 instruction cache (16 KB, 4-way)
    • L1 data cache (32 KB, 4-way)
    • L2 shared cache (512 KB, 8-way)
    • Shared memory with 32 banks
    • Memory coalescing unit

In Progress

  • Phase 5: Full System
    • IEEE 754 FP32 unit (implemented, needs integration testing)
    • 8 SMs (32 total cores)
    • Global block scheduler
    • Host interface (AXI-Lite)

Contributing

Contributions are welcome! Areas of interest:

  • Pipeline hazard handling improvements
  • Cache implementation
  • SIMT stack and divergence handling
  • FPU implementation
  • Test kernels and verification

License

MIT License — See LICENSE for details.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published