-
Notifications
You must be signed in to change notification settings - Fork 0
v0 roadmap: metal runtime → comm benchmarks + kernels #1
Copy link
Copy link
Open
Description
We now have the initial gpucomm/core scaffold: a minimal SwiftPM Metal runtime (MetalContext + kernel loader), a CLI (gpucomm), and two first experiments (bandwidth + reduction).
Goal
Make this repo a low-level GPU compute runtime + benchmark lab for Apple Silicon that focuses on data movement, synchronization, and memory behavior (not abstractions).
Proposed next milestones
- Kernel suite v0.1
- Scan (prefix sum), matmul, and a proper memcpy-style bandwidth kernel
- Parameterized kernels (problem size, threadgroup size) + correctness checks
- Benchmark harness v0.2
- Standard runner: warmup, repetitions, percentile stats
- Output formats: human +
--json - Stable timing: GPU timestamps + optional CPU wall time
- CPU ↔ GPU transfer + residency v0.3
- Upload/download latency + throughput (shared vs private + blit strategies)
- Pinned/contiguous allocation experiments where applicable
- Synchronization + threadgroup experiments v0.4
- Barrier patterns, bank conflict probes, reduction/scan variants
- Threadgroup scaling sweeps and occupancy-ish heuristics
- CI + docs v0.5
- GitHub Actions on macOS:
swift build -c release - README: benchmark methodology + “how to interpret numbers”
- GitHub Actions on macOS:
Decisions needed
- Do we prioritize transfer benchmarks (CPU↔GPU) before adding more kernels?
- Should the CLI be strict subcommands only, or do we also support a config-driven batch runner (YAML/JSON)?
- What’s the canonical set of benchmark outputs we want to standardize across gpucomm repos?
Acceptance criteria for next PR
- Add 1 new benchmark end-to-end (kernel + runner + CLI) with correctness validation
- Add
--jsonoutput for that benchmark - CI stays green (
swift build -c release)
Reactions are currently unavailable
Pinned by gpucomm-hq
Pinned comment options
Included latency benchmark for command submission / kernel launch.
- New:
gpucomm bench latency --kind empty|kernel --iters N --warmup N --reps N --format human|json|csv-
empty: empty command buffer commit+wait (host-side queue/sync overhead) -
kernel: 1-thread noop compute kernel (end-to-end kernel launch latency)
-
Repro:
./.build/release/gpucomm bench latency --kind empty --iters 2000 --warmup 200 --reps 5 --format json./.build/release/gpucomm bench latency --kind kernel --iters 2000 --warmup 200 --reps 5 --format json
Next: add blit latency kind (small copy) and optionally break down timings (submit vs wait).
Metadata
Metadata
Assignees
Labels
No labels