Printf-style debugging for Metal compute shaders. No Xcode GPU debugger, no buffer dumps, no guessing.
Add #include "metal_debug.h" to your shader, drop in a debug buffer, see what every thread computed.
#include "metal_debug.h"
kernel void my_kernel(
device float *A [[buffer(0)]],
device float *B [[buffer(1)]],
device float *C [[buffer(2)]],
device uint *dbg_buf [[buffer(30)]],
uint id [[thread_position_in_grid]]
) {
float a = A[id], b = B[id];
dbg_printf(dbg_buf, id, 0, a); // log input A
dbg_printf(dbg_buf, id, 1, b); // log input B
float result = a * b;
dbg_watch_nan(dbg_buf, id, 2, result); // only logs if NaN/Inf
dbg_assert(dbg_buf, id, 3, result > 0); // GPU-side assertion
C[id] = result;
}Host side (ObjC):
MetalDebugSession *dbg = [[MetalDebugSession alloc] initWithDevice:device maxEntries:4096];
[encoder setBuffer:dbg.buffer offset:0 atIndex:30];
// ... dispatch ...
[dbg dump];Output:
[metal-debug] 24 entries
thread[0] 0: 3.5
thread[0] 1: 2.0
thread[1] 0: 1.2
thread[1] 1: -0.5
thread[1] 3: ASSERTION FAILED
| Feature | GPU API | Description |
|---|---|---|
| Printf | dbg_printf(buf, tid, tag, val) |
Log float/int/uint/half/vec values |
| Conditional | dbg_printf_if(buf, cond, tid, tag, val) |
Only log when condition is true |
| NaN watchpoint | dbg_watch_nan(buf, tid, tag, val) |
Log only NaN/Inf values |
| Range watchpoint | dbg_watch_range(buf, tid, tag, val, lo, hi) |
Log values outside range |
| Assertions | dbg_assert(buf, tid, tag, cond) |
Record assertion failures |
| Breakpoints | dbg_break(buf, tid, tag, cond) |
Set flag for host to detect |
| Stats | dbg_stats(buf, tag, val) |
Cross-thread min/max/mean/count |
| Histogram | dbg_histogram(buf, tag, val, lo, hi) |
Value distribution with bar chart |
| Named tags | Preprocessor or host-side | "loss" instead of tag=42 |
| 2D grid view | Host-side | Display values as threadgroup grid |
| Diff mode | Host-side | Compare two kernel runs |
| Zero-overhead disable | #define METAL_DEBUG_DISABLE |
Compiles out all debug calls |
-
GPU side:
metal_debug.his a single header. Debug calls write(thread_id, tag, type, value)entries into a device buffer using atomic counters. -
Host side:
MetalDebugSessionallocates the buffer, binds it at slot 30, and reads/formats entries after execution. -
No recompilation needed when changing buffer size —
max_entriesis stored in the buffer itself and read by the GPU at runtime.
git clone <this repo>
cd metal-debug
make test # compiles + runs 9 test kernels on your GPUCopy src/metal_debug.h into your project. Link runtime/MetalDebugSession.{h,m} into your app.
#import "MetalDebugSession.h"
MetalDebugSession *dbg = [[MetalDebugSession alloc]
initWithDevice:device maxEntries:4096];
[dbg setName:@"loss" forTag:0];
[encoder setBuffer:dbg.buffer offset:0 atIndex:30];
// dispatch kernel...
[dbg dump]; // all entries, sorted by thread
[dbg dumpTag:0]; // filter by tag
[dbg dumpGrid:0 width:8 height:8]; // 2D threadgroup view
[dbg dumpStats:0]; // min/max/mean
[dbg dumpHistogram:0 lo:0 hi:1]; // value distribution
if ([dbg breakpointHit])
[dbg dumpBreakpoint]; // what went wrong
[dbg reset]; // reuse for next dispatchimport Metal
let dbg = MetalDebugSession(device: device, maxEntries: 4096)
encoder.setBuffer(dbg.buffer, offset: 0, index: 30)
// dispatch kernel...
dbg.dump()See examples/SwiftDemo/ for a complete Swift example.
from metal_debug import MetalDebugSession
dbg = MetalDebugSession(max_entries=4096)
# pass dbg.tensor as buffer(30) to your Metal/Triton kernel
torch.mps.synchronize()
dbg.dump()Explore debug traces interactively — filter, navigate, see grid views and stats live:
pip install textual
# Launch with demo data
python python/tui.py --demo
# Launch with a debug buffer dump
python python/tui.py trace.binOr from Python after a kernel dispatch:
dbg.explore(grid_width=8, grid_height=8)Keyboard shortcuts:
| Key | Action |
|---|---|
↑/↓ |
Navigate entries |
g |
Show 2D grid for selected tag |
a |
Show assertions only |
b |
Jump to breakpoint thread |
c |
Clear filters |
m |
Toggle mouse (enable copy/paste) |
escape |
Focus table from filter input |
q |
Quit |
Auto-inject the debug buffer parameter into kernel signatures and use string tags:
python3 src/metal_debug_preprocess.py my_kernel.metal -o my_kernel_debug.metal
xcrun metal -I path/to/metal-debug/src -o out.metallib my_kernel_debug.metalBefore:
kernel void foo(device float *A [[buffer(0)]], uint id [[thread_position_in_grid]]) {
dbg(id, "value", A[id]);
}After preprocessing:
kernel void foo(device float *A [[buffer(0)]], uint id [[thread_position_in_grid]],
device uint *_dbg_buf [[buffer(30)]]) {
dbg(id, 47248/*value*/, A[id]);
}If you use DBG_PARAM in your kernel signature, the short macros work:
kernel void my_kernel(device float *A [[buffer(0)]], DBG_PARAM,
uint id [[thread_position_in_grid]]) {
dbg(id, 0, A[id]); // printf
dbg_if(id == 0, id, 1, A[id]); // conditional
dbg_nan(id, 2, A[id]); // NaN watchpoint
dbg_check(id, 3, A[id] > 0); // assertion
dbg_stat(0, A[id]); // stats accumulator
dbg_hist(0, A[id], 0, 100); // histogram
dbg_brk(id, 4, A[id] < 0); // breakpoint
}MIT