Artifact for reproducing Table 1 of the paper.
workloads/
resnet/ ResNet-18 (CNN)
qwen/ Qwen3-8B (LLM Decoder)
sd3_mmdit/ SD3-MMDiT (Diffusion Transformer)
mamba/ Mamba-2 SSD (State Space Model)
ds_mhc_moe/ mHC-MoE (Mixture of Experts)
Each workload contains:
model_ref.py PyTorch Eager baseline
model_new.py LEGO-optimized (Triton kernels + system-level opts)
ab.py A/B benchmark tool (eager + torch.compile modes)
- GPU: NVIDIA RTX 6000 Ada (48GB) or comparable
- CUDA 12.9+, PyTorch 2.9+, Triton (bundled with PyTorch)
pip install einops
python ab.py \
--ref workloads/qwen/model_ref.py \
--test workloads/qwen/model_new.py \
--bench-runs 30 --compile-run-times 1for w in resnet qwen sd3_mmdit mamba ds_mhc_moe; do
echo "=== $w ==="
python ab.py \
--ref workloads/$w/model_ref.py \
--test workloads/$w/model_new.py \
--bench-runs 30 --compile-run-times 1
doneApache 2.0