feat(sc): uSystolic stoc_len halving + per-row QK granularity by heroarmor · Pull Request #2 · heroarmor/scmp_diffusion

heroarmor · 2026-05-27T21:23:09Z

Adds the uSystolic/HUB sign-magnitude stoc_len halving trick and a per-row QK granularity option to the Q-DiT SC integration.

Changes

SCController.halve flag, surfaced as --sc_halve (and SC_HALVE=1 env). When set, bipolar SC matmuls run at stoc_len/2 via halve_bipolar_stoc_len=True — no accuracy loss. No-op for non-bipolar modes and the noise surrogate.
SCAttention / SCMlp route through partial(sc_matmul, halve_bipolar_stoc_len=True) when halve is enabled.
--sc_qk_granularity {per_head,per_row}: per-row QK scaling to match the AV path (default stays per_head).
tools/kernel_launch_counter.py: point the DiT ckpt at the turbo path.
Bump scmp_kernels submodule to a576b83 (already on upstream/main).

Debug/scratch tools were intentionally left out of this PR.

🤖 Generated with Claude Code

Adds CrucibleComputingGroup/scmp_kernels as a submodule at ./scmp_kernels, pinned to heroarmor:add-mp-module branch tip (commit fbd7009). Will be re-pinned to org main once PR #2 lands. URL: https://github.com/heroarmor/scmp_kernels.git Path: ./scmp_kernels Initial pin: fbd7009 (add-mp-module branch — MP module + clipping removal + flat-API public surface for Q-DiT compatibility) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Brings the Diffusion application source code from scmp_llm/Q-DiT into this repo (93 files: qdit/, diffusion/, models/, scripts/, tests/, utils/, env files). All SC/MP-related bare imports are rewired to use the scmp_kernels submodule: from sc_triton import sc_matmul, sc_matmul_grouped, ... → from scmp_kernels.sc import sc_matmul_per_tensor as sc_matmul, sc_matmul_grouped, ... from sng import RNGPool, SNGBank → (deleted — only used by qdit/sc_integration/sc_matmul.py, which is itself deleted) from config_helpers import ... → from scmp_kernels.sc.config_helpers import ... from mp_config import (...) → from scmp_kernels.mp import (...) Other changes: - Deleted qdit/sc_integration/sc_matmul.py (vestigial — sc_matmul_qk had zero callers repo-wide; only path that needed xnor_matmul / bin_to_stoc_packed). - Removed corresponding line and __all__ entry in qdit/sc_integration/__init__.py. - qdit/sc_integration/mp_config.py is now a thin re-export shim of scmp_kernels.mp, so local relative imports (from .mp_config import …) in sc_attention/sc_mlp/sc_controller still resolve. - Removed every sys.path.insert(0, ".../SC") shim; package imports replace them. - Excluded artifacts not migrated: __pycache__/, build/, Q_DiT.egg-info/, *.nsys-rep, *.png, logs_mp_sweep/, results/, and the 92MB Inception .pb checkpoint under models/evaluations/. Verified: - All rewired .py files parse (AST OK). - All 8 sc_triton public names Q-DiT imports resolve through scmp_kernels/sc/__init__.py at the pinned submodule SHA. - No remaining bare imports of sc_triton/sng/config_helpers/mp_config. - No remaining sys.path.insert SC shims. Not verified (no GPU on dev box): - Actual kernel execution. Needs `pytest tests/` on a CUDA + Triton box. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…patcher scmp_kernels: - submodule bumped to fa6b5cd which removes the flat-API duplicates (sc_matmul_per_tensor, sc_matmul_mlp, sc_matmul_grouped, sc_matmul_enable_triton, sc_matmul_enable_triton_mlp, sc_matmul_grouped_enable_triton, sc_matmul_enable_batched_bipolar) and extends sc_matmul with group_a / group_b / rng_levels kwargs. qdit/sc_integration/sc_attention.py: - Collapses four _get_*_fn dispatchers (sc_matmul, mlp, av_grouped, batched_bipolar) into one _get_matmul_fn that returns either the real Triton sc_matmul or the noisy surrogate. - All call sites updated to pass granularity= (per_tensor, per_row, or per_head) instead of relying on which specialised function was returned. ~140 lines of branching/positional-arg plumbing removed. - Per-tensor max/min positional args are dropped — sc_matmul computes those internally. qdit/sc_integration/sc_mlp.py: - Same collapse: single _get_matmul_fn; all MLP linear paths now call sc_matmul(..., granularity="per_row", chunk_d=..., group_a=..., group_b=..., rng_levels=...). qdit/sc_integration/noise_matmul.py: - Four signature-specific adapters (noisy_sc_matmul, noisy_sc_matmul_mlp, noisy_sc_matmul_grouped, noisy_sc_matmul_enable_batched_bipolar) collapsed into one noisy_sc_matmul whose signature mirrors scmp_kernels.sc.sc_matmul. Per-row scaling is derived from (granularity, group_a, group_b). scripts/{debug_fixed_level_sanity, owen_mode_sweep, sobol_scramble_seed_sweep, sobol_variant_sweep, calibrate_mp_thresholds}.py tests/test_noise_matmul_adapters.py: - All flat-API call sites rewritten to sc_matmul(..., granularity=...). - Pre-computed q_maxs / q_mins arguments dropped where sc_matmul computes them internally. - Test names follow the new API (test_sc_matmul_per_tensor_vs_noisy etc.). Verified: every modified file parses (AST); no remaining import or call-site references to the deleted flat-API names anywhere outside the rewritten docstrings. Not verified (no GPU on dev box): kernel execution. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Team policy: scmp_kernels intentionally dropped the packed-XNOR / packed-AND algorithm in favor of the enable-signal table-lookup path. There is no longer an "off" mode for SC matmul — every SC call goes through the enable-signal kernels. The sc_enable flag's only remaining behavior was to gate a (now dead) algorithm switch, so it's removed. qdit/sc_integration/sc_controller.py - Drop the ``sc_enable`` constructor arg and the ``self.sc_enable`` attribute. Drop the repr field. qdit/sc_integration/sc_attention.py - 8 ``if self.sc_controller.sc_enable`` gates removed: • 4 inside ``_sc_linear_dynamic_mp`` / ``_sc_linear_combined_mp`` / ``_sc_linear`` — was conditionally adding ``rng_levels`` to kwargs; now passed unconditionally (``_rng_levels`` returns None unless fixed-level mode is set, same behavior). • 2 ``sc_enable and sc_mode == "bipolar"`` → just ``sc_mode == "bipolar"`` (per-head bipolar fast path always picks the batched kernel when mode allows). - ``_rng_levels`` body collapsed (both branches returned None already). - Also catches up these three legacy methods that the earlier rewrite missed: dropped positional ``x.max().item(), x.min().item(), …`` arguments and switched to ``granularity=`` kwargs so calls go through the unified ``sc_matmul`` dispatcher (would have failed at runtime). qdit/sc_integration/sc_mlp.py - ``_rng_levels`` body collapsed; no other sc_enable usage. qdit/sc_integration/sc_modelutils.py - Drop ``sc_enable=getattr(args, 'sc_enable', False)`` from the SCController constructor call. scripts/quant_sc_main.py - Drop ``--sc_enable`` CLI flag and its mentions in the run-name builder + logging. scripts/calibrate_mp_thresholds.py - ``_resolve_level_rng_levels`` body collapsed (was returning None in both branches). 15 shell scripts (batch_mp_sweep, calib_*, run_*gpu*, bench_*, owen_*, unit_test, etc.) - Drop ``--sc_enable`` argument from every script. Verified: zero remaining references to ``sc_enable`` anywhere in scmp_diffusion (outside the submodule). All edited Python files parse. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

scmp_llm/SC/ legacy/bench files (3a/3b/3c/3g): dropped — not migrated. These include sc.py and sc_enable.py (NumPy/PyTorch-CPU SC reference impls, superseded by Triton kernels), the bench_* / compare_* / test_kernel_opt.py comparison scripts (most tied to the deprecated packed-XNOR/AND path resolved in Gap 1), dse.py (kernel DSE), and the matmul_sc_triton / test_all_configs / benchmark_comparison test helpers that lived inside sc_triton.py (also packed-XNOR-bound). The following ARE application-side and migrate cleanly: tools/ calibrate_noise_model.py ← scmp_llm/SC/noise_model_calibration.py Calibrates the closed-form noise surrogate that qdit/sc_integration/noise_matmul.py consumes. evaluation/ kid.py ← scmp_llm/evaluation/kid.py build_full_mosaic.py ← scmp_llm/evaluation/build_full_mosaic.py build_sample_grids.py ← scmp_llm/evaluation/build_sample_grids.py compare_images.py ← scmp_llm/evaluation/compare_images.py FID/KID + sample-grid + side-by-side comparison helpers used in result reporting. evaluation/imagenet_ref/ extract.py ← scmp_llm/imagenet256_ref/extract.py parallel_npz.py ← scmp_llm/imagenet256_ref/parallel_npz.py compute_fid_kid.py ← scmp_llm/imagenet256_ref/compute_fid_kid.py compare_grid.py ← scmp_llm/imagenet256_ref/compare_grid.py run_openai_eval.sh ← scmp_llm/imagenet256_ref/run_openai_eval.sh run_openai_eval_v2.sh ← scmp_llm/imagenet256_ref/run_openai_eval_v2.sh ImageNet-256 reference-batch prep (images + FID statistics). Binary artifacts (1.9 GB VIRTUAL_imagenet256_labeled.npz, 1.2 GB images/) deliberately excluded — added to .gitignore. No SC/MP bare imports in any migrated file (all clean stdlib + torch + numpy + PIL + cleanfid). AST parses for every .py. After this commit, every Python file that lived in scmp_llm is either - migrated to scmp_kernels (the 6 active kernel files), or - migrated to scmp_diffusion (Q-DiT + evaluation + tools + ImageNet ref), or - intentionally dropped per team policy (CPU refs, legacy benchmarks, packed-XNOR/AND-bound tests, DSE). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Follows scmp_kernels PR CrucibleComputingGroup#3 (consolidate Triton kernels, 18 → 7). The public surface this repo consumes from scmp_kernels is unchanged: sc_matmul, clear_rng_cache, det_kernel_tuning, scmp_kernels.mp.*, scmp_kernels.sc.config_helpers.make_sobol_simple_config. All Q-DiT integration code (qdit/sc_integration/{sc_attention,sc_mlp,...}.py), calibration scripts, sweep scripts, and tests work unchanged. Submodule pin: fa6b5cd → b17fcf6 (+ 4 commits: dead-kernel cleanup, matmul-pair merges, dispatcher merges, quant-kernel unification — all bit-identical numerics) tools/kernel_launch_counter.py: updated the kernel-name list to match the consolidated layout (4 separate fused_quant_* kernels → 1 unified fused_quant_kernel; 6 dead enable_matmul_* kernels removed; 2 merged into IS_BIPOLAR-parameterised sources). Diagnostic tool only — not on the application path. Verified on RTX PRO 6000 Blackwell: ✓ smoke_test_e2e — all imports, all granularity paths produce sensible rel_err (no deprecated names re-emerged) ✓ compare_old_vs_new_thorough — 111/111 bit-identical to scmp_llm/SC/sc_triton ✓ test_noise_matmul_adapters — all 4 Q-DiT call patterns pass Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- SCController.halve flag (--sc_halve / SC_HALVE env): run bipolar SC matmuls at stoc_len/2 via halve_bipolar_stoc_len, no accuracy loss; no-op for non-bipolar modes and the noise surrogate - SCAttention/SCMlp route through partial(sc_matmul, halve_bipolar_stoc_len=True) when halve is set - --sc_qk_granularity {per_head,per_row}: per-row QK scaling to match the AV path - kernel_launch_counter: point DiT ckpt at the turbo path - bump scmp_kernels submodule to a576b83 (on upstream/main) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

heroarmor · 2026-05-27T21:26:57Z

Retargeting to upstream CrucibleComputingGroup/scmp_diffusion (correct base).

heroarmor and others added 7 commits May 11, 2026 22:02

heroarmor closed this May 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(sc): uSystolic stoc_len halving + per-row QK granularity#2

feat(sc): uSystolic stoc_len halving + per-row QK granularity#2
heroarmor wants to merge 7 commits into
mainfrom
feat/sc-halve

heroarmor commented May 27, 2026

Uh oh!

heroarmor commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

heroarmor commented May 27, 2026

Changes

Uh oh!

heroarmor commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant