feat(sc): uSystolic stoc_len halving + per-row QK granularity#2
Closed
heroarmor wants to merge 7 commits into
Closed
feat(sc): uSystolic stoc_len halving + per-row QK granularity#2heroarmor wants to merge 7 commits into
heroarmor wants to merge 7 commits into
Conversation
Adds CrucibleComputingGroup/scmp_kernels as a submodule at ./scmp_kernels, pinned to heroarmor:add-mp-module branch tip (commit fbd7009). Will be re-pinned to org main once PR #2 lands. URL: https://github.com/heroarmor/scmp_kernels.git Path: ./scmp_kernels Initial pin: fbd7009 (add-mp-module branch — MP module + clipping removal + flat-API public surface for Q-DiT compatibility) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Brings the Diffusion application source code from scmp_llm/Q-DiT into
this repo (93 files: qdit/, diffusion/, models/, scripts/, tests/,
utils/, env files). All SC/MP-related bare imports are rewired to use
the scmp_kernels submodule:
from sc_triton import sc_matmul, sc_matmul_grouped, ...
→ from scmp_kernels.sc import sc_matmul_per_tensor as sc_matmul,
sc_matmul_grouped, ...
from sng import RNGPool, SNGBank → (deleted — only used by
qdit/sc_integration/sc_matmul.py,
which is itself deleted)
from config_helpers import ... → from scmp_kernels.sc.config_helpers import ...
from mp_config import (...) → from scmp_kernels.mp import (...)
Other changes:
- Deleted qdit/sc_integration/sc_matmul.py (vestigial — sc_matmul_qk
had zero callers repo-wide; only path that needed xnor_matmul /
bin_to_stoc_packed).
- Removed corresponding line and __all__ entry in
qdit/sc_integration/__init__.py.
- qdit/sc_integration/mp_config.py is now a thin re-export shim of
scmp_kernels.mp, so local relative imports (from .mp_config import …)
in sc_attention/sc_mlp/sc_controller still resolve.
- Removed every sys.path.insert(0, ".../SC") shim; package imports
replace them.
- Excluded artifacts not migrated: __pycache__/, build/, Q_DiT.egg-info/,
*.nsys-rep, *.png, logs_mp_sweep/, results/, and the 92MB Inception
.pb checkpoint under models/evaluations/.
Verified:
- All rewired .py files parse (AST OK).
- All 8 sc_triton public names Q-DiT imports resolve through
scmp_kernels/sc/__init__.py at the pinned submodule SHA.
- No remaining bare imports of sc_triton/sng/config_helpers/mp_config.
- No remaining sys.path.insert SC shims.
Not verified (no GPU on dev box):
- Actual kernel execution. Needs `pytest tests/` on a CUDA + Triton box.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…patcher
scmp_kernels:
- submodule bumped to fa6b5cd which removes the flat-API duplicates
(sc_matmul_per_tensor, sc_matmul_mlp, sc_matmul_grouped,
sc_matmul_enable_triton, sc_matmul_enable_triton_mlp,
sc_matmul_grouped_enable_triton, sc_matmul_enable_batched_bipolar)
and extends sc_matmul with group_a / group_b / rng_levels kwargs.
qdit/sc_integration/sc_attention.py:
- Collapses four _get_*_fn dispatchers (sc_matmul, mlp, av_grouped,
batched_bipolar) into one _get_matmul_fn that returns either the
real Triton sc_matmul or the noisy surrogate.
- All call sites updated to pass granularity= (per_tensor, per_row,
or per_head) instead of relying on which specialised function was
returned. ~140 lines of branching/positional-arg plumbing removed.
- Per-tensor max/min positional args are dropped — sc_matmul computes
those internally.
qdit/sc_integration/sc_mlp.py:
- Same collapse: single _get_matmul_fn; all MLP linear paths now call
sc_matmul(..., granularity="per_row", chunk_d=..., group_a=...,
group_b=..., rng_levels=...).
qdit/sc_integration/noise_matmul.py:
- Four signature-specific adapters (noisy_sc_matmul,
noisy_sc_matmul_mlp, noisy_sc_matmul_grouped,
noisy_sc_matmul_enable_batched_bipolar) collapsed into one
noisy_sc_matmul whose signature mirrors scmp_kernels.sc.sc_matmul.
Per-row scaling is derived from (granularity, group_a, group_b).
scripts/{debug_fixed_level_sanity, owen_mode_sweep, sobol_scramble_seed_sweep,
sobol_variant_sweep, calibrate_mp_thresholds}.py
tests/test_noise_matmul_adapters.py:
- All flat-API call sites rewritten to sc_matmul(..., granularity=...).
- Pre-computed q_maxs / q_mins arguments dropped where sc_matmul
computes them internally.
- Test names follow the new API (test_sc_matmul_per_tensor_vs_noisy etc.).
Verified: every modified file parses (AST); no remaining import or
call-site references to the deleted flat-API names anywhere outside
the rewritten docstrings.
Not verified (no GPU on dev box): kernel execution.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Team policy: scmp_kernels intentionally dropped the packed-XNOR / packed-AND
algorithm in favor of the enable-signal table-lookup path. There is no
longer an "off" mode for SC matmul — every SC call goes through the
enable-signal kernels. The sc_enable flag's only remaining behavior was
to gate a (now dead) algorithm switch, so it's removed.
qdit/sc_integration/sc_controller.py
- Drop the ``sc_enable`` constructor arg and the ``self.sc_enable``
attribute. Drop the repr field.
qdit/sc_integration/sc_attention.py
- 8 ``if self.sc_controller.sc_enable`` gates removed:
• 4 inside ``_sc_linear_dynamic_mp`` / ``_sc_linear_combined_mp`` /
``_sc_linear`` — was conditionally adding ``rng_levels`` to kwargs;
now passed unconditionally (``_rng_levels`` returns None unless
fixed-level mode is set, same behavior).
• 2 ``sc_enable and sc_mode == "bipolar"`` → just
``sc_mode == "bipolar"`` (per-head bipolar fast path always picks
the batched kernel when mode allows).
- ``_rng_levels`` body collapsed (both branches returned None already).
- Also catches up these three legacy methods that the earlier rewrite
missed: dropped positional ``x.max().item(), x.min().item(), …``
arguments and switched to ``granularity=`` kwargs so calls go through
the unified ``sc_matmul`` dispatcher (would have failed at runtime).
qdit/sc_integration/sc_mlp.py
- ``_rng_levels`` body collapsed; no other sc_enable usage.
qdit/sc_integration/sc_modelutils.py
- Drop ``sc_enable=getattr(args, 'sc_enable', False)`` from the
SCController constructor call.
scripts/quant_sc_main.py
- Drop ``--sc_enable`` CLI flag and its mentions in the run-name
builder + logging.
scripts/calibrate_mp_thresholds.py
- ``_resolve_level_rng_levels`` body collapsed (was returning None
in both branches).
15 shell scripts (batch_mp_sweep, calib_*, run_*gpu*, bench_*, owen_*,
unit_test, etc.)
- Drop ``--sc_enable`` argument from every script.
Verified: zero remaining references to ``sc_enable`` anywhere in
scmp_diffusion (outside the submodule). All edited Python files parse.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
scmp_llm/SC/ legacy/bench files (3a/3b/3c/3g): dropped — not migrated.
These include sc.py and sc_enable.py (NumPy/PyTorch-CPU SC reference
impls, superseded by Triton kernels), the bench_* / compare_* /
test_kernel_opt.py comparison scripts (most tied to the deprecated
packed-XNOR/AND path resolved in Gap 1), dse.py (kernel DSE), and the
matmul_sc_triton / test_all_configs / benchmark_comparison test
helpers that lived inside sc_triton.py (also packed-XNOR-bound).
The following ARE application-side and migrate cleanly:
tools/
calibrate_noise_model.py ← scmp_llm/SC/noise_model_calibration.py
Calibrates the closed-form noise surrogate that
qdit/sc_integration/noise_matmul.py consumes.
evaluation/
kid.py ← scmp_llm/evaluation/kid.py
build_full_mosaic.py ← scmp_llm/evaluation/build_full_mosaic.py
build_sample_grids.py ← scmp_llm/evaluation/build_sample_grids.py
compare_images.py ← scmp_llm/evaluation/compare_images.py
FID/KID + sample-grid + side-by-side comparison
helpers used in result reporting.
evaluation/imagenet_ref/
extract.py ← scmp_llm/imagenet256_ref/extract.py
parallel_npz.py ← scmp_llm/imagenet256_ref/parallel_npz.py
compute_fid_kid.py ← scmp_llm/imagenet256_ref/compute_fid_kid.py
compare_grid.py ← scmp_llm/imagenet256_ref/compare_grid.py
run_openai_eval.sh ← scmp_llm/imagenet256_ref/run_openai_eval.sh
run_openai_eval_v2.sh ← scmp_llm/imagenet256_ref/run_openai_eval_v2.sh
ImageNet-256 reference-batch prep (images +
FID statistics). Binary artifacts (1.9 GB
VIRTUAL_imagenet256_labeled.npz, 1.2 GB images/)
deliberately excluded — added to .gitignore.
No SC/MP bare imports in any migrated file (all clean stdlib + torch +
numpy + PIL + cleanfid). AST parses for every .py.
After this commit, every Python file that lived in scmp_llm is either
- migrated to scmp_kernels (the 6 active kernel files), or
- migrated to scmp_diffusion (Q-DiT + evaluation + tools + ImageNet ref), or
- intentionally dropped per team policy (CPU refs, legacy benchmarks,
packed-XNOR/AND-bound tests, DSE).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Follows scmp_kernels PR CrucibleComputingGroup#3 (consolidate Triton kernels, 18 → 7). The public surface this repo consumes from scmp_kernels is unchanged: sc_matmul, clear_rng_cache, det_kernel_tuning, scmp_kernels.mp.*, scmp_kernels.sc.config_helpers.make_sobol_simple_config. All Q-DiT integration code (qdit/sc_integration/{sc_attention,sc_mlp,...}.py), calibration scripts, sweep scripts, and tests work unchanged. Submodule pin: fa6b5cd → b17fcf6 (+ 4 commits: dead-kernel cleanup, matmul-pair merges, dispatcher merges, quant-kernel unification — all bit-identical numerics) tools/kernel_launch_counter.py: updated the kernel-name list to match the consolidated layout (4 separate fused_quant_* kernels → 1 unified fused_quant_kernel; 6 dead enable_matmul_* kernels removed; 2 merged into IS_BIPOLAR-parameterised sources). Diagnostic tool only — not on the application path. Verified on RTX PRO 6000 Blackwell: ✓ smoke_test_e2e — all imports, all granularity paths produce sensible rel_err (no deprecated names re-emerged) ✓ compare_old_vs_new_thorough — 111/111 bit-identical to scmp_llm/SC/sc_triton ✓ test_noise_matmul_adapters — all 4 Q-DiT call patterns pass Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- SCController.halve flag (--sc_halve / SC_HALVE env): run bipolar SC matmuls
at stoc_len/2 via halve_bipolar_stoc_len, no accuracy loss; no-op for
non-bipolar modes and the noise surrogate
- SCAttention/SCMlp route through partial(sc_matmul, halve_bipolar_stoc_len=True)
when halve is set
- --sc_qk_granularity {per_head,per_row}: per-row QK scaling to match the AV path
- kernel_launch_counter: point DiT ckpt at the turbo path
- bump scmp_kernels submodule to a576b83 (on upstream/main)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Owner
Author
|
Retargeting to upstream CrucibleComputingGroup/scmp_diffusion (correct base). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds the uSystolic/HUB sign-magnitude stoc_len halving trick and a per-row QK granularity option to the Q-DiT SC integration.
Changes
SCController.halveflag, surfaced as--sc_halve(andSC_HALVE=1env). When set, bipolar SC matmuls run atstoc_len/2viahalve_bipolar_stoc_len=True— no accuracy loss. No-op for non-bipolar modes and the noise surrogate.SCAttention/SCMlproute throughpartial(sc_matmul, halve_bipolar_stoc_len=True)when halve is enabled.--sc_qk_granularity {per_head,per_row}: per-row QK scaling to match the AV path (default staysper_head).tools/kernel_launch_counter.py: point the DiT ckpt at the turbo path.scmp_kernelssubmodule toa576b83(already on upstream/main).Debug/scratch tools were intentionally left out of this PR.
🤖 Generated with Claude Code