Skip to content

feat(sc): uSystolic stoc_len halving + per-row QK granularity#2

Closed
heroarmor wants to merge 7 commits into
mainfrom
feat/sc-halve
Closed

feat(sc): uSystolic stoc_len halving + per-row QK granularity#2
heroarmor wants to merge 7 commits into
mainfrom
feat/sc-halve

Conversation

@heroarmor
Copy link
Copy Markdown
Owner

Adds the uSystolic/HUB sign-magnitude stoc_len halving trick and a per-row QK granularity option to the Q-DiT SC integration.

Changes

  • SCController.halve flag, surfaced as --sc_halve (and SC_HALVE=1 env). When set, bipolar SC matmuls run at stoc_len/2 via halve_bipolar_stoc_len=True — no accuracy loss. No-op for non-bipolar modes and the noise surrogate.
  • SCAttention / SCMlp route through partial(sc_matmul, halve_bipolar_stoc_len=True) when halve is enabled.
  • --sc_qk_granularity {per_head,per_row}: per-row QK scaling to match the AV path (default stays per_head).
  • tools/kernel_launch_counter.py: point the DiT ckpt at the turbo path.
  • Bump scmp_kernels submodule to a576b83 (already on upstream/main).

Debug/scratch tools were intentionally left out of this PR.

🤖 Generated with Claude Code

heroarmor and others added 7 commits May 11, 2026 22:02
Adds CrucibleComputingGroup/scmp_kernels as a submodule at ./scmp_kernels,
pinned to heroarmor:add-mp-module branch tip (commit fbd7009). Will be
re-pinned to org main once PR #2 lands.

URL: https://github.com/heroarmor/scmp_kernels.git
Path: ./scmp_kernels
Initial pin: fbd7009 (add-mp-module branch — MP module + clipping removal
+ flat-API public surface for Q-DiT compatibility)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Brings the Diffusion application source code from scmp_llm/Q-DiT into
this repo (93 files: qdit/, diffusion/, models/, scripts/, tests/,
utils/, env files). All SC/MP-related bare imports are rewired to use
the scmp_kernels submodule:

  from sc_triton import sc_matmul, sc_matmul_grouped, ...
    → from scmp_kernels.sc import sc_matmul_per_tensor as sc_matmul,
                                  sc_matmul_grouped, ...
  from sng import RNGPool, SNGBank          → (deleted — only used by
                                                qdit/sc_integration/sc_matmul.py,
                                                which is itself deleted)
  from config_helpers import ...            → from scmp_kernels.sc.config_helpers import ...
  from mp_config import (...)               → from scmp_kernels.mp import (...)

Other changes:
  - Deleted qdit/sc_integration/sc_matmul.py (vestigial — sc_matmul_qk
    had zero callers repo-wide; only path that needed xnor_matmul /
    bin_to_stoc_packed).
  - Removed corresponding line and __all__ entry in
    qdit/sc_integration/__init__.py.
  - qdit/sc_integration/mp_config.py is now a thin re-export shim of
    scmp_kernels.mp, so local relative imports (from .mp_config import …)
    in sc_attention/sc_mlp/sc_controller still resolve.
  - Removed every sys.path.insert(0, ".../SC") shim; package imports
    replace them.
  - Excluded artifacts not migrated: __pycache__/, build/, Q_DiT.egg-info/,
    *.nsys-rep, *.png, logs_mp_sweep/, results/, and the 92MB Inception
    .pb checkpoint under models/evaluations/.

Verified:
  - All rewired .py files parse (AST OK).
  - All 8 sc_triton public names Q-DiT imports resolve through
    scmp_kernels/sc/__init__.py at the pinned submodule SHA.
  - No remaining bare imports of sc_triton/sng/config_helpers/mp_config.
  - No remaining sys.path.insert SC shims.

Not verified (no GPU on dev box):
  - Actual kernel execution. Needs `pytest tests/` on a CUDA + Triton box.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…patcher

scmp_kernels:
  - submodule bumped to fa6b5cd which removes the flat-API duplicates
    (sc_matmul_per_tensor, sc_matmul_mlp, sc_matmul_grouped,
    sc_matmul_enable_triton, sc_matmul_enable_triton_mlp,
    sc_matmul_grouped_enable_triton, sc_matmul_enable_batched_bipolar)
    and extends sc_matmul with group_a / group_b / rng_levels kwargs.

qdit/sc_integration/sc_attention.py:
  - Collapses four _get_*_fn dispatchers (sc_matmul, mlp, av_grouped,
    batched_bipolar) into one _get_matmul_fn that returns either the
    real Triton sc_matmul or the noisy surrogate.
  - All call sites updated to pass granularity= (per_tensor, per_row,
    or per_head) instead of relying on which specialised function was
    returned. ~140 lines of branching/positional-arg plumbing removed.
  - Per-tensor max/min positional args are dropped — sc_matmul computes
    those internally.

qdit/sc_integration/sc_mlp.py:
  - Same collapse: single _get_matmul_fn; all MLP linear paths now call
    sc_matmul(..., granularity="per_row", chunk_d=..., group_a=...,
    group_b=..., rng_levels=...).

qdit/sc_integration/noise_matmul.py:
  - Four signature-specific adapters (noisy_sc_matmul,
    noisy_sc_matmul_mlp, noisy_sc_matmul_grouped,
    noisy_sc_matmul_enable_batched_bipolar) collapsed into one
    noisy_sc_matmul whose signature mirrors scmp_kernels.sc.sc_matmul.
    Per-row scaling is derived from (granularity, group_a, group_b).

scripts/{debug_fixed_level_sanity, owen_mode_sweep, sobol_scramble_seed_sweep,
         sobol_variant_sweep, calibrate_mp_thresholds}.py
tests/test_noise_matmul_adapters.py:
  - All flat-API call sites rewritten to sc_matmul(..., granularity=...).
  - Pre-computed q_maxs / q_mins arguments dropped where sc_matmul
    computes them internally.
  - Test names follow the new API (test_sc_matmul_per_tensor_vs_noisy etc.).

Verified: every modified file parses (AST); no remaining import or
call-site references to the deleted flat-API names anywhere outside
the rewritten docstrings.

Not verified (no GPU on dev box): kernel execution.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Team policy: scmp_kernels intentionally dropped the packed-XNOR / packed-AND
algorithm in favor of the enable-signal table-lookup path. There is no
longer an "off" mode for SC matmul — every SC call goes through the
enable-signal kernels. The sc_enable flag's only remaining behavior was
to gate a (now dead) algorithm switch, so it's removed.

qdit/sc_integration/sc_controller.py
  - Drop the ``sc_enable`` constructor arg and the ``self.sc_enable``
    attribute. Drop the repr field.

qdit/sc_integration/sc_attention.py
  - 8 ``if self.sc_controller.sc_enable`` gates removed:
      • 4 inside ``_sc_linear_dynamic_mp`` / ``_sc_linear_combined_mp`` /
        ``_sc_linear`` — was conditionally adding ``rng_levels`` to kwargs;
        now passed unconditionally (``_rng_levels`` returns None unless
        fixed-level mode is set, same behavior).
      • 2 ``sc_enable and sc_mode == "bipolar"`` → just
        ``sc_mode == "bipolar"`` (per-head bipolar fast path always picks
        the batched kernel when mode allows).
  - ``_rng_levels`` body collapsed (both branches returned None already).
  - Also catches up these three legacy methods that the earlier rewrite
    missed: dropped positional ``x.max().item(), x.min().item(), …``
    arguments and switched to ``granularity=`` kwargs so calls go through
    the unified ``sc_matmul`` dispatcher (would have failed at runtime).

qdit/sc_integration/sc_mlp.py
  - ``_rng_levels`` body collapsed; no other sc_enable usage.

qdit/sc_integration/sc_modelutils.py
  - Drop ``sc_enable=getattr(args, 'sc_enable', False)`` from the
    SCController constructor call.

scripts/quant_sc_main.py
  - Drop ``--sc_enable`` CLI flag and its mentions in the run-name
    builder + logging.

scripts/calibrate_mp_thresholds.py
  - ``_resolve_level_rng_levels`` body collapsed (was returning None
    in both branches).

15 shell scripts (batch_mp_sweep, calib_*, run_*gpu*, bench_*, owen_*,
unit_test, etc.)
  - Drop ``--sc_enable`` argument from every script.

Verified: zero remaining references to ``sc_enable`` anywhere in
scmp_diffusion (outside the submodule). All edited Python files parse.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
scmp_llm/SC/ legacy/bench files (3a/3b/3c/3g): dropped — not migrated.
These include sc.py and sc_enable.py (NumPy/PyTorch-CPU SC reference
impls, superseded by Triton kernels), the bench_* / compare_* /
test_kernel_opt.py comparison scripts (most tied to the deprecated
packed-XNOR/AND path resolved in Gap 1), dse.py (kernel DSE), and the
matmul_sc_triton / test_all_configs / benchmark_comparison test
helpers that lived inside sc_triton.py (also packed-XNOR-bound).

The following ARE application-side and migrate cleanly:

tools/
  calibrate_noise_model.py            ← scmp_llm/SC/noise_model_calibration.py
                                        Calibrates the closed-form noise surrogate that
                                        qdit/sc_integration/noise_matmul.py consumes.

evaluation/
  kid.py                              ← scmp_llm/evaluation/kid.py
  build_full_mosaic.py                ← scmp_llm/evaluation/build_full_mosaic.py
  build_sample_grids.py               ← scmp_llm/evaluation/build_sample_grids.py
  compare_images.py                   ← scmp_llm/evaluation/compare_images.py
                                        FID/KID + sample-grid + side-by-side comparison
                                        helpers used in result reporting.

evaluation/imagenet_ref/
  extract.py                          ← scmp_llm/imagenet256_ref/extract.py
  parallel_npz.py                     ← scmp_llm/imagenet256_ref/parallel_npz.py
  compute_fid_kid.py                  ← scmp_llm/imagenet256_ref/compute_fid_kid.py
  compare_grid.py                     ← scmp_llm/imagenet256_ref/compare_grid.py
  run_openai_eval.sh                  ← scmp_llm/imagenet256_ref/run_openai_eval.sh
  run_openai_eval_v2.sh               ← scmp_llm/imagenet256_ref/run_openai_eval_v2.sh
                                        ImageNet-256 reference-batch prep (images +
                                        FID statistics). Binary artifacts (1.9 GB
                                        VIRTUAL_imagenet256_labeled.npz, 1.2 GB images/)
                                        deliberately excluded — added to .gitignore.

No SC/MP bare imports in any migrated file (all clean stdlib + torch +
numpy + PIL + cleanfid). AST parses for every .py.

After this commit, every Python file that lived in scmp_llm is either
- migrated to scmp_kernels (the 6 active kernel files), or
- migrated to scmp_diffusion (Q-DiT + evaluation + tools + ImageNet ref), or
- intentionally dropped per team policy (CPU refs, legacy benchmarks,
  packed-XNOR/AND-bound tests, DSE).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Follows scmp_kernels PR CrucibleComputingGroup#3 (consolidate Triton kernels, 18 → 7). The
public surface this repo consumes from scmp_kernels is unchanged:
sc_matmul, clear_rng_cache, det_kernel_tuning, scmp_kernels.mp.*,
scmp_kernels.sc.config_helpers.make_sobol_simple_config. All Q-DiT
integration code (qdit/sc_integration/{sc_attention,sc_mlp,...}.py),
calibration scripts, sweep scripts, and tests work unchanged.

Submodule pin:
  fa6b5cd  →  b17fcf6
  (+ 4 commits: dead-kernel cleanup, matmul-pair merges, dispatcher
   merges, quant-kernel unification — all bit-identical numerics)

tools/kernel_launch_counter.py: updated the kernel-name list to match
the consolidated layout (4 separate fused_quant_* kernels → 1 unified
fused_quant_kernel; 6 dead enable_matmul_* kernels removed; 2 merged
into IS_BIPOLAR-parameterised sources). Diagnostic tool only — not on
the application path.

Verified on RTX PRO 6000 Blackwell:
  ✓ smoke_test_e2e — all imports, all granularity paths produce
    sensible rel_err (no deprecated names re-emerged)
  ✓ compare_old_vs_new_thorough — 111/111 bit-identical to
    scmp_llm/SC/sc_triton
  ✓ test_noise_matmul_adapters — all 4 Q-DiT call patterns pass

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- SCController.halve flag (--sc_halve / SC_HALVE env): run bipolar SC matmuls
  at stoc_len/2 via halve_bipolar_stoc_len, no accuracy loss; no-op for
  non-bipolar modes and the noise surrogate
- SCAttention/SCMlp route through partial(sc_matmul, halve_bipolar_stoc_len=True)
  when halve is set
- --sc_qk_granularity {per_head,per_row}: per-row QK scaling to match the AV path
- kernel_launch_counter: point DiT ckpt at the turbo path
- bump scmp_kernels submodule to a576b83 (on upstream/main)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@heroarmor
Copy link
Copy Markdown
Owner Author

Retargeting to upstream CrucibleComputingGroup/scmp_diffusion (correct base).

@heroarmor heroarmor closed this May 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant