Skip to content

[Issue 101][eval-and-fix] Honor SeedVR2 GroupNorm memory limit (closes pollockjj/mydevelopment#187)#31

Merged
pollockjj merged 4 commits into
issue_101from
issue_187
May 3, 2026
Merged

[Issue 101][eval-and-fix] Honor SeedVR2 GroupNorm memory limit (closes pollockjj/mydevelopment#187)#31
pollockjj merged 4 commits into
issue_101from
issue_187

Conversation

@pollockjj
Copy link
Copy Markdown
Owner

Plan: Honor SeedVR2 GroupNorm memory limit

Overview

The SeedVR2 native VAE in comfy/ldm/seedvr/vae.py exposes a configured GroupNorm memory cap (norm_max_memset_norm_limit(...) → module-level _NORM_LIMIT) but the chunked-GroupNorm dispatch gate at vae.py:509 reads the literal float("inf") instead of get_norm_limit(). The configured cap is therefore a no-op and the chunked branch is unreachable for any finite limit. This plan records that unreachability as a baseline decision packet on issue_101_pi HEAD, applies the one-line gate fix on pollockjj/ComfyUI#issue_187 (preserving the GiB calculation and the parallel structure of the chunked-conv sibling at lines 659–660), adds a regression test pair, runs a hygiene gate (ruff + production-code-shape independence), and emits a post-fix provenance decision packet. PR creation is delegated to the /pr skill after slice completion.

Diagnosis Summary

  • Failure signature. comfy/ldm/seedvr/vae.py line 509 reads if isinstance(norm_layer, ops.GroupNorm) and memory_occupy > float("inf"): # TODO: this may be set dynamically from the vae, and the gate's right-hand operand is the literal positive infinity. No finite memory_occupy (computed on line 508 as x.numel() * x.element_size() / 1024**3 GiB) can exceed positive infinity, so the chunked-GroupNorm else branch at line 521 always runs and the chunked branch at lines 510–520 is unreachable.
  • Reproduction. set_norm_limit(1e-9) followed by causal_norm_wrapper(ops.GroupNorm(num_channels=8, num_groups=4), torch.randn(1,8,2,4,4)) runs the full GroupNorm path (the nn.Module forward hook fires once with shape (2,8,4,4)) regardless of the configured limit. Measured live on issue_101_pi HEAD 4ffef0a4.
  • Root cause. Placeholder literal float("inf") was left in place of the configured limit accessor get_norm_limit() (defined on lines 185–186 of the same file). The author marked the line with # TODO: this may be set dynamically from the vae. The mutator set_norm_limit(value) (lines 189–193) and the upstream config bridge VideoAutoencoderKL.set_memory_limit(conv_max_mem, norm_max_mem) (lines 2297–2298) are wired correctly; only the gate's RHS is wrong.
  • Failure boundary. Any caller that constructs a 5D input (x.ndim == 5) routed through causal_norm_wrapper(GroupNorm, x) is affected; the boundary is exactly the isinstance(norm_layer, ops.GroupNorm) branch starting at line 502 with the 5D sub-branch starting at line 505. The 4D sub-branch (line 503–504) is unaffected because it never hits the gate.
  • Proposed fix. Replace memory_occupy > float("inf") with memory_occupy > get_norm_limit() at line 509. Preserve line 508 (the GiB calc) verbatim. The fix mirrors the chunked-conv sibling pattern at lines 659–660, where InflatedCausalConv3d.forward gates against the per-instance self.memory_limit. The # TODO comment may be removed by the slicer (resolved by the fix) but its retention or removal is not load-bearing for the AC contract.
  • Verification strategy. A CPU-only reachability probe with explicit set_norm_limit(1e-9) will exercise the gate and emit named-key JSON describing which branch ran. Pre-fix the JSON records chunked_groupnorm_reachable=false; post-fix the JSON records chunked_groupnorm_reachable=true. A regression test pair in tests-unit/comfy_test/ asserts the same dispatch behavior under pytest — the low-limit test exercises the chunked branch via a comfy.ldm.seedvr.vae.F.group_norm spy that records every direct call's num_groups argument and asserts at least one call has num_groups < gn.num_groups (the chunked branch's signature). A hygiene slice runs ruff on the touched files and asserts no observability detuning was introduced. A final decision-packet slice records post-fix provenance.

Affected Repositories

Repository Path Base Branch Issue Branch
pollockjj/mydevelopment /home/johnj/dev_cuda_1/mydevelopment main issue_187
pollockjj/ComfyUI /home/johnj/dev_cuda_0/ComfyUI issue_101_pi issue_187

Research and Methodology

Plan Foundations Comment URL: https://github.com/pollockjj/mydevelopment/issues/187#issuecomment-4361175762

Detected Scope: none — single-line gate fix where the existing vae.py code IS the spec; the chunked-conv sibling at vae.py:659-660 already canonicalizes the configured-limit pattern this defect must restore for the GroupNorm pathway.

The linked comment is the load-bearing artifact for every measurement, equivalence rule, canonical-protocol, tool version, and pipeline-stage citation in this plan. Every AC below that names a metric, threshold, equivalence rule, canonical protocol, tool version, or pipeline stage traces by quote or URL to an entry in that comment's ## Research and Methodology section. qa-plan verifies the comment exists and contains the required subsections for the detected scope.

Tools, Pipeline, and Measurements

Plan Foundations Comment URL: https://github.com/pollockjj/mydevelopment/issues/187#issuecomment-4361175762

The linked comment's ## Tools, Pipeline, and Measurements section enumerates: (1) Existing Tooling Inventory with REUSE / KNOWN-GOOD-REF / NEW / WAIVED status per tool; (2) Pipeline stages from input to output artifact; (3) Measurements table with tool+version, invocation, score range, tolerance; (4) Single-probe-before-sweep / boundary-bracketing requirements (both N/A for this plan and explicitly justified). Every AC below that names a tool, version, pipeline stage, or measurement traces by quote or URL to an entry in that section.

Ground Truth Probes

For every literal API surface that any AC references — attribute, method, function signature, parameter list, CLI flag, file path, env var, magic constant — captured live on the issue's base branch pollockjj/ComfyUI:issue_101_pi (HEAD 4ffef0a4) on 2026-05-01 from /home/johnj/dev_cuda_0/ComfyUI. Plan-introduced literals (probe JSON keys, test names, paths created by this plan) are anchored separately in ## Created Surface Contract, not here.

Probe ID Consumed by AC(s) Probe command Observed output (verbatim)
GTP-1 S1 AC-2, S2 AC-2, S2 AC-3 cd /home/johnj/dev_cuda_0/ComfyUI && .venv/bin/python -c "from comfy.cli_args import args; args.cpu=True; import comfy.ldm.seedvr.vae as v, inspect; print(inspect.signature(v.causal_norm_wrapper))" (norm_layer: torch.nn.modules.module.Module, x: torch.Tensor) -> torch.Tensor
GTP-2 S1 AC-2, S2 AC-2, S2 AC-3 cd /home/johnj/dev_cuda_0/ComfyUI && .venv/bin/python -c "from comfy.cli_args import args; args.cpu=True; import comfy.ldm.seedvr.vae as v, inspect; print(inspect.signature(v.set_norm_limit)); print(inspect.signature(v.get_norm_limit)); print(repr(v._NORM_LIMIT))" (value: Optional[float] = None) / () / inf
GTP-3 S1 AC-2, S2 AC-2 cd /home/johnj/dev_cuda_0/ComfyUI && .venv/bin/python -c "from comfy.cli_args import args; args.cpu=True; import comfy.ops as o; print(type(o.disable_weight_init.GroupNorm(num_channels=8, num_groups=4)).__mro__[:3])" (<class 'comfy.ops.disable_weight_init.GroupNorm'>, <class 'torch.nn.modules.normalization.GroupNorm'>, <class 'torch.nn.modules.module.Module'>)
GTP-4 S1 AC-2, S2 AC-2, S2 AC-3 cd /home/johnj/dev_cuda_0/ComfyUI && .venv/bin/python -c "from comfy.cli_args import args; args.cpu=True; import torch, comfy.ops as o; gn=o.disable_weight_init.GroupNorm(num_channels=8,num_groups=4); gn.eval(); calls=[]; gn.register_forward_hook(lambda m,i,out: calls.append(tuple(i[0].shape))); _=gn(torch.randn(2,8,4,4)); print(calls)" [(2, 8, 4, 4)]
GTP-5 S1 AC-2, S2 AC-1, S2 AC-3 sed -n '508,509p' /home/johnj/dev_cuda_0/ComfyUI/comfy/ldm/seedvr/vae.py line 508: memory_occupy = x.numel() * x.element_size() / 1024**3 ; line 509: if isinstance(norm_layer, ops.GroupNorm) and memory_occupy > float("inf"): # TODO: this may be set dynamically from the vae
GTP-6 S2 AC-2, S3 AC-1 ls /home/johnj/dev_cuda_0/ComfyUI/tests-unit/comfy_test/ folder_path_test.py / model_detection_test.py / __pycache__ / test_seedvr_rope_delegation.py / test_seedvr_vae_tiled_args_no_mutate.py (no test_seedvr_groupnorm_limit.py present)
GTP-7 S1 AC-2, S2 AC-3 cd /home/johnj/dev_cuda_0/ComfyUI && .venv/bin/python -c "print(9.5e-7 > float('inf')); print(9.5e-7 > 1e-9)" False / True
GTP-8 S1 AC-2, S2 AC-1, S2 AC-3 sed -n '184,193p' /home/johnj/dev_cuda_0/ComfyUI/comfy/ldm/seedvr/vae.py _NORM_LIMIT = float("inf") / def get_norm_limit(): / return _NORM_LIMIT / (blank) / (blank) / def set_norm_limit(value: Optional[float] = None): / global _NORM_LIMIT / if value is None: / value = float("inf") / _NORM_LIMIT = value
GTP-9 S1 AC-2, S2 AC-2, S2 AC-3 cd /home/johnj/dev_cuda_0/ComfyUI && .venv/bin/pip list 2>/dev/null | grep -E "^(torch|einops)\\b" einops 0.8.2 / torch 2.11.0+cu130
GTP-10 S2 AC-2, S3 AC-1 cd /home/johnj/dev_cuda_0/ComfyUI && .venv/bin/python -m pytest --version 2>&1 | head -1; ruff --version 2>&1 | head -1 pytest 9.0.3 / ruff 0.15.9
GTP-11 S1 AC-2, S2 AC-2 cd /home/johnj/dev_cuda_0/ComfyUI && .venv/bin/python -c "from comfy.cli_args import args; print(type(args.cpu).__name__, args.cpu); args.cpu=True; print(args.cpu)" bool False / True
GTP-12 S1 AC-3, S4 AC-1 git -C /home/johnj/dev_cuda_0/ComfyUI rev-parse HEAD; git -C /home/johnj/dev_cuda_0/ComfyUI rev-parse --abbrev-ref HEAD 4ffef0a410bb0f999ee2579b297b26470fab2a22 / issue_101_pi
GTP-13 S1 AC-2, S2 AC-2, S2 AC-3 cd /home/johnj/dev_cuda_0/ComfyUI && .venv/bin/python -c "from comfy.cli_args import args; args.cpu=True; import comfy.ldm.seedvr.vae as v; import torch.nn.functional as F; print(v.F is F); print(v.F.group_norm is F.group_norm); print(v.F.__name__)" True / True / torch.nn.functional
GTP-14 S2 AC-2 grep -nE 'args\\.cpu = True|torch\\.cuda\\.is_available' /home/johnj/dev_cuda_0/ComfyUI/tests-unit/comfy_test/test_seedvr_rope_delegation.py 51:if not torch.cuda.is_available(): / 52: args.cpu = True / 136: marks=pytest.mark.skipif(not torch.cuda.is_available(), reason="no cuda"),

If a planned literal cannot be probed (slot offline, source absent, tool missing) the AC is rewritten to remove the unverified literal or the slice is rescoped. No probe → no AC.

Created Surface Contract

These literals are introduced by this plan and therefore are NOT subject to Phase 0.5 ground-truth probing. They become probable after the slice that creates them lands.

Probe-script API (Slice 1 introduces; Slice 2 reuses): --out PATH CLI flag (writes UTF-8 JSON to PATH); --limit VALUE CLI flag (calls set_norm_limit(float(VALUE)); default 1e-9); script exit code 0 on success.

Probe JSON named keys (Slice 1 introduces baseline; Slice 2 emits fixed): chunked_groupnorm_reachable (bool), full_groupnorm_calls (int), chunked_groupnorm_calls (int), computed_memory_gib (float, GiB), configured_norm_limit_gib (float, GiB), tensor_shape (list of int), tensor_dtype (string), groupnorm_num_channels (int), groupnorm_num_groups (int), mydevelopment_head_sha (string), comfyui_head_sha (string), comfyui_branch (string), vae_line_509 (string verbatim).

Comparison JSON named keys (Slice 2 introduces): branch_reachability_delta (string, value "false->true"), prior_artifact_path (string, relative path), new_artifact_path (string, relative path), prior_chunked_groupnorm_reachable (bool), new_chunked_groupnorm_reachable (bool), prior_full_groupnorm_calls (int), new_full_groupnorm_calls (int), prior_chunked_groupnorm_calls (int), new_chunked_groupnorm_calls (int).

Test names (Slice 2 introduces): test_seedvr_groupnorm_default_limit_uses_full_groupnorm_path, test_seedvr_groupnorm_low_limit_uses_chunked_groupnorm_path. Both live in module tests_unit.comfy_test.test_seedvr_groupnorm_limit.

File paths (Slices 1–4 introduce):

  • pollockjj/mydevelopment#issue_187: github_issues/187/probe_seedvr_groupnorm_limit.py (S1), github_issues/187/slice1/groupnorm_limit_baseline.json (S1), github_issues/187/slice1/probe_run.log (S1), github_issues/187/slice1/provenance.md (S1), github_issues/187/slice2/groupnorm_limit_fixed.json (S2), github_issues/187/slice2/before_after_comparison.json (S2), github_issues/187/slice2/probe_run.log (S2), github_issues/187/slice2/pytest.log (S2), github_issues/187/slice2/vae_post_fix_line509.txt (S2), github_issues/187/slice2/vae_diff.patch (S2), github_issues/187/slice3/ruff.log (S3), github_issues/187/slice3/no_detuning.log (S3), github_issues/187/slice4/provenance.md (S4).
  • pollockjj/ComfyUI#issue_187: comfy/ldm/seedvr/vae.py (modified at line 509; line 508 preserved verbatim) (S2), tests-unit/comfy_test/test_seedvr_groupnorm_limit.py (S2 introduces).

Source-line replacement (Slice 2 introduces): at comfy/ldm/seedvr/vae.py line 509 the literal memory_occupy > float("inf") is replaced with memory_occupy > get_norm_limit(). The # TODO: this may be set dynamically from the vae trailing comment is no longer load-bearing; the slicer may retain or remove it without changing AC pass status.

Asset Readiness

Asset Location Provenance Status
pollockjj/ComfyUI:issue_101_pi HEAD 4ffef0a4 /home/johnj/dev_cuda_0/ComfyUI merged 2026-05-01 via PR #30 (per #101 fleet status) VERIFIED (GTP-12)
Source surface comfy/ldm/seedvr/vae.py (gate at line 509, calc at line 508, _NORM_LIMIT/accessors at 184–193, set_memory_limit bridge at 2297–2301) /home/johnj/dev_cuda_0/ComfyUI/comfy/ldm/seedvr/vae.py base branch issue_101_pi HEAD VERIFIED (GTP-5, GTP-8)
Test directory tests-unit/comfy_test/ /home/johnj/dev_cuda_0/ComfyUI/tests-unit/comfy_test/ base branch issue_101_pi HEAD VERIFIED (GTP-6)
Established test pattern in tests-unit/comfy_test/test_seedvr_rope_delegation.py (args.cpu = True block before comfy.ldm.* import) /home/johnj/dev_cuda_0/ComfyUI/tests-unit/comfy_test/test_seedvr_rope_delegation.py base branch issue_101_pi HEAD VERIFIED (GTP-14)
Deliverable venv (torch==2.11.0+cu130, einops==0.8.2, pytest==9.0.3) /home/johnj/dev_cuda_0/ComfyUI/.venv host slot VERIFIED (GTP-9, GTP-10)
ruff 0.15.9 binary /home/johnj/.local/bin/ruff host slot VERIFIED (GTP-10)
comfy.cli_args.args.cpu flag /home/johnj/dev_cuda_0/ComfyUI/comfy/cli_args.py base branch issue_101_pi HEAD VERIFIED (GTP-11)
comfy.ldm.seedvr.vae.F binding == torch.nn.functional source binding base branch VERIFIED (GTP-13)
pollockjj/mydevelopment:main HEAD cde849a6 /home/johnj/dev_cuda_1/mydevelopment local checkout VERIFIED (orchestrator workspace)
Plan Foundations comment ID IC_kwDOR2e1q88AAAABA_JC0g issue Comfy-Org#187 comment 4361175762 tdd-agent[bot] post 2026-05-01T19:21:41Z VERIFIED

No external assets (model weights, video/image datasets, OS packages) are required.

Slices


Slice 1: Pre-fix Baseline Reachability Probe

Kind: decision-packet

Objective: Produce the baseline reachability decision packet showing the chunked-GroupNorm branch is unreachable on pollockjj/ComfyUI:issue_101_pi HEAD 4ffef0a4 before any production fix is applied.

Acceptance Criteria

  • AC-1: File github_issues/187/probe_seedvr_groupnorm_limit.py is committed on pollockjj/mydevelopment:issue_187 and is a Python module that exposes a --out PATH CLI flag, a --limit VALUE CLI flag (default 1e-9), exits 0 on success, and writes a UTF-8 JSON file to PATH containing exactly the named-key set {chunked_groupnorm_reachable, full_groupnorm_calls, chunked_groupnorm_calls, computed_memory_gib, configured_norm_limit_gib, tensor_shape, tensor_dtype, groupnorm_num_channels, groupnorm_num_groups, mydevelopment_head_sha, comfyui_head_sha, comfyui_branch, vae_line_509} — verified by committed artifact github_issues/187/slice1/probe_run.log capturing the stdout of cd /home/johnj/dev_cuda_0/ComfyUI && .venv/bin/python /home/johnj/dev_cuda_1/mydevelopment/github_issues/187/probe_seedvr_groupnorm_limit.py --help showing the two flags and exit-0 behavior, plus the JSON artifact existing at the path declared in AC-2.

    • Assumes-passed: none
    • In-scope-of: Slice 1 Objective "Produce the baseline reachability decision packet showing the chunked-GroupNorm branch is unreachable on pollockjj/ComfyUI:issue_101_pi HEAD 4ffef0a4 before any production fix is applied."
    • Foundations-pin: comment-id=IC_kwDOR2e1q88AAAABA_JC0g edited-at=2026-05-01T20:29:35Z
  • AC-2: Probe is invoked as cd /home/johnj/dev_cuda_0/ComfyUI && .venv/bin/python /home/johnj/dev_cuda_1/mydevelopment/github_issues/187/probe_seedvr_groupnorm_limit.py --out /home/johnj/dev_cuda_1/mydevelopment/github_issues/187/slice1/groupnorm_limit_baseline.json --limit 1e-9 against issue_101_pi HEAD 4ffef0a4 (no production fix applied) and the resulting github_issues/187/slice1/groupnorm_limit_baseline.json is committed on pollockjj/mydevelopment:issue_187 with the named-key values chunked_groupnorm_reachable=false, full_groupnorm_calls=1, chunked_groupnorm_calls=0, configured_norm_limit_gib=1e-9, computed_memory_gib populated to a positive float in (0, 1e-5], tensor_shape=[1,8,2,4,4], tensor_dtype="torch.float32", groupnorm_num_channels=8, groupnorm_num_groups=4, vae_line_509 containing the substring memory_occupy > float("inf") — verified by committed artifact github_issues/187/slice1/groupnorm_limit_baseline.json parsed and key-checked, plus committed artifact github_issues/187/slice1/probe_run.log capturing the probe's stdout/stderr.

    • Assumes-passed: none
    • In-scope-of: Slice 1 Objective "Produce the baseline reachability decision packet showing the chunked-GroupNorm branch is unreachable on pollockjj/ComfyUI:issue_101_pi HEAD 4ffef0a4 before any production fix is applied."
    • Probe: GTP-1, GTP-2, GTP-3, GTP-4, GTP-5, GTP-7, GTP-8, GTP-9, GTP-11, GTP-13
    • Foundations-pin: comment-id=IC_kwDOR2e1q88AAAABA_JC0g edited-at=2026-05-01T20:29:35Z
  • AC-3: File github_issues/187/slice1/provenance.md is committed on pollockjj/mydevelopment:issue_187 containing exactly the named-line keys decision_summary:, recommended_action:, mydevelopment_head_sha:, comfyui_head_sha:, comfyui_branch:, coderabbit_url:, upstream_pr_url:, probe_command:, probe_output_path:, with decision_summary value baseline_groupnorm_chunking_unreachable, recommended_action value apply_gate_fix_on_pollockjj/ComfyUI_issue_187, comfyui_head_sha value 4ffef0a410bb0f999ee2579b297b26470fab2a22, comfyui_branch value issue_101_pi, coderabbit_url value https://github.com/Comfy-Org/ComfyUI/pull/11294#discussion_r2959796343, upstream_pr_url value https://github.com/Comfy-Org/ComfyUI/pull/11294, probe_command value matching the AC-2 command verbatim, probe_output_path value github_issues/187/slice1/groupnorm_limit_baseline.json — verified by committed artifact github_issues/187/slice1/provenance.md and grep '^<key>: ' parseability against the listed key set.

    • Assumes-passed: none
    • In-scope-of: Slice 1 Objective "Produce the baseline reachability decision packet showing the chunked-GroupNorm branch is unreachable on pollockjj/ComfyUI:issue_101_pi HEAD 4ffef0a4 before any production fix is applied."
    • Probe: GTP-12
    • Foundations-pin: comment-id=IC_kwDOR2e1q88AAAABA_JC0g edited-at=2026-05-01T20:29:35Z

Slice 2: Gate Fix, Regression Test Pair, Post-Fix Probe and Comparison

Kind: implementation

Objective: Prove that replacing the > float("inf") literal at comfy/ldm/seedvr/vae.py line 509 with > get_norm_limit() makes the chunked-GroupNorm branch reachable under a configured set_norm_limit(...) while preserving the default-limit (full-path) behavior, by landing the one-line gate fix on pollockjj/ComfyUI:issue_187, adding a two-test regression module that fails pre-fix on the chunked-path test, re-running the same probe to record chunked_groupnorm_reachable=true, and emitting a before/after delta artifact.

Acceptance Criteria

  • AC-1: File comfy/ldm/seedvr/vae.py on pollockjj/ComfyUI:issue_187 HEAD contains the literal memory_occupy > get_norm_limit() on the same logical line that previously read memory_occupy > float("inf") (the diff is restricted to this single token replacement on the if isinstance(norm_layer, ops.GroupNorm) and ... line); line 508 ( memory_occupy = x.numel() * x.element_size() / 1024**3) is preserved verbatim — verified by committed artifact github_issues/187/slice2/vae_post_fix_line509.txt containing the output of cd /home/johnj/dev_cuda_0/ComfyUI && grep -nF 'memory_occupy > get_norm_limit()' comfy/ldm/seedvr/vae.py (one matching line) AND committed artifact github_issues/187/slice2/vae_diff.patch containing the output of git -C /home/johnj/dev_cuda_0/ComfyUI diff issue_101_pi..issue_187 -- comfy/ldm/seedvr/vae.py whose hunk header references line 509 and whose changed lines contain exactly one - line with the substring memory_occupy > float("inf") and exactly one + line with the substring memory_occupy > get_norm_limit().

    • Pre-fix-fingerprint: §Diagnosis Summary / Root cause — vae.py:509 reads if isinstance(norm_layer, ops.GroupNorm) and memory_occupy > float("inf"): (placeholder literal float("inf") left in place of the configured-limit accessor get_norm_limit() defined on lines 185-186 per GTP-8); pre-fix grep -nF 'memory_occupy > get_norm_limit()' comfy/ldm/seedvr/vae.py returns no match
    • Post-fix-expectation: line 509's right-hand operand of the and is memory_occupy > get_norm_limit(); grep -nF 'memory_occupy > float("inf")' /home/johnj/dev_cuda_0/ComfyUI/comfy/ldm/seedvr/vae.py returns no match; line 508 contains the GiB calc verbatim
    • Assumes-passed: 1
    • In-scope-of: Slice 2 Objective "Prove that replacing the > float("inf") literal at comfy/ldm/seedvr/vae.py line 509 with > get_norm_limit() makes the chunked-GroupNorm branch reachable under a configured set_norm_limit(...) while preserving the default-limit (full-path) behavior, by landing the one-line gate fix on pollockjj/ComfyUI:issue_187, adding a two-test regression module that fails pre-fix on the chunked-path test, re-running the same probe to record chunked_groupnorm_reachable=true, and emitting a before/after delta artifact."
    • Probe: GTP-5, GTP-8
    • Foundations-pin: comment-id=IC_kwDOR2e1q88AAAABA_JC0g edited-at=2026-05-01T20:29:35Z
  • AC-2: File tests-unit/comfy_test/test_seedvr_groupnorm_limit.py on pollockjj/ComfyUI:issue_187 HEAD defines the two functions test_seedvr_groupnorm_default_limit_uses_full_groupnorm_path and test_seedvr_groupnorm_low_limit_uses_chunked_groupnorm_path. The first sets set_norm_limit(None), constructs comfy.ops.disable_weight_init.GroupNorm(num_channels=8, num_groups=4) with a registered nn.Module forward hook that appends to a list, runs causal_norm_wrapper(gn, torch.randn(1,8,2,4,4)), and asserts the hook fired exactly once AND that the output shape equals (1,8,2,4,4) AND restores set_norm_limit(None) in a finally. The second sets set_norm_limit(1e-9), builds the same fixture, additionally patches comfy.ldm.seedvr.vae.F.group_norm with a unittest.mock.patch.object spy whose side_effect records the num_groups argument of every direct call AND delegates to the real torch.nn.functional.group_norm, runs the same causal_norm_wrapper(gn, torch.randn(1,8,2,4,4)), and asserts the hook fired exactly zero times AND that the spy recorded at least one direct call whose num_groups argument is strictly less than gn.num_groups (which is 4) AND that the output shape equals (1,8,2,4,4) AND restores set_norm_limit(None) in a finally. The module sets comfy.cli_args.args.cpu = True before importing any comfy.ldm.* symbol when torch.cuda.is_available() is False (matching the established pattern in tests-unit/comfy_test/test_seedvr_rope_delegation.py lines 51-52 per GTP-14). Both tests pass under cd /home/johnj/dev_cuda_0/ComfyUI && python -m pytest -q tests-unit/comfy_test/test_seedvr_groupnorm_limit.py with overall exit 0 and the summary line containing 2 passed — verified by committed artifact github_issues/187/slice2/pytest.log capturing stdout+stderr of that pytest invocation including the 2 passed summary line and test names.

    • Pre-fix-fingerprint: §Diagnosis Summary / Reproduction — applying this same test module against issue_101_pi HEAD (no fix) causes test_seedvr_groupnorm_low_limit_uses_chunked_groupnorm_path to fail with AssertionError on the len(forward_hook_count) == 0 assertion (the gate > float("inf") keeps the full path active so the hook fires once with shape (2,8,4,4) per GTP-4 / GTP-7); pytest summary contains 1 failed, 1 passed
    • Post-fix-expectation: pytest summary contains 2 passed; the low-limit test's forward_hook_count is [] (zero entries); the low-limit test's F.group_norm spy recorded list contains at least one entry with num_groups < 4; the default-limit test's forward_hook_count has exactly one entry with shape (2,8,4,4); both tests' return tensors have shape (1,8,2,4,4)
    • Assumes-passed: 1
    • In-scope-of: Slice 2 Objective "Prove that replacing the > float("inf") literal at comfy/ldm/seedvr/vae.py line 509 with > get_norm_limit() makes the chunked-GroupNorm branch reachable under a configured set_norm_limit(...) while preserving the default-limit (full-path) behavior, by landing the one-line gate fix on pollockjj/ComfyUI:issue_187, adding a two-test regression module that fails pre-fix on the chunked-path test, re-running the same probe to record chunked_groupnorm_reachable=true, and emitting a before/after delta artifact."
    • Probe: GTP-1, GTP-2, GTP-3, GTP-4, GTP-6, GTP-9, GTP-10, GTP-11, GTP-13, GTP-14
    • Foundations-pin: comment-id=IC_kwDOR2e1q88AAAABA_JC0g edited-at=2026-05-01T20:29:35Z
  • AC-3: Probe is invoked as cd /home/johnj/dev_cuda_0/ComfyUI && .venv/bin/python /home/johnj/dev_cuda_1/mydevelopment/github_issues/187/probe_seedvr_groupnorm_limit.py --out /home/johnj/dev_cuda_1/mydevelopment/github_issues/187/slice2/groupnorm_limit_fixed.json --limit 1e-9 against pollockjj/ComfyUI:issue_187 HEAD (post-fix from AC-1) and the resulting github_issues/187/slice2/groupnorm_limit_fixed.json is committed on pollockjj/mydevelopment:issue_187 with the named-key values chunked_groupnorm_reachable=true, full_groupnorm_calls=0, chunked_groupnorm_calls=2, configured_norm_limit_gib=1e-9, computed_memory_gib populated to the same positive float as the Slice 1 baseline (within 1e-12 absolute), tensor_shape=[1,8,2,4,4], tensor_dtype="torch.float32", groupnorm_num_channels=8, groupnorm_num_groups=4, vae_line_509 containing the substring memory_occupy > get_norm_limit() and NOT containing float("inf") — verified by committed artifact github_issues/187/slice2/groupnorm_limit_fixed.json parsed and key-checked, plus committed artifact github_issues/187/slice2/probe_run.log.

    • Pre-fix-fingerprint: §Diagnosis Summary / Reproduction — same probe run on issue_101_pi HEAD (Slice 1 AC-2 outcome) records chunked_groupnorm_reachable=false, full_groupnorm_calls=1, chunked_groupnorm_calls=0, vae_line_509 containing float("inf") (the gate > float("inf") keeps the full path active per GTP-5 / GTP-7); the existing _NORM_LIMIT / get_norm_limit() accessors at lines 184-193 (GTP-8) are unused by the gate
    • Post-fix-expectation: chunked_groupnorm_reachable=true, full_groupnorm_calls=0, chunked_groupnorm_calls=2, vae_line_509 containing get_norm_limit() and not containing float("inf"); the gate now reads from the same get_norm_limit() accessor that the configured norm_max_mem flows into via set_norm_limit(...) (per GTP-8 + GTP-2)
    • Assumes-passed: 1
    • In-scope-of: Slice 2 Objective "Prove that replacing the > float("inf") literal at comfy/ldm/seedvr/vae.py line 509 with > get_norm_limit() makes the chunked-GroupNorm branch reachable under a configured set_norm_limit(...) while preserving the default-limit (full-path) behavior, by landing the one-line gate fix on pollockjj/ComfyUI:issue_187, adding a two-test regression module that fails pre-fix on the chunked-path test, re-running the same probe to record chunked_groupnorm_reachable=true, and emitting a before/after delta artifact."
    • Probe: GTP-1, GTP-2, GTP-3, GTP-4, GTP-5, GTP-7, GTP-8, GTP-13
    • Foundations-pin: comment-id=IC_kwDOR2e1q88AAAABA_JC0g edited-at=2026-05-01T20:29:35Z
  • AC-4: File github_issues/187/slice2/before_after_comparison.json is committed on pollockjj/mydevelopment:issue_187 containing exactly the named-key set {branch_reachability_delta, prior_artifact_path, new_artifact_path, prior_chunked_groupnorm_reachable, new_chunked_groupnorm_reachable, prior_full_groupnorm_calls, new_full_groupnorm_calls, prior_chunked_groupnorm_calls, new_chunked_groupnorm_calls} with values branch_reachability_delta="false->true", prior_artifact_path="github_issues/187/slice1/groupnorm_limit_baseline.json", new_artifact_path="github_issues/187/slice2/groupnorm_limit_fixed.json", prior_chunked_groupnorm_reachable=false, new_chunked_groupnorm_reachable=true, prior_full_groupnorm_calls=1, new_full_groupnorm_calls=0, prior_chunked_groupnorm_calls=0, new_chunked_groupnorm_calls=2 — verified by committed artifact at the named path parsed and key-checked.

    • Pre-fix-fingerprint: §Diagnosis Summary / Verification strategy — pre-Slice-2 only the Slice 1 baseline JSON exists (recording chunked_groupnorm_reachable=false); the comparison artifact does not exist and branch_reachability_delta cannot be "false->true" because no post-fix run has been performed
    • Post-fix-expectation: branch_reachability_delta="false->true" and the per-key value pairs above match; the comparison JSON exists at the named path on the issue branch
    • Assumes-passed: 1
    • In-scope-of: Slice 2 Objective "Prove that replacing the > float("inf") literal at comfy/ldm/seedvr/vae.py line 509 with > get_norm_limit() makes the chunked-GroupNorm branch reachable under a configured set_norm_limit(...) while preserving the default-limit (full-path) behavior, by landing the one-line gate fix on pollockjj/ComfyUI:issue_187, adding a two-test regression module that fails pre-fix on the chunked-path test, re-running the same probe to record chunked_groupnorm_reachable=true, and emitting a before/after delta artifact."
    • Foundations-pin: comment-id=IC_kwDOR2e1q88AAAABA_JC0g edited-at=2026-05-01T20:29:35Z

Slice 3: Hygiene Validator Gate (Ruff Lint + Production-Code Shape Independence)

Kind: hygiene

Objective: Prove the Slice 2 commits leave the touched files ruff-clean and free of any observability detuning (no added print / logging.debug / sentinel comments) that could land as production-code noise alongside the gate fix.

Acceptance Criteria

  • AC-1: cd /home/johnj/dev_cuda_0/ComfyUI && python -m ruff check comfy/ldm/seedvr/vae.py tests-unit/comfy_test/test_seedvr_groupnorm_limit.py exits 0 against pollockjj/ComfyUI:issue_187 HEAD — verified by committed validator-output artifact github_issues/187/slice3/ruff.log capturing stdout+stderr of that invocation and ending with the All checks passed! line, with shell exit code recorded as 0 in the log's final line RUFF_EXIT: 0.

    • Assumes-passed: 1, 2
    • In-scope-of: Slice 3 Objective "Prove the Slice 2 commits leave the touched files ruff-clean and free of any observability detuning (no added print / logging.debug / sentinel comments) that could land as production-code noise alongside the gate fix."
    • Probe: GTP-6, GTP-10
    • Foundations-pin: comment-id=IC_kwDOR2e1q88AAAABA_JC0g edited-at=2026-05-01T20:29:35Z
  • AC-2: Slice 2 introduced zero observability detuning (print, logging.debug, logging.info, sentinel comments, debug markers) in comfy/ldm/seedvr/vae.py and in tests-unit/comfy_test/test_seedvr_groupnorm_limit.py — verified by committed validator-output artifact github_issues/187/slice3/no_detuning.log containing the combined output of two greps: (a) grep -nE '^[+]\s*(print|logging\.debug|logging\.info|# TODO\s*(?:DEBUG|MARKER|SENTINEL))' /home/johnj/dev_cuda_1/mydevelopment/github_issues/187/slice2/vae_diff.patch returning no matches, AND (b) grep -nE '\bprint\(|logging\.debug|logging\.info|# TODO\s*(?:DEBUG|MARKER|SENTINEL)' /home/johnj/dev_cuda_0/ComfyUI/tests-unit/comfy_test/test_seedvr_groupnorm_limit.py returning no matches, with the log's final line recording NO_DETUNING_EXIT: 0.

    • Assumes-passed: 1, 2
    • In-scope-of: Slice 3 Objective "Prove the Slice 2 commits leave the touched files ruff-clean and free of any observability detuning (no added print / logging.debug / sentinel comments) that could land as production-code noise alongside the gate fix."
    • Foundations-pin: comment-id=IC_kwDOR2e1q88AAAABA_JC0g edited-at=2026-05-01T20:29:35Z

Slice 4: Post-Fix Provenance Decision Packet

Kind: decision-packet

Objective: Produce the post-fix provenance decision packet binding the issue branch HEADs, the canonical-reference URLs, and the verbatim invocation commands used by Slices 2 and 3 into a single architect-readable record.

Acceptance Criteria

  • AC-1: File github_issues/187/slice4/provenance.md is committed on pollockjj/mydevelopment:issue_187 containing exactly the named-line keys decision_summary:, recommended_action:, mydevelopment_head_sha:, comfyui_head_sha:, comfyui_branch:, comfyui_base_sha:, comfyui_base_branch:, coderabbit_url:, upstream_pr_url:, probe_command:, probe_output_path:, regression_test_command:, ruff_command:, with decision_summary value post_fix_groupnorm_chunking_restored, recommended_action value invoke_/pr_from_issue_187_to_issue_101_pi, comfyui_base_sha value 4ffef0a410bb0f999ee2579b297b26470fab2a22, comfyui_base_branch value issue_101_pi, comfyui_branch value issue_187, coderabbit_url value https://github.com/Comfy-Org/ComfyUI/pull/11294#discussion_r2959796343, upstream_pr_url value https://github.com/Comfy-Org/ComfyUI/pull/11294, probe_command value matching the Slice 2 AC-3 command verbatim, regression_test_command value matching the Slice 2 AC-2 pytest invocation verbatim, ruff_command value matching the Slice 3 AC-1 ruff invocation verbatim — verified by committed artifact at the named path and grep '^<key>: ' parseability against the listed key set.
    • Assumes-passed: 1, 2, 3
    • In-scope-of: Slice 4 Objective "Produce the post-fix provenance decision packet binding the issue branch HEADs, the canonical-reference URLs, and the verbatim invocation commands used by Slices 2 and 3 into a single architect-readable record."
    • Probe: GTP-12
    • Foundations-pin: comment-id=IC_kwDOR2e1q88AAAABA_JC0g edited-at=2026-05-01T20:29:35Z

Constraints

  • Python: python (resolved by orchestrator/slicer to the deliverable venv at /home/johnj/dev_cuda_0/ComfyUI/.venv/bin/python for ComfyUI-touching commands; no AC hardcodes an absolute interpreter path).
  • Runner: debug/run_debug.py (not exercised — this plan is CPU-only static-source plus pytest; no ComfyUI launch).
  • Isolation flags: --use-process-isolation --disable-cuda-malloc (not exercised — see above).
  • No packages installed into host venv (orchestrator workspace).
  • No pkill, no rm -rf, no python main.py.
  • All commits on issue branch issue_187 in both pollockjj/mydevelopment and pollockjj/ComfyUI, not on base branches main / issue_101_pi.
  • File-family lock per #186: do NOT touch comfy_extras/nodes_seedvr.py, comfy/sd.py, or comfy/ldm/seedvr/model.py. The fix is bounded to comfy/ldm/seedvr/vae.py line 509.
  • Relevant Existing Tooling:
    • pytest 9.0.3 in deliverable venv — REUSE; cited from tools_register.md#pytest.
    • ruff 0.15.9 system binary at /home/johnj/.local/bin/ruff — REUSE; ComfyUI CI invocation pattern is python -m ruff check . per .github/workflows/ruff.yml.
    • torch 2.11.0+cu130, einops 0.8.2 in deliverable venv — REUSE.
    • comfy.ops.disable_weight_init.GroupNorm in deliverable source — REUSE; constructable on CPU per GTP-3.
    • nn.Module.register_forward_hook (torch) and unittest.mock.patch.object (stdlib) — REUSE for measurement instruments.
    • comfy.ldm.seedvr.vae _NORM_LIMIT / get_norm_limit / set_norm_limit accessors — REUSE; the plan only fixes the consumer at line 509, not the accessors.
    • gh CLI — REUSE for issue body update via scripts/run_tdd_post.py update-issue (PR creation is the /pr skill's responsibility, not this plan).
  • Evidence Primitive Requirements:
    • Branch reachability claim → forward-hook count + comfy.ldm.seedvr.vae.F.group_norm direct-call spy together (the two instruments unambiguously identify which branch ran; chunked branch fires zero hook events AND records direct F.group_norm calls with num_groups < gn.num_groups).
    • Behavioral-change claim (gate fix) → one-line static-source check (grep) + committed git diff artifact + regression test that fails pre-fix and passes post-fix (per equivalence_methods.md#code-behavior-equivalence).
    • Decision-packet artifacts → JSON or markdown files with the named-key set declared in ## Created Surface Contract; grep '^<key>:' parseable for key existence; python -m json.tool parseable for structural validity.
    • Hygiene validator output → ruff log + grep log; both exit 0 / no matches respectively.
  • Repository Hygiene (local pre-submit gate, NOT a QA artifact):
    • Every dirty repo state is resolved by the agent without git stash and without external decision deferral.
    • Dirty paths are classified as committed work, precise committed .gitignore rules for safe generated bulk, or discarded generated debris.
    • Clean committed work already present on the active branch is repo provenance and is integrated, not removed as dirty-state cleanup.
    • Slicers and Melian must run live git status --short gates locally and must refuse to submit while any touched repo is dirty.
    • Plans MUST NOT require repo_hygiene.md, repo_hygiene.txt, raw git status --short transcripts, or CLEAN markers as QA evidence; this plan deliberately omits any hygiene-evidence AC. QA verifies submitted commits, branch refs, and artifact presence through GitHub rather than trusting slicer-authored cleanliness transcripts.

Out of Scope

  • The other seven Leg-1 sibling defects under #186: #188 frame count, #189 batch/time axes, #190 / #192 forward return contracts, #191 latent metadata, #193 cache clearing, #194 decode state guards. Each owns its own scope and its own issue branch.
  • Any benchmark, video-quality, or end-to-end runtime measurement (no PSNR / SSIM / LPIPS / DOVER; no real OOM under sustained encode/decode; no debug/run_debug.py launches).
  • Custom-node changes (numz/ComfyUI-SeedVR2_VideoUpscaler, ComfyUI-SeedVR2-Canonical, etc.). Their forks of set_memory_limit and causal_norm_wrapper already use get_norm_limit() correctly per the upstream SeedVR repo; only the in-tree comfy/ldm/seedvr/vae.py carries the placeholder gate.
  • Edits to debug harness files (debug/run_debug.py, debug/reset_gpu.py, debug/harness_runtime.py, etc.).
  • File-family edits owned by other slots: comfy_extras/nodes_seedvr.py, comfy/sd.py, comfy/ldm/seedvr/model.py (per #186 file-family lock).
  • PR creation, codex review, and Copilot review on the resulting PR. These are the /pr and /pr-review skills' responsibilities; this plan only lands committed work on issue branches and exits.
  • Any change to the chunked-conv sibling at vae.py:659-660 or to the upstream config bridge set_memory_limit(...) at vae.py:2297-2301. The fix only changes one logical line in causal_norm_wrapper.
  • Removing or rewriting the # TODO comment is optional; AC pass status does not depend on it.
  • Any change to _NORM_LIMIT's default value, set_norm_limit's Noneinf semantics, or get_norm_limit's return shape. The gate's RHS is the only mutation.
  • Slicer-authored repo_hygiene.md / repo_hygiene.txt / raw git status --short transcripts / CLEAN markers as QA evidence; repository cleanliness is a local pre-submit gate enforced by Melian and tdd-slice, not a committed artifact this plan asks for.
  • Slice 0 calibration against a published anchor; the plan's I/O contract is categorical (boolean reachability + integer call counts) and does not name a measurement instrument with a published reference, per the foundations comment's Scope: none justification.

pollockjj added 2 commits May 1, 2026 15:49
…chunked GroupNorm path is reachable

The 5D GroupNorm gate in causal_norm_wrapper compared memory_occupy
against float('inf'), making the chunked dispatch branch unreachable
under any finite memory estimate. The configured _NORM_LIMIT (set via
set_norm_limit / VAE.set_memory_limit norm_max_mem) had no effect.

Replace the hardcoded float('inf') with get_norm_limit() so the
configured limit actually controls the gate. The GiB calc line is
preserved unchanged.

CodeRabbit finding on upstream PR Comfy-Org#11294:
Comfy-Org#11294 (comment)

Add tests-unit/comfy_test/test_seedvr_groupnorm_limit.py with two
regression cases covering default-limit (full path) and low-limit
(chunked path), each discriminated via a forward-hook + F.group_norm
spy pair.
…8,2,4,4)) per AC-2 contract

QA gate slice 2 returned NOT MET on AC-2: shared fixture used
torch.zeros(*_TENSOR_SHAPE, dtype=torch.float32) instead of the
contract-required literal torch.randn(1,8,2,4,4). Inline the call in
both test bodies so the literal substring causal_norm_wrapper(gn,
torch.randn(1,8,2,4,4)) appears in source. Forward hook, F.group_norm
patch.object spy, output-shape assertions, and set_norm_limit(None)
restoration in finally are preserved unchanged.

Both named tests still pass:
  python -m pytest -q tests-unit/comfy_test/test_seedvr_groupnorm_limit.py
  -> 2 passed, exit 0
Ruff still clean:
  python -m ruff check comfy/ldm/seedvr/vae.py tests-unit/comfy_test/test_seedvr_groupnorm_limit.py
  -> All checks passed
Copilot AI review requested due to automatic review settings May 3, 2026 21:27
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Restores SeedVR2 VAE 5D GroupNorm chunking behavior by honoring the configured norm_max_mem limit (via get_norm_limit()), and adds regression tests to prevent the dispatch gate from regressing to an unreachable chunked path.

Changes:

  • Fix causal_norm_wrapper GroupNorm dispatch gate to compare against get_norm_limit() instead of float("inf").
  • Add unit tests that assert full-path behavior at the default limit and chunked-path behavior at a low limit.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
comfy/ldm/seedvr/vae.py Makes the GroupNorm chunking gate honor the configured norm memory limit.
tests-unit/comfy_test/test_seedvr_groupnorm_limit.py Adds regression coverage to distinguish full vs chunked GroupNorm execution paths.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

handle = gn.register_forward_hook(_hook)
try:
with patch.object(vae_mod.F, "group_norm", side_effect=_group_norm_spy):
out_tensor = causal_norm_wrapper(gn, torch.randn(1,8,2,4,4))
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOT-REQUIRED as a Ruff failure: E231 is in the active rule set (E family selected, not in lint.ignore) but Ruff CI on this PR ran SUCCESS on these exact lines. Ruff's E231 implementation does not flag this construction in current versions.

Adopted as a DRY improvement: the file already defines _TENSOR_SHAPE = (1, 8, 2, 4, 4) on line 54 and uses it on lines 92/94/144/146. Inlining torch.randn(1,8,2,4,4) with a different format on construction was an inconsistency. Switched to torch.randn(*_TENSOR_SHAPE) in commit 2e681fb.

handle = gn.register_forward_hook(_hook)
try:
with patch.object(vae_mod.F, "group_norm", side_effect=_group_norm_spy):
out_tensor = causal_norm_wrapper(gn, torch.randn(1,8,2,4,4))
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same disposition as the line-83 thread: NOT-REQUIRED as Ruff E231 (CI green), adopted as a DRY improvement using *_TENSOR_SHAPE in commit 2e681fb.

Copy link
Copy Markdown
Owner Author

@pollockjj pollockjj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — [Issue 101][eval-and-fix] Honor SeedVR2 GroupNorm memory limit

Scope: 1 production line + 158-line regression test. Base issue_101 ← head issue_187. Closes pollockjj/mydevelopment#187.

Overview

Fixes the causal_norm_wrapper chunked-GroupNorm dispatch gate at comfy/ldm/seedvr/vae.py:509 so it reads the configured limit accessor (get_norm_limit()) instead of the placeholder float("inf") literal that made the chunked branch unreachable for any finite limit. Mirrors the canonical pattern already used by the chunked-conv sibling at vae.py:659-660. Adds a regression test pair locking in dispatch behavior under both the default-inf and a deliberately-low 1e-9 GiB limit.

Correctness

  • The one-line change matches the diagnosis exactly. Line 508 (the memory_occupy = x.numel() * x.element_size() / 1024**3 GiB calc) is preserved byte-equal; only the gate RHS changes.
  • get_norm_limit() is defined at vae.py:185-186 of the same module and the set_norm_limit(...) mutator + the VideoAutoencoderKL.set_memory_limit(...) bridge are already wired correctly — only the gate consumer was missing.
  • Default _NORM_LIMIT = float("inf") preserves prior behavior at any call site that does not invoke set_norm_limit(...). The default-limit test verifies this directly.
  • # TODO: this may be set dynamically from the vae correctly removed since the TODO is resolved by the fix itself.

Test design — strong

  • Two independent observers discriminate the branches:
    1. register_forward_hook on gn catches the full-forward path (which invokes norm_layer(x) and triggers nn.Module.__call__).
    2. patch.object(vae_mod.F, "group_norm", side_effect=_group_norm_spy) catches direct F.group_norm calls; the chunked branch dispatches via F.group_norm with num_groups_per_chunk = norm_layer.num_groups // num_chunks, so any spy call with num_groups < _NUM_GROUPS is the chunked-branch signature.
  • Both tests assert the positive and negative branch fires/does-not-fire — the default-limit test asserts chunked_calls == 0 (not just full_calls == 1) and vice versa for the low-limit test. Catches a regression where both branches fire (e.g., an or instead of an if/else somewhere downstream).
  • Output-shape assertion (out_tensor.shape == _TENSOR_SHAPE) confirms the wrapper's 5D round-trip (b c t h w → (b t) c h w → b c t h w) is preserved on both branches.
  • set_norm_limit(None) reset in both finally blocks isolates module-global state across test runs. Consistent with the established sibling pattern in tests-unit/comfy_test/test_seedvr_rope_delegation.py.
  • CPU-only by construction (cli_args.cpu = True before the comfy.ldm.* import; gn.eval(); tiny float32 tensor) — no GPU required.

Suggestions / observations

  1. Coverage limited to disable_weight_init.GroupNorm. The production gate at line 509 reads isinstance(norm_layer, ops.GroupNorm) where ops resolves at module scope. Both disable_weight_init.GroupNorm and manual_cast.GroupNorm could appear in production paths depending on the active backend. The tests cover only disable_weight_init. If the dispatch is contractually identical for both subclasses (and they share the ops.GroupNorm MRO confirmed by GTP-3), a parametrized test would close the gap cheaply.
  2. No numerical-equivalence check between branches. The tests verify dispatch (which branch ran), not output correctness (does the chunked branch produce the same numerical output as the full branch within float tolerance?). For a gate fix this scope is reasonable, but a follow-up torch.allclose(full_output, chunked_output, rtol=..., atol=...) test would catch any future correctness drift in the chunked path.
  3. Spy signature eps=1e-5 defaults — minor. _group_norm_spy(input_tensor, num_groups_arg, weight=None, bias=None, eps=1e-5) matches today's F.group_norm signature, but if a future nn.functional.group_norm ever gains a kwarg (e.g., device=None) the spy would TypeError. Replacing the explicit-kwarg signature with (*args, **kwargs) and unpacking would future-proof at minimal cost.
  4. patch.object(vae_mod.F, "group_norm", ...) mutates the global torch.nn.functional.group_norm for the duration of the context (since vae_mod.F is torch.nn.functional per GTP-13). Intentional and documented in the module docstring; flagged for completeness — any concurrent test that calls F.group_norm from a different code path during the patch window would route through the spy. Not an issue for the current single-threaded pytest invocation.
  5. Base-branch nomenclature. The Affected Repositories table in the PR body lists base branch issue_101_pi, but the PR itself targets issue_101. Confirm the rename was intentional (looks like the _pi suffix was the planning-time tag).

Performance

N/A — one-line gate change, no algorithmic delta. The test pair completes in milliseconds with the small-tensor + CPU configuration.

Security

No new external surface. No auth, network, or untrusted-input handling. Production change is purely an internal accessor swap.

Verdict

The fix is minimal, correct, and surgically scoped. The test design is the strongest part — two-observer dispatch discrimination is robust against future refactors of either branch. The suggestions above are non-blocking; the parametrized-ops.GroupNorm coverage and the chunked-vs-full numerical-equivalence test are worth tracking as follow-up but should not gate this merge.

Looks good to land.

pollockjj added 2 commits May 3, 2026 16:37
…pilot review NOT-REQUIRED-but-DRY: lines 83/135 inlined torch.randn(1,8,2,4,4); aligning with the assertion sites on 92/94/144/146 that reference the constant)
…ignature

Codex review findings on PR #31:

- F1 (NOT-REQUIRED-non-blocking, applied): parametrize both tests over
  comfy.ops.disable_weight_init.GroupNorm AND comfy.ops.manual_cast.GroupNorm.
  Production gate at vae.py:509 reads isinstance(norm_layer, ops.GroupNorm)
  which matches both via MRO; coverage gap closed. Verified manual_cast
  exists with MRO (manual_cast.GroupNorm, disable_weight_init.GroupNorm,
  torch.nn.modules.normalization.GroupNorm).

- F3 (NOT-REQUIRED-non-blocking, applied): spy signature widened from
  explicit-kwargs (input, num_groups, weight=None, bias=None, eps=1e-5) to
  (input, num_groups, *args, **kwargs) with passthrough. Future torch
  F.group_norm kwargs no longer TypeError.

Findings F2 (numerical-equivalence test) tracked as separate follow-up;
F4 (patch.object global mutation) acknowledged informational; F5
(base-branch rename issue_101_pi -> issue_101) confirmed intentional in
review reply.
@pollockjj
Copy link
Copy Markdown
Owner Author

Disposition — codex review of 2026-05-03 21:32

Verdict adopted: "looks good to land." Five findings dispositioned and addressed where actionable.

F1 — Coverage limited to disable_weight_init.GroupNorm

NOT-REQUIRED-non-blocking, applied. Verified comfy.ops.manual_cast.GroupNorm exists with MRO (manual_cast.GroupNorm, disable_weight_init.GroupNorm, torch.nn.modules.normalization.GroupNorm)isinstance(norm_layer, ops.GroupNorm) matches both via inheritance. Both tests now parametrized via pytest.param(..., id="disable_weight_init") and pytest.param(..., id="manual_cast"). 4/4 cases pass on llamatron-cuda_1. Commit fa0432d2.

F2 — No numerical-equivalence check between branches

NOT-REQUIRED, tracked as follow-up. Out of scope for a dispatch-gate fix. Worth a separate issue (CPU-only torch.allclose between full and chunked outputs at matched seed) — flagging for the queue, not gating this merge.

F3 — Spy signature eps=1e-5 defaults

NOT-REQUIRED-non-blocking, applied. _group_norm_spy widened from explicit (input, num_groups, weight=None, bias=None, eps=1e-5) to (input, num_groups, *args, **kwargs) with passthrough to real_group_norm(input, num_groups, *args, **kwargs). Future F.group_norm kwargs no longer break the spy. Commit fa0432d2.

F4 — patch.object(vae_mod.F, "group_norm", ...) mutates global

INFORMATIONAL, acknowledged. Intentional and documented in the module docstring (the docstring explicitly cites F is torch.nn.functional per GTP-13 and the patch scope). No concurrent test paths invoke F.group_norm during the spy window in the current pytest invocation. No action.

F5 — Base-branch nomenclature (issue_101_pi vs issue_101)

ANSWER: rename intentional. During this session we hard-reset issue_101 := issue_101_pi (zero commits lost; issue_101_pi was a strict prefix containing only the pre-merge subset, while issue_101_pi carried the May-01 yousef-rafat re-merge + the process-isolation test infrastructure landed via PR #30). issue_101_pi was then deleted as a redundant alias. The PR base correctly tracks the unified cutover trunk issue_101. The plan body still mentions issue_101_pi because it was authored before the consolidation; not load-bearing.

Updated PR state

3 commits, 2 files changed, 4/4 parametrized regression cases pass locally, all 6 CI checks SUCCESS pre-push (re-running on the new commit). Ready to land.

@pollockjj pollockjj merged commit 7c6570c into issue_101 May 3, 2026
6 checks passed
@pollockjj pollockjj deleted the issue_187 branch May 3, 2026 21:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants