ltx2 model/pipeline review

# `ltx2` model/pipeline review

Commit tested: `0f1abc4ae8b0eb2a3b40e82a310507281144c423`

Review performed against the repository review rules.

Duplicate search: checked GitHub Issues/PRs for `ltx2`, affected class/function names, and the specific failure modes below. I found related LTX2 work, including PRs/issues such as `#12926`, `#13058`, `#13187`, `#13217`, `#13564`, and `#13572`, but no duplicate for these specific findings.

Files/categories reviewed: public imports and lazy exports, model configs/serialization assumptions, dtype/device behavior, pipeline runtime behavior, audio/video consistency, offload-adjacent pipeline paths, and test coverage under `tests/`.

## Issue 1: Video VAE compression ratios ignore downsample axes

Affected code:
https://github.com/huggingface/diffusers/blob/0f1abc4ae8b0eb2a3b40e82a310507281144c423/src/diffusers/models/autoencoders/autoencoder_kl_ltx2.py#L1151-L1160

Problem:
`AutoencoderKLLTX2Video` derives both spatial and temporal compression ratios from `sum(spatio_temporal_scaling)`, but the axis actually depends on `downsample_type`. A temporal-only stage incorrectly increases the spatial ratio, and a spatial-only stage incorrectly increases the temporal ratio.

Impact:
Pipelines use these ratios for latent sizing and validation. Custom configs or configs without explicit ratio overrides can report wrong latent geometry and allocate/validate the wrong shapes.

Reproduction:
```python
import torch
from diffusers import AutoencoderKLLTX2Video

vae = AutoencoderKLLTX2Video(
    in_channels=3,
    out_channels=3,
    latent_channels=4,
    block_out_channels=(8,),
    decoder_block_out_channels=(8,),
    layers_per_block=(1,),
    decoder_layers_per_block=(1, 1),
    spatio_temporal_scaling=(True,),
    decoder_spatio_temporal_scaling=(True,),
    decoder_inject_noise=(False, False),
    downsample_type=("temporal",),
    upsample_residual=(False,),
    upsample_factor=(1,),
    patch_size=1,
    patch_size_t=1,
    encoder_spatial_padding_mode="zeros",
    decoder_spatial_padding_mode="zeros",
)

x = torch.randn(1, 3, 5, 16, 16)
z = vae.encode(x).latent_dist.mode()

actual_spatial = x.shape[-1] // z.shape[-1]
actual_temporal = (x.shape[2] - 1) // (z.shape[2] - 1)

assert vae.spatial_compression_ratio == actual_spatial, (vae.spatial_compression_ratio, actual_spatial)
assert vae.temporal_compression_ratio == actual_temporal, (vae.temporal_compression_ratio, actual_temporal)
```

Relevant precedent:
The LTX2 pipelines already rely on these VAE ratios when deriving latent geometry:
https://github.com/huggingface/diffusers/blob/0f1abc4ae8b0eb2a3b40e82a310507281144c423/src/diffusers/pipelines/ltx2/pipeline_ltx2.py#L241-L252

Suggested fix:
```python
spatial_downsamples = sum(
    scale and mode in {"spatial", "spatiotemporal"}
    for scale, mode in zip(spatio_temporal_scaling, downsample_type)
)
temporal_downsamples = sum(
    scale and mode in {"temporal", "spatiotemporal"}
    for scale, mode in zip(spatio_temporal_scaling, downsample_type)
)

self.spatial_compression_ratio = (
    patch_size * 2**spatial_downsamples if spatial_compression_ratio is None else spatial_compression_ratio
)
self.temporal_compression_ratio = (
    patch_size_t * 2**temporal_downsamples if temporal_compression_ratio is None else temporal_compression_ratio
)
```

## Issue 2: `use_framewise_encoding` is ignored by video VAE encode

Affected code:
https://github.com/huggingface/diffusers/blob/0f1abc4ae8b0eb2a3b40e82a310507281144c423/src/diffusers/models/autoencoders/autoencoder_kl_ltx2.py#L1229-L1230

Problem:
`encode()` checks `self.use_framewise_decoding` instead of `self.use_framewise_encoding`. As a result, enabling framewise encoding has no effect, while enabling framewise decoding changes encode behavior.

Impact:
Users cannot independently control framewise/tiled encoding. This is especially risky for memory-sensitive video VAE encoding, where the encoding and decoding paths are expected to be separately configurable.

Reproduction:
```python
import torch
from diffusers import AutoencoderKLLTX2Video

class Probe(AutoencoderKLLTX2Video):
    def _temporal_tiled_encode(self, x, causal=None):
        raise RuntimeError("temporal encode used")

model = Probe(
    in_channels=3,
    out_channels=3,
    latent_channels=4,
    block_out_channels=(8,),
    decoder_block_out_channels=(8,),
    layers_per_block=(1,),
    decoder_layers_per_block=(1, 1),
    spatio_temporal_scaling=(True,),
    decoder_spatio_temporal_scaling=(True,),
    decoder_inject_noise=(False, False),
    downsample_type=("spatial",),
    upsample_residual=(False,),
    upsample_factor=(1,),
    patch_size=1,
    patch_size_t=1,
    encoder_spatial_padding_mode="zeros",
    decoder_spatial_padding_mode="zeros",
)

model.tile_sample_min_num_frames = 1
model.use_framewise_encoding = True
model.use_framewise_decoding = False

try:
    model.encode(torch.randn(1, 3, 5, 16, 16))
except RuntimeError:
    pass
else:
    raise AssertionError("use_framewise_encoding=True did not enable temporal tiled encode")
```

Relevant precedent:
The decode path uses the matching decode flag correctly:
https://github.com/huggingface/diffusers/blob/0f1abc4ae8b0eb2a3b40e82a310507281144c423/src/diffusers/models/autoencoders/autoencoder_kl_ltx2.py#L1278-L1279

Suggested fix:
```python
if self.use_framewise_encoding and num_frames > self.tile_sample_min_num_frames:
    return self._temporal_tiled_encode(x, causal=causal)
```

## Issue 3: Audio VAE compression ratios are hardcoded

Affected code:
https://github.com/huggingface/diffusers/blob/0f1abc4ae8b0eb2a3b40e82a310507281144c423/src/diffusers/models/autoencoders/autoencoder_kl_ltx2_audio.py#L750-L753

Problem:
`AutoencoderKLLTX2Audio` hardcodes both temporal and mel compression ratios to `4`, even though the actual ratio depends on the number of downsampling levels implied by `ch_mult`.

Impact:
Small/custom audio VAE configs report incorrect latent geometry. The LTX2 pipelines use these ratios to prepare audio latents, so wrong config values can produce shape mismatches or incorrectly sized generated audio latents.

Reproduction:
```python
import torch
from diffusers import AutoencoderKLLTX2Audio

for ch_mult in [(1,), (1, 2), (1, 2, 4)]:
    vae = AutoencoderKLLTX2Audio(
        base_channels=4,
        output_channels=2,
        ch_mult=ch_mult,
        num_res_blocks=1,
        attn_resolutions=None,
        in_channels=2,
        resolution=32,
        latent_channels=2,
        norm_type="pixel",
        causality_axis="height",
        dropout=0.0,
        mid_block_add_attention=False,
        sample_rate=16000,
        mel_hop_length=160,
        is_causal=True,
        mel_bins=8,
    )

    x = torch.randn(1, 2, 8, 8)
    z = vae.encode(x).latent_dist.mode()
    actual = (x.shape[2] // z.shape[2], x.shape[3] // z.shape[3])
    reported = (vae.temporal_compression_ratio, vae.mel_compression_ratio)

    assert reported == actual, (ch_mult, reported, actual)
```

Relevant precedent:
The video VAE already exposes compression ratios as runtime config-derived values instead of constants:
https://github.com/huggingface/diffusers/blob/0f1abc4ae8b0eb2a3b40e82a310507281144c423/src/diffusers/models/autoencoders/autoencoder_kl_ltx2.py#L1151-L1160

Suggested fix:
```python
compression_ratio = 2 ** (len(ch_mult) - 1)
self.temporal_compression_ratio = compression_ratio
self.mel_compression_ratio = compression_ratio
```

## Issue 4: Explicit zero audio guidance values are overwritten

Affected code:
https://github.com/huggingface/diffusers/blob/0f1abc4ae8b0eb2a3b40e82a310507281144c423/src/diffusers/pipelines/ltx2/pipeline_ltx2.py#L1005-L1008
https://github.com/huggingface/diffusers/blob/0f1abc4ae8b0eb2a3b40e82a310507281144c423/src/diffusers/pipelines/ltx2/pipeline_ltx2_image2video.py#L1068-L1071
https://github.com/huggingface/diffusers/blob/0f1abc4ae8b0eb2a3b40e82a310507281144c423/src/diffusers/pipelines/ltx2/pipeline_ltx2_condition.py#L1180-L1183

Problem:
The pipelines use `audio_* = audio_* or video_value` fallback logic. This treats valid explicit values like `0.0` as missing. For example, `audio_stg_scale=0.0` is replaced by `stg_scale`.

Impact:
Users cannot disable audio STG/rescale while keeping the corresponding video guidance enabled. This makes the public audio guidance API behave differently from its documented `None` default semantics.

Reproduction:
```python
import inspect
from diffusers import LTX2Pipeline

source = inspect.getsource(LTX2Pipeline.__call__)
assert "audio_stg_scale = audio_stg_scale or stg_scale" in source

stg_scale = 0.5
audio_stg_scale = 0.0

audio_stg_scale = audio_stg_scale or stg_scale

assert audio_stg_scale == 0.0, audio_stg_scale
```

Relevant precedent:
Diffusers pipelines generally distinguish `None` from valid falsy numeric values when applying optional argument defaults.

Suggested fix:
```python
audio_guidance_scale = guidance_scale if audio_guidance_scale is None else audio_guidance_scale
audio_stg_scale = stg_scale if audio_stg_scale is None else audio_stg_scale
audio_modality_scale = modality_scale if audio_modality_scale is None else audio_modality_scale
audio_guidance_rescale = guidance_rescale if audio_guidance_rescale is None else audio_guidance_rescale
```

## Issue 5: Vocoder config validation raises the wrong exception

Affected code:
https://github.com/huggingface/diffusers/blob/0f1abc4ae8b0eb2a3b40e82a310507281144c423/src/diffusers/pipelines/ltx2/vocoder.py#L318-L321

Problem:
The validation branch intends to raise `ValueError` when `resnet_kernel_sizes` and `resnet_dilations` lengths differ, but the error message calls `len(self.resnets_per_upsample)`. `self.resnets_per_upsample` is an integer, so the branch raises `TypeError` before the intended validation error.

Impact:
Invalid vocoder configs fail with a misleading implementation error instead of an actionable configuration error.

Reproduction:
```python
from diffusers.pipelines.ltx2.vocoder import LTX2Vocoder

try:
    LTX2Vocoder(resnet_kernel_sizes=[3, 7], resnet_dilations=[[1, 3, 5]])
except Exception as error:
    assert isinstance(error, ValueError), type(error).__name__
```

Relevant precedent:
The preceding validation branches in the same constructor raise direct `ValueError`s for malformed config shapes:
https://github.com/huggingface/diffusers/blob/0f1abc4ae8b0eb2a3b40e82a310507281144c423/src/diffusers/pipelines/ltx2/vocoder.py#L306-L316

Suggested fix:
```python
raise ValueError(
    f"`resnet_kernel_sizes` and `resnet_dilations` should be lists of the same length but are length"
    f" {self.resnets_per_upsample} and {len(resnet_dilations)}, respectively."
)
```

## Issue 6: LTX2 is missing slow tests and dedicated coverage for condition/latent-upsample pipelines

Affected code:
https://github.com/huggingface/diffusers/blob/0f1abc4ae8b0eb2a3b40e82a310507281144c423/tests/pipelines/ltx2/test_ltx2.py#L32
https://github.com/huggingface/diffusers/blob/0f1abc4ae8b0eb2a3b40e82a310507281144c423/tests/pipelines/ltx2/test_ltx2_image2video.py#L35

Problem:
The LTX2 family has fast tests for text-to-video and image-to-video, plus model tests, but no dedicated fast test file for `LTX2ConditionPipeline` and no dedicated fast test file for `LTX2LatentUpsamplePipeline`. The current LTX2 test set also has no `@slow` tests.

Impact:
Real checkpoint integration, condition workflows, latent upsample behavior, audio generation/export, two-stage generation, offload paths, and release checkpoint compatibility are not covered by slow tests. Missing condition and latent-upsample fast tests also leave public pipelines exposed to regressions that would not be caught by the existing fast suite.

Reproduction:
```python
from pathlib import Path

root = Path("tests")
ltx2_tests = sorted(root.glob("**/*ltx2*.py"))
slow_hits = [path for path in ltx2_tests if "@slow" in path.read_text(encoding="utf-8")]

assert Path("tests/pipelines/ltx2/test_ltx2_condition.py").exists()
assert Path("tests/pipelines/ltx2/test_ltx2_latent_upsample.py").exists()
assert slow_hits, [str(path) for path in ltx2_tests]
```

Relevant precedent:
LTX has dedicated condition and latent upsample fast tests:
https://github.com/huggingface/diffusers/blob/0f1abc4ae8b0eb2a3b40e82a310507281144c423/tests/pipelines/ltx/test_ltx_condition.py#L1
https://github.com/huggingface/diffusers/blob/0f1abc4ae8b0eb2a3b40e82a310507281144c423/tests/pipelines/ltx/test_ltx_latent_upsample.py#L1

Suggested fix:
Add dedicated fast tests for `LTX2ConditionPipeline` and `LTX2LatentUpsamplePipeline`, modeled on the existing LTX tests but using LTX2 tiny components. Add slow tests for the current LTX2 checkpoint family covering at least `LTX2Pipeline`, `LTX2ImageToVideoPipeline`, `LTX2ConditionPipeline`, latent upsample/two-stage generation, audio output, and one CPU/GPU offload path.


	self.vae_spatial_compression_ratio = (
	self.vae.spatial_compression_ratio if getattr(self, "vae", None) is not None else 32
	)
	self.vae_temporal_compression_ratio = (
	self.vae.temporal_compression_ratio if getattr(self, "vae", None) is not None else 8
	)
	# TODO: check whether the MEL compression ratio logic here is corrct
	self.audio_vae_mel_compression_ratio = (
	self.audio_vae.mel_compression_ratio if getattr(self, "audio_vae", None) is not None else 4
	)
	self.audio_vae_temporal_compression_ratio = (
	self.audio_vae.temporal_compression_ratio if getattr(self, "audio_vae", None) is not None else 4

	# TODO: calculate programmatically instead of hardcoding
	self.temporal_compression_ratio = LATENT_DOWNSAMPLE_FACTOR # 4
	# TODO: confirm whether the mel compression ratio below is correct
	self.mel_compression_ratio = LATENT_DOWNSAMPLE_FACTOR

	audio_guidance_scale = audio_guidance_scale or guidance_scale
	audio_stg_scale = audio_stg_scale or stg_scale
	audio_modality_scale = audio_modality_scale or modality_scale
	audio_guidance_rescale = audio_guidance_rescale or guidance_rescale

	audio_guidance_scale = audio_guidance_scale or guidance_scale
	audio_stg_scale = audio_stg_scale or stg_scale
	audio_modality_scale = audio_modality_scale or modality_scale
	audio_guidance_rescale = audio_guidance_rescale or guidance_rescale

	audio_guidance_scale = audio_guidance_scale or guidance_scale
	audio_stg_scale = audio_stg_scale or stg_scale
	audio_modality_scale = audio_modality_scale or modality_scale
	audio_guidance_rescale = audio_guidance_rescale or guidance_rescale

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ltx2 model/pipeline review #13601

`ltx2` model/pipeline review

Issue 1: Video VAE compression ratios ignore downsample axes

Issue 2: `use_framewise_encoding` is ignored by video VAE encode

Issue 3: Audio VAE compression ratios are hardcoded

Issue 4: Explicit zero audio guidance values are overwritten

Issue 5: Vocoder config validation raises the wrong exception

Issue 6: LTX2 is missing slow tests and dedicated coverage for condition/latent-upsample pipelines

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

	self.spatial_compression_ratio = (
	patch_size * 2 ** sum(spatio_temporal_scaling)
	if spatial_compression_ratio is None
	else spatial_compression_ratio
	)
	self.temporal_compression_ratio = (
	patch_size_t * 2 ** sum(spatio_temporal_scaling)
	if temporal_compression_ratio is None
	else temporal_compression_ratio
	)

	if self.use_framewise_decoding and num_frames > self.tile_sample_min_num_frames:
	return self._temporal_tiled_encode(x, causal=causal)

	if self.use_framewise_decoding and num_frames > tile_latent_min_num_frames:
	return self._temporal_tiled_decode(z, temb, causal=causal, return_dict=return_dict)

	if self.resnets_per_upsample != len(resnet_dilations):
	raise ValueError(
	f"`resnet_kernel_sizes` and `resnet_dilations` should be lists of the same length but are length"
	f" {len(self.resnets_per_upsample)} and {len(resnet_dilations)}, respectively."

	self.out_channels = out_channels
	self.total_upsample_factor = math.prod(upsample_factors)
	self.act_fn = act_fn
	self.negative_slope = leaky_relu_negative_slope
	self.final_act_fn = final_act_fn

	if self.num_upsample_layers != len(upsample_factors):
	raise ValueError(
	f"`upsample_kernel_sizes` and `upsample_factors` should be lists of the same length but are length"
	f" {self.num_upsample_layers} and {len(upsample_factors)}, respectively."
	)

ltx2 model/pipeline review #13601

Description

ltx2 model/pipeline review

Issue 1: Video VAE compression ratios ignore downsample axes

Issue 2: use_framewise_encoding is ignored by video VAE encode

Issue 3: Audio VAE compression ratios are hardcoded

Issue 4: Explicit zero audio guidance values are overwritten

Issue 5: Vocoder config validation raises the wrong exception

Issue 6: LTX2 is missing slow tests and dedicated coverage for condition/latent-upsample pipelines

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

`ltx2` model/pipeline review

Issue 2: `use_framewise_encoding` is ignored by video VAE encode