ltx2 model/pipeline review
Commit tested: 0f1abc4ae8b0eb2a3b40e82a310507281144c423
Review performed against the repository review rules.
Duplicate search: checked GitHub Issues/PRs for ltx2, affected class/function names, and the specific failure modes below. I found related LTX2 work, including PRs/issues such as #12926, #13058, #13187, #13217, #13564, and #13572, but no duplicate for these specific findings.
Files/categories reviewed: public imports and lazy exports, model configs/serialization assumptions, dtype/device behavior, pipeline runtime behavior, audio/video consistency, offload-adjacent pipeline paths, and test coverage under tests/.
Issue 1: Video VAE compression ratios ignore downsample axes
Affected code:
|
self.spatial_compression_ratio = ( |
|
patch_size * 2 ** sum(spatio_temporal_scaling) |
|
if spatial_compression_ratio is None |
|
else spatial_compression_ratio |
|
) |
|
self.temporal_compression_ratio = ( |
|
patch_size_t * 2 ** sum(spatio_temporal_scaling) |
|
if temporal_compression_ratio is None |
|
else temporal_compression_ratio |
|
) |
Problem:
AutoencoderKLLTX2Video derives both spatial and temporal compression ratios from sum(spatio_temporal_scaling), but the axis actually depends on downsample_type. A temporal-only stage incorrectly increases the spatial ratio, and a spatial-only stage incorrectly increases the temporal ratio.
Impact:
Pipelines use these ratios for latent sizing and validation. Custom configs or configs without explicit ratio overrides can report wrong latent geometry and allocate/validate the wrong shapes.
Reproduction:
import torch
from diffusers import AutoencoderKLLTX2Video
vae = AutoencoderKLLTX2Video(
in_channels=3,
out_channels=3,
latent_channels=4,
block_out_channels=(8,),
decoder_block_out_channels=(8,),
layers_per_block=(1,),
decoder_layers_per_block=(1, 1),
spatio_temporal_scaling=(True,),
decoder_spatio_temporal_scaling=(True,),
decoder_inject_noise=(False, False),
downsample_type=("temporal",),
upsample_residual=(False,),
upsample_factor=(1,),
patch_size=1,
patch_size_t=1,
encoder_spatial_padding_mode="zeros",
decoder_spatial_padding_mode="zeros",
)
x = torch.randn(1, 3, 5, 16, 16)
z = vae.encode(x).latent_dist.mode()
actual_spatial = x.shape[-1] // z.shape[-1]
actual_temporal = (x.shape[2] - 1) // (z.shape[2] - 1)
assert vae.spatial_compression_ratio == actual_spatial, (vae.spatial_compression_ratio, actual_spatial)
assert vae.temporal_compression_ratio == actual_temporal, (vae.temporal_compression_ratio, actual_temporal)
Relevant precedent:
The LTX2 pipelines already rely on these VAE ratios when deriving latent geometry:
|
self.vae_spatial_compression_ratio = ( |
|
self.vae.spatial_compression_ratio if getattr(self, "vae", None) is not None else 32 |
|
) |
|
self.vae_temporal_compression_ratio = ( |
|
self.vae.temporal_compression_ratio if getattr(self, "vae", None) is not None else 8 |
|
) |
|
# TODO: check whether the MEL compression ratio logic here is corrct |
|
self.audio_vae_mel_compression_ratio = ( |
|
self.audio_vae.mel_compression_ratio if getattr(self, "audio_vae", None) is not None else 4 |
|
) |
|
self.audio_vae_temporal_compression_ratio = ( |
|
self.audio_vae.temporal_compression_ratio if getattr(self, "audio_vae", None) is not None else 4 |
Suggested fix:
spatial_downsamples = sum(
scale and mode in {"spatial", "spatiotemporal"}
for scale, mode in zip(spatio_temporal_scaling, downsample_type)
)
temporal_downsamples = sum(
scale and mode in {"temporal", "spatiotemporal"}
for scale, mode in zip(spatio_temporal_scaling, downsample_type)
)
self.spatial_compression_ratio = (
patch_size * 2**spatial_downsamples if spatial_compression_ratio is None else spatial_compression_ratio
)
self.temporal_compression_ratio = (
patch_size_t * 2**temporal_downsamples if temporal_compression_ratio is None else temporal_compression_ratio
)
Issue 2: use_framewise_encoding is ignored by video VAE encode
Affected code:
|
if self.use_framewise_decoding and num_frames > self.tile_sample_min_num_frames: |
|
return self._temporal_tiled_encode(x, causal=causal) |
Problem:
encode() checks self.use_framewise_decoding instead of self.use_framewise_encoding. As a result, enabling framewise encoding has no effect, while enabling framewise decoding changes encode behavior.
Impact:
Users cannot independently control framewise/tiled encoding. This is especially risky for memory-sensitive video VAE encoding, where the encoding and decoding paths are expected to be separately configurable.
Reproduction:
import torch
from diffusers import AutoencoderKLLTX2Video
class Probe(AutoencoderKLLTX2Video):
def _temporal_tiled_encode(self, x, causal=None):
raise RuntimeError("temporal encode used")
model = Probe(
in_channels=3,
out_channels=3,
latent_channels=4,
block_out_channels=(8,),
decoder_block_out_channels=(8,),
layers_per_block=(1,),
decoder_layers_per_block=(1, 1),
spatio_temporal_scaling=(True,),
decoder_spatio_temporal_scaling=(True,),
decoder_inject_noise=(False, False),
downsample_type=("spatial",),
upsample_residual=(False,),
upsample_factor=(1,),
patch_size=1,
patch_size_t=1,
encoder_spatial_padding_mode="zeros",
decoder_spatial_padding_mode="zeros",
)
model.tile_sample_min_num_frames = 1
model.use_framewise_encoding = True
model.use_framewise_decoding = False
try:
model.encode(torch.randn(1, 3, 5, 16, 16))
except RuntimeError:
pass
else:
raise AssertionError("use_framewise_encoding=True did not enable temporal tiled encode")
Relevant precedent:
The decode path uses the matching decode flag correctly:
|
if self.use_framewise_decoding and num_frames > tile_latent_min_num_frames: |
|
return self._temporal_tiled_decode(z, temb, causal=causal, return_dict=return_dict) |
Suggested fix:
if self.use_framewise_encoding and num_frames > self.tile_sample_min_num_frames:
return self._temporal_tiled_encode(x, causal=causal)
Issue 3: Audio VAE compression ratios are hardcoded
Affected code:
|
# TODO: calculate programmatically instead of hardcoding |
|
self.temporal_compression_ratio = LATENT_DOWNSAMPLE_FACTOR # 4 |
|
# TODO: confirm whether the mel compression ratio below is correct |
|
self.mel_compression_ratio = LATENT_DOWNSAMPLE_FACTOR |
Problem:
AutoencoderKLLTX2Audio hardcodes both temporal and mel compression ratios to 4, even though the actual ratio depends on the number of downsampling levels implied by ch_mult.
Impact:
Small/custom audio VAE configs report incorrect latent geometry. The LTX2 pipelines use these ratios to prepare audio latents, so wrong config values can produce shape mismatches or incorrectly sized generated audio latents.
Reproduction:
import torch
from diffusers import AutoencoderKLLTX2Audio
for ch_mult in [(1,), (1, 2), (1, 2, 4)]:
vae = AutoencoderKLLTX2Audio(
base_channels=4,
output_channels=2,
ch_mult=ch_mult,
num_res_blocks=1,
attn_resolutions=None,
in_channels=2,
resolution=32,
latent_channels=2,
norm_type="pixel",
causality_axis="height",
dropout=0.0,
mid_block_add_attention=False,
sample_rate=16000,
mel_hop_length=160,
is_causal=True,
mel_bins=8,
)
x = torch.randn(1, 2, 8, 8)
z = vae.encode(x).latent_dist.mode()
actual = (x.shape[2] // z.shape[2], x.shape[3] // z.shape[3])
reported = (vae.temporal_compression_ratio, vae.mel_compression_ratio)
assert reported == actual, (ch_mult, reported, actual)
Relevant precedent:
The video VAE already exposes compression ratios as runtime config-derived values instead of constants:
|
self.spatial_compression_ratio = ( |
|
patch_size * 2 ** sum(spatio_temporal_scaling) |
|
if spatial_compression_ratio is None |
|
else spatial_compression_ratio |
|
) |
|
self.temporal_compression_ratio = ( |
|
patch_size_t * 2 ** sum(spatio_temporal_scaling) |
|
if temporal_compression_ratio is None |
|
else temporal_compression_ratio |
|
) |
Suggested fix:
compression_ratio = 2 ** (len(ch_mult) - 1)
self.temporal_compression_ratio = compression_ratio
self.mel_compression_ratio = compression_ratio
Issue 4: Explicit zero audio guidance values are overwritten
Affected code:
|
audio_guidance_scale = audio_guidance_scale or guidance_scale |
|
audio_stg_scale = audio_stg_scale or stg_scale |
|
audio_modality_scale = audio_modality_scale or modality_scale |
|
audio_guidance_rescale = audio_guidance_rescale or guidance_rescale |
|
audio_guidance_scale = audio_guidance_scale or guidance_scale |
|
audio_stg_scale = audio_stg_scale or stg_scale |
|
audio_modality_scale = audio_modality_scale or modality_scale |
|
audio_guidance_rescale = audio_guidance_rescale or guidance_rescale |
|
audio_guidance_scale = audio_guidance_scale or guidance_scale |
|
audio_stg_scale = audio_stg_scale or stg_scale |
|
audio_modality_scale = audio_modality_scale or modality_scale |
|
audio_guidance_rescale = audio_guidance_rescale or guidance_rescale |
Problem:
The pipelines use audio_* = audio_* or video_value fallback logic. This treats valid explicit values like 0.0 as missing. For example, audio_stg_scale=0.0 is replaced by stg_scale.
Impact:
Users cannot disable audio STG/rescale while keeping the corresponding video guidance enabled. This makes the public audio guidance API behave differently from its documented None default semantics.
Reproduction:
import inspect
from diffusers import LTX2Pipeline
source = inspect.getsource(LTX2Pipeline.__call__)
assert "audio_stg_scale = audio_stg_scale or stg_scale" in source
stg_scale = 0.5
audio_stg_scale = 0.0
audio_stg_scale = audio_stg_scale or stg_scale
assert audio_stg_scale == 0.0, audio_stg_scale
Relevant precedent:
Diffusers pipelines generally distinguish None from valid falsy numeric values when applying optional argument defaults.
Suggested fix:
audio_guidance_scale = guidance_scale if audio_guidance_scale is None else audio_guidance_scale
audio_stg_scale = stg_scale if audio_stg_scale is None else audio_stg_scale
audio_modality_scale = modality_scale if audio_modality_scale is None else audio_modality_scale
audio_guidance_rescale = guidance_rescale if audio_guidance_rescale is None else audio_guidance_rescale
Issue 5: Vocoder config validation raises the wrong exception
Affected code:
|
if self.resnets_per_upsample != len(resnet_dilations): |
|
raise ValueError( |
|
f"`resnet_kernel_sizes` and `resnet_dilations` should be lists of the same length but are length" |
|
f" {len(self.resnets_per_upsample)} and {len(resnet_dilations)}, respectively." |
Problem:
The validation branch intends to raise ValueError when resnet_kernel_sizes and resnet_dilations lengths differ, but the error message calls len(self.resnets_per_upsample). self.resnets_per_upsample is an integer, so the branch raises TypeError before the intended validation error.
Impact:
Invalid vocoder configs fail with a misleading implementation error instead of an actionable configuration error.
Reproduction:
from diffusers.pipelines.ltx2.vocoder import LTX2Vocoder
try:
LTX2Vocoder(resnet_kernel_sizes=[3, 7], resnet_dilations=[[1, 3, 5]])
except Exception as error:
assert isinstance(error, ValueError), type(error).__name__
Relevant precedent:
The preceding validation branches in the same constructor raise direct ValueErrors for malformed config shapes:
|
self.out_channels = out_channels |
|
self.total_upsample_factor = math.prod(upsample_factors) |
|
self.act_fn = act_fn |
|
self.negative_slope = leaky_relu_negative_slope |
|
self.final_act_fn = final_act_fn |
|
|
|
if self.num_upsample_layers != len(upsample_factors): |
|
raise ValueError( |
|
f"`upsample_kernel_sizes` and `upsample_factors` should be lists of the same length but are length" |
|
f" {self.num_upsample_layers} and {len(upsample_factors)}, respectively." |
|
) |
Suggested fix:
raise ValueError(
f"`resnet_kernel_sizes` and `resnet_dilations` should be lists of the same length but are length"
f" {self.resnets_per_upsample} and {len(resnet_dilations)}, respectively."
)
Issue 6: LTX2 is missing slow tests and dedicated coverage for condition/latent-upsample pipelines
Affected code:
|
from ..test_pipelines_common import PipelineTesterMixin |
Problem:
The LTX2 family has fast tests for text-to-video and image-to-video, plus model tests, but no dedicated fast test file for LTX2ConditionPipeline and no dedicated fast test file for LTX2LatentUpsamplePipeline. The current LTX2 test set also has no @slow tests.
Impact:
Real checkpoint integration, condition workflows, latent upsample behavior, audio generation/export, two-stage generation, offload paths, and release checkpoint compatibility are not covered by slow tests. Missing condition and latent-upsample fast tests also leave public pipelines exposed to regressions that would not be caught by the existing fast suite.
Reproduction:
from pathlib import Path
root = Path("tests")
ltx2_tests = sorted(root.glob("**/*ltx2*.py"))
slow_hits = [path for path in ltx2_tests if "@slow" in path.read_text(encoding="utf-8")]
assert Path("tests/pipelines/ltx2/test_ltx2_condition.py").exists()
assert Path("tests/pipelines/ltx2/test_ltx2_latent_upsample.py").exists()
assert slow_hits, [str(path) for path in ltx2_tests]
Relevant precedent:
LTX has dedicated condition and latent upsample fast tests:
|
# Copyright 2025 The HuggingFace Team. |
|
# Copyright 2025 The HuggingFace Team. |
Suggested fix:
Add dedicated fast tests for LTX2ConditionPipeline and LTX2LatentUpsamplePipeline, modeled on the existing LTX tests but using LTX2 tiny components. Add slow tests for the current LTX2 checkpoint family covering at least LTX2Pipeline, LTX2ImageToVideoPipeline, LTX2ConditionPipeline, latent upsample/two-stage generation, audio output, and one CPU/GPU offload path.
ltx2model/pipeline reviewCommit tested:
0f1abc4ae8b0eb2a3b40e82a310507281144c423Review performed against the repository review rules.
Duplicate search: checked GitHub Issues/PRs for
ltx2, affected class/function names, and the specific failure modes below. I found related LTX2 work, including PRs/issues such as#12926,#13058,#13187,#13217,#13564, and#13572, but no duplicate for these specific findings.Files/categories reviewed: public imports and lazy exports, model configs/serialization assumptions, dtype/device behavior, pipeline runtime behavior, audio/video consistency, offload-adjacent pipeline paths, and test coverage under
tests/.Issue 1: Video VAE compression ratios ignore downsample axes
Affected code:
diffusers/src/diffusers/models/autoencoders/autoencoder_kl_ltx2.py
Lines 1151 to 1160 in 0f1abc4
Problem:
AutoencoderKLLTX2Videoderives both spatial and temporal compression ratios fromsum(spatio_temporal_scaling), but the axis actually depends ondownsample_type. A temporal-only stage incorrectly increases the spatial ratio, and a spatial-only stage incorrectly increases the temporal ratio.Impact:
Pipelines use these ratios for latent sizing and validation. Custom configs or configs without explicit ratio overrides can report wrong latent geometry and allocate/validate the wrong shapes.
Reproduction:
Relevant precedent:
The LTX2 pipelines already rely on these VAE ratios when deriving latent geometry:
diffusers/src/diffusers/pipelines/ltx2/pipeline_ltx2.py
Lines 241 to 252 in 0f1abc4
Suggested fix:
Issue 2:
use_framewise_encodingis ignored by video VAE encodeAffected code:
diffusers/src/diffusers/models/autoencoders/autoencoder_kl_ltx2.py
Lines 1229 to 1230 in 0f1abc4
Problem:
encode()checksself.use_framewise_decodinginstead ofself.use_framewise_encoding. As a result, enabling framewise encoding has no effect, while enabling framewise decoding changes encode behavior.Impact:
Users cannot independently control framewise/tiled encoding. This is especially risky for memory-sensitive video VAE encoding, where the encoding and decoding paths are expected to be separately configurable.
Reproduction:
Relevant precedent:
The decode path uses the matching decode flag correctly:
diffusers/src/diffusers/models/autoencoders/autoencoder_kl_ltx2.py
Lines 1278 to 1279 in 0f1abc4
Suggested fix:
Issue 3: Audio VAE compression ratios are hardcoded
Affected code:
diffusers/src/diffusers/models/autoencoders/autoencoder_kl_ltx2_audio.py
Lines 750 to 753 in 0f1abc4
Problem:
AutoencoderKLLTX2Audiohardcodes both temporal and mel compression ratios to4, even though the actual ratio depends on the number of downsampling levels implied bych_mult.Impact:
Small/custom audio VAE configs report incorrect latent geometry. The LTX2 pipelines use these ratios to prepare audio latents, so wrong config values can produce shape mismatches or incorrectly sized generated audio latents.
Reproduction:
Relevant precedent:
The video VAE already exposes compression ratios as runtime config-derived values instead of constants:
diffusers/src/diffusers/models/autoencoders/autoencoder_kl_ltx2.py
Lines 1151 to 1160 in 0f1abc4
Suggested fix:
Issue 4: Explicit zero audio guidance values are overwritten
Affected code:
diffusers/src/diffusers/pipelines/ltx2/pipeline_ltx2.py
Lines 1005 to 1008 in 0f1abc4
diffusers/src/diffusers/pipelines/ltx2/pipeline_ltx2_image2video.py
Lines 1068 to 1071 in 0f1abc4
diffusers/src/diffusers/pipelines/ltx2/pipeline_ltx2_condition.py
Lines 1180 to 1183 in 0f1abc4
Problem:
The pipelines use
audio_* = audio_* or video_valuefallback logic. This treats valid explicit values like0.0as missing. For example,audio_stg_scale=0.0is replaced bystg_scale.Impact:
Users cannot disable audio STG/rescale while keeping the corresponding video guidance enabled. This makes the public audio guidance API behave differently from its documented
Nonedefault semantics.Reproduction:
Relevant precedent:
Diffusers pipelines generally distinguish
Nonefrom valid falsy numeric values when applying optional argument defaults.Suggested fix:
Issue 5: Vocoder config validation raises the wrong exception
Affected code:
diffusers/src/diffusers/pipelines/ltx2/vocoder.py
Lines 318 to 321 in 0f1abc4
Problem:
The validation branch intends to raise
ValueErrorwhenresnet_kernel_sizesandresnet_dilationslengths differ, but the error message callslen(self.resnets_per_upsample).self.resnets_per_upsampleis an integer, so the branch raisesTypeErrorbefore the intended validation error.Impact:
Invalid vocoder configs fail with a misleading implementation error instead of an actionable configuration error.
Reproduction:
Relevant precedent:
The preceding validation branches in the same constructor raise direct
ValueErrors for malformed config shapes:diffusers/src/diffusers/pipelines/ltx2/vocoder.py
Lines 306 to 316 in 0f1abc4
Suggested fix:
Issue 6: LTX2 is missing slow tests and dedicated coverage for condition/latent-upsample pipelines
Affected code:
diffusers/tests/pipelines/ltx2/test_ltx2.py
Line 32 in 0f1abc4
diffusers/tests/pipelines/ltx2/test_ltx2_image2video.py
Line 35 in 0f1abc4
Problem:
The LTX2 family has fast tests for text-to-video and image-to-video, plus model tests, but no dedicated fast test file for
LTX2ConditionPipelineand no dedicated fast test file forLTX2LatentUpsamplePipeline. The current LTX2 test set also has no@slowtests.Impact:
Real checkpoint integration, condition workflows, latent upsample behavior, audio generation/export, two-stage generation, offload paths, and release checkpoint compatibility are not covered by slow tests. Missing condition and latent-upsample fast tests also leave public pipelines exposed to regressions that would not be caught by the existing fast suite.
Reproduction:
Relevant precedent:
LTX has dedicated condition and latent upsample fast tests:
diffusers/tests/pipelines/ltx/test_ltx_condition.py
Line 1 in 0f1abc4
diffusers/tests/pipelines/ltx/test_ltx_latent_upsample.py
Line 1 in 0f1abc4
Suggested fix:
Add dedicated fast tests for
LTX2ConditionPipelineandLTX2LatentUpsamplePipeline, modeled on the existing LTX tests but using LTX2 tiny components. Add slow tests for the current LTX2 checkpoint family covering at leastLTX2Pipeline,LTX2ImageToVideoPipeline,LTX2ConditionPipeline, latent upsample/two-stage generation, audio output, and one CPU/GPU offload path.