Skip to content

ltx2 model/pipeline review #13601

@hlky

Description

@hlky

ltx2 model/pipeline review

Commit tested: 0f1abc4ae8b0eb2a3b40e82a310507281144c423

Review performed against the repository review rules.

Duplicate search: checked GitHub Issues/PRs for ltx2, affected class/function names, and the specific failure modes below. I found related LTX2 work, including PRs/issues such as #12926, #13058, #13187, #13217, #13564, and #13572, but no duplicate for these specific findings.

Files/categories reviewed: public imports and lazy exports, model configs/serialization assumptions, dtype/device behavior, pipeline runtime behavior, audio/video consistency, offload-adjacent pipeline paths, and test coverage under tests/.

Issue 1: Video VAE compression ratios ignore downsample axes

Affected code:

self.spatial_compression_ratio = (
patch_size * 2 ** sum(spatio_temporal_scaling)
if spatial_compression_ratio is None
else spatial_compression_ratio
)
self.temporal_compression_ratio = (
patch_size_t * 2 ** sum(spatio_temporal_scaling)
if temporal_compression_ratio is None
else temporal_compression_ratio
)

Problem:
AutoencoderKLLTX2Video derives both spatial and temporal compression ratios from sum(spatio_temporal_scaling), but the axis actually depends on downsample_type. A temporal-only stage incorrectly increases the spatial ratio, and a spatial-only stage incorrectly increases the temporal ratio.

Impact:
Pipelines use these ratios for latent sizing and validation. Custom configs or configs without explicit ratio overrides can report wrong latent geometry and allocate/validate the wrong shapes.

Reproduction:

import torch
from diffusers import AutoencoderKLLTX2Video

vae = AutoencoderKLLTX2Video(
    in_channels=3,
    out_channels=3,
    latent_channels=4,
    block_out_channels=(8,),
    decoder_block_out_channels=(8,),
    layers_per_block=(1,),
    decoder_layers_per_block=(1, 1),
    spatio_temporal_scaling=(True,),
    decoder_spatio_temporal_scaling=(True,),
    decoder_inject_noise=(False, False),
    downsample_type=("temporal",),
    upsample_residual=(False,),
    upsample_factor=(1,),
    patch_size=1,
    patch_size_t=1,
    encoder_spatial_padding_mode="zeros",
    decoder_spatial_padding_mode="zeros",
)

x = torch.randn(1, 3, 5, 16, 16)
z = vae.encode(x).latent_dist.mode()

actual_spatial = x.shape[-1] // z.shape[-1]
actual_temporal = (x.shape[2] - 1) // (z.shape[2] - 1)

assert vae.spatial_compression_ratio == actual_spatial, (vae.spatial_compression_ratio, actual_spatial)
assert vae.temporal_compression_ratio == actual_temporal, (vae.temporal_compression_ratio, actual_temporal)

Relevant precedent:
The LTX2 pipelines already rely on these VAE ratios when deriving latent geometry:

self.vae_spatial_compression_ratio = (
self.vae.spatial_compression_ratio if getattr(self, "vae", None) is not None else 32
)
self.vae_temporal_compression_ratio = (
self.vae.temporal_compression_ratio if getattr(self, "vae", None) is not None else 8
)
# TODO: check whether the MEL compression ratio logic here is corrct
self.audio_vae_mel_compression_ratio = (
self.audio_vae.mel_compression_ratio if getattr(self, "audio_vae", None) is not None else 4
)
self.audio_vae_temporal_compression_ratio = (
self.audio_vae.temporal_compression_ratio if getattr(self, "audio_vae", None) is not None else 4

Suggested fix:

spatial_downsamples = sum(
    scale and mode in {"spatial", "spatiotemporal"}
    for scale, mode in zip(spatio_temporal_scaling, downsample_type)
)
temporal_downsamples = sum(
    scale and mode in {"temporal", "spatiotemporal"}
    for scale, mode in zip(spatio_temporal_scaling, downsample_type)
)

self.spatial_compression_ratio = (
    patch_size * 2**spatial_downsamples if spatial_compression_ratio is None else spatial_compression_ratio
)
self.temporal_compression_ratio = (
    patch_size_t * 2**temporal_downsamples if temporal_compression_ratio is None else temporal_compression_ratio
)

Issue 2: use_framewise_encoding is ignored by video VAE encode

Affected code:

if self.use_framewise_decoding and num_frames > self.tile_sample_min_num_frames:
return self._temporal_tiled_encode(x, causal=causal)

Problem:
encode() checks self.use_framewise_decoding instead of self.use_framewise_encoding. As a result, enabling framewise encoding has no effect, while enabling framewise decoding changes encode behavior.

Impact:
Users cannot independently control framewise/tiled encoding. This is especially risky for memory-sensitive video VAE encoding, where the encoding and decoding paths are expected to be separately configurable.

Reproduction:

import torch
from diffusers import AutoencoderKLLTX2Video

class Probe(AutoencoderKLLTX2Video):
    def _temporal_tiled_encode(self, x, causal=None):
        raise RuntimeError("temporal encode used")

model = Probe(
    in_channels=3,
    out_channels=3,
    latent_channels=4,
    block_out_channels=(8,),
    decoder_block_out_channels=(8,),
    layers_per_block=(1,),
    decoder_layers_per_block=(1, 1),
    spatio_temporal_scaling=(True,),
    decoder_spatio_temporal_scaling=(True,),
    decoder_inject_noise=(False, False),
    downsample_type=("spatial",),
    upsample_residual=(False,),
    upsample_factor=(1,),
    patch_size=1,
    patch_size_t=1,
    encoder_spatial_padding_mode="zeros",
    decoder_spatial_padding_mode="zeros",
)

model.tile_sample_min_num_frames = 1
model.use_framewise_encoding = True
model.use_framewise_decoding = False

try:
    model.encode(torch.randn(1, 3, 5, 16, 16))
except RuntimeError:
    pass
else:
    raise AssertionError("use_framewise_encoding=True did not enable temporal tiled encode")

Relevant precedent:
The decode path uses the matching decode flag correctly:

if self.use_framewise_decoding and num_frames > tile_latent_min_num_frames:
return self._temporal_tiled_decode(z, temb, causal=causal, return_dict=return_dict)

Suggested fix:

if self.use_framewise_encoding and num_frames > self.tile_sample_min_num_frames:
    return self._temporal_tiled_encode(x, causal=causal)

Issue 3: Audio VAE compression ratios are hardcoded

Affected code:

# TODO: calculate programmatically instead of hardcoding
self.temporal_compression_ratio = LATENT_DOWNSAMPLE_FACTOR # 4
# TODO: confirm whether the mel compression ratio below is correct
self.mel_compression_ratio = LATENT_DOWNSAMPLE_FACTOR

Problem:
AutoencoderKLLTX2Audio hardcodes both temporal and mel compression ratios to 4, even though the actual ratio depends on the number of downsampling levels implied by ch_mult.

Impact:
Small/custom audio VAE configs report incorrect latent geometry. The LTX2 pipelines use these ratios to prepare audio latents, so wrong config values can produce shape mismatches or incorrectly sized generated audio latents.

Reproduction:

import torch
from diffusers import AutoencoderKLLTX2Audio

for ch_mult in [(1,), (1, 2), (1, 2, 4)]:
    vae = AutoencoderKLLTX2Audio(
        base_channels=4,
        output_channels=2,
        ch_mult=ch_mult,
        num_res_blocks=1,
        attn_resolutions=None,
        in_channels=2,
        resolution=32,
        latent_channels=2,
        norm_type="pixel",
        causality_axis="height",
        dropout=0.0,
        mid_block_add_attention=False,
        sample_rate=16000,
        mel_hop_length=160,
        is_causal=True,
        mel_bins=8,
    )

    x = torch.randn(1, 2, 8, 8)
    z = vae.encode(x).latent_dist.mode()
    actual = (x.shape[2] // z.shape[2], x.shape[3] // z.shape[3])
    reported = (vae.temporal_compression_ratio, vae.mel_compression_ratio)

    assert reported == actual, (ch_mult, reported, actual)

Relevant precedent:
The video VAE already exposes compression ratios as runtime config-derived values instead of constants:

self.spatial_compression_ratio = (
patch_size * 2 ** sum(spatio_temporal_scaling)
if spatial_compression_ratio is None
else spatial_compression_ratio
)
self.temporal_compression_ratio = (
patch_size_t * 2 ** sum(spatio_temporal_scaling)
if temporal_compression_ratio is None
else temporal_compression_ratio
)

Suggested fix:

compression_ratio = 2 ** (len(ch_mult) - 1)
self.temporal_compression_ratio = compression_ratio
self.mel_compression_ratio = compression_ratio

Issue 4: Explicit zero audio guidance values are overwritten

Affected code:

audio_guidance_scale = audio_guidance_scale or guidance_scale
audio_stg_scale = audio_stg_scale or stg_scale
audio_modality_scale = audio_modality_scale or modality_scale
audio_guidance_rescale = audio_guidance_rescale or guidance_rescale

audio_guidance_scale = audio_guidance_scale or guidance_scale
audio_stg_scale = audio_stg_scale or stg_scale
audio_modality_scale = audio_modality_scale or modality_scale
audio_guidance_rescale = audio_guidance_rescale or guidance_rescale

audio_guidance_scale = audio_guidance_scale or guidance_scale
audio_stg_scale = audio_stg_scale or stg_scale
audio_modality_scale = audio_modality_scale or modality_scale
audio_guidance_rescale = audio_guidance_rescale or guidance_rescale

Problem:
The pipelines use audio_* = audio_* or video_value fallback logic. This treats valid explicit values like 0.0 as missing. For example, audio_stg_scale=0.0 is replaced by stg_scale.

Impact:
Users cannot disable audio STG/rescale while keeping the corresponding video guidance enabled. This makes the public audio guidance API behave differently from its documented None default semantics.

Reproduction:

import inspect
from diffusers import LTX2Pipeline

source = inspect.getsource(LTX2Pipeline.__call__)
assert "audio_stg_scale = audio_stg_scale or stg_scale" in source

stg_scale = 0.5
audio_stg_scale = 0.0

audio_stg_scale = audio_stg_scale or stg_scale

assert audio_stg_scale == 0.0, audio_stg_scale

Relevant precedent:
Diffusers pipelines generally distinguish None from valid falsy numeric values when applying optional argument defaults.

Suggested fix:

audio_guidance_scale = guidance_scale if audio_guidance_scale is None else audio_guidance_scale
audio_stg_scale = stg_scale if audio_stg_scale is None else audio_stg_scale
audio_modality_scale = modality_scale if audio_modality_scale is None else audio_modality_scale
audio_guidance_rescale = guidance_rescale if audio_guidance_rescale is None else audio_guidance_rescale

Issue 5: Vocoder config validation raises the wrong exception

Affected code:

if self.resnets_per_upsample != len(resnet_dilations):
raise ValueError(
f"`resnet_kernel_sizes` and `resnet_dilations` should be lists of the same length but are length"
f" {len(self.resnets_per_upsample)} and {len(resnet_dilations)}, respectively."

Problem:
The validation branch intends to raise ValueError when resnet_kernel_sizes and resnet_dilations lengths differ, but the error message calls len(self.resnets_per_upsample). self.resnets_per_upsample is an integer, so the branch raises TypeError before the intended validation error.

Impact:
Invalid vocoder configs fail with a misleading implementation error instead of an actionable configuration error.

Reproduction:

from diffusers.pipelines.ltx2.vocoder import LTX2Vocoder

try:
    LTX2Vocoder(resnet_kernel_sizes=[3, 7], resnet_dilations=[[1, 3, 5]])
except Exception as error:
    assert isinstance(error, ValueError), type(error).__name__

Relevant precedent:
The preceding validation branches in the same constructor raise direct ValueErrors for malformed config shapes:

self.out_channels = out_channels
self.total_upsample_factor = math.prod(upsample_factors)
self.act_fn = act_fn
self.negative_slope = leaky_relu_negative_slope
self.final_act_fn = final_act_fn
if self.num_upsample_layers != len(upsample_factors):
raise ValueError(
f"`upsample_kernel_sizes` and `upsample_factors` should be lists of the same length but are length"
f" {self.num_upsample_layers} and {len(upsample_factors)}, respectively."
)

Suggested fix:

raise ValueError(
    f"`resnet_kernel_sizes` and `resnet_dilations` should be lists of the same length but are length"
    f" {self.resnets_per_upsample} and {len(resnet_dilations)}, respectively."
)

Issue 6: LTX2 is missing slow tests and dedicated coverage for condition/latent-upsample pipelines

Affected code:

from ..test_pipelines_common import PipelineTesterMixin

Problem:
The LTX2 family has fast tests for text-to-video and image-to-video, plus model tests, but no dedicated fast test file for LTX2ConditionPipeline and no dedicated fast test file for LTX2LatentUpsamplePipeline. The current LTX2 test set also has no @slow tests.

Impact:
Real checkpoint integration, condition workflows, latent upsample behavior, audio generation/export, two-stage generation, offload paths, and release checkpoint compatibility are not covered by slow tests. Missing condition and latent-upsample fast tests also leave public pipelines exposed to regressions that would not be caught by the existing fast suite.

Reproduction:

from pathlib import Path

root = Path("tests")
ltx2_tests = sorted(root.glob("**/*ltx2*.py"))
slow_hits = [path for path in ltx2_tests if "@slow" in path.read_text(encoding="utf-8")]

assert Path("tests/pipelines/ltx2/test_ltx2_condition.py").exists()
assert Path("tests/pipelines/ltx2/test_ltx2_latent_upsample.py").exists()
assert slow_hits, [str(path) for path in ltx2_tests]

Relevant precedent:
LTX has dedicated condition and latent upsample fast tests:

# Copyright 2025 The HuggingFace Team.

# Copyright 2025 The HuggingFace Team.

Suggested fix:
Add dedicated fast tests for LTX2ConditionPipeline and LTX2LatentUpsamplePipeline, modeled on the existing LTX tests but using LTX2 tiny components. Add slow tests for the current LTX2 checkpoint family covering at least LTX2Pipeline, LTX2ImageToVideoPipeline, LTX2ConditionPipeline, latent upsample/two-stage generation, audio output, and one CPU/GPU offload path.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions