Skip to content

support LongCat-Video#1018

Merged
Artiprocher merged 1 commit intomainfrom
longcat
Oct 30, 2025
Merged

support LongCat-Video#1018
Artiprocher merged 1 commit intomainfrom
longcat

Conversation

@Artiprocher
Copy link
Copy Markdown
Collaborator

No description provided.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @Artiprocher, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the video generation framework by adding full support for the LongCat-Video model. This integration allows users to leverage advanced text-to-video, image-to-video, and video continuation features. The changes encompass the addition of the model's core architecture, its seamless integration into the existing pipeline, and the provision of detailed examples for both inference and training, ensuring a robust and user-friendly experience.

Highlights

  • New Model Support: Introduced comprehensive support for the meituan-longcat/LongCat-Video model, expanding the framework's capabilities to include text-to-video, image-to-video, and video continuation.
  • Core Model Implementation: Added the LongCatVideoTransformer3DModel with its associated components, such as RMSNorm_FP32, RotaryPositionalEmbedding, and custom attention mechanisms, forming the backbone of the new video generation capabilities.
  • Pipeline Integration: Integrated the LongCat-Video model into the video pipeline, including a new processing unit (WanVideoUnit_LongCatVideo) and a dedicated model function (model_fn_longcat_video) to handle its specific inference logic.
  • VRAM Management Enhancements: Updated the VRAM management system to properly handle the new LayerNorm_FP32 and RMSNorm_FP32 layers introduced by the LongCat-Video model, ensuring efficient memory usage.
  • Documentation and Examples: Updated README.md and README_zh.md to reflect the new model support and provided new example scripts for LongCat-Video inference, full training, LoRA training, and validation.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@Artiprocher Artiprocher merged commit a30ed90 into main Oct 30, 2025
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for the LongCat-Video model, including its architecture, pipeline integration, and example scripts for inference and training. The changes are extensive and well-structured. I've identified a few critical issues in the pipeline integration that could lead to incorrect behavior, particularly with batching and the model's output interpretation. There's also a medium-severity issue regarding unsafe tensor initialization in the model's implementation. After addressing these points, the PR should be in good shape.

num_cond_latents = longcat_latents.shape[2]
else:
num_cond_latents = 0
context = context.unsqueeze(0)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The use of context.unsqueeze(0) appears to be incorrect and will likely cause issues with batch sizes greater than 1 (e.g., when using classifier-free guidance with cfg_merge=True). It reshapes the context tensor to [1, B, N, C], which will cause a dimension mismatch later in the model's forward pass. It should be context.unsqueeze(1) to correctly shape it to [B, 1, N, C] for batch processing.

Suggested change
context = context.unsqueeze(0)
context = context.unsqueeze(1)

use_gradient_checkpointing=use_gradient_checkpointing,
use_gradient_checkpointing_offload=use_gradient_checkpointing_offload,
)
output = -output
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Negating the model's output with output = -output is highly unusual and potentially a critical bug. Standard diffusion models are trained to predict noise or velocity, and negating the output would reverse the diffusion process, leading to incorrect results. If this model was specifically trained to predict the negative of the target, this should be clearly documented with a comment. Otherwise, this line should be removed.

Suggested change
output = -output
# output = -output

if num_cond_latents is not None and num_cond_latents > 0:
k_full = torch.cat([k_cache, k], dim=2).contiguous()
v_full = torch.cat([v_cache, v], dim=2).contiguous()
q_padding = torch.cat([torch.empty_like(k_cache), q], dim=2).contiguous()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using torch.empty_like is unsafe as it creates a tensor with uninitialized memory, which can lead to non-deterministic behavior or NaNs. Please use torch.zeros_like instead for safe and deterministic initialization.

Suggested change
q_padding = torch.cat([torch.empty_like(k_cache), q], dim=2).contiguous()
q_padding = torch.cat([torch.zeros_like(k_cache), q], dim=2).contiguous()

@Artiprocher Artiprocher deleted the longcat branch November 4, 2025 03:01
LePao1 pushed a commit to LePao1/DiffSynth-Studio that referenced this pull request Feb 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant