- 
                Notifications
    You must be signed in to change notification settings 
- Fork 6.5k
Open
Labels
staleIssues that haven't received updatesIssues that haven't received updates
Description
New txt2vid project:
A Step-Video-T2V, a state-of-the-art (SoTA) text-to-video pre-trained model with 30 billion parameters and the capability to generate videos up to 204 frames. To enhance both training and inference efficiency, we propose a deep compression VAE for videos, achieving 16x16 spatial and 8x temporal compression ratios. Direct Preference Optimization (DPO) is applied in the final stage to further enhance the visual quality of the generated videos. Step-Video-T2V's performance is evaluated on a novel video generation benchmark, Step-Video-T2V-Eval, demonstrating its SoTA text-to-video quality compared to both open-source and commercial engines.
https://github.com/stepfun-ai/Step-Video-T2V
412358956-9274b351-595d-41fb-aba3-f58e6e91603a.mp4
| Model | height/width/frame | Peak GPU Memory | 50 steps w flash-attn | 50 steps w/o flash-attn | 
|---|---|---|---|---|
| Step-Video-T2V | 544px992px204f | 77.64 GB | 743 s | 1232 s | 
| Step-Video-T2V | 544px992px136f | 72.48 GB | 408 s | 605 s | 
| Models | 🤗Huggingface | 🤖Modelscope | 
|---|---|---|
| Step-Video-T2V | download | download | 
| Step-Video-T2V-Turbo (Inference Step Distillation) | download | download | 
AshD, ghunkins, qinxuye, yiyixuxu and a-r-r-o-wghunkins and a-r-r-o-wghunkins and a-r-r-o-wtolgacangoz
Metadata
Metadata
Assignees
Labels
staleIssues that haven't received updatesIssues that haven't received updates