From d4f6ad7ea3cdc6e955df28a77e1d82828d0f9cfd Mon Sep 17 00:00:00 2001 From: Sayak Paul Date: Thu, 23 Mar 2023 08:37:16 +0530 Subject: [PATCH 1/3] small fixes to the text to video doc. --- docs/source/en/api/pipelines/text_to_video.mdx | 16 +++++++++------- 1 file changed, 9 insertions(+), 7 deletions(-) diff --git a/docs/source/en/api/pipelines/text_to_video.mdx b/docs/source/en/api/pipelines/text_to_video.mdx index f1fe794e1537..aa34583783b8 100644 --- a/docs/source/en/api/pipelines/text_to_video.mdx +++ b/docs/source/en/api/pipelines/text_to_video.mdx @@ -12,23 +12,25 @@ specific language governing permissions and limitations under the License. # Text-to-video synthesis -Text-to-video synthesis from [ModelScope](https://modelscope.cn/) can be considered the same as Stable Diffusion structure-wise but it is extended to videos instead of static images. More specifically, this system allows us to generate videos from a natural language text prompt. +## Overview -From the [model summary](https://huggingface.co/damo-vilab/modelscope-damo-text-to-video-synthesis): +[VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation](https://arxiv.org/abs/2303.08320) by Zhengxiong Luo, Dayou Chen, Yingya Zhang, Yan Huang, Liang Wang, Yujun Shen, Deli Zhao, Jingren Zhou, Tieniu Tan. -*This model is based on a multi-stage text-to-video generation diffusion model, which inputs a description text and returns a video that matches the text description. Only English input is supported.* +The abstract of the paper is the following: + +*A diffusion probabilistic model (DPM), which constructs a forward diffusion process by gradually adding noise to data points and learns the reverse denoising process to generate new samples, has been shown to handle complex data distribution. Despite its recent success in image synthesis, applying DPMs to video generation is still challenging due to high-dimensional data spaces. Previous methods usually adopt a standard diffusion process, where frames in the same video clip are destroyed with independent noises, ignoring the content redundancy and temporal correlation. This work presents a decomposed diffusion process via resolving the per-frame noise into a base noise that is shared among all frames and a residual noise that varies along the time axis. The denoising pipeline employs two jointly-learned networks to match the noise decomposition accordingly. Experiments on various datasets confirm that our approach, termed as VideoFusion, surpasses both GAN-based and diffusion-based alternatives in high-quality video generation. We further show that our decomposed formulation can benefit from pre-trained image diffusion models and well-support text-conditioned video creation.* Resources: * [Website](https://modelscope.cn/models/damo/text-to-video-synthesis/summary) * [GitHub repository](https://github.com/modelscope/modelscope/) -* [Spaces] (TODO) +* [Colab Notebook](https://colab.research.google.com/drive/13SmCwNyl2Fjjmmit6pX7IDUlYP9WIstL) ## Available Pipelines: | Pipeline | Tasks | Demo |---|---|:---:| -| [DiffusionPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/text_to_video_synthesis/pipeline_text_to_video_synth.py) | *Text-to-Video Generation* | [Spaces] (TODO) +| [TextToVideoSDPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/text_to_video_synthesis/pipeline_text_to_video_synth.py) | *Text-to-Video Generation* | [Colab Notebook](https://colab.research.google.com/drive/13SmCwNyl2Fjjmmit6pX7IDUlYP9WIstL) ## Usage example @@ -116,7 +118,7 @@ Here are some sample outputs: * [damo-vilab/text-to-video-ms-1.7b](https://huggingface.co/damo-vilab/text-to-video-ms-1.7b/) * [damo-vilab/text-to-video-ms-1.7b-legacy](https://huggingface.co/damo-vilab/text-to-video-ms-1.7b-legacy) -## DiffusionPipeline -[[autodoc]] DiffusionPipeline +## TextToVideoSDPipeline +[[autodoc]] TextToVideoSDPipeline - all - __call__ From c939f1e8238a37c7c815029f19f973098d7c2920 Mon Sep 17 00:00:00 2001 From: Sayak Paul Date: Thu, 23 Mar 2023 12:54:00 +0530 Subject: [PATCH 2/3] add: Spaces link. --- docs/source/en/api/pipelines/text_to_video.mdx | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/source/en/api/pipelines/text_to_video.mdx b/docs/source/en/api/pipelines/text_to_video.mdx index aa34583783b8..4db74a0c2748 100644 --- a/docs/source/en/api/pipelines/text_to_video.mdx +++ b/docs/source/en/api/pipelines/text_to_video.mdx @@ -24,13 +24,13 @@ Resources: * [Website](https://modelscope.cn/models/damo/text-to-video-synthesis/summary) * [GitHub repository](https://github.com/modelscope/modelscope/) -* [Colab Notebook](https://colab.research.google.com/drive/13SmCwNyl2Fjjmmit6pX7IDUlYP9WIstL) +* [🤗 Spaces](https://huggingface.co/spaces/damo-vilab/modelscope-text-to-video-synthesis) ## Available Pipelines: | Pipeline | Tasks | Demo |---|---|:---:| -| [TextToVideoSDPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/text_to_video_synthesis/pipeline_text_to_video_synth.py) | *Text-to-Video Generation* | [Colab Notebook](https://colab.research.google.com/drive/13SmCwNyl2Fjjmmit6pX7IDUlYP9WIstL) +| [TextToVideoSDPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/text_to_video_synthesis/pipeline_text_to_video_synth.py) | *Text-to-Video Generation* | [🤗 Spaces](https://huggingface.co/spaces/damo-vilab/modelscope-text-to-video-synthesis) ## Usage example From 470629b4728483bb5091d5bde16c888e0516b9f7 Mon Sep 17 00:00:00 2001 From: Sayak Paul Date: Thu, 23 Mar 2023 14:08:25 +0530 Subject: [PATCH 3/3] add: warning on research-only model. --- docs/source/en/api/pipelines/text_to_video.mdx | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/docs/source/en/api/pipelines/text_to_video.mdx b/docs/source/en/api/pipelines/text_to_video.mdx index 4db74a0c2748..82b2f19ce1b2 100644 --- a/docs/source/en/api/pipelines/text_to_video.mdx +++ b/docs/source/en/api/pipelines/text_to_video.mdx @@ -10,6 +10,12 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o specific language governing permissions and limitations under the License. --> + + +This pipeline is for research purposes only. + + + # Text-to-video synthesis ## Overview