Skip to content

invictus717/InteractiveVideo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

InteractiveVideo: User-Centric Controllable Video Generation with Synergistic Multimodal Instructions

arXiv website Discord Open in HugginFace Youtube Video

🔆 Introduction

InteractiveVideo is a user-centric framework for interactive video generation. It highlights the contributions of comprehensive editing by users' intuitive manipulation, and it performs high-quality regional content control and precise motion control. We would like to introduce features as follows:

1. Personalize A Video

"Purple Flowers." "Purple Flowers, bee" "the purple flowers are shaking, a bee is flying"
"1 Cat." "1 Cat, butterfly" "the small yellow butterfly is flying to the cat's face"

2. Fine-grained Video Editing

"flowers." "flowers." "windy, the flowers are shaking in the wind"
"1 Man." "1 Man, rose." "1 Man, smiling."

3. Powerful Motion Control

InteractiveVideo can perform precise motion control.

"1 man, dark light " "the man is turning his body" "the man is turning his body"
"1 beautiful girl with long black hair, and a flower on her head, clouds" " the girl is turning gradually" " the girl is turning gradually"

4. Characters Dressing up

InteractiveVideo can smoothly cooperate with LoRAs and DreamBooth, thus, there are many potential functions of this framework that are still under-explored.

"Yae Miko" (Genshin Impact) "Dressing Up " "Dressing Up"

⚙️ Quick Start

1. Install Environment via Anaconda

# create a conda environment
conda create -n ivideo python=3.10
conda activate ivideo

# install requirements
pip install -r requirements.txt

2. Prepare Checkpoints

You can simply use the following script to download checkpoints

python scripts/download_models.py

This will take a long time, you can also selectively download checkpoints by modifying "scripts/download_models.py" and "scripts/*.json". Please make sure that there is at least one checkpoint left for each JSON file. Moreover, all checkpoints are listed as follows

  1. Checkpoints for enjoying image-to-image generation
Models Types Version Checkpoints
StableDiffusion - v1.5 Huggingface
StableDiffusion - turbo Huggingface
KoHaKu Animation v2.1 Huggingface
LCM-LoRA-StableDiffusion - v1.5 Huggingface
LCM-LoRA-StableDiffusion - xl Huggingface
  1. Checkpoints for enjoying image-to-video generation
Models Types Version Checkpoints
StableDiffusion - v1.5 Huggingface
PIA (UNet) - - Huggingface
Dreambooth MagicMixRealistic v5 Civitai
Dreambooth RCNZCartoon3d v10 Civitai
Dreambooth RealisticVision - Huggingface
  1. Checkpoints for enjoying dragging images.
Models Types Resolution Checkpoints
StyleGAN-2 Lions 512 x 512 Google Storage
StyleGAN-2 Dogs 1024 x 1024 Google Storage
StyleGAN-2 Horses 256 x 256 Google Storage
StyleGAN-2 Elephants 512 x 512 Google Storage
StyleGAN-2 Face (FFHQ) 512 x 512 NGC
StyleGAN-2 Cat Face (AFHQ) 512 x 512 NGC
StyleGAN-2 Car 512 x 512 CloudFront
StyleGAN-2 Cat 512 x 512 CloudFront
StyleGAN-2 Landmark (LHQ) 256 x 256 Google Drive

Also, you can train and try your customized models. You should put your model into the "checkpoints" folder, which is organized as follows

InteractiveVideo  # project
|----checkpoints
|----|----drag  # Drag
|----|----|----stylegan2_elephants_512_pytorch.pkl
|----|----i2i  # Image-2-Image
|----|----|----lora
|----|----|----|----lcm-lora-sdv1-5.safetensors
|----|----i2v  # Image-to-Video
|----|----|----unet
|----|----|----|----pia.ckpt
|----|----|----dreambooth
|----|----|----|----realisticVisionV51_v51VAE.safetensors
|----|----diffusion_body
|----|----|----stable-diffusion-v1-5
|----|----|----kohahu-v2-1
|----|----|----sd-turbo

💫 Usage

1. Local demo

To run a local demo, use the following command (recommended)

  python demo/main.py

You can also run our web demo locally with

  python demo/main_gradio.py

In the following, we provide some instructions for a quick start.

2. Image-to-Image Generation

Input image-to-image text prompts, and click the "Confirm Text" button. The generation is real-time.

3. Image-to-Video Generation

Input image-to-video text prompts, and click the "Confirm Text" button. Then click the "Generate Video" button and wait for seconds.

The generated video might not be satisfactory, but you can properly customize the video with multi-modal instructions. For example, draw butterflies to help the model know the location of them.

4. Drag Image

You can also drag images. First, you should choose a proper checkpoint in the "Drag Image" tab and click the "Drag Mode On" button. It will take a few minutes to prepare. Then you can draw masks, add points, and click the "start" button. Once the result is satisfactory, click the "stop" button.

😉 Citation

If the code and paper help your research, please kindly cite:

@article{zhang2024interactivevideo,
      title={InteractiveVideo: User-Centric Controllable Video Generation with Synergistic Multimodal Instructions}, 
      author={Zhang, Yiyuan and Kang, Yuhao and Zhang, Zhixin and Ding, Xiaohan and Zhao, Sanyuan and Yue, Xiangyu},
      year={2024},
      eprint={2402.03040},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

🤗 Acknowledgements

Our codebase builds on Stable Diffusion, StreamDiffusion, DragGAN, PTI, and PIA. Thanks the authors for sharing their awesome codebases!

📢 Disclaimer

We develop this repository for RESEARCH purposes, so it can only be used for personal/research/non-commercial purposes.