When running this file, you need to modify the device that loads each model and the parameters of the calling function according to your own needs.

## Video Foundation Model

### Video Understanding

In [1]:
from modules.mplug import VideoCaptioning

tool_vc = VideoCaptioning("cuda:7")
video_captioning = tool_vc.inference # input: f"{video_path}", output: caption (str)

  from .autonotebook import tqdm as notebook_tqdm


Initializing mPLUG for VideoCaptioning




In [2]:
from modules.blip import ImageCaptioning

tool_blip2 = ImageCaptioning("cuda:0")
frames_captioning = tool_blip2.inference # input: f"{video_path}", output: caption (str)

Initializing BLIP2 for ImageCaptioning


Loading checkpoint shards: 100%|██████████| 2/2 [00:39<00:00, 19.51s/it]


### Video Processing (MoviePy)

In [3]:
from modules.video_moviepy import MoviepyInterface

movie_interface = MoviepyInterface()
video_subclip = movie_interface.intercept_fragments # input: f"{video_path}, {begin_second}, {end_second}", output: new_video_path
add_subtitles = movie_interface.add_subtitles # input: f"{video_path}, {start_time}, {duration}, {instruct_text}", output: new_video_path
concat_videos = movie_interface.concat_videos # input: f"{video_path1}, {video_path2}...", output: new_video_path
extract_audio = movie_interface.extract_audio # input: f"{video_path}", output: audio_path
add_audio_to_video = movie_interface.add_audio_to_video # input: f"{video_path}, {audio_path}", output: new_video_path

Initializing MoviepyInterface


### Video Generation

In [4]:
from modules.modelscope_t2v import ModelscopeT2V

mst2v = ModelscopeT2V("cuda:1")
text2video = mst2v.inference # input: f"{description}", output: new_video_path

In [5]:
from modules.annotator import Video2Canny, Video2Pose, Video2Depth

tool_v2c = Video2Canny()
tool_v2p = Video2Pose("cuda:1")
tool_v2d = Video2Depth("cuda:1")

video2canny = tool_v2c.inference # input: f"{video_path}", output: new_video_path
video2pose = tool_v2p.inference # input: f"{video_path}", output: new_video_path
video2depth = tool_v2d.inference # input: f"{video_path}", output: new_video_path

Initializing Video2Canny
Initializing Video2Pose
Initializing Video2Depth


In [6]:
from modules.text2video_zero import CannyText2Video, PoseText2Video, DepthText2Video, VideoPix2Pix

pose_text2video = PoseText2Video("cuda:5").inference  # input: f"{video_path}, {prompt}", output: new_video_path
canny_text2video = CannyText2Video("cuda:5").inference  # input: f"{video_path}, {prompt}", output: new_video_path
depth_text2video = DepthText2Video("cuda:5").inference  # input: f"{video_path}, {prompt}", output: new_video_path
video_pix2pix = VideoPix2Pix("cuda:5").inference # input: f"{video_path}, {prompt}", output: new_video_path

You have disabled the safety checker for <class 'diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion_controlnet.StableDiffusionControlNetPipeline'> by passing `safety_checker=None`. Ensure that you abide to the conditions of the Stable Diffusion license and do not expose unfiltered results in services or applications open to the public. Both the diffusers team and Hugging Face strongly recommend to keep the safety filter enabled in all public facing circumstances, disabling it only for use-cases that involve analyzing network behavior or auditing its results. For more information, please have a look at https://github.com/huggingface/diffusers/pull/254 .
You have disabled the safety checker for <class 'diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion_controlnet.StableDiffusionControlNetPipeline'> by passing `safety_checker=None`. Ensure that you abide to the conditions of the Stable Diffusion license and do not expose unfiltered results in services or applicatio

### Audio Generation

In [7]:
from modules.bark import Text2Audio

tool_t2a = Text2Audio()
text2audio = tool_t2a.text2audio   # input: f"{text}", output: audio_path
text2music = tool_t2a.text2music   # input: f"{text}", output: audio_path

torch version does not support flash attention. You will get faster inference speed by upgrade torch to newest nightly version.


Initializing Bark for Text2Audio
Loading bark models for text2audio...


## Experiments

The code snippet below requires modifying the parameters passed in by the function to your own input.

For the specific input format, please refer to the notes above.

In [8]:
print(video_captioning("video/v0006.mp4"))

a video of a girl in a blue sweater in a mirror .


In [9]:
print(frames_captioning("video/v0006.mp4"))

Second 0: the woman is wearing a blue dress. Second 1: the girl is wearing a blue dress. Second 2: the woman is wearing a blue dress. Second 3: the girl is wearing a blue dress. Second 4: the girl is wearing a blue dress. Second 5: the girl is dancing. Second 6: the girl is wearing a blue dress. Second 7: the girl is wearing a blue dress. Second 8: the girl is wearing a blue dress. Second 9: the woman is dancing. Second 10: the girl is wearing a blue dress.


In [10]:
print("./" + text2video("a goldendoodle playing in a park by a lake"))

100%|██████████| 25/25 [00:09<00:00,  2.76it/s]


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
./video/ee58c835.mp4


In [11]:
print("./" + video2canny("video/v0006.mp4"))

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
./video/0ef7_edge_v0006_v0006.mp4


In [12]:
print("./" + video2pose("video/v0006.mp4"))

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
./video/2748_pose_v0006_v0006.mp4


In [13]:
print("./" + video2depth("video/v0006.mp4"))

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
./video/2c51_depth_v0006_v0006.mp4


In [14]:
print("./" + pose_text2video("video/0d50_human-pose_1759_v0006.mp4, an astronaut dancing on the moon"))

Processing chunk 1 / 9


100%|██████████| 20/20 [00:05<00:00,  3.51it/s]


Processing chunk 2 / 9


100%|██████████| 20/20 [00:04<00:00,  4.09it/s]


Processing chunk 3 / 9


100%|██████████| 20/20 [00:04<00:00,  4.09it/s]


Processing chunk 4 / 9


100%|██████████| 20/20 [00:04<00:00,  4.08it/s]


Processing chunk 5 / 9


100%|██████████| 20/20 [00:04<00:00,  4.07it/s]


Processing chunk 6 / 9


100%|██████████| 20/20 [00:04<00:00,  4.07it/s]


Processing chunk 7 / 9


100%|██████████| 20/20 [00:04<00:00,  4.05it/s]


Processing chunk 8 / 9


100%|██████████| 20/20 [00:04<00:00,  4.04it/s]


Processing chunk 9 / 9


100%|██████████| 20/20 [00:03<00:00,  5.62it/s]


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
./video/5d78_pose2video_0d50_v0006.mp4


In [15]:
print("./" + depth_text2video("video/7574_depth_235ee6b2_235ee6b2.mp4, a tiger"))

Processing chunk 1 / 3


100%|██████████| 20/20 [00:11<00:00,  1.73it/s]


Processing chunk 2 / 3


100%|██████████| 20/20 [00:11<00:00,  1.72it/s]


Processing chunk 3 / 3


100%|██████████| 20/20 [00:04<00:00,  4.12it/s]


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
./video/227e_depth2video_7574_235ee6b2.mp4


In [16]:
print("./" + video_pix2pix("video/235ee6b2.mp4, make it snowy"))

Processing chunk 1 / 3


100%|██████████| 50/50 [00:31<00:00,  1.58it/s]


Processing chunk 2 / 3


100%|██████████| 50/50 [00:31<00:00,  1.60it/s]


Processing chunk 3 / 3


100%|██████████| 50/50 [00:12<00:00,  3.94it/s]


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
./video/1bc4_pix2pix_235ee6b2_235ee6b2.mp4


In [17]:
print("./" + text2audio("a goldendoodle playing in a park by a lake"))

100%|██████████| 100/100 [00:01<00:00, 66.13it/s]
100%|██████████| 7/7 [00:04<00:00,  1.56it/s]


./video/edafb26c.wav


In [18]:
print("./" + text2music("a goldendoodle playing in a park by a lake"))

100%|██████████| 100/100 [00:02<00:00, 33.85it/s]
100%|██████████| 13/13 [00:09<00:00,  1.42it/s]


./video/0f17d609.wav
