# CogVideoX

**CogVideoX** is a text-to-video generation model focused on creating more coherent videos aligned with a prompt. It achieves this using
* a 3D variational autoencoder that compresses videos spatially and temporally, improving compression rate and video accuracy.
* an expert transformer block to help align text and video, and a 3D full attention module for capturing and creating spatially and temporally accurate videos.

## Data preparation

The training scripts accepts data in two formats. The first format is suited for small-scale training, and the second format uses a CSV format, which is more appropriate for streaming data for large-scale training.

### Small format

Two files where one file contains line-separted prompts and another file contains line-separated paths to video data (the path to video files must be relative to the path we pass when specifying `--instance_data_root`).

Assume we have specified `--instance_data_root` as `/dataset`, and that directory contains the files: `prompts.txt` and `videos.txt`

The `prompts.txt` file should contain line-separated prompts:
```
A black and white animated sequence featuring a rabbit, named Rabbity Ribfried, and an anthropomorphic goat in a musical, playful environment, showcasing their evolving interaction.
A black and white animated sequence on a ship's deck features a bulldog character, named Bully Bulldoger, showcasing exaggerated facial expressions and body language. The character progresses from confident to focused, then to strained and distressed, displaying a range of emotions as it navigates challenges. The ship's interior remains static in the background, with minimalistic details such as a bell and open door. The character's dynamic movements and changing expressions drive the narrative, with no camera movement to distract from its evolving reactions and physical gestures.
...
```

The `videos.txt` file should contain line-separated paths to video files. The path should be *relative* to the `--instance_data_root` directory:
```
videos/00000.mp4
videos/00001.mp4
...
```
Therefore, the dataset root directory looks like this:
```
/dataset
├── prompts.txt
├── videos.txt
├── videos
    ├── videos/00000.mp4
    ├── videos/00001.mp4
    ├── ...
```

When using this format, the `--caption_column` must be `prompts.txt` and the `--video_column` must be `videos.txt`.

### Stream format

We could use a single CSV file. Assume we have a `metadata.csv` file whose format is
```
<CAPTION_COLUMN>,<PATH_TO_VIDEO_COLUMN>
"""A black and white animated sequence featuring a rabbit, named Rabbity Ribfried, and an anthropomorphic goat in a musical, playful environment, showcasing their evolving interaction.""","""00000.mp4"""
"""A black and white animated sequence on a ship's deck features a bulldog character, named Bully Bulldoger, showcasing exaggerated facial expressions and body language. The character progresses from confident to focused, then to strained and distressed, displaying a range of emotions as it navigates challenges. The ship's interior remains static in the background, with minimalistic details such as a bell and open door. The character's dynamic movements and changing expressions drive the narrative, with no camera movement to distract from its evolving reactions and physical gestures.""","""00001.mp4"""
...
```

Here, the `--instance_data_root` should be the location where the videos are stored and `--dataset_name` should be either a path or local folder or a `load_dataset` compatible dataset hosted on the Hub.

[Hub dataset] Assume we have videos of Minecraft gameplay at `https://huggingface.co/datasets/my-awesome-username/minecraft-videos`, then we would have to specify `--dataset_name my-awesome-username/minecraft-videos`.

Using this format, the `--caption_column` must be `<CAPTION_COLUMN>` and `--video_column` must be `<PATH_TO_VIDEO_COLUMN>`.

## Training

We must install necessary requirements:
* PyTorch 2.0 or above for quantized/deepspeed training
* `pip install diffusers transformers accelerate peft huggingface_hub` for all things modeling and training related
* `pip install datasets decord` for loading video training data
* `pip install bitsandbytes` for using 8-bit Adam and AdamW optimizers for memory-optimized training
* `pip install wandb` optionally for monitoring trianing logs
* `pip install deepspeed` optionally for **Deepspeed** training
* `pip install prodigyopt` optionally if we would like to use the Prodigy optimizer for training



As always, make sure to install the `diffusers` library from source:
```bash
git clone https://github.com/huggingface/diffusers
cd diffusers
pip install .
```

We need to navigate to the following folder and install the corresponding dependencies:
```bash
cd examples/cogvideo
pip install -r requirements.txt
```

Now we can initialize a HuggingFace Accelerate environment
```bash
accelerate config
```
To set up a default Acclerate environment without choosing any configurations:
```bash
accelerate config default
```
Or if our environment does not support an interactive shell like a notebook, we can use:
```python
from accelerate.utils import write_basic_config
write_basic_config()
```

If our data is prepared as suggusted in the Data preparation section, we can start training. Assuming we will train on 50 videos of a similar concept, we have found 1500-2000 steps to work well. The official recommendation is 100 videos with a total of 4000 steps. Assuming we will train on a single GPU with a `--train_batch_size 1`:
* 1500 steps on 50 videos correspond to 30 training epochs
* 4000 steps on 100 videos correspond to 40 training epochs

We can specify the following in the bash file to run:
```bash
GPU_IDS="0"

accelerate launch --gpu_ids $GPU_IDS examples/cogvideo/train_cogvideox_lora.py \
  --pretrained_model_name_or_path THUDM/CogVideoX-2b \
  --cache_dir <CACHE_DIR> \
  --instance_data_root <PATH_TO_WHERE_VIDEO_FILES_ARE_STORED> \
  --dataset_name my-awesome-name/my-awesome-dataset \
  --caption_column <CAPTION_COLUMN> \
  --video_column <PATH_TO_VIDEO_COLUMN> \
  --id_token <ID_TOKEN> \
  --validation_prompt "<ID_TOKEN> Spiderman swinging over buildings:::A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance" \
  --validation_prompt_separator ::: \
  --num_validation_videos 1 \
  --validation_epochs 10 \
  --seed 42 \
  --rank 64 \
  --lora_alpha 64 \
  --mixed_precision fp16 \
  --output_dir /raid/aryan/cogvideox-lora \
  --height 480 --width 720 --fps 8 --max_num_frames 49 --skip_frames_start 0 --skip_frames_end 0 \
  --train_batch_size 1 \
  --num_train_epochs 30 \
  --checkpointing_steps 1000 \
  --gradient_accumulation_steps 1 \
  --learning_rate 1e-3 \
  --lr_scheduler cosine_with_restarts \
  --lr_warmup_steps 200 \
  --lr_num_cycles 1 \
  --enable_slicing \
  --enable_tiling \
  --optimizer Adam \
  --adam_beta1 0.9 \
  --adam_beta2 0.95 \
  --max_grad_norm 1.0 \
  --report_to wandb
```
Note:
* `--report_to wandb` will ensure the training runs are tracked on Weights and Biases. To use it, be sure to install wandb with pip install wandb.
* `--validation_prompt` and `--validation_epochs` to allow the script to do a few validation inference runs. This allows us to qualitatively check if the training is progressing as expected.

## Inference

Once we have trained our LoRA model, the inference can be done simply loading the LoRA weights into the `CogVideoXPipeline`.

In [None]:
import torch
from diffusers import CogVideoXPipeline
from diffusers.utils import export_to_video

pipe = CogVideoXPipeline.from_pretrained(
    "THUDM/CogVideoX-2b",
    torch_dtype=torch.float16
)
# pipe.load_lora_weights("/path/to/lora/weights", adapter_name="cogvideox-lora") # Or,
pipe.load_lora_weights(
    "my-awesome-hf-username/my-awesome-lora-name",
    adapter_name="cogvideox-lora"
) # If loading from the HF Hub
pipe.to("cuda")

# Assuming lora_alpha=32 and rank=64 for training. If different, set accordingly
pipe.set_adapters(["cogvideox-lora"], [32 / 64])

In [None]:
prompt = "A vast, shimmering ocean flows gracefully under a twilight sky, its waves undulating in a mesmerizing dance of blues and greens. The surface glints with the last rays of the setting sun, casting golden highlights that ripple across the water. Seagulls soar above, their cries blending with the gentle roar of the waves. The horizon stretches infinitely, where the ocean meets the sky in a seamless blend of hues. Close-ups reveal the intricate patterns of the waves, capturing the fluidity and dynamic beauty of the sea in motion."
frames = pipe(prompt, guidance_scale=6, use_dynamic_cfg=True).frames[0]
export_to_video(frames, "output.mp4", fps=8)

## Reduce memory usage

While testing using the diffusers library, all optimizations included in the diffusers library were enabled. This scheme has not been tested for actual memory usage on devices outside of **NVIDIA A100 / H100** architectures. Generally, this scheme can be adapted to all **NVIDIA Ampere architecture** and above devices. If optimizations are disabled, memory consumption will multiply, with peak memory usage being about 3 times the value in the table, but the speed will increase by about 3-4 times.

To selectively disable some optimizations, we can
```python
pipe.enable_sequential_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()
```

