[ICML 2026] CameraNoise: Enabling Faithful Camera Control in Video Diffusion through Geometry-Flow-Guided Noise Warping
Haoyu Zhao, Jiaxi Gu, Haoran Chen, Qingping Zheng, Yeying Jin, Hongyi Yang, Junqi Cheng, Yuang Zhang, Zenghui Lu, Huan Yu, Jie Jiang, Peng Shu, Zuxuan Wu, Yu-Gang Jiang
Fudan University, Tencent.
δΈζ | English
- 2026-06-01: We release the Image-to-Video inference pipeline and CameraNoise weights.
- 2026-05-29: The CameraNoise paper is available on arXiv.
- 2026-05-01: CameraNoise is accepted to ICML 2026.
CameraNoise is a camera-motion control framework for video diffusion models, designed to improve both camera trajectory controllability and geometric consistency in video generation. Instead of directly injecting numerical camera parameters into the diffusion backbone, CameraNoise converts camera poses into a temporally coherent stochastic representation in the diffusion noise space. This design decouples camera motion from scene appearance and enables stable transfer of camera trajectories from a reference video to a new reference image.
This repository includes two main components:
- CameraNoise warping: the standalone condition-construction stage that converts camera motion into
GRFlowand then intoCameraNoise. It does not require QwenVL, Wan2.1, or LoRA weights, and is useful for analyzing camera trajectories, debugging noise warping, reusing CameraNoise conditions, or integrating the condition into other video diffusion frameworks. - CameraNoise I2V inference: the end-to-end image-to-video pipeline. Given a reference image and a reference video, it automatically performs camera estimation, CameraNoise construction, image captioning, and final video generation.
At its core, CameraNoise warping first builds an appearance-agnostic geometric motion field, namely Geometry-guided Reprojection Flow (GRFlow), from camera intrinsics and extrinsics. This flow is then applied to Gaussian noise to synthesize CameraNoise that follows the target camera trajectory while preserving the diffusion noise prior as much as possible. With this noise-space conditioning strategy, CameraNoise serves as a lightweight and reusable camera-motion condition for Wan2.1-I2V, Wan2.1-T2V, and potentially other video diffusion models, enabling videos with stable structure, high visual quality, and faithful camera motion.
-
Reference-video camera control: estimate camera motion from an arbitrary reference video and transfer it to a new reference image.
-
Standalone CameraNoise warping: run the pipeline from camera pose estimation to GRFlow construction and CameraNoise synthesis without launching the full video generation process.
-
Geometry-guided Reprojection Flow: construct an appearance-agnostic geometric motion representation from camera intrinsics and extrinsics, avoiding direct dependence on image textures or semantic content.
-
Noise-space conditioning: encode camera motion into CameraNoise as a temporally coherent noise condition for video diffusion models.
-
End-to-end I2V pipeline: run camera estimation, CameraNoise synthesis, image captioning, and final video generation with a single command.
-
Manifest logging: automatically record inputs, conditions, outputs, and key parameters for reproducible experiments.
The overall CameraNoise workflow consists of two relatively independent stages. The first stage is CameraNoise warping, which converts camera motion from the reference video into a noise-space condition. The second stage is video generation, which takes the reference image, caption, and CameraNoise as inputs and generates the final camera-controlled video.
flowchart LR
subgraph W[Standalone CameraNoise Warping]
A[Reference Video] --> B[Camera Pose EstimationοΌe.g., VGGT]
B --> C[GRFlow Construction]
C --> D[CameraNoise Warping]
D --> N[CameraNoise .npy]
D --> V[Noise Visualization .mp4]
end
subgraph G[Camera-controlled Video Generation]
E[Reference Image] --> F[Image CaptioningοΌe.g., Qwen2-VL]
E --> M[Wan2.1-I2V]
F --> M
N --> M
M --> H[Camera-controlled Video]
end
If you only need the camera-motion condition, you can run CameraNoise Warping alone. If you want to generate the final video, run the full I2V pipeline.
In addition to end-to-end I2V inference, this repository supports standalone CameraNoise condition generation. This corresponds to the core step of our method: converting the estimated camera motion from a reference video into a noise-space condition that can be used by a diffusion model.
Standalone CameraNoise warping is useful when you want to:
- extract and save CameraNoise from a reference video without immediately generating a video;
- inspect whether camera pose estimation, GRFlow, or noise warping is temporally stable;
- plug CameraNoise into your own video diffusion model or inference framework;
- preprocess reference videos in batches and cache
.npyconditions for later experiments; - visualize the noise warping process induced by camera motion.
The basic inputs and outputs are:
Input:
reference video
camera pose estimation model, e.g., VGGT
CameraNoise warping config
Intermediate:
camera poses
GRFlow
Output:
CameraNoise .npy # [T,H,W,C]
CameraNoise visualization .mp4 # for checking temporal noise propagation
For detailed usage, configuration options, and visualization examples, please refer to:
If you only want to reproduce or debug CameraNoise warping, we recommend starting from the standalone documentation above.
CameraNoise/
cameranoise_i2v.py # entry point for end-to-end I2V inference
inference.sh # example command
requirements.txt
scripts/
build_cameranoise.py # CameraNoise construction interface; can be used standalone
caption_image_qwenvl.py # QwenVL caption generation interface
generate_camera_control_video.py
cameranoise_warping/ # standalone CameraNoise warping module
configs/
default.yaml # default CameraNoise warping config
README.md # standalone guide for CameraNoise warping
README_zh.md # Chinese version of the standalone guide
diffsynth/ # Wan/DiffSynth inference code
assets/
cameranoise_icon.png
teaser.png
outputs/
demo1/
inputs/ # reference image / reference video
conditions/
noises/ # CameraNoise .npy and visualization results
grflows/ # intermediate GRFlow results
camerapose/ # camera pose estimation results
samples/ # final generated videos
manifest.json
The cameranoise_warping/ directory is the core condition-construction module of this project. If you only want to generate CameraNoise conditions, please refer to:
cameranoise_warping/README.md
We recommend using uv to manage the Python environment. uv provides fast dependency installation and convenient management of Python versions, virtual environments, and packages.
macOS / Linux:
curl -LsSf https://astral.sh/uv/install.sh | shIf curl is not available, use:
wget -qO- https://astral.sh/uv/install.sh | shWindows PowerShell:
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"After installation, restart your terminal and check:
uv --versionWe recommend Python 3.10:
cd CameraNoise
uv venv --python 3.10.14Activate the environment:
source .venv/bin/activateFor Windows:
.venv\Scripts\activatePlease choose the PyTorch installation command according to your CUDA version. Below is an example for CUDA 12.4:
uv pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124For CUDA 12.1, use:
uv pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121uv pip install -r requirements.txtCameraNoise inference requires the following checkpoints:
| Model | Usage | Hugging Face |
|---|---|---|
| VGGT | estimate camera motion from the reference video | facebook/VGGT-1B |
| Qwen2-VL | generate captions for the reference image | Qwen/Qwen2-VL-7B-Instruct |
| Wan2.1-I2V | base image-to-video generation model | Wan-AI/Wan2.1-I2V-14B-720P |
| CameraNoise LoRA | our CameraNoise camera-control weights | gulucaptain/CameraNoise-I2V |
uv pip install -U "huggingface_hub[cli]"If the model requires authentication or you want to use your own Hugging Face token, run:
huggingface-cli loginWe recommend placing all checkpoints under checkpoints/:
mkdir -p checkpointsDownload VGGT:
huggingface-cli download facebook/VGGT-1B \
--local-dir checkpoints/VGGT-1BDownload Qwen2-VL:
huggingface-cli download Qwen/Qwen2-VL-7B-Instruct \
--local-dir checkpoints/Qwen2-VL-7B-InstructDownload Wan2.1-I2V:
huggingface-cli download Wan-AI/Wan2.1-I2V-14B-720P \
--local-dir checkpoints/Wan2.1-I2V-14B-720Phuggingface-cli download gulucaptain/CameraNoise-I2V \
--local-dir checkpoints/CameraNoise-I2VAfter downloading, you can check the LoRA file name with:
find checkpoints/CameraNoise-I2V -name "*.safetensors"During inference, set --lora-path to the corresponding .safetensors file, for example:
--lora-path checkpoints/CameraNoise-I2V/cameranoise_lora.safetensorsEach demo corresponds to a folder under outputs/. Place the reference image and reference video under inputs/:
outputs/demo1/
inputs/
example.jpg # reference image
example.mp4 # reference video that provides camera motion
The script will automatically generate the following files:
outputs/demo1/
conditions/
caption.txt
noises/
example_noises.npy
example_visualization.mp4
camerapose/
grflows/
samples/
demo1.mp4
manifest.json
If you already have a CameraNoise .npy file, you can also place it under:
outputs/demo1/conditions/noises/
We also provide the inference results for the demos under outputs/ on Hugging Face:
Hugging Face: gulucaptain/CameraNoise-I2V/i2v_demo_results
Below is an example command for generating a 576x1024 video:
python cameranoise_i2v.py \
--demo-dir outputs/demo1 \
--vggt-ckpt checkpoints/VGGT-1B \
--cameranoise-config cameranoise_warping/configs/default.yaml \
--qwenvl-model-path checkpoints/Qwen2-VL-7B-Instruct \
--model-root checkpoints/Wan2.1-I2V-14B-720P \
--lora-path checkpoints/CameraNoise-I2V/1024x576/cameranoise_i2v_wan2.1_1024x576_lora.safetensors \
--height 576 \
--width 1024 \
--frames 49 \
--sample-mode front \
--degradation-value 0.2 \
--cfg 3.5 \
--device cuda \
--output-type singleBatch inference for demos in outputs
for i in {1..10}; do
DEMO_DIR="outputs/demo${i}"
if [ ! -d "$DEMO_DIR" ]; then
echo "Skip ${DEMO_DIR}: directory not found."
continue
fi
echo "========================================"
echo "Running ${DEMO_DIR}"
echo "========================================"
python cameranoise_i2v.py \
--demo-dir "$DEMO_DIR" \
--vggt-ckpt checkpoints/VGGT-1B \
--cameranoise-config cameranoise_warping/configs/default.yaml \
--qwenvl-model-path checkpoints/Qwen2-VL-7B-Instruct \
--model-root checkpoints/Wan2.1-I2V-14B-720P \
--lora-path checkpoints/CameraNoise-I2V/1024x576/cameranoise_i2v_wan2.1_1024x576_lora.safetensors \
--height 576 \
--width 1024 \
--frames 49 \
--sample-mode front \
--degradation-value 0.2 \
--cfg 3.5 \
--device cuda \
--output-type single
doneThe end-to-end CameraNoise pipeline automatically runs the following steps:
reference video
-> camera pose estimation
-> GRFlow construction
-> CameraNoise warping
-> CameraNoise .npy / visualization
reference image
-> QwenVL caption generation
reference image + caption + CameraNoise
-> Wan2.1-I2V generation
-> final video
For debugging, we recommend splitting the full pipeline into three stages:
- Caption stage: generate
caption.txtfor the reference image. - CameraNoise warping stage: generate CameraNoise
.npyand visualization videos from the reference video. - Video generation stage: reuse the existing caption and CameraNoise to run Wan2.1-I2V generation.
This staged workflow avoids repeatedly loading VGGT, QwenVL, and Wan2.1-I2V, and also makes it easier to cache CameraNoise conditions for batch experiments. In particular, when testing different prompts, LoRA weights, or generation settings, fixing the same CameraNoise condition helps isolate how the video diffusion model responds to the camera motion.
cameranoise_i2v.py automatically infers the spatial size of CameraNoise from the target video resolution:
cameranoise_downscale_size = [height // 8, width // 8]For example:
576x1024 -> [72, 128]
768x768 -> [96, 96]
The default reference size for amplitude/std scaling is:
--cameranoise-std-reference-size 96To manually specify the saved CameraNoise size, use:
--cameranoise-downscale-size 72,128| Argument | Description |
|---|---|
--demo-dir |
Demo directory. It should contain an inputs/ folder. |
--vggt-ckpt |
Path to VGGT weights. Required when generating CameraNoise from a reference video. |
--cameranoise-config |
Optional YAML config. It is merged with cameranoise_warping/configs/default.yaml. |
--cameranoise-std-reference-size |
Reference size for CameraNoise amplitude/std scaling. Default: 96. |
--cameranoise-downscale-size |
Saved CameraNoise size in H,W format. Default: [height/8, width/8]. |
--cameranoise-overwrite |
Regenerate CameraNoise even if an existing file is found. |
--qwenvl-model-path |
Path to QwenVL weights. Required when caption.txt does not exist. |
--overwrite-caption |
Regenerate caption even if caption.txt already exists. |
--model-root |
Directory of the Wan2.1-I2V-14B-720P model. |
--lora-path |
Path to the CameraNoise LoRA weights. |
--height, --width |
Resolution of the generated video. |
--frames |
Number of frames to generate. |
--cfg |
Classifier-free guidance scale. |
--degradation-value |
CameraNoise degradation value. If not provided, it is randomly sampled from [0, 0.6]. |
--sample-mode |
Frame sampling mode for CameraNoise: front or even. |
--output-type |
Output mode: single, concat, or ct1. |
--device |
Inference device, e.g., cuda. |
After a successful run, you will get:
conditions/caption.txt # QwenVL image caption
conditions/noises/*_noises.npy # CameraNoise, [T,H,W,C]
conditions/noises/*_visualization.mp4 # CameraNoise visualization
samples/*.mp4 # final generated video
manifest.json # record of inputs, conditions, outputs, and parameters
The manifest.json file records the input files, generated condition files, final video paths, and key runtime parameters, which helps ensure reproducible experiments.
If conditions/caption.txt already exists, the script will reuse it automatically. To regenerate the caption, pass:
--overwrite-captionIf a CameraNoise .npy file already exists under conditions/noises/ or inputs/, the script will reuse it automatically. To regenerate CameraNoise, pass:
--cameranoise-overwriteThis allows you to debug VGGT, QwenVL, and final video generation separately, while avoiding redundant computation.
- When the CameraNoise size is inferred automatically,
--heightand--widthshould be divisible by8. - CameraNoise
.npyfiles use the[T,H,W,C]layout. - For 576x1024 inference, we recommend using a CameraNoise size of
[72,128]. - For 768x768 inference, we recommend using a CameraNoise size of
[96,96]. - If you run out of GPU memory, try reducing the resolution, number of frames, or sampling steps.
- If you only want to debug camera conditions, we recommend generating and checking
*_visualization.mp4first. - The base model, LoRA weights, and CameraNoise resolution should be compatible with each other; otherwise, generation quality may degrade.
Make sure the following file exists:
outputs/demo1/conditions/caption.txt
Then run the main script as usual. The script will automatically reuse the existing caption.
Please read the standalone guide:
cameranoise_warping/README.md
This document focuses on the inputs, outputs, configuration, and running procedure of CameraNoise warping. The end-to-end command in the main README is mainly intended for full I2V inference.
Place the .npy file in either of the following locations:
outputs/demo1/conditions/noises/
Do not pass --cameranoise-overwrite. The script will automatically reuse the existing CameraNoise file.
The reference video provides the target camera motion. CameraNoise estimates camera poses from this video and converts the camera motion into a temporally coherent noise condition.
No. The reference image provides the generated content and appearance, while the reference video provides the camera motion. CameraNoise is designed to transfer the camera trajectory from the reference video to a new reference image.
You can try the following:
- check whether the reference video contains hard cuts, fast transitions, or severe jitter;
- check whether the CameraNoise visualization is temporally smooth;
- adjust
--degradation-value; - try a different random seed;
- reduce the camera motion amplitude or use a smoother reference video;
- make sure the CameraNoise resolution matches the output video resolution.
- Release CameraNoise I2V inference code.
- Release CameraNoise LoRA weights.
- Support automatic CameraNoise construction from reference video.
- Support manifest logging for reproducible inference.
- Release CameraNoise T2V inference code.
- Release CameraNoise training code.
- Add more camera-motion visualization tools.
- Release additional checkpoints and ablation configs.
- Release training data.
If you find CameraNoise useful for your research, please cite our paper:
@inproceedings{zhao2026cameranoise,
title = {CameraNoise: Enabling Faithful Camera Control in Video Diffusion through Geometry-Flow-Guided Noise Warping},
author = {Zhao, Haoyu and Gu, Jiaxi and Chen, Haoran and Zheng, Qingping and Jin, Yeying and Yang, Hongyi and Cheng, Junqi and Zhang, Yuang and Lu, Zenghui and Yu, Huan and Jiang, Jie and Shu, Peng and Wu, Zuxuan and Jiang, Yu-Gang},
booktitle = {Proceedings of the Forty-third International Conference on Machine Learning},
year = {2026}
}This project builds upon or refers to the following excellent open-source projects and models:
We sincerely thank these works for their contributions to the open-source community.
The code and CameraNoise weights in this repository are released under the Apache-2.0 License by default.
Please note that third-party base models may have their own license requirements, including VGGT, Qwen2-VL, and Wan2.1. Please read and follow the licenses and terms of the corresponding model repositories before use.
