Skip to content

gulucaptain/Camera-Transformer-1

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

13 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

CT-1 Logo

CT-1: Vision-Language-Camera Models Transfer Spatial Reasoning Knowledge to Camera-controllable Video Generation

Project Page Β  ArXiv Β  Code Coming Soon

Haoyu Zhao, Zihao Zhang, Jiaxi Gu, Haoran Chen, Qingping Zheng, Pin Tang, Yeyin Jin,
Yuang Zhang, Junqi Cheng, Zenghui Lu, Peng Shu, Zuxuan Wu, Yu-Gang Jiang
Fudan University; Tencent.

Stars


πŸ“‹ Abstract

Camera-controllable video generation aims to synthesize videos with flexible and physically plausible camera movements. However, existing methods either provide imprecise camera control from text prompts or rely on labor-intensive manual camera trajectory parameters, limiting their use in automated scenarios.

To address these issues, we propose a novel Vision-Language-Camera model, termed CT-1 (Camera Transformer 1), a specialized model designed to transfer spatial reasoning knowledge to video generation by accurately estimating camera trajectories. Built upon vision-language modules and a Diffusion Transformer model, CT-1 employs a Wavelet-based Regularization Loss in the frequency domain to effectively learn complex camera trajectory distributions. These trajectories are integrated into a video diffusion model to enable spatially aware camera control that aligns with user intentions.

To facilitate the training of CT-1, we design a dedicated data curation pipeline and construct CT-200K, a large-scale dataset containing over 47M frames. Experimental results demonstrate that our framework successfully bridges the gap between spatial reasoning and video synthesis, yielding faithful and high-quality camera-controllable videos and improving camera control accuracy by 25.7% over prior methods.


πŸ”₯ News

Date Event
🟑 2026-04-10 Project page released. Code coming soon.

🧠 Framework Overview

CT-1 follows a "Camera-Decision-First, Generation-Next" two-stage paradigm:

Vision-Language Input (Image + Text)
          β”‚
          β–Ό
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚  CT-1 (VLC Model) β”‚  ← Diffusion Transformer + Wavelet Regularization Loss
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚
          β–Ό
   Camera Trajectories
          β”‚
          β–Ό
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚  Video Diffusion Model  β”‚  ← Camera controllable video generation
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚
          β–Ό
   Generated Video

The framework consists of three main components:

  • (a) Vision-Language Module β€” for semantic embedding of image and text inputs
  • (b) Diffusion Transformer Module β€” for modeling camera trajectory distributions with Wavelet-based Regularization Loss
  • (c) Controllable Video Generation Models β€” synthesize videos conditioned on the predicted trajectories

🎬 Video Generation with CT-1

Challenging Scenarios β€” Forward motion & rotational motion across diverse scenes.

πŸ” Animated previews below (GIF).

πŸ’‘ For full video demos including camera trajectory visualizations, cross-model comparisons, and driving scenarios, please visit our Project Page.


✨ Highlights

  • 🎯 VLC Model: First to formulate camera trajectory estimation as a vision-language understanding task
  • 🌊 Wavelet-based Regularization Loss: Novel frequency-domain loss for learning complex camera trajectory distributions
  • πŸ“¦ CT-200K Dataset: Large-scale dataset with 47M+ frames and dedicated curation pipeline
  • πŸ”Œ Cross-Model Compatibility: CT-1 predicted trajectories are compatible with existing models (CameraCtrl, MotionCtrl, etc.)
  • πŸš— Cross-Domain Generalization: Validated on general scenes and driving scenarios

πŸ’» Code

🚧 Coming Soon β€” Code and model weights will be released.

The release will include:

  • CT-1 model code & weights
  • CT-200K dataset
  • Training pipeline
  • Inference demo
  • Evaluation scripts

πŸ“Œ The trajectory visualization code is available in our separate repository:
Camera Trajectories Visualization


πŸ“Ž Citation

If you find this work useful, please consider citing:

@article{zhao2026ct1,
  title     = {CT-1: Vision-Language-Camera Models Transfer Spatial Reasoning Knowledge to Camera-controllable Video Generation},
  author    = {Haoyu Zhao, Zihao Zhang, Jiaxi Gu, Haoran Chen, Qingping Zheng, Pin Tang, Yeyin Jin, Yuang Zhang, Junqi Cheng, Zenghui Lu, Peng Shu, Zuxuan Wu, Yu-Gang Jiang},
  journal   = {arXiv preprint: 2604.09201},
  year      = {2026}
}

Star History Chart
Built with ❀️ | Project Page

About

🎬 Tell the camera where to go: CT-1 understands your intent and generates videos with precise, spatially-aware camera control.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors