Compression as Adaptation: Implicit Visual Representation with Diffusion Foundation Models

This is the official code repository for the paper

Compression as Adaptation: Implicit Visual Representation with Diffusion Foundation Models.

University of Cambridge, Microsoft Research Asia

TL;DR

A single vector (1x131072) is enough to represent a 480p, 81-frame video!

Visual Results

Video Compression

480p video compression at 11.8KB/s (30fps).

Ground Truth | VOV (Ours)

Beyond Compression: Editing

Native video editing with T2V model.

Original | Edited

For more visual results and interactive demos, visit our project page.

Abstract

Modern visual generative models acquire rich visual knowledge through large-scale training, yet existing visual representations (such as pixels, latents, or tokens) remain external to the model and cannot directly exploit this knowledge for compact storage or reuse. In this work, we introduce a new visual representation framework that encodes a signal as a function, which is parametrized by low-rank adaptations attached to a frozen visual generative model. Such implicit representations of visual signals, e.g., an 81-frame video, can further be hashed into a single compact vector, achieving strong perceptual video compression at extremely low bitrates. Beyond basic compression, the functional nature of this representation enables inference-time scaling and control, allowing additional refinement on the compression performance. More broadly, as the implicit representations directly act as a function of the generation process, this suggests a unified framework bridging visual compression and generation.

Code Structure

VisionAsAdaptations/
├── image/                          # Image pipeline (Qwen-Image)
│   ├── train_lora_multi.py         # Multi-GPU LoRA training
│   ├── reconstruct_lora_multi.py   # Reconstruction from LoRA weights
│   ├── config_train.py             # Training argument parser
│   ├── data_utils.py               # Data loading & preprocessing
│   ├── unilora_utils.py            # LoRA injection helpers
│   ├── pipelines/                  # Qwen-Image diffusion pipeline
│   ├── unilora/                    # UniLoRA layer & model implementation
│   └── descriptions/               # Image caption files
│
└── video/                          # Video pipeline (Wan2.1)
    ├── train/                      # Two-stage training
    │   ├── stage1_vm/              #   Stage 1: Visual-Memory style overfitting
    │   └── stage2_vs/              #   Stage 2: Rate-distortion with entropy term
    ├── scaling/                    # Inference-time scaling reconstruction
    ├── edit/                       # Caption editing & LoRA merging
    └── eval/                       # Evaluation (PSNR, DISTS, LPIPS, FVD)

Each subdirectory contains its own README.md with detailed environment setup and usage instructions.

Getting Started

Image Pipeline

cd image
pip install -r requirements.txt

# Training (multi-GPU)
DATA_ROOT=/path/to/images \
CACHE_DIR=/path/to/pretrained_models \
bash train_multi.sh

# Reconstruction
INIT_LORA_PATH=/path/to/lora_checkpoint.safetensors \
DATA_ROOT=/path/to/images \
CACHE_DIR=/path/to/pretrained_models \
bash reconstruct_multi.sh

See image/README.md for full details on environment variables (NUM_GPUS, OUT_DIR, DESCRIPTION_FILE, WIDTH, HEIGHT, etc.).

Video Pipeline

cd video
pip install -r requirements.txt

Training (two-stage):

cd train

# Stage 1 — Visual-Memory style single-video overfitting
FRAMES_DIR=/path/to/frames \
CAPTION_FILE=/path/to/caption.txt \
CACHE_DIR=/path/to/pretrained_models \
bash stage1_train.sh

# Stage 2 — Rate-distortion with entropy coding
INIT_LORA_PATH=/path/to/stage1_checkpoint.safetensors \
FRAMES_DIR=/path/to/frames \
CAPTION_FILE=/path/to/caption.txt \
CACHE_DIR=/path/to/pretrained_models \
bash stage2_train.sh

# Or run both stages sequentially
bash run_two_stage.sh

Inference-time scaling:

cd scaling
bash sample.sh

Editing (caption change or LoRA merge):

cd edit
# Same LoRA, new caption
WEIGHTS=/path/to/lora.safetensors CAPTION="new prompt" bash edit_caption.sh

# Merge two LoRAs
WEIGHTS1=/path/to/lora1.safetensors WEIGHTS2=/path/to/lora2.safetensors \
CAPTION="new prompt" bash edit_merge.sh

Evaluation:

cd eval
bash eval.sh

See video/README.md and each subfolder's README for full configuration options.

Citation

If you find this work useful, please cite:

@misc{he2026compressionadaptationimplicitvisual,
      title={Compression as Adaptation: Implicit Visual Representation with Diffusion Foundation Models},
      author={Jiajun He and Zongyu Guo and Zhaoyang Jia and Xiaoyi Zhang and Jiahao Li and Xiao Li and Bin Li and José Miguel Hernández-Lobato and Yan Lu},
      year={2026},
      eprint={2603.07615},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2603.07615},
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.github		.github
assets/videos		assets/videos
image		image
video		video
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Compression as Adaptation: Implicit Visual Representation with Diffusion Foundation Models

TL;DR

Visual Results

Video Compression

Beyond Compression: Editing

Abstract

Code Structure

Getting Started

Image Pipeline

Video Pipeline

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Compression as Adaptation: Implicit Visual Representation with Diffusion Foundation Models

TL;DR

Visual Results

Video Compression

Beyond Compression: Editing

Abstract

Code Structure

Getting Started

Image Pipeline

Video Pipeline

Citation

About

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages