Skip to content

mayuelala/GroupEditing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

6 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Group Editing Logo Group Editing

Group Editing: Edit Multiple Images in One Go

Yue Ma, Xinyu Wang, Qianli Ma, Qinghe Wang, Mingzhe Zheng, Xiangpeng Yang, Hao Li, Chongbo Zhao, Jixuan Ying, Harry Yang, Hongyu Liu, Qifeng Chen

Accepted by CVPR 2026

GitHub

🎏 Abstract

TL; DR: Group Editing enables consistent editing across multiple images in one go by combining pseudo-video modeling with explicit geometry cues.

CLICK for the full abstract

Editing a set of related images with consistent subject identity, style, and structure is challenging due to viewpoint and pose variation. Group Editing reformulates multiple images as a pseudo-temporal sequence and leverages a video-generation prior to improve global consistency. To further enhance cross-view alignment, we integrate VGGT-based geometric correspondence and flow cues into the generation process. Our implementation uses a practical 5-stage pipeline, including mask extraction, input conversion, VGGT token extraction, flow estimation, and final Wan-VACE based generation. This repository provides the research code and engineering pipeline for reproducible group editing experiments.

πŸ“€ Demo Video

Please view the demo videos, visual comparisons, and qualitative results on the Project Page.

πŸ“‹ Changelog

🚧 Todo

  • Release more demo videos and cases
  • Add one-command pipeline launcher
  • Add config-driven path management (YAML/JSON)
  • Add cleaner benchmark/evaluation scripts
  • Release training details and model cards

✨ Features

  • Group-level consistent editing across multiple input images
  • Pseudo-video reformulation for improved temporal-like coherence
  • VGGT-based geometry guidance for better correspondence alignment
  • Mask-aware subject editing using GroundingDINO + SAM
  • Flow-guided generation with multi-stage preprocessing
  • Wan-VACE + LoRA integration for controllable generation

πŸ›‘ Setup Environment

# Create conda environment
conda create -n group-edit python=3.10
conda activate group-edit

git clone https://github.com/mayuelala/GroupEditing
# Install dependencies
cd GroupEditing
pip install -r requirements.txt

Requirements

  • Python 3.10+
  • PyTorch 2.0+
  • CUDA-compatible GPU
  • Recommended: 24GB+ VRAM (higher VRAM provides smoother inference)

πŸ“₯ Model Download

This project needs several checkpoints from Hugging Face / ModelScope plus your project LoRA.

Component Model ID / Source Local Target Directory Used In
Grounding DINO IDEA-Research/grounding-dino-base ./models/IDEA-Research/grounding-dino-base utils/process-origin2mask.py
SAM facebook/sam-vit-huge ./models/facebook/sam-vit-huge utils/process-origin2mask.py
VGGT facebook/VGGT-1B ./models/facebook/models--facebook--VGGT-1B vggt/infer-out-from-video-4frame.py
Wan VACE 14B shards Wan-AI/Wan2.1-VACE-14B ./models/Wan-AI/Wan2.1-VACE-14B infer-test.py
Wan converted T5/VAE DiffSynth-Studio/Wan-Series-Converted-Safetensors ./models/DiffSynth-Studio/Wan-Series-Converted-Safetensors infer-test.py
Group-Editing LoRA Heey731/group-editing (epoch-9.safetensors) ./models/epoch-9.safetensors infer-test.py

Download core checkpoints (GroundingDINO / SAM / VGGT)

We use pretrained models from GroundingDINO, SAM, and VGGT.
Our method additionally requires a LoRA checkpoint, provided separately.

from huggingface_hub import snapshot_download

snapshot_download("IDEA-Research/grounding-dino-base", local_dir="./models/grounding-dino-base")
snapshot_download("facebook/sam-vit-huge", local_dir="./models/sam-vit-huge")
snapshot_download("facebook/VGGT-1B", local_dir="./models/VGGT-1B")

Download Wan checkpoints (example with ModelScope)

# Wan VACE 14B
modelscope download --model Wan-AI/Wan2.1-VACE-14B \
  --local_dir ./models/Wan-AI/Wan2.1-VACE-14B

# Wan converted safetensors (T5 + VAE)
modelscope download --model DiffSynth-Studio/Wan-Series-Converted-Safetensors \
  --local_dir ./models/DiffSynth-Studio/Wan-Series-Converted-Safetensors

Download Group-Editing LoRA checkpoint

from huggingface_hub import hf_hub_download

hf_hub_download(
    repo_id="Heey731/group-editing",
    filename="epoch-9.safetensors",
    local_dir="./models"
)

Note: infer-test.py currently uses ckpt_path at line 45. Please set it to your local LoRA checkpoint path before running. The released LoRA file is ./models/epoch-9.safetensors (source: https://huggingface.co/Heey731/group-editing).

βš”οΈ Group Editing Inference

Quick Start (5-stage pipeline)

# 1) Extract object masks from origin videos
python utils/process-origin2mask.py

# 2) Convert mask/origin videos to pipeline input format
python utils/process-mask2input.py

# 3) Extract VGGT tokens
# Optional: export VGGT_MODEL_ROOT=./models/facebook/models--facebook--VGGT-1B
python vggt/infer-out-from-video-4frame.py

# 4) Compute flow tensors from masks
python utils/2delta-batch-gpu-multi-frame.py

# 5) Run final generation
python infer-test.py

Input Data Format

  • Origin video: ./test-data/Gemini-out/<id>-origin.mp4
  • Object description JSON: ./test-data/gemini-test.json
  • Generated intermediate folders:
    • ./test-data/Gemini-out-expand-5
    • ./test-data/Gemini-out-expand-5-vggt
    • ./test-data/Gemini-out-expand-5-map

πŸ“ Project Structure

Click for directory structure
Group-Editing/
β”œβ”€β”€ diffsynth/                         # Core diffusion framework
β”‚   β”œβ”€β”€ models/                        # Model definitions (Wan DiT/VACE, encoders, etc.)
β”‚   └── pipelines/                     # Pipeline implementations (wan_video_new.py)
β”œβ”€β”€ utils/
β”‚   β”œβ”€β”€ process-origin2mask.py         # Stage-1 mask extraction (GroundingDINO + SAM)
β”‚   β”œβ”€β”€ process-mask2input.py          # Stage-2 input conversion
β”‚   └── 2delta-batch-gpu-multi-frame.py# Stage-4 flow tensor extraction
β”œβ”€β”€ vggt/
β”‚   └── infer-out-from-video-4frame.py # Stage-3 VGGT token extraction
β”œβ”€β”€ infer-test.py                      # Stage-5 final inference
β”œβ”€β”€ models/                            # Local checkpoints (ignored by git)
β”œβ”€β”€ test-data/                         # Local data and intermediate files (optional)
β”œβ”€β”€ requirements.txt
└── README.md

πŸ”§ Key Modifications

This repository is built on top of DiffSynth-Studio and includes project-specific edits for Group Editing:

1. infer-test.py

  • Integrates LoRA loading for Wan-VACE pipeline
  • Loads and injects VGGT tokens (vggt_tensor) and flow tensors (flow_tensor)
  • Implements practical task loop for grouped editing generation

2. vggt/infer-out-from-video-4frame.py

  • Adds masked-frame token extraction for video-style group inputs
  • Supports Hugging Face cache-style model root resolution (snapshots/<revision>)

3. utils/process-origin2mask.py + utils/2delta-batch-gpu-multi-frame.py

  • Stage-1 object mask extraction with GroundingDINO and SAM
  • Stage-4 contour/TPS-based flow map generation for guidance

4. diffsynth/models/stepvideo_text_encoder.py

  • Added import fallback for transformers API compatibility across versions:
    • from transformers import PretrainedConfig, PreTrainedModel
    • fallback to configuration_utils / modeling_utils

πŸ“ Citation

If you use this code, please cite:

@article{groupediting2026,
  title={Group Editing: Edit Multiple Images in One Go},
  author={Ma, Yue and Wang, Xinyu and Ma, Qianli and Wang, Qinghe and Zheng, Mingzhe and Yang, Xiangpeng and Li, Hao and Zhao, Chongbo and Ying, Jixuan and Liu, Hongyu and Chen, Qifeng},
  journal={arXiv preprint arXiv:XXXX.XXXXX},
  year={2026}
}

πŸ“œ License

This project is released under the Apache-2.0 License. See LICENSE for details.

πŸ’— Acknowledgements

This repository builds upon:

Thanks to the original authors and communities for open-sourcing their work.

🧿 Maintenance

This repository is maintained for research and reproducibility. If you find issues or have suggestions, please open an issue or discussion thread.

About

[CVPR 2026] Group Editing: This repo is the official implementation of "Group Editing: Edit Multiple Images in One Go"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages