Multi-focal Conditioned Latent Diffusion for Person Image Synthesis
Jiaqi Liu, Jichao Zhang, Paolo Rota, Nicu Sebe
Computer Vision and Pattern Recognition Conference (CVPR), 2025, Nashville, USA
You can directly download our test results from Google Drive (Including 256x176, 512*352 on Deepfashion) for further comparison.
-
Download
img_highres.zipof the DeepFashion Dataset from In-shop Clothes Retrieval Benchmark. -
Unzip
img_highres.zip. You will need to ask for password from the dataset maintainers. Then unzip it and put it under the./dataset/fashiondirectory. -
Preprocess dataset by runing
prepare_dataset.py. This will split the dataset, and prepare the needed conditions such poses, texture maps and face embeddings. You need pip install detectron2 for Densepose. The whole preprocessing time requires ~ 8h. You could also download our processed conditions form Google Drive and unzip. -
After the preprocessing, you should have your dataset folder organized as follows:
./dataset/fashion/
|-- train
|-- train_densepose
|-- train_texture
|-- train_face
|-- test
|-- test_densepose
|-- test_texture
|-- test_face
|-- MEN
|-- WOMEN
pip install -r requirements.txt
-
Download pretrained weight of based models and other components and put it to the pretrained weights:
-
Download our trained checkpoints from Google drive/HF hub and put it to
./checkpointsfolder.
Finally you will have your pretrained weight as this structure:
./pretrained_weights/
|-- model_final_844d15.pkl
|-- control_v11p_sd15_seg
|-- config.json
|-- diffusion_pytorch_model.bin
`-- diffusion_pytorch_model.safetensors
|-- image_encoder
| |-- config.json
| `-- pytorch_model.bin
|-- sd-vae-ft-mse
| |-- config.json
| |-- diffusion_pytorch_model.bin
| `-- diffusion_pytorch_model.safetensors
`-- stable-diffusion-v1-5
|-- feature_extractor
| `-- preprocessor_config.json
|-- model_index.json
|-- unet
| |-- config.json
| `-- diffusion_pytorch_model.bin
`-- v1-inference.yaml
./checkpoints/
|-- denoising_unet.pth
|-- image_projector.pth
|-- pose_guider.pth
`-- reference_unet.pth
The overall pipeline of our proposed Multi-focal Conditioned Diffusion Model. (a) Face regions and appearance regions are first extracted from the source person images; (b) multi-focal condition aggregation module
This code support multi-GPU training with accelerate. Full training takes ~26 hours with 2 A100-80G GPUs with a batch size 12 on deepfashion dataset.
accelerate launch --main_process_port 12148 train.py --config ./configs/train/train.yamlTo test our method on the whole Deepfashion dataset, run:
test.py --save_path FOLDER_TO_SAVE --ckpt_dir ./checkpoints/ --config_path ./configs/train/train.yamlThen, the results can be evaluated by:
evaluate.py --save_path FOLDER_TO_SAVE --gt_folder FOLDER_FOR_GT --training_path ./dataset/fashion/train/MCLD allows flexible editing since it decompose the human appearance and identities. We will release the editing code in the future as soon as it is ready.
@misc{liu2025multifocalconditionedlatentdiffusion,
title={Multi-focal Conditioned Latent Diffusion for Person Image Synthesis},
author={Jiaqi Liu and Jichao Zhang and Paolo Rota and Nicu Sebe},
year={2025},
eprint={2503.15686},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2503.15686},
}
