Distribution-Aware Data Expansion with Diffusion Models

This is the official repository of Distribution-Aware Data Expansion with Diffusion Models (DistDiff).

1. Highlights

Abstract: The scale and quality of a dataset significantly impact the performance of deep models. However, acquiring large-scale annotated datasets is both a costly and time-consuming endeavor. To address this challenge, dataset expansion technologies aim to automatically augment datasets, unlocking the full potential of deep models. Current data expansion methods encompass image transformation-based and synthesis-based methods. The transformation-based methods introduce only local variations, resulting in poor diversity, while image synthesis-based methods can create entirely new content, thus significantly enhancing informativeness. However, existing synthesis methods carry the risk of distribution deviations, potentially degrading model performance with out-of-distribution samples. In this paper, we propose DistDiff, an effective data expansion framework based on the distribution-aware diffusion model. DistDiff constructs hierarchical prototypes to approximate the real data distribution, optimizing latent data points within diffusion models with hierarchical energy guidance. We demonstrate its ability to generate distribution-consistent samples, achieving substantial improvements in data expansion tasks. Specifically, without additional training, DistDiff achieves a 30.7% improvement in accuracy across six image datasets compared to the model trained on original datasets and a 9.8% improvement compared to the state-of-the-art diffusion-based method.

2. Main Contributions

We introduce a novel diffusion-based data expansion algorithm, named DistDiff, which facilitates distribution consistent data augmentation without requiring retraining.
By leveraging hierarchical prototypes to approximate data distribution, we propose an effective distribution-aware energy guidance at both class and group levels in the diffusion sampling process.
The experimental results illustrate that our DistDiff is capable of generating high-quality samples, surpassing existing image transformation and synthesis methods significantly.

3. Method Pipeline

Please refer to the paper for more technical details.

4. How to Run Experiments?

4.1 Setup & Preparation

Please follow the instructions detailed in INSTALL.md to install necessary libraries and prepare all datasets.

4.2 Run Experiments

1) Hierarchical Prototypes Construction

First, you need to construct hierarchical prototypes by:

python extract_prototypes.py -a CLIP-VIT-B32 -d ${DATASET}

The generated results will be saved in data/prototypes/.

2) Data Expansion

We utilize 4 distinct GPUs to execute the entire program. This is achieved by modifying CUDA_VISIBLE_DEVICES=N and setting --split N for each GPU, where N ranges from 0 to 3.

cd src/
sh scripts/exps/expand_diff.sh ${DATASET} ${START} ${END} 0.8 ${K} 0 0 &
sh scripts/exps/expand_diff.sh ${DATASET} ${START} ${END} 0.8 ${K} 1 1 &
sh scripts/exps/expand_diff.sh ${DATASET} ${START} ${END} 0.8 ${K} 2 2 &
sh scripts/exps/expand_diff.sh ${DATASET} ${START} ${END} 0.8 ${K} 3 3

Please note that if your system doesn't possess 4 GPUs, you can modify the total_split parameter in scripts/exps/expand_diff.sh to reflect the actual number of GPUs you have, and make corresponding adjustments to the SPLIT and GPU.

3) Training and Evaluation

After data expansion, use the following script to train and evaluate the backbone on the expanded datasets.

sh scripts/exps/train_expand.sh ${DATASET} ${BACKBONE} 0.1 ${EXP} False ${GPU}

You can also execute previous steps jointly by:

sh single_exp.sh

Citation

If you find our work inspiring in your research, please cite our work.

@article{zhu2024distributionaware,
  title={Distribution-Aware Data Expansion with Diffusion Models},
  author={Haowei Zhu and Ling Yang and Jun-Hai Yong and Wentao Zhang and Bin Wang},
  journal={arXiv preprint arXiv:2403.06741},
  year={2024}
}

Acknowledgments

Our code is based on GIF-SD repository. We thank the authors for releasing their code. If you use our model and code, please consider citing these works as well.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
figure		figure
src		src
INSTALL.md		INSTALL.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

figure

figure

src

src

INSTALL.md

INSTALL.md

README.md

README.md

Repository files navigation

Distribution-Aware Data Expansion with Diffusion Models

1. Highlights

2. Main Contributions

3. Method Pipeline

4. How to Run Experiments?

4.1 Setup & Preparation

4.2 Run Experiments

Citation

Acknowledgments

About

Releases

Packages

Languages

haoweiz23/DistDiff

Folders and files

Latest commit

History

Repository files navigation

Distribution-Aware Data Expansion with Diffusion Models

1. Highlights

2. Main Contributions

3. Method Pipeline

4. How to Run Experiments?

4.1 Setup & Preparation

4.2 Run Experiments

Citation

Acknowledgments

About

Resources

Stars

Watchers

Forks

Languages