Multisize Dataset Condensation

[Paper] | [BibTeX]

Official PyTorch implementation of "Multisize Dataset Condensation", published at ICLR'24 (Oral)

Abstract While dataset condensation effectively enhances training efficiency, its application in on-device scenarios brings unique challenges. 1) Due to the fluctuating computational resources of these devices, there's a demand for a flexible dataset size that diverges from a predefined size. 2) The limited computational power on devices often prevents additional condensation operations. These two challenges connect to the "subset degradation problem" in traditional dataset condensation: a subset from a larger condensed dataset is often unrepresentative compared to directly condensing the whole dataset to that smaller size. In this paper, we propose Multisize Dataset Condensation (MDC) by compressing N condensation processes into a single condensation process to obtain datasets with multiple sizes. Specifically, we introduce an "adaptive subset loss" on top of the basic condensation loss to mitigate the "subset degradation problem". Our MDC method offers several benefits: 1) No additional condensation process is required; 2) reduced storage requirement by reusing condensed images. Experiments validate our findings on networks including ConvNet, ResNet and DenseNet, and datasets including SVHN, CIFAR-10, CIFAR-100 and ImageNet. For example, we achieved 5.22%-6.40% average accuracy gains on condensing CIFAR-10 to ten images per class.

Code

Key files:

condense_reg.py: main file of the condensation process.
reg_ipcx.py: helper class (class Regularizer) and functions to maintain and update the most learnable subset (MLS).

Key functions (reg_ipcx.py):

Paper Function	Function Name
Feature Distance Calculation	def feat_loss_for_ipc_reg():
Feature Distance Comparison	def select_reg_ipc():
MLS Freezing Judgement	def get_freeze_ipc():

Basic Usage

Installation

Download repo:

git clone https://github.com/he-y/Multisize-Dataset-Condensation MDC
cd MDC

Create pytorch environment:

conda env create -f environment.yaml
conda activate mdc

Condensing

MDC Condense:

python condense_reg.py --reproduce -d [DATASET] -f [FACTOR] --ipc [IPC] --adaptive_reg True

# Example on CIFAR-10, IPC10
python condense_reg.py --reproduce -d cifar10 -f 2 --ipc 10 --adaptive_reg True

Parallel running on different classes is also implemented. (See Appendix B.6 to see the accuracy is stable after this parallel running)

python condense_reg_mp.py  --reproduce -d [DATASET] -f [FACTOR] --ipc [IPC] --adaptive_reg True --nclass_sub [NUM_SUB_CLASS] --phase [PHASE_ID]

# Example on CIFAR-10, IPC10, two jobs separatly condense class 1-5 and 6-10 
python condense_reg_mp.py --reproduce -d cifar10 -f 2 --ipc 10 --adaptive_reg True --nclass_sub 5 --phase 0 &
python condense_reg_mp.py --reproduce -d cifar10 -f 2 --ipc 10 --adaptive_reg True --nclass_sub 5 --phase 1 &

Testing

To evaluate a condensed dataset, run:

python test.py --reproduce -d [DATASET] -f [FACTOR] --ipc [IPC] --test_type [CHOICES] --test_data_dir [PATH_TO_CONDENSED_DATA_DIR] --ipcy [IPCY]

# Example of evaluating the performance of IPC5 from CIFAR-10, IPC10 (repeating 3 times).
python test.py --reproduce -d cifar10 -f 2 --ipc 10 --test_type cx_cy --test_data_dir ./path_to_ipc10_data --ipcy 5 --repeat 3

Test Types	Explaination
`other`	(default) evaluate the condensed dataset
`cx_cy`	choose IPC[Y] images from total IPC images
	e.g., choose IPC5 from IPC10
`baseline_b`	concatenate all IPC[1] images to form a IPC[N] dataset

Table Results (Google Drive)

The condensed data used in our experiments can be downloaded from google drive, including:

No.	Content	Datasets	Methods
Table 1	Baseline Comparison	SVHN CIFAR-10 CIFAR-100 ImageNet-10	Baseline A Baseline B Baseline C MDC
Table 2	SOTA Comparison	CIFAR-10 CIFAR-100	DC (ICLR'21) DSA (ICML'21) MTT (CVPR'22) IDC (ICML'22) DREAM (ICCV'23) MDC (Ours)
Table 3	Ablation study on three components: Calculate, Compare, Freeze	CIFAR-10	MDC
Table 4	Cross Architecture Performance	CIFAR-10	Baseline A Baseline B Baseline C MDC (ResNet, DesNet)
Table 5	Evaluation Metric Comparison on three metrics: Gradient Distance, Feature Distance, Accuracy Difference	CIFAR-10	MDC
Table 6	Effects of different condensation runs	CIFAR-10	MDC
Appendix
Table 7	Feature Distance	(Skipping)
Table 8	MDC on DREAM	CIFAR-10	IDC (ICML'22) DREAM (ICCV'23)
Table 9	Primary Result with Std.	See Table 1 (logs `xxx.txt`)
Table 10	Details of condensation run (58.37), e.g., per 100 step performance	CIFAR-10	MDC
Table 11	Details of condensation run (59.55), e.g., per 100 step performance	CIFAR-10	MDC
Table 12	Class-wise MDC	CIFAR-10	MDC

Related Repos

Our code is mainly developed on following papers and repos:

Dataset Condensation via Efficient Synthetic-Data Parameterization: [Paper], [Code]
DREAM: Efficient Dataset Distillation by Representative Matching: [Paper], [Code]
Dataset Condensation with Gradient Matching: [Paper], [Code]

Citation

@inproceedings{he2024multisize,
  title={Multisize Dataset Condensation},
  author={He, Yang and Xiao, Lingao and Zhou, Joey Tianyi and Tsang, Ivor},
  booktitle={The Twelfth International Conference on Learning Representations},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
images		images
misc		misc
models		models
query_strategies		query_strategies
.gitignore		.gitignore
License		License
README.md		README.md
argument.py		argument.py
condense_reg.py		condense_reg.py
condense_reg_mp.py		condense_reg_mp.py
coreset.py		coreset.py
data.py		data.py
environment.yaml		environment.yaml
matchloss.py		matchloss.py
pretrain.py		pretrain.py
reg_ipcx.py		reg_ipcx.py
test.py		test.py
train.py		train.py

License

he-y/Multisize-Dataset-Condensation

Folders and files

Latest commit

History

Repository files navigation

Multisize Dataset Condensation

Code

Basic Usage

Installation

Condensing

Testing

Table Results (Google Drive)

Related Repos

Citation

About

Resources

License

Stars

Watchers

Forks

Languages