Skip to content

he-y/Multisize-Dataset-Condensation

Repository files navigation

Multisize Dataset Condensation

[Paper] | [BibTeX]

Official PyTorch implementation of "Multisize Dataset Condensation", published at ICLR'24 (Oral)

Alt text

Abstract While dataset condensation effectively enhances training efficiency, its application in on-device scenarios brings unique challenges. 1) Due to the fluctuating computational resources of these devices, there's a demand for a flexible dataset size that diverges from a predefined size. 2) The limited computational power on devices often prevents additional condensation operations. These two challenges connect to the "subset degradation problem" in traditional dataset condensation: a subset from a larger condensed dataset is often unrepresentative compared to directly condensing the whole dataset to that smaller size. In this paper, we propose Multisize Dataset Condensation (MDC) by compressing N condensation processes into a single condensation process to obtain datasets with multiple sizes. Specifically, we introduce an "adaptive subset loss" on top of the basic condensation loss to mitigate the "subset degradation problem". Our MDC method offers several benefits: 1) No additional condensation process is required; 2) reduced storage requirement by reusing condensed images. Experiments validate our findings on networks including ConvNet, ResNet and DenseNet, and datasets including SVHN, CIFAR-10, CIFAR-100 and ImageNet. For example, we achieved 5.22%-6.40% average accuracy gains on condensing CIFAR-10 to ten images per class.

Code

Key files:

  • condense_reg.py: main file of the condensation process.
  • reg_ipcx.py: helper class (class Regularizer) and functions to maintain and update the most learnable subset (MLS).

Key functions (reg_ipcx.py):

Paper Function Function Name
Feature Distance Calculation def feat_loss_for_ipc_reg():
Feature Distance Comparison def select_reg_ipc():
MLS Freezing Judgement def get_freeze_ipc():

Basic Usage

Installation

Download repo:

git clone https://github.com/he-y/Multisize-Dataset-Condensation MDC
cd MDC

Create pytorch environment:

conda env create -f environment.yaml
conda activate mdc

Condensing

MDC Condense:

python condense_reg.py --reproduce -d [DATASET] -f [FACTOR] --ipc [IPC] --adaptive_reg True

# Example on CIFAR-10, IPC10
python condense_reg.py --reproduce -d cifar10 -f 2 --ipc 10 --adaptive_reg True

Parallel running on different classes is also implemented. (See Appendix B.6 to see the accuracy is stable after this parallel running)

python condense_reg_mp.py  --reproduce -d [DATASET] -f [FACTOR] --ipc [IPC] --adaptive_reg True --nclass_sub [NUM_SUB_CLASS] --phase [PHASE_ID]

# Example on CIFAR-10, IPC10, two jobs separatly condense class 1-5 and 6-10 
python condense_reg_mp.py --reproduce -d cifar10 -f 2 --ipc 10 --adaptive_reg True --nclass_sub 5 --phase 0 &
python condense_reg_mp.py --reproduce -d cifar10 -f 2 --ipc 10 --adaptive_reg True --nclass_sub 5 --phase 1 &

Testing

To evaluate a condensed dataset, run:

python test.py --reproduce -d [DATASET] -f [FACTOR] --ipc [IPC] --test_type [CHOICES] --test_data_dir [PATH_TO_CONDENSED_DATA_DIR] --ipcy [IPCY]

# Example of evaluating the performance of IPC5 from CIFAR-10, IPC10 (repeating 3 times).
python test.py --reproduce -d cifar10 -f 2 --ipc 10 --test_type cx_cy --test_data_dir ./path_to_ipc10_data --ipcy 5 --repeat 3
Test Types Explaination
other (default) evaluate the condensed dataset
cx_cy choose IPC[Y] images from total IPC images
e.g., choose IPC5 from IPC10
baseline_b concatenate all IPC[1] images to form a IPC[N] dataset

Table Results (Google Drive)

The condensed data used in our experiments can be downloaded from google drive, including:

No. Content Datasets Methods
Table 1 Baseline Comparison SVHN
CIFAR-10
CIFAR-100
ImageNet-10
Baseline A
Baseline B
Baseline C
MDC
Table 2 SOTA Comparison CIFAR-10
CIFAR-100
DC (ICLR'21)
DSA (ICML'21)
MTT (CVPR'22)
IDC (ICML'22)
DREAM (ICCV'23)
MDC (Ours)
Table 3 Ablation study on three components: Calculate, Compare, Freeze CIFAR-10 MDC
Table 4 Cross Architecture Performance CIFAR-10 Baseline A
Baseline B
Baseline C
MDC
(ResNet, DesNet)
Table 5 Evaluation Metric Comparison on three metrics:
Gradient Distance, Feature Distance, Accuracy Difference
CIFAR-10 MDC
Table 6 Effects of different condensation runs CIFAR-10 MDC
Appendix
Table 7 Feature Distance (Skipping)
Table 8 MDC on DREAM CIFAR-10 IDC (ICML'22)
DREAM (ICCV'23)
Table 9 Primary Result with Std. See Table 1
(logs xxx.txt)
Table 10 Details of condensation run (58.37), e.g., per 100 step performance CIFAR-10 MDC
Table 11 Details of condensation run (59.55), e.g., per 100 step performance CIFAR-10 MDC
Table 12 Class-wise MDC CIFAR-10 MDC

Related Repos

Our code is mainly developed on following papers and repos:

  • Dataset Condensation via Efficient Synthetic-Data Parameterization: [Paper], [Code]
  • DREAM: Efficient Dataset Distillation by Representative Matching: [Paper], [Code]
  • Dataset Condensation with Gradient Matching: [Paper], [Code]

Citation

@inproceedings{he2024multisize,
  title={Multisize Dataset Condensation},
  author={He, Yang and Xiao, Lingao and Zhou, Joey Tianyi and Tsang, Ivor},
  booktitle={The Twelfth International Conference on Learning Representations},
  year={2024}
}

About

Official PyTorch implementation of "Multisize Dataset Condensation" (ICLR'24 Oral)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages