DISC: Learning from Noisy Labels via Dynamic Instance-Specific Selection and Correction

This repository is the official implementation of the CVPR2023 paper "DISC: Learning from Noisy Labels via Dynamic Instance-Specific Selection and Correction". This repository includes several baseline methods and supports almost all commonly used benchmarks and label noise types. It can serve as a library for LNL models.

Title: DISC: Learning from Noisy Labels via Dynamic Instance Specific Selection and Correction
Authors: Yifan Li, Hu Han, Shiguang Shan, Xilin Chen
Institute: Institute of Computing Technology, Chinese Academy of Sciences

Citing DISC

If you find this repo is useful, please cite the following BibTeX entry. Thank you very much!

@InProceedings{Li_2023_DISC,
    author    = {Li, Yifan and Han, Hu and Shan, Shiguang and Chen, Xilin},
    title     = {DISC: Learning From Noisy Labels via Dynamic Instance-Specific Selection and Correction},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2023},
    pages     = {24070-24079}
}

Content

DISC

1. Abstract

Existing studies indicate that deep neural networks (DNNs) can eventually memorize the label noise. We observe that the memorization strength of DNNs towards each instance is different and can be represented by the confidence value, which becomes larger and larger during the training process. Based on this, we propose a Dynamic Instance-specific Selection and Correction method (DISC) for learning from noisy labels (LNL). We first use a two-view-based backbone for image classification, obtaining confidences for each image from two views. Then we propose a dynamic threshold strategy for each instance, based on the momentum of each instance's memorization strength in previous epochs to select and correct noisy labeled data. Benefiting from the dynamic threshold strategy and two-view learning, we can effectively group each instance into one of the three subsets (i.e., clean, hard, and purified) based on the prediction consistency and discrepancy by two views at each epoch. Finally, we employ different regularization strategies to conquer subsets with different degrees of label noise, improving the whole network's robustness. Comprehensive evaluations on three controllable and four real-world LNL benchmarks show that our method outperforms the state-of-the-art (SOTA) methods to leverage useful information in noisy data while alleviating the pollution of label noise.

2. Requirements

The code requires python>=3.7 and the following packages.

torch==1.8.0
torchvision==0.9.0
numpy==1.19.4
scipy==1.6.0
addict==2.4.0
tqdm==4.64.0
nni==2.5

These packages can be installed directly by running the following command:

pip install -r requirements.txt

Note that all the experiments are conducted under one single RTX 3090, so the results may be a little different with the original paper when you use a different gpu.

3. Datasets

This code includes seven datasets including: CIFAR-10, CIFAR-100, Tiny-ImageNet, Animals-10N, Food-101, Mini-WebVision (top-50 classes from WebVisionV1.0 (training set and validation set) and ILSVRC-2012 (only validation set)) and Clothing1M.

Datasets	Download links
CIFAR-10	link
CIFAR-100	link
Tiny-ImageNet	link
Animals-10N	link
Food-101	link
WebVision V1.0	link
ILSVRC-2012	link
Clothing1M	link

If you want to run one of the datasets, please download it into your data directory and change the dataset path in bash scripts (see the following section).

4. Reproduce the results of DISC

In order to reproduce the results of DISC, you need to change the hyper-parameters in bash scripts directory (shs/) for different datasets.

4.1 Synthetic dataset (CIFAR-10, CIFAR-100, Tiny-ImageNet)

For instance, for the synthetic label noise dataset CIFAR10, the example of script is shown as follows:

model_name='DISC' # the extra model name of the algorithm
noise_type='ins' # the label noise type, which could be: 'ins', 'sym', 'asym'
gpuid='1' # the gpu to assign
seed='1' # the random seed
save_path='./logs/' # the directory for saving logs
data_path='/data/yfli/CIFAR10' # the directory of dataset
config_path='./configs/DISC_CIFAR.py' # the configuration file of algorithm for different datasets
dataset='cifar-10' # the name of dataset
num_classes=10 # the class number of dataset
noise_rates=(0.2 0.4 0.6) # the noise rate for synthetic label noise datasets

for noise_rate in "${noise_rates[@]}"
do
    python main.py -c=$config_path  --save_path=$save_path --noise_type=$noise_type --seed=$seed --gpu=$gpuid --percent=$noise_rate --dataset=$dataset --num_classes=$num_classes  --root=$data_path --model_name=disc
done

If you want to run the results of DISC on CIFAR-10 with inst. noise, please change the data directory as yours and run the following command:

bash shs/DISC_cifar10.sh

Furthermore, there are three types of label noise to choose from: symmetric noise ('sym'), asymmetric noise ('asym'), and instance noise ('ins'). You can also adjust the noise ratio by changing the 'noise_rates' hyper-parameter.

4.2 Real-world noise datasets (Animals-10N, Food-101, Mini-WebVision, Clothing1M)

If you want to run the results of DISC on Animals-10N, the bash scripts can be shown as:

model_name='DISC' # the extra model name of the algorithm
gpuid='0' # the gpu to assign
save_path='./logs/' # the directory for saving logs
data_path='/data/yfli/Animal10N/' # the directory of the dataset, and you need to change this as yours
config_path='./configs/DISC_animal10N.py'
dataset='animal10N'
num_classes=10

python main.py -c='./configs/DISC_animal10N.py' --save_path=$save_path
               --gpu=$gpuid --model_name=$model_name 
               --root=$data_path 
               --dataset=$dataset --num_classes=$num_classes

All you need to do is to change the directory of data path as yours, and run:

bash shs/DISC_animal10N.sh

5. Other methods

Currently, there are many other baselines in this database (algorithms/) including Co-learning, Co-teaching, Co-teaching+, Decoupling, ELR, GJS, JoCoR, JointOptim, MetaLearning, Mixup, NL, PENCIL.

However, these methods are currently only applicable to the CIFAR-10/100 datasets. You can adapt the DISC code to achieve results on other benchmarks according to your needs.

We hope this repository will serve as a codebase for LNL in the future. Anyone who wishes to contribute can do so by submitting a pull request or forking it to their own repository.

6. Reference

This codebase refers to Co-learning [link], DivideMix [link], ELR [link], think you all!

7. Contact

If you have any other questions, please contact liyifan20g@ict.ac.cn.

License

This repo is licensed under MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
algorithms		algorithms
assets		assets
configs		configs
datasets		datasets
losses		losses
models		models
shs		shs
utils		utils
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

JackYFL/DISC

Folders and files

Latest commit

History

Repository files navigation

DISC: Learning from Noisy Labels via Dynamic Instance-Specific Selection and Correction

Citing DISC

Content

1. Abstract

2. Requirements

3. Datasets

4. Reproduce the results of DISC

4.1 Synthetic dataset (CIFAR-10, CIFAR-100, Tiny-ImageNet)

4.2 Real-world noise datasets (Animals-10N, Food-101, Mini-WebVision, Clothing1M)

5. Other methods

6. Reference

7. Contact

License

About

Resources

Stars

Watchers

Forks

Languages