Fix: data split issue #404

jeongHwarr · 2022-07-05T07:40:18Z

Description

Provide a summary of the modification as well as the issue that has been resolved. List any dependencies that this modification necessitates.
Fixes When test_split is applied to a custom dataset, test data contamination occurs #403

Changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

Checklist

My code follows the pre-commit style and check guidelines of this project.
I have performed a self-review of my code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing tests pass locally with my changes

#openvinotoolkit#403

ashwinvaidya17 · 2022-07-11T12:13:05Z

What's wrong with setting the seed to a value other than 0? If we make this change, we will have to ensure that lines like

if config.project.seed != 0:
        seed_everything(config.project.seed)

are changed in other files - train.py, etc

jeongHwarr · 2022-07-12T09:17:38Z

@ashwinvaidya17

If seed is non-zero, there is no problem.
However, if the seed is 0, the test data is also included in the training data. So the performance is abnormally high.

Does seed 0 means that the seed is not fixed in this project?
If so, the code seems to need a lot of modification.

It seems that samples are shared rather than made the same seed.

anomalib/anomalib/data/folder.py

Lines 163 to 166 in 8364577

    
           if not normal_test_dir: 
        
               samples = split_normal_images_in_train_set( 
        
                   samples=samples, split_ratio=split_ratio, seed=seed, normal_label="normal" 
        
               )

Now, the PR code solves the data contamination problem, but each traininig data and test data are created identically for the seed 0.

ashwinvaidya17 · 2022-07-12T11:12:38Z

@jeongHwarr not sure if I quite understand the contamination problem. Currently we treat seed 0 as having no seed. However, even with seed 0 I get no overlap between train split and the test split. Here are the sample lines of code I am using for testing this. This does not raise Assertion Error for multiple runs.

from anomalib.data.folder import make_dataset
from pathlib import Path

dataset = make_dataset(normal_dir=Path("../datasets/bottle/good/"), abnormal_dir=Path('../datasets/bottle/broken_large'), split_ratio=0.33, seed=0)

train_split = dataset[dataset.split == "train"]
test_split = dataset[dataset.split == "test"]
assert train_split["image_path"].values.all() != test_split["image_path"].values.all()

Here is another sample using Folder DataModule

folder_dataset = Folder(root=Path("../datasets/bottle"),normal_dir="good", abnormal_dir='broken_large', split_ratio=0.1, seed=0, image_size=256)
folder_dataset.setup()
assert folder_dataset.test_data.samples["image_path"].values.all() != folder_dataset.train_data.samples["image_path"].values.all()

jeongHwarr · 2022-07-13T06:06:41Z

@ashwinvaidya17

There is no overlap in your test code because you called make_dataset only once.

However, in the code of this project, make_dataset is called multiple times. At this time, if the seed is not fixed, the sample will not come out the same.

anomalib/anomalib/data/folder.py

Lines 163 to 166 in 8364577

    
           if not normal_test_dir: 
        
               samples = split_normal_images_in_train_set( 
        
                   samples=samples, split_ratio=split_ratio, seed=seed, normal_label="normal" 
        
               )

To reproduce this situation, set normal_dir to None in yaml, let the above part be called, and then compare the training data and test data. Then you can find the overlap there.

make_dataset is called multiple times through the code below.

anomalib/anomalib/data/folder.py

Lines 476 to 523 in 8364577

    
           if stage in (None, "fit"): 
        
               self.train_data = FolderDataset( 
        
                   normal_dir=self.normal_dir, 
        
                   abnormal_dir=self.abnormal_dir, 
        
                   normal_test_dir=self.normal_test, 
        
                   split="train", 
        
                   split_ratio=self.split_ratio, 
        
                   mask_dir=self.mask_dir, 
        
                   pre_process=self.pre_process_train, 
        
                   extensions=self.extensions, 
        
                   task=self.task, 
        
                   seed=self.seed, 
        
                   create_validation_set=self.create_validation_set, 
        
               ) 
        
           if self.create_validation_set: 
        
               self.val_data = FolderDataset( 
        
                   normal_dir=self.normal_dir, 
        
                   abnormal_dir=self.abnormal_dir, 
        
                   normal_test_dir=self.normal_test, 
        
                   split="val", 
        
                   split_ratio=self.split_ratio, 
        
                   mask_dir=self.mask_dir, 
        
                   pre_process=self.pre_process_val, 
        
                   extensions=self.extensions, 
        
                   task=self.task, 
        
                   seed=self.seed, 
        
                   create_validation_set=self.create_validation_set, 
        
               ) 
        
           self.test_data = FolderDataset( 
        
               normal_dir=self.normal_dir, 
        
               abnormal_dir=self.abnormal_dir, 
        
               split="test", 
        
               normal_test_dir=self.normal_test, 
        
               split_ratio=self.split_ratio, 
        
               mask_dir=self.mask_dir, 
        
               pre_process=self.pre_process_val, 
        
               extensions=self.extensions, 
        
               task=self.task, 
        
               seed=self.seed, 
        
               create_validation_set=self.create_validation_set, 
        
           ) 
        
           if stage == "predict": 
        
               self.inference_data = InferenceDataset( 
        
                   path=self.root, image_size=self.image_size, transform_config=self.transform_config_val 
        
               )

anomalib/anomalib/data/folder.py

Lines 246 to 256 in 8364577

    
           self.samples = make_dataset( 
        
               normal_dir=normal_dir, 
        
               abnormal_dir=abnormal_dir, 
        
               normal_test_dir=normal_test_dir, 
        
               mask_dir=mask_dir, 
        
               split=split, 
        
               split_ratio=split_ratio, 
        
               seed=seed, 
        
               create_validation_set=create_validation_set, 
        
               extensions=extensions, 
        
           )

To solve this problem, make_datset is called only once, and the below part in make_dataset must be processed in another part not in the make_datset.

anomalib/anomalib/data/folder.py

Lines 172 to 175 in 8364577

    
           # Get the data frame for the split. 
        
           if split is not None and split in ["train", "val", "test"]: 
        
               samples = samples[samples.split == split] 
        
               samples = samples.reset_index(drop=True)

ashwinvaidya17 · 2022-07-13T07:40:25Z

@jeongHwarr I see what you mean. Thanks for spotting this. I'll look into the issue. But does it make sense to keep seed 0 as no seed or should we refactor the code to make seed: null as no seed?

ashwinvaidya17 · 2022-07-13T11:01:41Z

@jeongHwarr Thanks again for spotting this issue and creating a PR for addressing it. Is it possible for you to make a few changes to this PR (https://github.com/openvinotoolkit/anomalib/pull/437/files) so that we can merge this and you can be a contributor to this repo 😀. Other option is to always merge this and I can create a new PR to address this.

jeongHwarr · 2022-07-14T09:03:02Z

@ashwinvaidya17

I think that None or null is more clear than seed 0.
Until I saw the code of this repo, I thought that it was fixed as seed 0.

I've reviewed your PR and nothing seems to change!
Thank you 😀

Fix: data split issue

2d0965a

#openvinotoolkit#403

ashwinvaidya17 self-requested a review July 5, 2022 17:22

samet-akcay approved these changes Jul 14, 2022

View reviewed changes

samet-akcay merged commit 0f9286c into openvinotoolkit:development Jul 14, 2022

djdameln mentioned this pull request Dec 1, 2022

[Bug]: Some data used for both train and test when using folder dataset format with random seed #746

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: data split issue #404

Fix: data split issue #404

jeongHwarr commented Jul 5, 2022

ashwinvaidya17 commented Jul 11, 2022

jeongHwarr commented Jul 12, 2022

ashwinvaidya17 commented Jul 12, 2022

jeongHwarr commented Jul 13, 2022

ashwinvaidya17 commented Jul 13, 2022

ashwinvaidya17 commented Jul 13, 2022

jeongHwarr commented Jul 14, 2022 •

edited

Fix: data split issue #404

Fix: data split issue #404

Conversation

jeongHwarr commented Jul 5, 2022

Description

Changes

Checklist

ashwinvaidya17 commented Jul 11, 2022

jeongHwarr commented Jul 12, 2022

ashwinvaidya17 commented Jul 12, 2022

jeongHwarr commented Jul 13, 2022

ashwinvaidya17 commented Jul 13, 2022

ashwinvaidya17 commented Jul 13, 2022

jeongHwarr commented Jul 14, 2022 • edited

jeongHwarr commented Jul 14, 2022 •

edited