Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: data split issue #404

Merged
merged 1 commit into from Jul 14, 2022

Conversation

jeongHwarr
Copy link
Contributor

Description

Changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

Checklist

  • My code follows the pre-commit style and check guidelines of this project.
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing tests pass locally with my changes

@ashwinvaidya17 ashwinvaidya17 self-requested a review July 5, 2022 17:22
@ashwinvaidya17
Copy link
Collaborator

What's wrong with setting the seed to a value other than 0? If we make this change, we will have to ensure that lines like

if config.project.seed != 0:
        seed_everything(config.project.seed)

are changed in other files - train.py, etc

@jeongHwarr
Copy link
Contributor Author

@ashwinvaidya17

If seed is non-zero, there is no problem.
However, if the seed is 0, the test data is also included in the training data. So the performance is abnormally high.

Does seed 0 means that the seed is not fixed in this project?
If so, the code seems to need a lot of modification.

It seems that samples are shared rather than made the same seed.

if not normal_test_dir:
samples = split_normal_images_in_train_set(
samples=samples, split_ratio=split_ratio, seed=seed, normal_label="normal"
)

Now, the PR code solves the data contamination problem, but each traininig data and test data are created identically for the seed 0.

@ashwinvaidya17
Copy link
Collaborator

@jeongHwarr not sure if I quite understand the contamination problem. Currently we treat seed 0 as having no seed. However, even with seed 0 I get no overlap between train split and the test split. Here are the sample lines of code I am using for testing this. This does not raise Assertion Error for multiple runs.

from anomalib.data.folder import make_dataset
from pathlib import Path

dataset = make_dataset(normal_dir=Path("../datasets/bottle/good/"), abnormal_dir=Path('../datasets/bottle/broken_large'), split_ratio=0.33, seed=0)

train_split = dataset[dataset.split == "train"]
test_split = dataset[dataset.split == "test"]
assert train_split["image_path"].values.all() != test_split["image_path"].values.all()

Here is another sample using Folder DataModule

folder_dataset = Folder(root=Path("../datasets/bottle"),normal_dir="good", abnormal_dir='broken_large', split_ratio=0.1, seed=0, image_size=256)
folder_dataset.setup()
assert folder_dataset.test_data.samples["image_path"].values.all() != folder_dataset.train_data.samples["image_path"].values.all()

@jeongHwarr
Copy link
Contributor Author

@ashwinvaidya17

There is no overlap in your test code because you called make_dataset only once.

However, in the code of this project, make_dataset is called multiple times. At this time, if the seed is not fixed, the sample will not come out the same.

if not normal_test_dir:
samples = split_normal_images_in_train_set(
samples=samples, split_ratio=split_ratio, seed=seed, normal_label="normal"
)

To reproduce this situation, set normal_dir to None in yaml, let the above part be called, and then compare the training data and test data. Then you can find the overlap there.

make_dataset is called multiple times through the code below.

if stage in (None, "fit"):
self.train_data = FolderDataset(
normal_dir=self.normal_dir,
abnormal_dir=self.abnormal_dir,
normal_test_dir=self.normal_test,
split="train",
split_ratio=self.split_ratio,
mask_dir=self.mask_dir,
pre_process=self.pre_process_train,
extensions=self.extensions,
task=self.task,
seed=self.seed,
create_validation_set=self.create_validation_set,
)
if self.create_validation_set:
self.val_data = FolderDataset(
normal_dir=self.normal_dir,
abnormal_dir=self.abnormal_dir,
normal_test_dir=self.normal_test,
split="val",
split_ratio=self.split_ratio,
mask_dir=self.mask_dir,
pre_process=self.pre_process_val,
extensions=self.extensions,
task=self.task,
seed=self.seed,
create_validation_set=self.create_validation_set,
)
self.test_data = FolderDataset(
normal_dir=self.normal_dir,
abnormal_dir=self.abnormal_dir,
split="test",
normal_test_dir=self.normal_test,
split_ratio=self.split_ratio,
mask_dir=self.mask_dir,
pre_process=self.pre_process_val,
extensions=self.extensions,
task=self.task,
seed=self.seed,
create_validation_set=self.create_validation_set,
)
if stage == "predict":
self.inference_data = InferenceDataset(
path=self.root, image_size=self.image_size, transform_config=self.transform_config_val
)

self.samples = make_dataset(
normal_dir=normal_dir,
abnormal_dir=abnormal_dir,
normal_test_dir=normal_test_dir,
mask_dir=mask_dir,
split=split,
split_ratio=split_ratio,
seed=seed,
create_validation_set=create_validation_set,
extensions=extensions,
)

To solve this problem, make_datset is called only once, and the below part in make_dataset must be processed in another part not in the make_datset.

# Get the data frame for the split.
if split is not None and split in ["train", "val", "test"]:
samples = samples[samples.split == split]
samples = samples.reset_index(drop=True)

@ashwinvaidya17
Copy link
Collaborator

@jeongHwarr I see what you mean. Thanks for spotting this. I'll look into the issue. But does it make sense to keep seed 0 as no seed or should we refactor the code to make seed: null as no seed?

@ashwinvaidya17
Copy link
Collaborator

@jeongHwarr Thanks again for spotting this issue and creating a PR for addressing it. Is it possible for you to make a few changes to this PR (https://github.com/openvinotoolkit/anomalib/pull/437/files) so that we can merge this and you can be a contributor to this repo 😀. Other option is to always merge this and I can create a new PR to address this.

@jeongHwarr
Copy link
Contributor Author

jeongHwarr commented Jul 14, 2022

@ashwinvaidya17

I think that None or null is more clear than seed 0.
Until I saw the code of this repo, I thought that it was fixed as seed 0.

I've reviewed your PR and nothing seems to change!
Thank you 😀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

When test_split is applied to a custom dataset, test data contamination occurs
3 participants