Skip to content
This repository has been archived by the owner on Oct 9, 2023. It is now read-only.

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

How do I ensure no data leakage in the validation split #143

Closed
aribornstein opened this issue Feb 23, 2021 · 0 comments
Closed

How do I ensure no data leakage in the validation split #143

aribornstein opened this issue Feb 23, 2021 · 0 comments
Labels
question Further information is requested

Comments

@aribornstein
Copy link
Contributor

What is your question?

I'm working on a dataset that has some identifiable information which would lead to data leakage if not the data is not split properly. Currently the validation split is hard coded in each respective DataModules.

        if valid_split:
            full_length = len(train_ds)
            train_split = int((1.0 - valid_split) * full_length)
            valid_split = full_length - train_split
            train_ds, valid_ds = torch.utils.data.random_split(
                train_ds,
                [train_split, valid_split],
                generator=torch.Generator().manual_seed(seed)
            )

Ideally I'd like a flag that enables me to ensure there is no overlap in the train and validation data on these fields by rebalancing any overlap. Due to the way dataset initialization is hardcoded once the validation dataset is created it becomes immutable.

What is the best way to handle such a check, in Flash?

@aribornstein aribornstein added the question Further information is requested label Feb 23, 2021
@Borda Borda closed this as completed Mar 15, 2021
@Lightning-Universe Lightning-Universe locked and limited conversation to collaborators Mar 15, 2021

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants