Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected behavior doing Split + Filter #3450

Closed
jbrachat opened this issue Dec 17, 2021 · 1 comment
Closed

Unexpected behavior doing Split + Filter #3450

jbrachat opened this issue Dec 17, 2021 · 1 comment
Labels
bug Something isn't working

Comments

@jbrachat
Copy link

jbrachat commented Dec 17, 2021

Describe the bug

I observed unexpected behavior when applying 'train_test_split' followed by 'filter' on dataset. Elements of the training dataset eventually end up in the test dataset (after applying the 'filter')

Steps to reproduce the bug

from datasets import Dataset
import pandas as pd
dic = {'x': [1,2,3,4,5,6,7,8,9], 'y':['q','w','e','r','t','y','u','i','o']}
df = pd.DataFrame.from_dict(dic)
dataset = Dataset.from_pandas(df)
split_dataset = dataset.train_test_split(test_size=0.5, shuffle=False, seed=42)
train_dataset = split_dataset["train"]
eval_dataset = split_dataset["test"]
eval_dataset_2 = eval_dataset.filter(lambda example: example['x'] % 2 == 0)
print( eval_dataset['x'])
print(eval_dataset_2['x'])

One observes that elements in eval_dataset2 are actually coming from the training dataset...

Expected results

The expected results would be that the filtered eval dataset would only contain elements from the original eval dataset.

Actual results

Specify the actual results or traceback.

Environment info

  • datasets version: 1.12.1
  • Platform: Windows 10
  • Python version: 3.7
  • PyArrow version: 5.0.0
@jbrachat jbrachat added the bug Something isn't working label Dec 17, 2021
@lhoestq
Copy link
Member

lhoestq commented Dec 20, 2021

Hi ! This is an issue with datasets 1.12. Sorry for the inconvenience. Can you update to >=1.13 ?
see #3190

Maybe we should also backport the bug fix to 1.12 (in a new version 1.12.2)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants