Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Jigsaw unintended Bias #2935

Merged
merged 9 commits into from
Sep 24, 2021

Conversation

Iwontbecreative
Copy link
Contributor

Hi,

Here's a first attempt at this dataset. Would be great if it could be merged relatively quickly as it is needed for Bigscience-related stuff.

This requires manual download, and I had some trouble generating dummy_data in this setting, so welcoming feedback there.

@Iwontbecreative
Copy link
Contributor Author

Note that the tests seem to fail because of a bug in an Exception at the moment, see: #2936 for the fix

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice :) thanks a lot for adding this dataset !

a few comments:

datasets/jigsaw_unintended_bias/README.md Show resolved Hide resolved
datasets/jigsaw_unintended_bias/README.md Outdated Show resolved Hide resolved
datasets/jigsaw_unintended_bias/README.md Outdated Show resolved Hide resolved
@@ -0,0 +1,156 @@
# coding=utf-8
# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor.
# Copyright 2021 The HuggingFace Datasets Authors and the current dataset script contributor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had replaced this went back seeing as templates/new_dataset_script,py includes 2020. Might be worth updating this on your end.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed ! thanks

@Iwontbecreative
Copy link
Contributor Author

@lhoestq implemented your changes, I think this might be ready for another look.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome thank you !

I added my final comments (just some nitpicking)

# It is in charge of opening the given file and yielding (key, example) tuples from the dataset
# The key is not important, it's more here for legacy reason (legacy from tfds)

data = pd.read_csv(path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will load all the data in memory (1GB), could it be possible to iterate on the csv file line by line maybe ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, chunks of 50k now which should be small enough.

datasets/jigsaw_unintended_bias/README.md Outdated Show resolved Hide resolved
@Iwontbecreative
Copy link
Contributor Author

Thanks @lhoestq, implemented the changes, let me know if anything else pops up.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice thank you ! LGTM :)

@lhoestq lhoestq merged commit d3649fc into huggingface:master Sep 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants