-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Jigsaw unintended Bias #2935
Add Jigsaw unintended Bias #2935
Conversation
Note that the tests seem to fail because of a bug in an Exception at the moment, see: #2936 for the fix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice :) thanks a lot for adding this dataset !
a few comments:
@@ -0,0 +1,156 @@ | |||
# coding=utf-8 | |||
# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor. | |
# Copyright 2021 The HuggingFace Datasets Authors and the current dataset script contributor. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had replaced this went back seeing as templates/new_dataset_script,py
includes 2020. Might be worth updating this on your end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indeed ! thanks
@lhoestq implemented your changes, I think this might be ready for another look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome thank you !
I added my final comments (just some nitpicking)
# It is in charge of opening the given file and yielding (key, example) tuples from the dataset | ||
# The key is not important, it's more here for legacy reason (legacy from tfds) | ||
|
||
data = pd.read_csv(path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will load all the data in memory (1GB), could it be possible to iterate on the csv file line by line maybe ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, chunks of 50k now which should be small enough.
Thanks @lhoestq, implemented the changes, let me know if anything else pops up. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice thank you ! LGTM :)
Hi,
Here's a first attempt at this dataset. Would be great if it could be merged relatively quickly as it is needed for Bigscience-related stuff.
This requires manual download, and I had some trouble generating dummy_data in this setting, so welcoming feedback there.