-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use standard open-domain validation split in nq_open #3029
Conversation
The nq_open dataset originally drew the validation set from this file: https://github.com/google-research-datasets/natural-questions/blob/master/nq_open/NQ-open.efficientqa.dev.1.1.sample.jsonl However, that's the dev set used specifically and only for the efficientqa competition, and it's not the same dev set as is used in every open-domain question answering paper (including the Lee et al paper that introduced the open-domain variant of NQ, cited at the top of the dataset file). This PR changes nq_open to use the standard validation split and bumps the version to 2.0.0 since this is a breaking change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix.
I guess the dataset_infos.json
should be updated as well:
datasets-cli test datasets/nq_open --save_infos --all_configs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also dummy data should be moved to 2.0.0
subdirectory:
- From: dummy/nq_open/1.0.0/dummy_data.zip
- To: dummy/nq_open/2.0.0/dummy_data.zip
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And a minor change: the tag pretty_name
should be added to the header of the README.md
file:
pretty_name: NQ-Open
I had to run datasets-cli with --ignore_verifications the first time since it was complaining about a missing file, but now it runs without that flag fine. I moved dummy_data.zip to the new folder, but also had to modify the filename of the test file in the zip (should I not have done that?). Finally, I added the pretty name tag. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes @craffel you did right! The renaming of the dev datafile was also required. Sorry I forgot to tell you.
Once all tests pass, I can merge to master.
Great, thanks for the help. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @craffel for the fix!
The nq_open dataset originally drew the validation set from this file:
https://github.com/google-research-datasets/natural-questions/blob/master/nq_open/NQ-open.efficientqa.dev.1.1.sample.jsonl
However, that's the dev set used specifically and only for the efficientqa competition, and it's not the same dev set as is used in every open-domain question answering paper (including the Lee et al paper that introduced the open-domain variant of NQ, cited at the top of the dataset file). This PR changes nq_open to use the standard validation split and bumps the version to 2.0.0 since this is a breaking change.