-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce web and wiki config in triviaqa dataset #2949
Introduce web and wiki config in triviaqa dataset #2949
Conversation
The TriviaQA paper suggests that the two subsets (Wikipedia and Web) should be treated differently. There are also different leaderboards for the two sets on CodaLab. For that reason, introduce additional builder configs in the trivia_qa dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice thanks ! It looks all good to me :)
I only have one comment:
We try to keep the dummy data as small as possible. The rc, rc.web, rc.wikipedia and unfiltered ones are a bit big because of the evidence files. Feel free to only keep a few sentences in them in order to make the dummy data even smaller.
I just made the dummy data smaller :) |
Thank you so much for reviewing and accepting my pull request!! :) I created these rather large dummy data sets to cover all different cases for the row structure. E.g. in the web configuration, it's possible that a row has evidence from wikipedia ("EntityPages") and the web ("SearchResults"). But it also might happen that either EntityPages or SearchResults is empty. Probably, I will add this thought to the dataset description in the future. |
Ok I see ! Yes feel free to mention it in the dataset card, this can be useful. For the dummy data though we can keep the small ones, as the tests are mainly about testing the parsing from the dataset script rather than the actual content of the dataset. |
The TriviaQA paper suggests that the two subsets (Wikipedia and Web)
should be treated differently. There are also different leaderboards
for the two sets on CodaLab. For that reason, introduce additional
builder configs in the trivia_qa dataset.