Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce web and wiki config in triviaqa dataset #2949

Merged
merged 10 commits into from
Oct 1, 2021

Conversation

shirte
Copy link
Contributor

@shirte shirte commented Sep 20, 2021

The TriviaQA paper suggests that the two subsets (Wikipedia and Web)
should be treated differently. There are also different leaderboards
for the two sets on CodaLab. For that reason, introduce additional
builder configs in the trivia_qa dataset.

The TriviaQA paper suggests that the two subsets (Wikipedia and Web)
should be treated differently. There are also different leaderboards
for the two sets on CodaLab. For that reason, introduce additional
builder configs in the trivia_qa dataset.
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice thanks ! It looks all good to me :)

I only have one comment:

We try to keep the dummy data as small as possible. The rc, rc.web, rc.wikipedia and unfiltered ones are a bit big because of the evidence files. Feel free to only keep a few sentences in them in order to make the dummy data even smaller.

@lhoestq lhoestq mentioned this pull request Oct 1, 2021
@lhoestq
Copy link
Member

lhoestq commented Oct 1, 2021

I just made the dummy data smaller :)
Once github refreshes the change I think we can merge !

@lhoestq lhoestq merged commit f9ee6d4 into huggingface:master Oct 1, 2021
@shirte
Copy link
Contributor Author

shirte commented Oct 2, 2021

Thank you so much for reviewing and accepting my pull request!! :)

I created these rather large dummy data sets to cover all different cases for the row structure. E.g. in the web configuration, it's possible that a row has evidence from wikipedia ("EntityPages") and the web ("SearchResults"). But it also might happen that either EntityPages or SearchResults is empty. Probably, I will add this thought to the dataset description in the future.

@lhoestq
Copy link
Member

lhoestq commented Oct 5, 2021

Ok I see ! Yes feel free to mention it in the dataset card, this can be useful.

For the dummy data though we can keep the small ones, as the tests are mainly about testing the parsing from the dataset script rather than the actual content of the dataset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants