Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Task casting for text classification & question answering #2255
Task casting for text classification & question answering #2255
Changes from 18 commits
9a19cf2
7ab394a
e5cef8d
4564eb9
29650f8
1516ae6
2157bf5
9302893
02bcfa2
531cf94
d3bccca
6dfd7f2
b0899b1
3cf039d
b2a02c5
28e6ab1
7f0f683
3a30b62
2ec62a2
5df3b65
35e6110
e21d728
8d5427d
c344557
6220e7f
388c5d4
d759f65
6b851aa
95d395e
6d74541
0f6513a
d3cdf78
34aa06e
fe61d2c
46beea9
c83e501
db52ce7
3ad17f3
14e6e5b
316cdb0
8e2299a
40de679
b67ed1f
944b22f
1cb8c93
6858125
6c6b3c9
4d40558
7076902
81674cc
3e39872
7abd2f0
441a36e
894a509
56635e7
1274aa5
5e48403
5a6021a
a868d35
99200c7
f895389
290b583
3578dbd
7a0de24
6bf7970
2dc8d26
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we keep this here in this pr?
more generally, am i correct in understanding that we'll need to add the relevant tasks to each of the ~600 canonical datasets at some point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was merely a demonstrator for the syntax, to validate the API "feel" from the user perspective when adding a dataset.
I don't think it makes much sense to keep it in the PR unless we test something specific on SQuAD.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes... To make our lives easier, we need to find a way to infer supported tasks automatically for simple datasets, without changing the datasets scripts. For example, we could check if the dataset is already formatted as expected (no need for column renaming / casting).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point re task detection! i wonder if we could do this with a simple check on whether the schema features are a subset of the dataset features, e.g.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hey @SBrandeis could you clarify what you mean by "unnesting columns" here?
do you mean that
Datasets.rename_columns
should supported columns with nested features likeanswers.text
in SQuAD:There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @lewtun! Yes that's correct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SBrandeis can you clarify what is meant by this TODO? in particular, what does "members" refer to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course 😄
For context, this method is called when concatenating datasets with the
concatenate_datasets
method. TheDatasetInfo
of the concatenatedDataset
object is constructed usingDatasetInfo.from_merge
: https://github.com/huggingface/datasets/blob/master/src/datasets/arrow_dataset.py#L3098What I meant here is that it would be nice to propagate the
task_templates
from the componentsDataset
objects to the resulting concatenatedDataset
object (I'm happy to reformulate if it's not clear 😅 ). "members" refer to the componentsDataset
objects (the ones being concatenated).If all the component
Dataset
s can be prepared for thequestion_answering
task / pipeline, then the resulting concatenatedDataset
can also be prepared for thequestion_answering
task.Let me know if this helps 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks a lot - this is now perfectly clear! i'll have a stab at it in this pr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add
TaskTempplate
here too? 👀There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this function supposed to be used by the user? If so it should have a doc string and doc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no this is not supposed to be used by the user (at least for now). still no harm in having a docstring, so i've added:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QuestionAnsweringExtractive
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the suggestion! i'll keep this unchanged for now since it will soon be overhauled by #2371
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you want to check with a few QA datasets that this schema make sense. Typically NaturalQuestions, TriviaQA and can be good second datasets to compare to and be sure of the generality of the schema.
A good recent list of QA datasets to compare the schemas among, is for instance in the UnitedQA paper: https://arxiv.org/abs/2101.00178
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the tip! added to #2371