New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Task casting for text classification & question answering #2255
Conversation
cc @abhi1thakur |
Looks really nice so far, thanks ! |
This decorator is needed to make TaskTemplate (and inherited classes) JSON-serializable
hey @SBrandeis @lhoestq, i now have a better idea about what you guys are trying to achieve with the task templates and have a few follow-up questions:
from datasets.tasks import TextClassification
from transformers import AutoModelForSequenceClassification, AutoConfig
tc = TextClassification.from_json("path/to/dataset_infos.json")
# load a model with the desired config
model_ckpt = ...
config = AutoConfig.from_pretrained(model_ckpt, label2id=tc.label2id, id2label=tc.id2label)
model = AutoModelForSequenceClassification.from_pretrained(model_ckpt, config=config)
# run eval ... perhaps this is what @SBrandeis had in mind with the
|
The initial idea was to be able to do something like this: from datasets import load_dataset
dset = load_dataset("name", task="binary_classification")
# OR
dset = load_dataset("name")
dset = dset.prepare_for_task("binary_classification")
[Edit 1] Typo |
ah that's very elegant! just so i've completely understood, the result would be that the relevant column names of |
That's correct! Also, the features need to be appropriately casted datasets.Features({
"text": datasets.Value("string"),
"label": datasets.ClassLabel(names=[...]),
}) |
|
great, thanks a lot for your answers! now it's much clearer what i need to do next 😃 |
This makes sense for two reasons: 1) to handle both Dataset and DatasetDict objects, and 2) be closer to the post processing logic per split
src/datasets/arrow_dataset.py
Outdated
column_mapping = template.column_mapping | ||
columns_to_drop = [column for column in self.column_names if column not in column_mapping] | ||
dataset = self.remove_columns(columns_to_drop) | ||
# TODO(sbrandeis): Add support for unnesting columns too |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hey @SBrandeis could you clarify what you mean by "unnesting columns" here?
do you mean that Datasets.rename_columns
should supported columns with nested features like answers.text
in SQuAD:
features=datasets.Features(
{
"id": datasets.Value("string"),
"title": datasets.Value("string"),
"context": datasets.Value("string"),
"question": datasets.Value("string"),
"answers": datasets.features.Sequence(
{
"text": datasets.Value("string"),
"answer_start": datasets.Value("int32"),
}
),
}
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @lewtun! Yes that's correct
hey @lhoestq @SBrandeis, i've made some small tweaks to @SBrandeis's code so that # DatasetDict with default columns
ds = load_dataset("./datasets/emotion/")
# DatasetDict({
# train: Dataset({
# features: ['tweet', 'emotion'],
# num_rows: 16000
# })
# validation: Dataset({
# features: ['tweet', 'emotion'],
# num_rows: 2000
# })
# test: Dataset({
# features: ['tweet', 'emotion'],
# num_rows: 2000
# })
# })
# DatasetDict with remapped columns
ds = load_dataset("./datasets/emotion/", task="text_classification")
DatasetDict({
# train: Dataset({
# features: ['text', 'label'],
# num_rows: 16000
# })
# validation: Dataset({
# features: ['text', 'label'],
# num_rows: 2000
# })
# test: Dataset({
# features: ['text', 'label'],
# num_rows: 2000
# })
# })
# Dataset with default columns
ds = load_dataset("./datasets/emotion/", split="train")
# Map/cast features
ds = ds.prepare_for_task("text_classification")
# Dataset({
# features: ['text', 'label'],
# num_rows: 16000
# }) i have a few follow-up questions / remarks:
task_templates=[TextClassification(labels=class_names, text_column="tweet", label_column="emotion"), TextClassification(labels=class_names, text_column="some_other_column", label_column="some_other_column")] since we use
|
i'm not sure why the benchmarks are getting cancelled - is this expected? |
Hmm I don't know. It's certainly unrelated to this PR though. Maybe github has some issues |
Something is happening with actions: https://www.githubstatus.com/ |
hey @lhoestq and @SBrandeis, i've:
i think this is ready for another review :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks all good thank you :)
Can you also add prepare_for_task
in the main_classes.rst
file of the documentation ?
Done! I also remembered that I needed to do the same for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome job @lewtun, thanks a lot!
src/datasets/load.py
Outdated
@@ -635,6 +635,7 @@ def load_dataset( | |||
save_infos: bool = False, | |||
script_version: Optional[Union[str, Version]] = None, | |||
use_auth_token: Optional[Union[bool, str]] = None, | |||
task: Optional[str] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add TaskTempplate
here too? 👀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch!
class TextClassification(TaskTemplate): | ||
task = "text-classification" | ||
input_schema = Features({"text": Value("string")}) | ||
label_schema = Features({"labels": ClassLabel}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Since we update this in __post_init__
do we need to declare a default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch! i'll fix it :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah actually we need to declare a default because label_schema
is a required argument for __init__
. i've reverted the change and added a TODO so we can think about a more elegant approach as we iterate on the other tasks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a few high-level comment.
I like the general idea!
src/datasets/arrow_dataset.py
Outdated
@@ -1384,6 +1385,44 @@ def with_transform( | |||
dataset.set_transform(transform=transform, columns=columns, output_all_columns=output_all_columns) | |||
return dataset | |||
|
|||
def prepare_for_task(self, task: Union[str, TaskTemplate]) -> "Dataset": | |||
"""Prepare a dataset for the given task. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's good to explain a bit what are the operations contained in the expression "prepare a dataset"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea! i opted for the following:
Prepare a dataset for the given task by casting the dataset's :class:
Features
to standardized column names and types as detailed in~tasks
.
task (:obj:`Union[str, TaskTemplate]`): The task to prepare the dataset for during training and evaluation. If :obj:`str`, supported tasks include: | ||
|
||
- :obj:`"text-classification"` | ||
- :obj:`"question-answering"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question-answering
exists in two forms: abstractive and extractive question answering.
we can keep a generic question-answering
but then it will probably mean diferrent schema of input/output for both (abstractive will have text for both while extractive can use spans indication as well as text).
Or we can also propose to use abstractive-question-answering
and extractive-question-answering
for instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could have question-answering-abstractive
and question-answering-extractive
if somehow we can use a for a completion or search in the future (detail).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I see that people are more organizing in terms of general and sub-tasks, for instance on paperwithcode: https://paperswithcode.com/area/natural-language-processing and on nlpprogress: https://github.com/sebastianruder/NLP-progress/blob/master/english/question_answering.md#squad
Probably the best is to align with one of these in terms of denomination, PaperWithCode is probably the most active and maintained and we work with them as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
our idea was to start by following the pipeline taxonomy in transformers
, where question-answering
is indeed restricted to the extractive case.
but you're 100% correct that we should already start considering the abstractive case or grouping in terms of the sub-domains that PwC adopts.
since our focus right now is on getting text-classification
integrated and tested with autonlp
, i've opened an issue here to tackle this next: #2371
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PWC uses "Question Answering" (meaning extractive) vs "Generative Question Answering" (includes abstractive) which encapsulates most of the QA tasks except Open/Closed Domain QA and Knowledge base QA (they require no context since the knowledge is not part of the query).
I'm fine with these names, or simply extractive and abstractive
src/datasets/load.py
Outdated
@@ -694,6 +696,7 @@ def load_dataset( | |||
You can specify a different version that the default "main" by using a commit sha or a git tag of the dataset repository. | |||
use_auth_token (``str`` or ``bool``, optional): Optional string or boolean to use as Bearer token for remote files on the Datasets Hub. | |||
If True, will get token from `"~/.huggingface"`. | |||
task (``str``): The task to prepare the dataset for during training and evaluation. Casts the dataset's :class:`Features` according to one of the schemas in `~tasks`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
task (``str``): The task to prepare the dataset for during training and evaluation. Casts the dataset's :class:`Features` according to one of the schemas in `~tasks`. | |
task (``str``): The task to prepare the dataset for during training and evaluation. Casts the dataset's :class:`Features` according to standardized column names and types as detailed in `~tasks`. |
NAME2TEMPLATE = {QuestionAnswering.task: QuestionAnswering, TextClassification.task: TextClassification} | ||
|
||
|
||
def task_template_from_dict(task_template_dict: dict) -> Optional[TaskTemplate]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this function supposed to be used by the user? If so it should have a doc string and doc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no this is not supposed to be used by the user (at least for now). still no harm in having a docstring, so i've added:
"""Create one of the supported task templates in :obj:`datasets.tasks` from a dictionary."""
|
||
|
||
@dataclass(frozen=True) | ||
class QuestionAnswering(TaskTemplate): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QuestionAnsweringExtractive
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the suggestion! i'll keep this unchanged for now since it will soon be overhauled by #2371
@dataclass(frozen=True) | ||
class QuestionAnswering(TaskTemplate): | ||
task = "question-answering" | ||
input_schema = Features({"question": Value("string"), "context": Value("string")}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you want to check with a few QA datasets that this schema make sense. Typically NaturalQuestions, TriviaQA and can be good second datasets to compare to and be sure of the generality of the schema.
A good recent list of QA datasets to compare the schemas among, is for instance in the UnitedQA paper: https://arxiv.org/abs/2101.00178
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the tip! added to #2371
This PR implements task preparation for a given task, in the continuation of #2143
Task taxonomy follows 🤗 Transformers's pipelines taxonomy: https://github.com/huggingface/transformers/tree/master/src/transformers/pipelines
Edit by @lewtun:
This PR implements support for the following tasks:
text-classification
question-answering
The intended usage is as follows:
Note that the current implementation assumes that
DatasetInfo.task_templates
has been pre-defined by the user / contributor when overriding theMyDataset(GeneratorBasedBuilder)._info
function.As pointed out by @SBrandeis, for evaluation we'll need a way to detect which datasets are already have a compatible schema so we don't have to edit hundreds of dataset scripts. One possibility is to check if the schema features are a subset of the dataset ones, e.g.