Task casting for text classification & question answering #2255

SBrandeis · 2021-04-23T16:00:41Z

This PR implements task preparation for a given task, in the continuation of #2143

Task taxonomy follows 🤗 Transformers's pipelines taxonomy: https://github.com/huggingface/transformers/tree/master/src/transformers/pipelines

Edit by @lewtun:

This PR implements support for the following tasks:

text-classification
question-answering

The intended usage is as follows:

# Load a dataset with default column names / features
ds = load_dataset("dataset_name")
# Cast column names / features to schema. Casting is defined in the dataset's `DatasetInfo`
ds = ds.prepare_for_task(task="text-classification")
# Casting can also be realised during load
ds = load_dataset("dataset_name", task="text-classification")
# We can also combine shared tasks across dataset concatenation
ds1 = load_dataset("dataset_name_1", task="text-classification")
ds2 = load_dataset("dataset_name_2", task="text-classification")
# If the tasks have the same schema, so will `ds_concat`
ds_concat = concatenate_datasets([ds1, ds2])

Note that the current implementation assumes that DatasetInfo.task_templates has been pre-defined by the user / contributor when overriding the MyDataset(GeneratorBasedBuilder)._info function.

As pointed out by @SBrandeis, for evaluation we'll need a way to detect which datasets are already have a compatible schema so we don't have to edit hundreds of dataset scripts. One possibility is to check if the schema features are a subset of the dataset ones, e.g.

squad = load_dataset("./datasets/squad", split="train")
qa = QuestionAnswering()
schema = Features({**qa.input_schema, **qa.label_schema})
assert all(item in squad.features.items() for item in schema.items())

SBrandeis · 2021-04-23T16:09:05Z

cc @abhi1thakur

lhoestq · 2021-04-23T18:38:56Z

Looks really nice so far, thanks !
Maybe if a dataset doesn't have a template for a specific task we could try the default template of this task ?

This decorator is needed to make TaskTemplate (and inherited classes) JSON-serializable

lewtun · 2021-05-05T07:49:33Z

hey @SBrandeis @lhoestq,

i now have a better idea about what you guys are trying to achieve with the task templates and have a few follow-up questions:

how did you envision using DatasetInfo for running evaluation? my understanding is that all dataset_infos.json files are stored in the datasets repo (unlike transformers where each model's weights etc are stored in a dedicated repo).
this suggests the following workflow:

- git clone datasets
- load target dataset to evaluate
- load `dataset_infos.json` for target dataset
- run eval for each task template in `task_templates`
- store metrics as evaluation cards (similar to what is done in `autonlp`)

assuming the above workflow, i see that the current TaskTemplate attributes of task, input_schema, and label_schema still require some wrangling from dataset_infos.json to reproduce additional mappings like label2id that we'd need for e.g. text classification. an alternative would be to instantiate the task template class directly from the JSON with something like

from datasets.tasks import TextClassification
from transformers import AutoModelForSequenceClassification, AutoConfig

tc = TextClassification.from_json("path/to/dataset_infos.json")
# load a model with the desired config
model_ckpt = ...
config = AutoConfig.from_pretrained(model_ckpt, label2id=tc.label2id, id2label=tc.id2label)
model = AutoModelForSequenceClassification.from_pretrained(model_ckpt, config=config)
# run eval ...

perhaps this is what @SBrandeis had in mind with the TaskTemplate.from_dict method?

i personally prefer using task_templates over supervised_keys because it encourages the contributor to think in terms of 1 or more tasks. my question here is do we currently use supervised_keys for anything important in the datasets library?

SBrandeis · 2021-05-05T09:28:24Z

How do you envision using DatasetInfo for running evaluation?

The initial idea was to be able to do something like this:

from datasets import load_dataset
dset = load_dataset("name", task="binary_classification")
# OR
dset = load_dataset("name")
dset = dset.prepare_for_task("binary_classification")

I don't think that's needed if we proceed as mentioned above
supervised_keys are mostly a legacy compatibility thing with TF datasets, not sure it's used for anything right now. I'll let @lhoestq give more details on that

[Edit 1] Typo

lewtun · 2021-05-05T09:42:00Z

The initial idea was to be able to do something like this:

from datasets import load_dataset
dset = load_dataset("name", task="binary_classification")
# OR
dset = load_dataset("name")
dset = dset.prepare_for_task("binary_classification")

ah that's very elegant! just so i've completely understood, the result would be that the relevant column names of dset would be mapped to e.g. text and label and thus we'd have a uniform schema for the evaluation of all binary_classification tasks?

SBrandeis · 2021-05-05T10:06:21Z

That's correct! Also, the features need to be appropriately casted
For a classification task for example, we would need to cast the datasets features to something like this:

datasets.Features({
    "text": datasets.Value("string"),
    "label": datasets.ClassLabel(names=[...]),
})

lhoestq · 2021-05-05T10:27:58Z

We can ignore supervised_keys (it came from TFDS and we're not using it) and use task_templates

lewtun · 2021-05-05T11:43:08Z

great, thanks a lot for your answers! now it's much clearer what i need to do next 😃

This makes sense for two reasons: 1) to handle both Dataset and DatasetDict objects, and 2) be closer to the post processing logic per split

lewtun · 2021-05-07T12:58:36Z

src/datasets/arrow_dataset.py

+        column_mapping = template.column_mapping
+        columns_to_drop = [column for column in self.column_names if column not in column_mapping]
+        dataset = self.remove_columns(columns_to_drop)
+        # TODO(sbrandeis): Add support for unnesting columns too


hey @SBrandeis could you clarify what you mean by "unnesting columns" here?

do you mean that Datasets.rename_columns should supported columns with nested features like answers.text in SQuAD:

features=datasets.Features( { "id": datasets.Value("string"), "title": datasets.Value("string"), "context": datasets.Value("string"), "question": datasets.Value("string"), "answers": datasets.features.Sequence( { "text": datasets.Value("string"), "answer_start": datasets.Value("int32"), } ), } )

Hi @lewtun! Yes that's correct

lewtun · 2021-05-07T13:29:41Z

hey @lhoestq @SBrandeis,

i've made some small tweaks to @SBrandeis's code so that Dataset.prepare_for_task is called in DatasetBuilder. using the emotion dataset as a test case, the following now works:

# DatasetDict with default columns
ds = load_dataset("./datasets/emotion/")
# DatasetDict({
#     train: Dataset({
#         features: ['tweet', 'emotion'],
#         num_rows: 16000
#     })
#     validation: Dataset({
#         features: ['tweet', 'emotion'],
#         num_rows: 2000
#     })
#     test: Dataset({
#         features: ['tweet', 'emotion'],
#         num_rows: 2000
#     })
# })

# DatasetDict with remapped columns
ds = load_dataset("./datasets/emotion/", task="text_classification")
DatasetDict({
#     train: Dataset({
#         features: ['text', 'label'],
#         num_rows: 16000
#     })
#     validation: Dataset({
#         features: ['text', 'label'],
#         num_rows: 2000
#     })
#     test: Dataset({
#         features: ['text', 'label'],
#         num_rows: 2000
#     })
# })

# Dataset with default columns
ds = load_dataset("./datasets/emotion/", split="train")
# Map/cast features
ds = ds.prepare_for_task("text_classification")
# Dataset({
#     features: ['text', 'label'],
#     num_rows: 16000
# })

i have a few follow-up questions / remarks:

i'm working under the assumption that contributors / users only provide a unique set of task types. in particular, the current implementation does not support something like:

task_templates=[TextClassification(labels=class_names, text_column="tweet", label_column="emotion"), TextClassification(labels=class_names, text_column="some_other_column", label_column="some_other_column")]

since we use TaskTemplate.task and the filter for compatible templates in Dataset.prepare_for_task. should we support these scenarios? my hunch is that this is rare in practice, but please correct me if i'm wrong.

when we eventually run evaluation for transformers models, i expect we'll be using the Trainer for which we can pass the standard label names to TrainingArguments.label_names. if that's the case, it might be prudent to heed the warning from the docs and use labels instead of label in the schema:

your model can accept multiple label arguments (use the label_names in your TrainingArguments to indicate their name to the Trainer) but none of them should be named "label".

i plan to forge ahead on the rest of the pipeline taxonomy. please let me know if you'd prefer smaller, self-contained pull requests (e.g. one per task)

lewtun · 2021-05-14T14:30:46Z

i'm not sure why the benchmarks are getting cancelled - is this expected?

lhoestq · 2021-05-14T14:52:38Z

i'm not sure why the benchmarks are getting cancelled - is this expected?

Hmm I don't know. It's certainly unrelated to this PR though. Maybe github has some issues

SBrandeis · 2021-05-14T14:57:13Z

Something is happening with actions: https://www.githubstatus.com/

lewtun · 2021-05-17T16:47:22Z

hey @lhoestq and @SBrandeis, i've:

extended the prepare_for_task API along the lines that @lhoestq suggested. i wasn't entirely sure what the datasets convention is for docstrings with mixed types, so please see if my proposal makes sense
added a few new tests to check that we trigger the value errors on incorrect input

i think this is ready for another review :)

lhoestq

Looks all good thank you :)

Can you also add prepare_for_task in the main_classes.rst file of the documentation ?

tests/test_arrow_dataset.py

lewtun · 2021-05-17T19:53:26Z

Looks all good thank you :)

Can you also add prepare_for_task in the main_classes.rst file of the documentation ?

Done! I also remembered that I needed to do the same for DatasetDict, so included this as well :)

SBrandeis

Awesome job @lewtun, thanks a lot!

SBrandeis · 2021-05-18T08:33:31Z

src/datasets/load.py

@@ -635,6 +635,7 @@ def load_dataset(
    save_infos: bool = False,
    script_version: Optional[Union[str, Version]] = None,
    use_auth_token: Optional[Union[bool, str]] = None,
+    task: Optional[str] = None,


Should we add TaskTempplate here too? 👀

good catch!

SBrandeis · 2021-05-18T08:36:24Z

src/datasets/tasks/text_classification.py

+class TextClassification(TaskTemplate):
+    task = "text-classification"
+    input_schema = Features({"text": Value("string")})
+    label_schema = Features({"labels": ClassLabel})


Nit: Since we update this in __post_init__ do we need to declare a default?

good catch! i'll fix it :)

ah actually we need to declare a default because label_schema is a required argument for __init__. i've reverted the change and added a TODO so we can think about a more elegant approach as we iterate on the other tasks

thomwolf

Added a few high-level comment.

I like the general idea!

thomwolf · 2021-05-18T09:12:11Z

src/datasets/arrow_dataset.py

@@ -1384,6 +1385,44 @@ def with_transform(
        dataset.set_transform(transform=transform, columns=columns, output_all_columns=output_all_columns)
        return dataset

+    def prepare_for_task(self, task: Union[str, TaskTemplate]) -> "Dataset":
+        """Prepare a dataset for the given task.


I think it's good to explain a bit what are the operations contained in the expression "prepare a dataset"

good idea! i opted for the following:

Prepare a dataset for the given task by casting the dataset's :class:Features to standardized column names and types as detailed in ~tasks.

thomwolf · 2021-05-18T09:14:34Z

src/datasets/arrow_dataset.py

+            task (:obj:`Union[str, TaskTemplate]`): The task to prepare the dataset for during training and evaluation. If :obj:`str`, supported tasks include:
+
+                - :obj:`"text-classification"`
+                - :obj:`"question-answering"`


question-answering exists in two forms: abstractive and extractive question answering.

we can keep a generic question-answering but then it will probably mean diferrent schema of input/output for both (abstractive will have text for both while extractive can use spans indication as well as text).

Or we can also propose to use abstractive-question-answering and extractive-question-answering for instance.

Maybe we could have question-answering-abstractive and question-answering-extractive if somehow we can use a for a completion or search in the future (detail).

Actually I see that people are more organizing in terms of general and sub-tasks, for instance on paperwithcode: https://paperswithcode.com/area/natural-language-processing and on nlpprogress: https://github.com/sebastianruder/NLP-progress/blob/master/english/question_answering.md#squad

Probably the best is to align with one of these in terms of denomination, PaperWithCode is probably the most active and maintained and we work with them as well.

our idea was to start by following the pipeline taxonomy in transformers, where question-answering is indeed restricted to the extractive case.

but you're 100% correct that we should already start considering the abstractive case or grouping in terms of the sub-domains that PwC adopts.

since our focus right now is on getting text-classification integrated and tested with autonlp, i've opened an issue here to tackle this next: #2371

PWC uses "Question Answering" (meaning extractive) vs "Generative Question Answering" (includes abstractive) which encapsulates most of the QA tasks except Open/Closed Domain QA and Knowledge base QA (they require no context since the knowledge is not part of the query).

I'm fine with these names, or simply extractive and abstractive

thomwolf · 2021-05-18T09:17:01Z

src/datasets/load.py

@@ -694,6 +696,7 @@ def load_dataset(
              You can specify a different version that the default "main" by using a commit sha or a git tag of the dataset repository.
        use_auth_token (``str`` or ``bool``, optional): Optional string or boolean to use as Bearer token for remote files on the Datasets Hub.
            If True, will get token from `"~/.huggingface"`.
+        task (``str``): The task to prepare the dataset for during training and evaluation. Casts the dataset's :class:`Features` according to one of the schemas in `~tasks`.


Suggested change

task (``str``): The task to prepare the dataset for during training and evaluation. Casts the dataset's :class:`Features` according to one of the schemas in `~tasks`.

task (``str``): The task to prepare the dataset for during training and evaluation. Casts the dataset's :class:`Features` according to standardized column names and types as detailed in `~tasks`.

thomwolf · 2021-05-18T09:17:35Z

src/datasets/tasks/__init__.py

+NAME2TEMPLATE = {QuestionAnswering.task: QuestionAnswering, TextClassification.task: TextClassification}
+
+
+def task_template_from_dict(task_template_dict: dict) -> Optional[TaskTemplate]:


Is this function supposed to be used by the user? If so it should have a doc string and doc

no this is not supposed to be used by the user (at least for now). still no harm in having a docstring, so i've added:

"""Create one of the supported task templates in :obj:`datasets.tasks` from a dictionary."""

thomwolf · 2021-05-18T09:18:20Z

src/datasets/tasks/question_answering.py

+
+
+@dataclass(frozen=True)
+class QuestionAnswering(TaskTemplate):


QuestionAnsweringExtractive ?

thanks for the suggestion! i'll keep this unchanged for now since it will soon be overhauled by #2371

thomwolf · 2021-05-18T09:21:21Z

src/datasets/tasks/question_answering.py

+@dataclass(frozen=True)
+class QuestionAnswering(TaskTemplate):
+    task = "question-answering"
+    input_schema = Features({"question": Value("string"), "context": Value("string")})


Maybe you want to check with a few QA datasets that this schema make sense. Typically NaturalQuestions, TriviaQA and can be good second datasets to compare to and be sure of the generality of the schema.

A good recent list of QA datasets to compare the schemas among, is for instance in the UnitedQA paper: https://arxiv.org/abs/2101.00178

thanks for the tip! added to #2371

SBrandeis added 4 commits April 23, 2021 17:11

WIP: task templates in datasets info

9a19cf2

Add rename_columnS method

7ab394a

WIP: Add prepare_for-task method

e5cef8d

WIP: usage example

4564eb9

SBrandeis requested a review from lhoestq April 23, 2021 16:01

Code quality

29650f8

SBrandeis added 2 commits April 30, 2021 17:54

Merge branch 'master' into sbrandeis/task_casting

1516ae6

wip: Add text_classification pipeline

2157bf5

SBrandeis mentioned this pull request May 4, 2021

Add rename_columnS method #2312

Merged

Decorate TaskTemplate with dataclass

9302893

This decorator is needed to make TaskTemplate (and inherited classes) JSON-serializable

lewtun added 10 commits May 6, 2021 16:10

Add text classification template to emotion dataset_infos.json

02bcfa2

Fix key name in task_template_from_dict

531cf94

Refactor TextClassification init and fix from_dict keys

d3bccca

Remove unused import

6dfd7f2

Add task specification to load_dataset

b0899b1

Rename emotion columns for task template testing

3cf039d

Move task casting to builder

b2a02c5

This makes sense for two reasons: 1) to handle both Dataset and DatasetDict objects, and 2) be closer to the post processing logic per split

Refactor task loading with prepare_for_task method

28e6ab1

Add docstring and error message to prepare_for_task

7f0f683

Merge branch 'master' into sbrandeis/task_casting

3a30b62

lewtun reviewed May 7, 2021

View reviewed changes

lewtun added 3 commits May 14, 2021 15:31

Add supported tasks to prepare_for_task docstring

3e39872

Update dataset_dict doctstring

7abd2f0

Add TODO and tweak docstrings

441a36e

lewtun mentioned this pull request May 14, 2021

Allow model labels to be passed during task preparation #2359

Closed

Merge branch 'master' into sbrandeis/task_casting

894a509

lewtun mentioned this pull request May 14, 2021

Automatically detect datasets with compatible task schemas #2360

Open

lewtun added 4 commits May 17, 2021 18:01

Allow TaskTemplate objects to be passed to prepare_for_task

56635e7

Add tests for invalid templates

1274aa5

Update docstring for prepare_for_task

5e48403

Merge branch 'master' into sbrandeis/task_casting

5a6021a

lhoestq approved these changes May 17, 2021

View reviewed changes

tests/test_arrow_dataset.py Outdated Show resolved Hide resolved

lewtun added 2 commits May 17, 2021 21:52

Fix docstrings and add to HTML build

a868d35

Remove redundant context manager in tests

99200c7

SBrandeis commented May 18, 2021

View reviewed changes

lewtun added 3 commits May 18, 2021 10:44

Add TaskTemplate to typing of load_dataset

f895389

Remove default value for label_schema in TextClassificationTemplate

290b583

Revert default value for label_schema and add TODO

3578dbd

thomwolf reviewed May 18, 2021

View reviewed changes

lewtun mentioned this pull request May 18, 2021

Align question answering tasks with sub-domains #2371

Closed

lewtun added 3 commits May 18, 2021 12:02

Add and improve docstrings for task templates

7a0de24

Add list of task templates to docs

6bf7970

Migrate task templates to dedicated section of docs

2dc8d26

lewtun merged commit bc61954 into master May 18, 2021

lewtun deleted the sbrandeis/task_casting branch May 18, 2021 13:31

lewtun mentioned this pull request May 31, 2021

Rename QuestionAnswering template to QuestionAnsweringExtractive #2429

Merged

	task (``str``): The task to prepare the dataset for during training and evaluation. Casts the dataset's :class:`Features` according to one of the schemas in `~tasks`.
	task (``str``): The task to prepare the dataset for during training and evaluation. Casts the dataset's :class:`Features` according to standardized column names and types as detailed in `~tasks`.

		NAME2TEMPLATE = {QuestionAnswering.task: QuestionAnswering, TextClassification.task: TextClassification}


		def task_template_from_dict(task_template_dict: dict) -> Optional[TaskTemplate]:



		@dataclass(frozen=True)
		class QuestionAnswering(TaskTemplate):

Task casting for text classification & question answering #2255

Task casting for text classification & question answering #2255

Conversation

SBrandeis commented Apr 23, 2021 • edited by lewtun

SBrandeis commented Apr 23, 2021

lhoestq commented Apr 23, 2021

lewtun commented May 5, 2021 • edited

SBrandeis commented May 5, 2021 • edited

lewtun commented May 5, 2021

SBrandeis commented May 5, 2021

lhoestq commented May 5, 2021

lewtun commented May 5, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lewtun commented May 7, 2021

lewtun commented May 14, 2021

lhoestq commented May 14, 2021

SBrandeis commented May 14, 2021

lewtun commented May 17, 2021 • edited

lhoestq left a comment

Choose a reason for hiding this comment

lewtun commented May 17, 2021

SBrandeis left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thomwolf left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SBrandeis commented Apr 23, 2021 •

edited by lewtun

lewtun commented May 5, 2021 •

edited

SBrandeis commented May 5, 2021 •

edited

lewtun commented May 17, 2021 •

edited