Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task casting for text classification & question answering #2255

Merged
merged 66 commits into from May 18, 2021

Conversation

SBrandeis
Copy link
Contributor

@SBrandeis SBrandeis commented Apr 23, 2021

This PR implements task preparation for a given task, in the continuation of #2143

Task taxonomy follows 🤗 Transformers's pipelines taxonomy: https://github.com/huggingface/transformers/tree/master/src/transformers/pipelines

Edit by @lewtun:

This PR implements support for the following tasks:

  • text-classification
  • question-answering

The intended usage is as follows:

# Load a dataset with default column names / features
ds = load_dataset("dataset_name")
# Cast column names / features to schema. Casting is defined in the dataset's `DatasetInfo`
ds = ds.prepare_for_task(task="text-classification")
# Casting can also be realised during load
ds = load_dataset("dataset_name", task="text-classification")
# We can also combine shared tasks across dataset concatenation
ds1 = load_dataset("dataset_name_1", task="text-classification")
ds2 = load_dataset("dataset_name_2", task="text-classification")
# If the tasks have the same schema, so will `ds_concat`
ds_concat = concatenate_datasets([ds1, ds2])

Note that the current implementation assumes that DatasetInfo.task_templates has been pre-defined by the user / contributor when overriding the MyDataset(GeneratorBasedBuilder)._info function.

As pointed out by @SBrandeis, for evaluation we'll need a way to detect which datasets are already have a compatible schema so we don't have to edit hundreds of dataset scripts. One possibility is to check if the schema features are a subset of the dataset ones, e.g.

squad = load_dataset("./datasets/squad", split="train")
qa = QuestionAnswering()
schema = Features({**qa.input_schema, **qa.label_schema})
assert all(item in squad.features.items() for item in schema.items())

@SBrandeis SBrandeis requested a review from lhoestq April 23, 2021 16:01
@SBrandeis
Copy link
Contributor Author

cc @abhi1thakur

@lhoestq
Copy link
Member

lhoestq commented Apr 23, 2021

Looks really nice so far, thanks !
Maybe if a dataset doesn't have a template for a specific task we could try the default template of this task ?

This decorator is needed to make TaskTemplate (and inherited classes) JSON-serializable
@lewtun
Copy link
Member

lewtun commented May 5, 2021

hey @SBrandeis @lhoestq,

i now have a better idea about what you guys are trying to achieve with the task templates and have a few follow-up questions:

  1. how did you envision using DatasetInfo for running evaluation? my understanding is that all dataset_infos.json files are stored in the datasets repo (unlike transformers where each model's weights etc are stored in a dedicated repo).
    this suggests the following workflow:
- git clone datasets
- load target dataset to evaluate
- load `dataset_infos.json` for target dataset
- run eval for each task template in `task_templates`
- store metrics as evaluation cards (similar to what is done in `autonlp`)
  1. assuming the above workflow, i see that the current TaskTemplate attributes of task, input_schema, and label_schema still require some wrangling from dataset_infos.json to reproduce additional mappings like label2id that we'd need for e.g. text classification. an alternative would be to instantiate the task template class directly from the JSON with something like
from datasets.tasks import TextClassification
from transformers import AutoModelForSequenceClassification, AutoConfig

tc = TextClassification.from_json("path/to/dataset_infos.json")
# load a model with the desired config
model_ckpt = ...
config = AutoConfig.from_pretrained(model_ckpt, label2id=tc.label2id, id2label=tc.id2label)
model = AutoModelForSequenceClassification.from_pretrained(model_ckpt, config=config)
# run eval ...

perhaps this is what @SBrandeis had in mind with the TaskTemplate.from_dict method?

  1. i personally prefer using task_templates over supervised_keys because it encourages the contributor to think in terms of 1 or more tasks. my question here is do we currently use supervised_keys for anything important in the datasets library?

@SBrandeis
Copy link
Contributor Author

SBrandeis commented May 5, 2021

  1. How do you envision using DatasetInfo for running evaluation?

The initial idea was to be able to do something like this:

from datasets import load_dataset
dset = load_dataset("name", task="binary_classification")
# OR
dset = load_dataset("name")
dset = dset.prepare_for_task("binary_classification")
  1. I don't think that's needed if we proceed as mentioned above

  2. supervised_keys are mostly a legacy compatibility thing with TF datasets, not sure it's used for anything right now. I'll let @lhoestq give more details on that

[Edit 1] Typo

@lewtun
Copy link
Member

lewtun commented May 5, 2021

The initial idea was to be able to do something like this:

from datasets import load_dataset
dset = load_dataset("name", task="binary_classification")
# OR
dset = load_dataset("name")
dset = dset.prepare_for_task("binary_classification")

ah that's very elegant! just so i've completely understood, the result would be that the relevant column names of dset would be mapped to e.g. text and label and thus we'd have a uniform schema for the evaluation of all binary_classification tasks?

@SBrandeis
Copy link
Contributor Author

That's correct! Also, the features need to be appropriately casted
For a classification task for example, we would need to cast the datasets features to something like this:

datasets.Features({
    "text": datasets.Value("string"),
    "label": datasets.ClassLabel(names=[...]),
})

@lhoestq
Copy link
Member

lhoestq commented May 5, 2021

  1. We can ignore supervised_keys (it came from TFDS and we're not using it) and use task_templates

@lewtun
Copy link
Member

lewtun commented May 5, 2021

great, thanks a lot for your answers! now it's much clearer what i need to do next 😃

column_mapping = template.column_mapping
columns_to_drop = [column for column in self.column_names if column not in column_mapping]
dataset = self.remove_columns(columns_to_drop)
# TODO(sbrandeis): Add support for unnesting columns too
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @SBrandeis could you clarify what you mean by "unnesting columns" here?

do you mean that Datasets.rename_columns should supported columns with nested features like answers.text in SQuAD:

features=datasets.Features(
    {
        "id": datasets.Value("string"),
        "title": datasets.Value("string"),
        "context": datasets.Value("string"),
        "question": datasets.Value("string"),
        "answers": datasets.features.Sequence(
            {
                "text": datasets.Value("string"),
                "answer_start": datasets.Value("int32"),
            }
        ),
    }
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @lewtun! Yes that's correct

@lewtun
Copy link
Member

lewtun commented May 7, 2021

hey @lhoestq @SBrandeis,

i've made some small tweaks to @SBrandeis's code so that Dataset.prepare_for_task is called in DatasetBuilder. using the emotion dataset as a test case, the following now works:

# DatasetDict with default columns
ds = load_dataset("./datasets/emotion/")
# DatasetDict({
#     train: Dataset({
#         features: ['tweet', 'emotion'],
#         num_rows: 16000
#     })
#     validation: Dataset({
#         features: ['tweet', 'emotion'],
#         num_rows: 2000
#     })
#     test: Dataset({
#         features: ['tweet', 'emotion'],
#         num_rows: 2000
#     })
# })

# DatasetDict with remapped columns
ds = load_dataset("./datasets/emotion/", task="text_classification")
DatasetDict({
#     train: Dataset({
#         features: ['text', 'label'],
#         num_rows: 16000
#     })
#     validation: Dataset({
#         features: ['text', 'label'],
#         num_rows: 2000
#     })
#     test: Dataset({
#         features: ['text', 'label'],
#         num_rows: 2000
#     })
# })

# Dataset with default columns
ds = load_dataset("./datasets/emotion/", split="train")
# Map/cast features
ds = ds.prepare_for_task("text_classification")
# Dataset({
#     features: ['text', 'label'],
#     num_rows: 16000
# })

i have a few follow-up questions / remarks:

  1. i'm working under the assumption that contributors / users only provide a unique set of task types. in particular, the current implementation does not support something like:
task_templates=[TextClassification(labels=class_names, text_column="tweet", label_column="emotion"), TextClassification(labels=class_names, text_column="some_other_column", label_column="some_other_column")]

since we use TaskTemplate.task and the filter for compatible templates in Dataset.prepare_for_task. should we support these scenarios? my hunch is that this is rare in practice, but please correct me if i'm wrong.

  1. when we eventually run evaluation for transformers models, i expect we'll be using the Trainer for which we can pass the standard label names to TrainingArguments.label_names. if that's the case, it might be prudent to heed the warning from the docs and use labels instead of label in the schema:

your model can accept multiple label arguments (use the label_names in your TrainingArguments to indicate their name to the Trainer) but none of them should be named "label".

  1. i plan to forge ahead on the rest of the pipeline taxonomy. please let me know if you'd prefer smaller, self-contained pull requests (e.g. one per task)

@lewtun
Copy link
Member

lewtun commented May 14, 2021

i'm not sure why the benchmarks are getting cancelled - is this expected?

@lhoestq
Copy link
Member

lhoestq commented May 14, 2021

i'm not sure why the benchmarks are getting cancelled - is this expected?

Hmm I don't know. It's certainly unrelated to this PR though. Maybe github has some issues

@SBrandeis
Copy link
Contributor Author

Something is happening with actions: https://www.githubstatus.com/

@lewtun
Copy link
Member

lewtun commented May 17, 2021

hey @lhoestq and @SBrandeis, i've:

  • extended the prepare_for_task API along the lines that @lhoestq suggested. i wasn't entirely sure what the datasets convention is for docstrings with mixed types, so please see if my proposal makes sense
  • added a few new tests to check that we trigger the value errors on incorrect input

i think this is ready for another review :)

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks all good thank you :)

Can you also add prepare_for_task in the main_classes.rst file of the documentation ?

tests/test_arrow_dataset.py Outdated Show resolved Hide resolved
@lewtun
Copy link
Member

lewtun commented May 17, 2021

Looks all good thank you :)

Can you also add prepare_for_task in the main_classes.rst file of the documentation ?

Done! I also remembered that I needed to do the same for DatasetDict, so included this as well :)

Copy link
Contributor Author

@SBrandeis SBrandeis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome job @lewtun, thanks a lot!

@@ -635,6 +635,7 @@ def load_dataset(
save_infos: bool = False,
script_version: Optional[Union[str, Version]] = None,
use_auth_token: Optional[Union[bool, str]] = None,
task: Optional[str] = None,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add TaskTempplate here too? 👀

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch!

class TextClassification(TaskTemplate):
task = "text-classification"
input_schema = Features({"text": Value("string")})
label_schema = Features({"labels": ClassLabel})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Since we update this in __post_init__ do we need to declare a default?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch! i'll fix it :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah actually we need to declare a default because label_schema is a required argument for __init__. i've reverted the change and added a TODO so we can think about a more elegant approach as we iterate on the other tasks

Copy link
Member

@thomwolf thomwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a few high-level comment.

I like the general idea!

@@ -1384,6 +1385,44 @@ def with_transform(
dataset.set_transform(transform=transform, columns=columns, output_all_columns=output_all_columns)
return dataset

def prepare_for_task(self, task: Union[str, TaskTemplate]) -> "Dataset":
"""Prepare a dataset for the given task.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's good to explain a bit what are the operations contained in the expression "prepare a dataset"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea! i opted for the following:

Prepare a dataset for the given task by casting the dataset's :class:Features to standardized column names and types as detailed in ~tasks.

task (:obj:`Union[str, TaskTemplate]`): The task to prepare the dataset for during training and evaluation. If :obj:`str`, supported tasks include:

- :obj:`"text-classification"`
- :obj:`"question-answering"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question-answering exists in two forms: abstractive and extractive question answering.

we can keep a generic question-answering but then it will probably mean diferrent schema of input/output for both (abstractive will have text for both while extractive can use spans indication as well as text).

Or we can also propose to use abstractive-question-answering and extractive-question-answering for instance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could have question-answering-abstractive and question-answering-extractive if somehow we can use a for a completion or search in the future (detail).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I see that people are more organizing in terms of general and sub-tasks, for instance on paperwithcode: https://paperswithcode.com/area/natural-language-processing and on nlpprogress: https://github.com/sebastianruder/NLP-progress/blob/master/english/question_answering.md#squad

Probably the best is to align with one of these in terms of denomination, PaperWithCode is probably the most active and maintained and we work with them as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

our idea was to start by following the pipeline taxonomy in transformers, where question-answering is indeed restricted to the extractive case.

but you're 100% correct that we should already start considering the abstractive case or grouping in terms of the sub-domains that PwC adopts.

since our focus right now is on getting text-classification integrated and tested with autonlp, i've opened an issue here to tackle this next: #2371

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PWC uses "Question Answering" (meaning extractive) vs "Generative Question Answering" (includes abstractive) which encapsulates most of the QA tasks except Open/Closed Domain QA and Knowledge base QA (they require no context since the knowledge is not part of the query).

I'm fine with these names, or simply extractive and abstractive

@@ -694,6 +696,7 @@ def load_dataset(
You can specify a different version that the default "main" by using a commit sha or a git tag of the dataset repository.
use_auth_token (``str`` or ``bool``, optional): Optional string or boolean to use as Bearer token for remote files on the Datasets Hub.
If True, will get token from `"~/.huggingface"`.
task (``str``): The task to prepare the dataset for during training and evaluation. Casts the dataset's :class:`Features` according to one of the schemas in `~tasks`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
task (``str``): The task to prepare the dataset for during training and evaluation. Casts the dataset's :class:`Features` according to one of the schemas in `~tasks`.
task (``str``): The task to prepare the dataset for during training and evaluation. Casts the dataset's :class:`Features` according to standardized column names and types as detailed in `~tasks`.

NAME2TEMPLATE = {QuestionAnswering.task: QuestionAnswering, TextClassification.task: TextClassification}


def task_template_from_dict(task_template_dict: dict) -> Optional[TaskTemplate]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this function supposed to be used by the user? If so it should have a doc string and doc

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no this is not supposed to be used by the user (at least for now). still no harm in having a docstring, so i've added:

"""Create one of the supported task templates in :obj:`datasets.tasks` from a dictionary."""



@dataclass(frozen=True)
class QuestionAnswering(TaskTemplate):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QuestionAnsweringExtractive ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the suggestion! i'll keep this unchanged for now since it will soon be overhauled by #2371

@dataclass(frozen=True)
class QuestionAnswering(TaskTemplate):
task = "question-answering"
input_schema = Features({"question": Value("string"), "context": Value("string")})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you want to check with a few QA datasets that this schema make sense. Typically NaturalQuestions, TriviaQA and can be good second datasets to compare to and be sure of the generality of the schema.

A good recent list of QA datasets to compare the schemas among, is for instance in the UnitedQA paper: https://arxiv.org/abs/2101.00178

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the tip! added to #2371

@lewtun lewtun merged commit bc61954 into master May 18, 2021
@lewtun lewtun deleted the sbrandeis/task_casting branch May 18, 2021 13:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants