Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task casting for text classification & question answering #2255

Merged
merged 66 commits into from
May 18, 2021
Merged
Show file tree
Hide file tree
Changes from 18 commits
Commits
Show all changes
66 commits
Select commit Hold shift + click to select a range
9a19cf2
WIP: task templates in datasets info
SBrandeis Apr 23, 2021
7ab394a
Add rename_columnS method
SBrandeis Apr 23, 2021
e5cef8d
WIP: Add prepare_for-task method
SBrandeis Apr 23, 2021
4564eb9
WIP: usage example
SBrandeis Apr 23, 2021
29650f8
Code quality
SBrandeis Apr 23, 2021
1516ae6
Merge branch 'master' into sbrandeis/task_casting
SBrandeis Apr 30, 2021
2157bf5
wip: Add text_classification pipeline
SBrandeis Apr 30, 2021
9302893
Decorate TaskTemplate with dataclass
lewtun May 4, 2021
02bcfa2
Add text classification template to emotion dataset_infos.json
lewtun May 6, 2021
531cf94
Fix key name in task_template_from_dict
lewtun May 6, 2021
d3bccca
Refactor TextClassification init and fix from_dict keys
lewtun May 6, 2021
6dfd7f2
Remove unused import
lewtun May 7, 2021
b0899b1
Add task specification to load_dataset
lewtun May 7, 2021
3cf039d
Rename emotion columns for task template testing
lewtun May 7, 2021
b2a02c5
Move task casting to builder
lewtun May 7, 2021
28e6ab1
Refactor task loading with prepare_for_task method
lewtun May 7, 2021
7f0f683
Add docstring and error message to prepare_for_task
lewtun May 7, 2021
3a30b62
Merge branch 'master' into sbrandeis/task_casting
lewtun May 7, 2021
2ec62a2
Add TODO in docstring
lewtun May 7, 2021
5df3b65
Update datas_infos.json for emotion dataset
lewtun May 7, 2021
35e6110
Call task preparation in load_dataset instead of builder
lewtun May 10, 2021
e21d728
Extend prepare_for_task to handle nested columns
lewtun May 10, 2021
8d5427d
Fix dataclass inheritance for QuestionAnswering task template
lewtun May 10, 2021
c344557
Unflatten column renaming
lewtun May 10, 2021
6220e7f
Replace nested question-answering columns with outer answers column
lewtun May 10, 2021
388c5d4
Revert emotion dataset
lewtun May 10, 2021
d759f65
Merge branch 'master' into sbrandeis/task_casting
lewtun May 10, 2021
6b851aa
Fix imports
lewtun May 10, 2021
95d395e
Revert emotion dataset for real!
lewtun May 10, 2021
6d74541
Remove label_mapping from TextClassification template
lewtun May 10, 2021
0f6513a
Rename TextClassification task labels to "labels" for Trainer compati…
lewtun May 10, 2021
d3cdf78
Add unit tests for text classification & question answering tasks
lewtun May 10, 2021
34aa06e
Fix style
lewtun May 10, 2021
fe61d2c
Remove setup from question answering test
lewtun May 10, 2021
46beea9
Clean up dataset memory after test
lewtun May 10, 2021
c83e501
Rename task names to use hyphen instead of underscore
lewtun May 11, 2021
db52ce7
Integrate review comments
lewtun May 12, 2021
3ad17f3
Remove task template from SQuAD
lewtun May 12, 2021
14e6e5b
Add hashing of task templates to enable set intersections
lewtun May 12, 2021
316cdb0
Filter Nones from task templates intersection
lewtun May 12, 2021
8e2299a
Fix style & quality
lewtun May 12, 2021
40de679
Clean up docstring
lewtun May 12, 2021
b67ed1f
Use context manager for in_memory dataset test
lewtun May 12, 2021
944b22f
Handle cases where task templates are none in merge
lewtun May 12, 2021
1cb8c93
Add more context managers!
lewtun May 12, 2021
6858125
Add tests for task template concatenation
lewtun May 14, 2021
6c6b3c9
Mark task name and schemas a class variables
lewtun May 14, 2021
4d40558
Replace custom hash functions with frozen dataclass decorators
lewtun May 14, 2021
7076902
Replace init with post_init in task templates
lewtun May 14, 2021
81674cc
Add supported tasks to docstring
lewtun May 14, 2021
3e39872
Add supported tasks to prepare_for_task docstring
lewtun May 14, 2021
7abd2f0
Update dataset_dict doctstring
lewtun May 14, 2021
441a36e
Add TODO and tweak docstrings
lewtun May 14, 2021
894a509
Merge branch 'master' into sbrandeis/task_casting
lewtun May 14, 2021
56635e7
Allow TaskTemplate objects to be passed to prepare_for_task
lewtun May 17, 2021
1274aa5
Add tests for invalid templates
lewtun May 17, 2021
5e48403
Update docstring for prepare_for_task
lewtun May 17, 2021
5a6021a
Merge branch 'master' into sbrandeis/task_casting
lewtun May 17, 2021
a868d35
Fix docstrings and add to HTML build
lewtun May 17, 2021
99200c7
Remove redundant context manager in tests
lewtun May 17, 2021
f895389
Add TaskTemplate to typing of load_dataset
lewtun May 18, 2021
290b583
Remove default value for label_schema in TextClassificationTemplate
lewtun May 18, 2021
3578dbd
Revert default value for label_schema and add TODO
lewtun May 18, 2021
7a0de24
Add and improve docstrings for task templates
lewtun May 18, 2021
6bf7970
Add list of task templates to docs
lewtun May 18, 2021
2dc8d26
Migrate task templates to dedicated section of docs
lewtun May 18, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
2 changes: 1 addition & 1 deletion datasets/emotion/dataset_infos.json
Original file line number Diff line number Diff line change
@@ -1 +1 @@
{"emotion":{"description":"Emotion is a dataset of English Twitter messages with six basic emotions: anger, fear, joy, love, sadness, and surprise. For more detailed information please refer to the\npaper.\n","citation":"@inproceedings{saravia-etal-2018-carer,\n title = \"{CARER}: Contextualized Affect Representations for Emotion Recognition\",\n author = \"Saravia, Elvis and\n Liu, Hsien-Chi Toby and\n Huang, Yen-Hao and\n Wu, Junlin and\n Chen, Yi-Shin\",\n booktitle = \"Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing\",\n month = oct # \"-\" # nov,\n year = \"2018\",\n address = \"Brussels, Belgium\",\n publisher = \"Association for Computational Linguistics\",\n url = \"https://www.aclweb.org/anthology/D18-1404\",\n doi = \"10.18653/v1/D18-1404\",\n pages = \"3687--3697\",\n abstract = \"Emotions are expressed in nuanced ways, which varies by collective or individual experiences, knowledge, and beliefs. Therefore, to understand emotion, as conveyed through text, a robust mechanism capable of capturing and modeling different linguistic nuances and phenomena is needed. We propose a semi-supervised, graph-based algorithm to produce rich structural descriptors which serve as the building blocks for constructing contextualized affect representations from text. The pattern-based representations are further enriched with word embeddings and evaluated through several emotion recognition tasks. Our experimental results demonstrate that the proposed method outperforms state-of-the-art techniques on emotion recognition tasks.\",\n}\n","homepage":"https://github.com/dair-ai/emotion_dataset","license":"","features":{"text":{"dtype":"string","id":null,"_type":"Value"},"label":{"dtype":"string","id":null,"_type":"Value"}},"post_processed":null,"supervised_keys":null,"builder_name":"emotion","config_name":"emotion","version":{"version_str":"0.1.0","description":"First Emotion release","major":0,"minor":1,"patch":0},"splits":{"train":{"name":"train","num_bytes":1754632,"num_examples":16000,"dataset_name":"emotion"},"validation":{"name":"validation","num_bytes":216248,"num_examples":2000,"dataset_name":"emotion"},"test":{"name":"test","num_bytes":218768,"num_examples":2000,"dataset_name":"emotion"}},"download_checksums":{"https://www.dropbox.com/s/1pzkadrvffbqw6o/train.txt?dl=1":{"num_bytes":1658616,"checksum":"3ab03d945a6cb783d818ccd06dafd52d2ed8b4f62f0f85a09d7d11870865b190"},"https://www.dropbox.com/s/2mzialpsgf9k5l3/val.txt?dl=1":{"num_bytes":204240,"checksum":"34faaa31962fe63cdf5dbf6c132ef8ab166c640254ab991af78f3aea375e79ef"},"https://www.dropbox.com/s/ikkqxfdbdec3fuj/test.txt?dl=1":{"num_bytes":206760,"checksum":"60f531690d20127339e7f054edc299a82c627b5ec0dd5d552d53d544e0cfcc17"}},"download_size":2069616,"post_processing_size":null,"dataset_size":2189648,"size_in_bytes":4259264},"default":{"description":"Emotion is a dataset of English Twitter messages with six basic emotions: anger, fear, joy, love, sadness, and surprise. For more detailed information please refer to the paper.\n","citation":"@inproceedings{saravia-etal-2018-carer,\n title = \"{CARER}: Contextualized Affect Representations for Emotion Recognition\",\n author = \"Saravia, Elvis and\n Liu, Hsien-Chi Toby and\n Huang, Yen-Hao and\n Wu, Junlin and\n Chen, Yi-Shin\",\n booktitle = \"Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing\",\n month = oct # \"-\" # nov,\n year = \"2018\",\n address = \"Brussels, Belgium\",\n publisher = \"Association for Computational Linguistics\",\n url = \"https://www.aclweb.org/anthology/D18-1404\",\n doi = \"10.18653/v1/D18-1404\",\n pages = \"3687--3697\",\n abstract = \"Emotions are expressed in nuanced ways, which varies by collective or individual experiences, knowledge, and beliefs. Therefore, to understand emotion, as conveyed through text, a robust mechanism capable of capturing and modeling different linguistic nuances and phenomena is needed. We propose a semi-supervised, graph-based algorithm to produce rich structural descriptors which serve as the building blocks for constructing contextualized affect representations from text. The pattern-based representations are further enriched with word embeddings and evaluated through several emotion recognition tasks. Our experimental results demonstrate that the proposed method outperforms state-of-the-art techniques on emotion recognition tasks.\",\n}\n","homepage":"https://github.com/dair-ai/emotion_dataset","license":"","features":{"text":{"dtype":"string","id":null,"_type":"Value"},"label":{"num_classes":6,"names":["sadness","joy","love","anger","fear","surprise"],"names_file":null,"id":null,"_type":"ClassLabel"}},"post_processed":null,"supervised_keys":{"input":"text","output":"label"},"builder_name":"emotion","config_name":"default","version":{"version_str":"0.0.0","description":null,"major":0,"minor":0,"patch":0},"splits":{"train":{"name":"train","num_bytes":1741541,"num_examples":16000,"dataset_name":"emotion"},"validation":{"name":"validation","num_bytes":214699,"num_examples":2000,"dataset_name":"emotion"},"test":{"name":"test","num_bytes":217177,"num_examples":2000,"dataset_name":"emotion"}},"download_checksums":{"https://www.dropbox.com/s/1pzkadrvffbqw6o/train.txt?dl=1":{"num_bytes":1658616,"checksum":"3ab03d945a6cb783d818ccd06dafd52d2ed8b4f62f0f85a09d7d11870865b190"},"https://www.dropbox.com/s/2mzialpsgf9k5l3/val.txt?dl=1":{"num_bytes":204240,"checksum":"34faaa31962fe63cdf5dbf6c132ef8ab166c640254ab991af78f3aea375e79ef"},"https://www.dropbox.com/s/ikkqxfdbdec3fuj/test.txt?dl=1":{"num_bytes":206760,"checksum":"60f531690d20127339e7f054edc299a82c627b5ec0dd5d552d53d544e0cfcc17"}},"download_size":2069616,"post_processing_size":null,"dataset_size":2173417,"size_in_bytes":4243033}}
{"emotion": {"description": "Emotion is a dataset of English Twitter messages with six basic emotions: anger, fear, joy, love, sadness, and surprise. For more detailed information please refer to the paper.\n", "citation": "@inproceedings{saravia-etal-2018-carer,\n title = \"{CARER}: Contextualized Affect Representations for Emotion Recognition\",\n author = \"Saravia, Elvis and\n Liu, Hsien-Chi Toby and\n Huang, Yen-Hao and\n Wu, Junlin and\n Chen, Yi-Shin\",\n booktitle = \"Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing\",\n month = oct # \"-\" # nov,\n year = \"2018\",\n address = \"Brussels, Belgium\",\n publisher = \"Association for Computational Linguistics\",\n url = \"https://www.aclweb.org/anthology/D18-1404\",\n doi = \"10.18653/v1/D18-1404\",\n pages = \"3687--3697\",\n abstract = \"Emotions are expressed in nuanced ways, which varies by collective or individual experiences, knowledge, and beliefs. Therefore, to understand emotion, as conveyed through text, a robust mechanism capable of capturing and modeling different linguistic nuances and phenomena is needed. We propose a semi-supervised, graph-based algorithm to produce rich structural descriptors which serve as the building blocks for constructing contextualized affect representations from text. The pattern-based representations are further enriched with word embeddings and evaluated through several emotion recognition tasks. Our experimental results demonstrate that the proposed method outperforms state-of-the-art techniques on emotion recognition tasks.\",\n}\n", "homepage": "https://github.com/dair-ai/emotion_dataset", "license": "", "features": {"text": {"dtype": "string", "id": null, "_type": "Value"}, "label": {"num_classes": 6, "names": ["sadness", "joy", "love", "anger", "fear", "surprise"], "names_file": null, "id": null, "_type": "ClassLabel"}}, "post_processed": null, "supervised_keys": {"input": "text", "output": "label"}, "builder_name": "emotion", "config_name": "emotion", "version": {"version_str": "1.0.0", "description": null, "major": 1, "minor": 0, "patch": 0}, "splits": {"train": {"name": "train", "num_bytes": 1741541, "num_examples": 16000, "dataset_name": "emotion"}, "validation": {"name": "validation", "num_bytes": 214699, "num_examples": 2000, "dataset_name": "emotion"}, "test": {"name": "test", "num_bytes": 217177, "num_examples": 2000, "dataset_name": "emotion"}}, "download_checksums": {"https://www.dropbox.com/s/1pzkadrvffbqw6o/train.txt?dl=1": {"num_bytes": 1658616, "checksum": "3ab03d945a6cb783d818ccd06dafd52d2ed8b4f62f0f85a09d7d11870865b190"}, "https://www.dropbox.com/s/2mzialpsgf9k5l3/val.txt?dl=1": {"num_bytes": 204240, "checksum": "34faaa31962fe63cdf5dbf6c132ef8ab166c640254ab991af78f3aea375e79ef"}, "https://www.dropbox.com/s/ikkqxfdbdec3fuj/test.txt?dl=1": {"num_bytes": 206760, "checksum": "60f531690d20127339e7f054edc299a82c627b5ec0dd5d552d53d544e0cfcc17"}}, "download_size": 2069616, "post_processing_size": null, "dataset_size": 2173417, "size_in_bytes": 4243033, "task_templates": [{"task": "text_classification", "input_schema": {"text": {"dtype": "string", "id": null, "_type": "Value"}}, "label_schema": {"label": {"num_classes": 6, "names": ["anger", "fear", "joy", "love", "sadness", "surprise"], "names_file": null, "id": null, "_type": "ClassLabel"}}}]}}
21 changes: 19 additions & 2 deletions datasets/emotion/emotion.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
import csv

import datasets
from datasets.tasks import TextClassification


_CITATION = """\
Expand Down Expand Up @@ -33,17 +34,33 @@
_TEST_DOWNLOAD_URL = "https://www.dropbox.com/s/ikkqxfdbdec3fuj/test.txt?dl=1"


class EmotionConfig(datasets.BuilderConfig):
def __init__(self, **kwargs):
super(EmotionConfig, self).__init__(**kwargs)


class Emotion(datasets.GeneratorBasedBuilder):

VERSION = datasets.Version("0.1.0")
BUILDER_CONFIGS = [
EmotionConfig(
name="emotion",
version=datasets.Version("1.0.0"),
description="Emotion is a dataset of English Twitter messages with six basic emotions: anger, fear, joy, love, sadness, and surprise.",
)
]

def _info(self):
class_names = ["sadness", "joy", "love", "anger", "fear", "surprise"]
return datasets.DatasetInfo(
description=_DESCRIPTION,
features=datasets.Features(
{"text": datasets.Value("string"), "label": datasets.ClassLabel(names=class_names)}
{"tweet": datasets.Value("string"), "emotion": datasets.ClassLabel(names=class_names)}
lewtun marked this conversation as resolved.
Show resolved Hide resolved
),
supervised_keys=("text", "label"),
homepage=_URL,
citation=_CITATION,
task_templates=[TextClassification(labels=class_names, text_column="tweet", label_column="emotion")],
)

def _split_generators(self, dl_manager):
Expand All @@ -63,4 +80,4 @@ def _generate_examples(self, filepath):
csv_reader = csv.reader(csv_file, delimiter=";")
for id_, row in enumerate(csv_reader):
text, label = row
yield id_, {"text": text, "label": label}
yield id_, {"tweet": text, "emotion": label}
4 changes: 4 additions & 0 deletions datasets/squad/squad.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@
import json

import datasets
from datasets.tasks import QuestionAnswering


logger = datasets.logging.get_logger(__name__)
Expand Down Expand Up @@ -98,6 +99,9 @@ def _info(self):
supervised_keys=None,
homepage="https://rajpurkar.github.io/SQuAD-explorer/",
citation=_CITATION,
task_templates=[
QuestionAnswering(),
Copy link
Member

@lewtun lewtun May 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we keep this here in this pr?

more generally, am i correct in understanding that we'll need to add the relevant tasks to each of the ~600 canonical datasets at some point?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was merely a demonstrator for the syntax, to validate the API "feel" from the user perspective when adding a dataset.
I don't think it makes much sense to keep it in the PR unless we test something specific on SQuAD.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more generally, am i correct in understanding that we'll need to add the relevant tasks to each of the ~600 canonical datasets at some point?

Yes... To make our lives easier, we need to find a way to infer supported tasks automatically for simple datasets, without changing the datasets scripts. For example, we could check if the dataset is already formatted as expected (no need for column renaming / casting).

Copy link
Member

@lewtun lewtun May 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point re task detection! i wonder if we could do this with a simple check on whether the schema features are a subset of the dataset features, e.g.

squad = load_dataset("./datasets/squad", split="train")
qa = QuestionAnswering()
schema = Features({**qa.input_schema, **qa.label_schema})
assert all(item in squad.features.items() for item in schema.items())

],
)

def _split_generators(self, dl_manager):
Expand Down
27 changes: 27 additions & 0 deletions src/datasets/arrow_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -1372,6 +1372,33 @@ def with_transform(
dataset.set_transform(transform=transform, columns=columns, output_all_columns=output_all_columns)
return dataset

def prepare_for_task(self, task: str) -> "Dataset":
"""Prepares a dataset for the given task.

Casts :attr:`datasets.DatasetInfo.features` according to a task-specific schema.
lewtun marked this conversation as resolved.
Show resolved Hide resolved

Args:
task (``str``): One of the compatible tasks in ['text_classification', 'question_answering']
"""
tasks = [template.task for template in (self.info.task_templates or [])]
compatible_templates = [template for template in (self.info.task_templates or []) if template.task == task]
if not compatible_templates:
raise ValueError(f"Task {task} is not compatible with this dataset! Available tasks: {tasks}")

if len(compatible_templates) > 1:
raise ValueError(
f"""Expected 1 task template but found {len(compatible_templates)}! Please ensure that \
:attr:`datasets.DatasetInfo.task_templates` contains a unique set of task types."""
lewtun marked this conversation as resolved.
Show resolved Hide resolved
)
template = compatible_templates[0]
column_mapping = template.column_mapping
columns_to_drop = [column for column in self.column_names if column not in column_mapping]
dataset = self.remove_columns(columns_to_drop)
# TODO(sbrandeis): Add support for unnesting columns too
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @SBrandeis could you clarify what you mean by "unnesting columns" here?

do you mean that Datasets.rename_columns should supported columns with nested features like answers.text in SQuAD:

features=datasets.Features(
    {
        "id": datasets.Value("string"),
        "title": datasets.Value("string"),
        "context": datasets.Value("string"),
        "question": datasets.Value("string"),
        "answers": datasets.features.Sequence(
            {
                "text": datasets.Value("string"),
                "answer_start": datasets.Value("int32"),
            }
        ),
    }
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @lewtun! Yes that's correct

dataset = dataset.rename_columns(column_mapping)
dataset = dataset.cast(features=template.features)
return dataset

def _getitem(
self,
key: Union[int, slice, str],
Expand Down
5 changes: 5 additions & 0 deletions src/datasets/builder.py
Original file line number Diff line number Diff line change
Expand Up @@ -204,6 +204,7 @@ def __init__(
name: Optional[str] = None,
hash: Optional[str] = None,
features: Optional[Features] = None,
task=None,
**config_kwargs,
):
"""Constructs a DatasetBuilder.
Expand All @@ -226,6 +227,7 @@ def __init__(
# DatasetBuilder name
self.name: str = camelcase_to_snakecase(self.__class__.__name__)
self.hash: Optional[str] = hash
self.task = task

# Prepare config: DatasetConfig contains name, version and description but can be extended by each dataset
config_kwargs = {key: value for key, value in config_kwargs.items() if value is not None}
Expand Down Expand Up @@ -813,6 +815,9 @@ def _build_single_dataset(
)
else:
ds.info.features = self.info.post_processed.features
# Rename and cast features to match task schema
if self.task is not None:
ds = ds.prepare_for_task(self.task)
lewtun marked this conversation as resolved.
Show resolved Hide resolved

return ds

Expand Down
21 changes: 21 additions & 0 deletions src/datasets/info.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@
from . import config
from .features import Features, Value
from .splits import SplitDict
from .tasks import TaskTemplate, task_template_from_dict
from .utils import Version
from .utils.logging import get_logger

Expand Down Expand Up @@ -108,6 +109,7 @@ class DatasetInfo:
post_processing_size (int, optional):
dataset_size (int, optional):
size_in_bytes (int, optional):
task_templates (List[TaskTemplate], optional): TODO(sbrandeis) - document this
lewtun marked this conversation as resolved.
Show resolved Hide resolved
"""

# Set in the dataset scripts
Expand All @@ -131,6 +133,8 @@ class DatasetInfo:
dataset_size: Optional[int] = None
size_in_bytes: Optional[int] = None

task_templates: Optional[List[TaskTemplate]] = None

def __post_init__(self):
# Convert back to the correct classes when we reload from dict
if self.features is not None and not isinstance(self.features, Features):
Expand All @@ -150,6 +154,19 @@ def __post_init__(self):
else:
self.supervised_keys = SupervisedKeysData(**self.supervised_keys)

if self.task_templates is not None:
if isinstance(self.task_templates, (list, tuple)):
templates = [
template if isinstance(template, TaskTemplate) else task_template_from_dict(template)
for template in self.task_templates
]
self.task_templates = [template for template in templates if template is not None]
elif isinstance(self.task_templates, TaskTemplate):
self.task_templates = [self.task_templates]
else:
template = task_template_from_dict(self.task_templates)
self.task_templates = [template] if template is not None else []

def _license_path(self, dataset_info_dir):
return os.path.join(dataset_info_dir, config.LICENSE_FILENAME)

Expand Down Expand Up @@ -189,13 +206,17 @@ def unique(values):
features = None
supervised_keys = None

# Can be prepared for a pipeline if all members can also be prepared for this specific pipeline
task_templates = None # TODO(sbrandeis)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SBrandeis can you clarify what is meant by this TODO? in particular, what does "members" refer to?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course 😄

For context, this method is called when concatenating datasets with the concatenate_datasets method. The DatasetInfo of the concatenated Dataset object is constructed using DatasetInfo.from_merge: https://github.com/huggingface/datasets/blob/master/src/datasets/arrow_dataset.py#L3098

What I meant here is that it would be nice to propagate the task_templates from the components Dataset objects to the resulting concatenated Dataset object (I'm happy to reformulate if it's not clear 😅 ). "members" refer to the components Dataset objects (the ones being concatenated).

If all the component Datasets can be prepared for the question_answering task / pipeline, then the resulting concatenated Dataset can also be prepared for the question_answering task.

Let me know if this helps 😄

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks a lot - this is now perfectly clear! i'll have a stab at it in this pr


return cls(
description=description,
citation=citation,
homepage=homepage,
license=license,
features=features,
supervised_keys=supervised_keys,
task_templates=task_templates,
)

@classmethod
Expand Down
2 changes: 2 additions & 0 deletions src/datasets/load.py
Original file line number Diff line number Diff line change
Expand Up @@ -635,6 +635,7 @@ def load_dataset(
save_infos: bool = False,
script_version: Optional[Union[str, Version]] = None,
use_auth_token: Optional[Union[bool, str]] = None,
task: Optional[str] = None,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add TaskTempplate here too? 👀

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch!

**config_kwargs,
) -> Union[DatasetDict, Dataset]:
"""Load a dataset.
Expand Down Expand Up @@ -730,6 +731,7 @@ def load_dataset(
data_files=data_files,
hash=hash,
features=features,
task=task,
**config_kwargs,
)

Expand Down
21 changes: 21 additions & 0 deletions src/datasets/tasks/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
from typing import Optional

from .base import TaskTemplate
from .question_answering import QuestionAnswering
from .text_classification import TextClassification


__all__ = ["TaskTemplate", "QuestionAnswering", "TextClassification"]


NAME2TEMPLATE = {QuestionAnswering.task: QuestionAnswering, TextClassification.task: TextClassification}


def task_template_from_dict(task_template_dict: dict) -> Optional[TaskTemplate]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this function supposed to be used by the user? If so it should have a doc string and doc

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no this is not supposed to be used by the user (at least for now). still no harm in having a docstring, so i've added:

"""Create one of the supported task templates in :obj:`datasets.tasks` from a dictionary."""

task_name = task_template_dict.get("task")
if task_name is None:
return None
template = NAME2TEMPLATE.get(task_name)
if template is None:
return None
return template.from_dict(task_template_dict)
26 changes: 26 additions & 0 deletions src/datasets/tasks/base.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
import abc
from dataclasses import dataclass
from typing import Dict

from ..features import Features


@dataclass
lewtun marked this conversation as resolved.
Show resolved Hide resolved
class TaskTemplate(abc.ABC):
task: str
input_schema: Features
label_schema: Features

@property
def features(self) -> Features:
return Features(**self.input_schema, **self.label_schema)

@property
@abc.abstractmethod
def column_mapping(self) -> Dict[str, str]:
return NotImplemented

@classmethod
@abc.abstractmethod
def from_dict(cls, template_dict: dict) -> "TaskTemplate":
return NotImplemented
40 changes: 40 additions & 0 deletions src/datasets/tasks/question_answering.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
from typing import Dict

from ..features import Features, Value
from .base import TaskTemplate


class QuestionAnswering(TaskTemplate):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QuestionAnsweringExtractive ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the suggestion! i'll keep this unchanged for now since it will soon be overhauled by #2371

task = "question_answering"
input_schema = Features({"question": Value("string"), "context": Value("string")})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you want to check with a few QA datasets that this schema make sense. Typically NaturalQuestions, TriviaQA and can be good second datasets to compare to and be sure of the generality of the schema.

A good recent list of QA datasets to compare the schemas among, is for instance in the UnitedQA paper: https://arxiv.org/abs/2101.00178

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the tip! added to #2371

label_schema = Features({"answer_start": Value("int32"), "answer_end": Value("int32")})
lewtun marked this conversation as resolved.
Show resolved Hide resolved

def __init__(
self,
question_column: str = "question",
context_column: str = "context",
answer_start_column: str = "answer_start",
answer_end_column: str = "answer_end",
):
self.question_column = question_column
self.context_column = context_column
self.answer_start_column = answer_start_column
self.answer_end_column = answer_end_column

@property
def column_mapping(self) -> Dict[str, str]:
return {
self.question_column: "question",
self.context_column: "context",
self.answer_start_column: "answer_start",
self.answer_end_column: "answer_end",
}

@classmethod
def from_dict(cls, template_dict: dict) -> "QuestionAnswering":
return cls(
question_column=template_dict["question_column"],
context_column=template_dict["answer_column"],
answer_start_column=template_dict["answer_start_column"],
answer_end_column=template_dict["answer_end_column"],
)