-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset infos in yaml #4926
Dataset infos in yaml #4926
Conversation
The documentation is not available anymore as the PR was closed or merged. |
datasets/conll2000/README.md
Outdated
@@ -3,6 +3,98 @@ language: | |||
- en | |||
paperswithcode_id: conll-2000-1 | |||
pretty_name: CoNLL-2000 | |||
dataset_infos: | |||
- config_name: conll2000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need the (non-default) config_name for backward compatibility?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The dataset explicitly defines one configuration with this name, but since there's only one there's no ambiguity. Let me tweak the YAML loading a bit and we can remove this line :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed config name for conll2000 in the YAML part, and for everything to connect smoothly I had to remove the definition of the config in the dataset script itself. I'll see if it's something I can do for all the datasets that have one configuration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I modified the class_label YAML dump structure to show the label ids. This is more practical this way, see conll2000.
This is ready for review ! :)
cc @albertvillanova @polinaeterna @mariosasko WDYT ?
I can generate the YAML for all the other datasets in a subsequent PR
if not f.init or value != f.default or f.metadata.get("include_in_asdict_even_if_is_default", False): | ||
result[f.name] = value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To simplify the JSON and YAML dumps, I stop dumping the dataclass attributes that have are the default value.
e.g. decode=True for Image, or length=-1 for Sequence
Created #5018 where I added the YAML see other dataset cards: imagenet-1k, glue, flores, gem |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job!
One nit: the metadata generation in push_to_hub
also needs to be updated, no?
PS: We should also probably use DatasetInfo
from hugginface_hub
instead of having our own implementation in info.py
, but this can be addressed later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
love this change! i added a few suggestions/fixes for documentation :)
src/datasets/commands/test.py
Outdated
@@ -50,7 +50,9 @@ def register_subcommand(parser: ArgumentParser): | |||
help="Can be used to specify a manual directory to get the files from.", | |||
) | |||
test_parser.add_argument("--all_configs", action="store_true", help="Test all dataset configurations") | |||
test_parser.add_argument("--save_infos", action="store_true", help="Save the dataset infos file") | |||
test_parser.add_argument( | |||
"--save_infos", action="store_true", help="Save the dataset infos in the dataset card (README.md)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe change --save_infos
to --save_info
to be consistent with dataset_info
instead of dataset_infos
for users? should then be changed in documentation and docstrings and the code below too, if you agree with this change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea ! I changed to --save_info and kept --save_infos as an alias
@@ -33,8 +33,9 @@ | |||
import json | |||
import os | |||
import posixpath | |||
from dataclasses import dataclass, field | |||
from typing import Dict, List, Optional, Union | |||
from dataclasses import dataclass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just curious: what's the reason for not importing field
explicitly and using just field
instead of dataclasses.field
below like it was before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
field
is used as a variable names at several places - this is just to avoid collisions
Co-authored-by: Polina Kazakova <polina@huggingface.co>
Took your comments into account and updated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks all good now!
To simplify the addition of new datasets, we'd like to have the dataset infos in the YAML and deprecate the dataset_infos.json file. YAML is readable and easy to edit, and the YAML metadata of the readme already contain dataset metadata so we would have everything in one place.
To be more specific, I moved these fields from DatasetInfo to the YAML:
Here is what I ended up with for
squad
:and it can be a list if there are several configs
I already did the change for
conll2000
andcrime_and_punish
as an example.Implementation details
Load/Read
This is done via
DatasetInfosDict.write_to_directory/from_directory
I had to implement a custom the YAML export logic for
SplitDict
,Version
andFeatures
.The first two are trivial, but the logic for
Features
is more complicated, because I added a simplification step (or the YAML would be too long and less readable): it's just a formatting step to remove unnecessary nesting of YAML data.Other changes
I had to update the DatasetModule factories to also download the README.md alongside the dataset scripts/data files, and not just the dataset_infos.json
YAML validation
I removed the old validation code that was in metadata.py, now we can just use the Hub YAML validation
Datasets-cli
The
datasets-cli test --save_infos
command now creates a README.md file with the dataset_infos in it, instead of a datasets_infos.json fileBackward compatibility
dataset_infos.json
files are still supported and loaded if they exist to have full backward compatibility.Though I removed the unnecessary keys when the value is the default (like all the
id: null
from the Value feature types) to make them easier to read.TODO
datasets
EDITS
Fix #4876 and fix #2773