Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metadata validation #2107

Merged
merged 32 commits into from Apr 26, 2021
Merged

Metadata validation #2107

merged 32 commits into from Apr 26, 2021

Conversation

theo-m
Copy link
Contributor

@theo-m theo-m commented Mar 24, 2021

  • pydantic metadata schema with dedicated validators against our taxonomy
  • ci script to validate new changes against this schema and start a vertuous loop
  • soft validation on tasks ids since we expect the taxonomy to undergo some changes in the near future

for reference with the current validation we have 365 378 datasets with invalid metadata! full error report here.

@theo-m theo-m requested a review from lhoestq March 24, 2021 08:58
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really cool thanks :)

I left a few comments.

Also I was wondering this is really needed to have utils.metadata as a submodule of datasets ? This is only used by the CI so I'm not sure we should have this in the actual datasets package.

src/datasets/utils/metadata.py Show resolved Hide resolved
src/datasets/utils/resources/creators.json Show resolved Hide resolved
src/datasets/utils/metadata.py Outdated Show resolved Hide resolved
src/datasets/utils/metadata.py Show resolved Hide resolved
@theo-m
Copy link
Contributor Author

theo-m commented Mar 24, 2021

Also I was wondering this is really needed to have utils.metadata as a submodule of datasets ? This is only used by the CI so I'm not sure we should have this in the actual datasets package.

I'm unclear on the suggestion, would you rather have a root-level ./metadata.py file? I think it's well where it is, if anything we could move it out of utils and into datasets as it could be used by e.g. DatasetDict so that users can pull the metadata easily rather than have to reparse the readme.

@lhoestq
Copy link
Member

lhoestq commented Mar 25, 2021

Ok that makes sense if we want to have functions that parse the metadata for users

@theo-m theo-m requested a review from lhoestq March 25, 2021 16:06
@gchhablani
Copy link
Contributor

Hi @theo-m @lhoestq

This seems very interesting. Should I add the descriptions to the PR on datasets-tagging? Alternatively, I can also create a google-sheet/markdown table :)

Sorry for the delay in responding.

Thanks,
Gunjan

@theo-m
Copy link
Contributor Author

theo-m commented Mar 26, 2021

Hi @theo-m @lhoestq

This seems very interesting. Should I add the descriptions to the PR on datasets-tagging? Alternatively, I can also create a google-sheet/markdown table :)

Sorry for the delay in responding.

Thanks,
Gunjan

Hi @gchhablani, yes I think at the moment the best solution is for you to write in datasets-tagging, as the PR will allow us to discuss and review, even though the work will be ported to this repo in the end.
Or we wait for this to be merged and you reopen the PR here, your call :)

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good :)
Can you just add docstrings before we merge ?
Can you add tests as well ?

known_multilingualities, known_multilingualities_url = load_json_resource("multilingualities.json")


def dict_from_readme(f: Path) -> Optional[Dict[str, List[str]]]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def dict_from_readme(f: Path) -> Optional[Dict[str, List[str]]]:
def dict_from_readme(path: Path) -> Optional[Dict[str, List[str]]]:

Use explicit argument names.
Can you also add a docstring ?

src/datasets/utils/metadata.py Show resolved Hide resolved
@SBrandeis SBrandeis requested a review from lhoestq April 23, 2021 11:23
@SBrandeis
Copy link
Contributor

cc @abhi1thakur

@SBrandeis SBrandeis self-assigned this Apr 23, 2021
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @theo-m @SBrandeis !

src/datasets/utils/resources/languages.json Show resolved Hide resolved
@SBrandeis SBrandeis merged commit bb42d5c into master Apr 26, 2021
@SBrandeis SBrandeis deleted the theo/config-validator branch April 26, 2021 08:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants