Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix opus_gnome dataset card #4806

Merged
merged 12 commits into from
Aug 9, 2022

Conversation

gojiteji
Copy link
Contributor

@gojiteji gojiteji commented Aug 9, 2022

I fixed a issue #4805.

I changed "gnome" to "opus_gnome" in README.md.

Fix #4805

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Aug 9, 2022

The documentation is not available anymore as the PR was closed or merged.

@gojiteji gojiteji closed this Aug 9, 2022
@albertvillanova
Copy link
Member

@gojiteji why have you closed this PR and created an identical one?

@gojiteji
Copy link
Contributor Author

gojiteji commented Aug 9, 2022

@albertvillanova
I forgot to follow "How to create a Pull" in CONTRIBUTING.md in this branch.

@albertvillanova
Copy link
Member

Both are identical. And you can push additional commits to this branch.

@gojiteji
Copy link
Contributor Author

gojiteji commented Aug 9, 2022

I see. Thank you for your comment.

@albertvillanova
Copy link
Member

albertvillanova commented Aug 9, 2022

Anyway, @gojiteji thanks for your contribution and this fix.

@albertvillanova
Copy link
Member

Once you have modified the opus_gnome dataset card, our Continuous Integration test suite performs some tests on it that make some additional requirements: the errors that appear have nothing to do with your contribution, but with these additional quality requirements.

@gojiteji
Copy link
Contributor Author

gojiteji commented Aug 9, 2022

the errors that appear have nothing to do with your contribution, but with these additional quality requirements.

Is there anything I should do?

@albertvillanova
Copy link
Member

If you would like to address them as well in this PR, it would be awesome: https://github.com/huggingface/datasets/runs/7741104780?check_suite_focus=true

@albertvillanova
Copy link
Member

albertvillanova commented Aug 9, 2022

These are the 2 error messages:

E           ValueError: The following issues have been found in the dataset cards:
E           README Validation:
E           The following issues were found for the README at `/home/runner/work/datasets/datasets/datasets/opus_gnome/README.md`:
E           -	No first-level heading starting with `Dataset Card for` found in README. Skipping further validation for this README.

E           The following issues have been found in the dataset cards:
E           YAML tags:
E           Could not validate the metadata, found the following errors:
E           * field 'language':
E           	['ara', 'cat', 'foo', 'gr', 'nqo', 'tmp'] are not registered tags for 'language', reference at https://github.com/huggingface/datasets/tree/main/src/datasets/utils/resources/languages.json

@albertvillanova
Copy link
Member

In principle there are 2 errors:

The first one says, the title of the README does not start with Dataset Card for:

  • The README title is: # Dataset Card Creation Guide
  • According to the template here, it should be: # Dataset Card for [Dataset Name]

@albertvillanova
Copy link
Member

In relation with the languages:

  • you should check whether the language codes are properly spelled
  • and if so, adding them to our languages.json file, so that they are properly validated

@gojiteji
Copy link
Contributor Author

gojiteji commented Aug 9, 2022

Thank you for the detailed information. I'm checking it now.

@albertvillanova
Copy link
Member

E           ValueError: The following issues have been found in the dataset cards:
E           README Validation:
E           The following issues were found for the README at `/home/runner/work/datasets/datasets/datasets/opus_gnome/README.md`:
E           -	Expected some content in section `Data Instances` but it is empty.
E           -	Expected some content in section `Data Fields` but it is empty.
E           -	Expected some content in section `Data Splits` but it is empty.

@gojiteji
Copy link
Contributor Author

gojiteji commented Aug 9, 2022

I added ara, cat, gr, and nqo to languages.json and removed foo and tmp from README.md.
I also write Data Instances, Data Fields, and Data Splits in README.md.

@albertvillanova
Copy link
Member

Thanks for your investigation and fixes to the dataset card structure! I'm just making some suggestions before merging this PR: see below.

@gojiteji
Copy link
Contributor Author

gojiteji commented Aug 9, 2022

Should I create PR for config.json to add ara cat gr nqo first?
I think I can pass this failing after that.

Or removing ara, cat, gr, nqo, foo, tmp from README.md.

Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestions for the language codes.

datasets/opus_gnome/README.md Show resolved Hide resolved
@@ -130,7 +129,6 @@ language:
- th
- tk
- tl
- tmp
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a deprecated language tag: it should be replaced by tyj.

However, tyj should be added to our languages.json:

"tyj": "Tai Do; Tai Yo",

See reference here: https://iso639-3.sil.org/request/2015-019

@@ -58,6 +58,7 @@
"ar-TD": "Arabic (Chad)",
"ar-TN": "Arabic (Tunisia)",
"ar-YE": "Arabic (Yemen)",
"ara":"Arabic",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When there exist several codes (ISO 639 1, 2, 3) for the same language, we just use the lower one. For Arabic, we use ar.

As ar already appears in the README, I would remove ara from the README and would not add it to languages.json

Suggested change
"ara":"Arabic",

@@ -139,6 +140,7 @@
"byn": "Blin; Bilin",
"bzd": "Bribri",
"ca": "Catalan",
"cat": "Catalan; Valencian",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same as above: for "Catalan; Valencian" we use ca instead. I would remove cat from README and from here.

Suggested change
"cat": "Catalan; Valencian",

@@ -523,6 +525,7 @@
"gom": "Goan Konkani",
"gor": "Gorontalo",
"got": "Gothic",
"GR": "Greek",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Than language code for Greek is el. I would remove gr from README and from here.

Suggested change
"GR": "Greek",

src/datasets/utils/resources/languages.json Show resolved Hide resolved
@albertvillanova
Copy link
Member

Once you address these issues, all the CI tests will pass.

gojiteji and others added 3 commits August 9, 2022 18:43
Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
Copy link
Contributor Author

@gojiteji gojiteji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your request, and sorry for the many CI tests runs.

@albertvillanova albertvillanova changed the title document fix in opus_gnome dataset Fix opus_gnome dataset card Aug 9, 2022
@albertvillanova
Copy link
Member

Once the remaining changes are addressed (see unresolved above), we will be able to merge this:

  • Remove "ara" from README
  • Remove "cat" from README
  • Remove "gr" from README
  • Replace "tmp" with "tyj" in README
  • Add "tyj" to languages.json:
    "tyj": "Tai Do; Tai Yo",
    

@gojiteji
Copy link
Contributor Author

gojiteji commented Aug 9, 2022

I did the five changes.

Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @gojiteji for the original fix and all the subsequent improvements to the dataset documentation card.

@albertvillanova albertvillanova merged commit cdf268b into huggingface:main Aug 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Wrong example in opus_gnome dataset card
3 participants