Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve save and load repocard metadata #355

Merged
merged 19 commits into from Sep 27, 2021
Merged

Conversation

elishowk
Copy link
Contributor

@elishowk elishowk commented Sep 20, 2021

Problem

Loading and Saving repocard shouldn't modify implicitely the YAML.

  • loading and saving override the current newlines format.
  • also pyyaml changes "no" to False and encode UTF-8 so that round-trip is impossible.

Summary

Newlines

  • Python3 by default read files and normalize new lines. This PR takes care of that.

Replacing pyyaml by ruamel.yaml

  • Added a test case RepocardTest.test_metadata_roundtrip that works fine with ruamel.yaml and not with pyyaml.

How tests fail with pyyaml ?

Bug 1 : 'no' string parsed to False

________________________________________ RepocardTest.test_metadata_roundtrip ________________________________________

self = <tests.test_repocard.RepocardTest testMethod=test_metadata_roundtrip>

    def test_metadata_roundtrip(self):
        filename = "dummy_target.md"
        filepath = Path(REPOCARD_DIR) / filename
        filepath.write_text(ROUND_TRIP_MODELCARD_CASE)
        metadata = metadata_load(filepath)
>       self.assertDictEqual({
            "language": "no",
            "datasets": "CLUECorpusSmall",
            "widget": [{ "text": "北京是[MASK]国的首都。" }],
        }, metadata)
E       AssertionError: {'language': 'no', 'datasets': 'CLUECorpusSmall', 'wid[30 chars]。'}]} != {'language': False, 'datasets': 'CLUECorpusSmall', 'wi[31 chars]。'}]}
E         {'datasets': 'CLUECorpusSmall',
E       -  'language': 'no',
E       ?              ^^^^
E
E       +  'language': False,
E       ?              ^^^^^
E
E          'widget': [{'text': '北京是[MASK]国的首都。'}]}

tests/test_repocard.py:132: AssertionError

Bug 2 : UTF-8 string broken.

________________________________________ RepocardTest.test_metadata_roundtrip ________________________________________

self = <tests.test_repocard.RepocardTest testMethod=test_metadata_roundtrip>

    def test_metadata_roundtrip(self):
        filename = "dummy_target.md"
        filepath = Path(REPOCARD_DIR) / filename
        filepath.write_text(ROUND_TRIP_MODELCARD_CASE)
        metadata = metadata_load(filepath)
        metadata_save(filepath, metadata)
        content = filepath.read_text()
>       self.assertEqual(content, ROUND_TRIP_MODELCARD_CASE)
E       AssertionError: '\n---\nlanguage: false\ndatasets: CLUECorpusSmall\nwidget:[88 chars]le\n' != '\n---\nlanguage: no\ndatasets: CLUECorpusSmall\nwidget:\n-[35 chars]le\n'
E
E         ---
E       - language: false
E       ?           ^^^^^
E       + language: no
E       ?           ^^
E         datasets: CLUECorpusSmall
E         widget:
E       - - text: "\u5317\u4EAC\u662F[MASK]\u56FD\u7684\u9996\u90FD\u3002"
E       + - text: 北京是[MASK]国的首都。
E         ---
E
E         # Title

tests/test_repocard.py:139: AssertionError

@elishowk elishowk self-assigned this Sep 20, 2021
Copy link
Member

@julien-c julien-c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes the behavior when local_path does not exist, where we still want to create a file and output the metadata block

(@LysandreJik was right that we needed to add a test for this 😆 )

Other than that, looks good!

src/huggingface_hub/repocard.py Outdated Show resolved Hide resolved
@elishowk
Copy link
Contributor Author

This changes the behavior when local_path does not exist, where we still want to create a file and output the metadata block

(@LysandreJik was right that we needed to add a test for this laughing )

Other than that, looks good!

Fixed that and added a test

@elishowk
Copy link
Contributor Author

I'm proposing to replace pyyaml by ruamel.yaml, which is better maintained and optimized for safe loading with preservation of the existing formatting. WDYT @LysandreJik ?

@elishowk elishowk changed the title fix save and load keeping newline Improve save and load metadta, keeping newline Sep 21, 2021
@LysandreJik
Copy link
Member

I'd much rather we stay with the pyyaml which is an official package from the YAML organization. I'd rather we stay as lightweight as possible in terms of dependencies.

@julien-c
Copy link
Member

it's also what we use in our other codebases

@elishowk elishowk changed the title Improve save and load metadta, keeping newline Improve save and load metadata Sep 22, 2021
@elishowk elishowk added the enhancement New feature or request label Sep 22, 2021
elias@showk.me added 2 commits September 22, 2021 15:13
This reverts commit 0da1d94.
This reverts commit 9f0eb62.
@elishowk
Copy link
Contributor Author

elishowk commented Sep 22, 2021

The reason I choose ruamel.yaml instead of pyyaml is because the latter have really bad loading/dumping side effects unacceptable when we need a safe & conservative "round trip" loading/dumping. PyYAML's official indeed but full of not-fixed bugs.
Also PyYAML does not support YAML 1.2 yet whereas ruamel supports it by default (fixing the yes/no to boolean bug)

Real bugs found in the hub while using PyYAML

  • Example for UTF-8 bug :
-- text: "Fui a la librería a comprar un <mask>."
+- text: "Fui a la librer\xEDa a comprar un <mask>."
-language: no
+language: false

I pushed the rollback to PyYAML, but it's your call. It's not usable for my case anyway, I'd have to patch a local copy of hf_hub for scripting README.md metadata bulk-corrections

@LysandreJik
Copy link
Member

Thank you for the explanation @elishowk - this explanation, along with a PR description would be very useful to understand the core issue and the solution you provide, as it's impossible to review without having the full context.

Now that you have explained, I understand why you chose ruamel.yaml. Could you provide a sample use-case that you use in your setup that works with ruamel.yaml but doesn't with pyyaml, so that I may take a closer look? Thank you.

@elishowk
Copy link
Contributor Author

elishowk commented Sep 23, 2021

Hi @LysandreJik,

Sorry for the lack of context. I started this PR and discovered bugs on the road while working on real-world metadata, and didn't always remembered to explain all the context...

So, now I pushed back ruamel.yaml for your consideration, and I added 1 test case in test_repocard.py that shows and ensures the problem I encountered with pyyaml is solved with ruamel.yaml.

I completed the top-description of this PR #355 (comment) in order to show what happens with the new test case when keeping pyyaml.

I hope this give you all the elements to review 😅

Cheers

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the explanation. Looks good, only left a nit.

src/huggingface_hub/repocard.py Outdated Show resolved Hide resolved
@LysandreJik
Copy link
Member

Thank you @elishowk! Feel free to merge when green.

@elishowk elishowk changed the title Improve save and load metadata Improve save and load repocard metadata Sep 27, 2021
@elishowk elishowk merged commit b26da2a into main Sep 27, 2021
@elishowk elishowk deleted the fix-repocard-save-and-load branch September 27, 2021 12:36
@elishowk elishowk mentioned this pull request Oct 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants