[Json configs] Make json prettier for all saved tokenizer files & ensure same json format for all processors (tok + feat_extract) by patrickvonplaten · Pull Request #17457 · huggingface/transformers

patrickvonplaten · 2022-05-27T16:17:59Z

What does this PR do?

As an example, see: https://huggingface.co/facebook/wav2vec2-base-100h/commit/9c1fef36b62a428a658e5b022ef9f21b38f47e0b

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

HuggingFaceDocBuilderDev · 2022-05-27T16:35:59Z

The documentation is not available anymore as the PR was closed or merged.

…to correct_saving_format_json

patrickvonplaten · 2022-05-27T20:50:10Z

tests/test_tokenization_common.py

                tokenizer_p_files = tokenizer_p.save_pretrained(tmpdirname2)

+                # make sure that all ".json" files are saved in the correct format
+                for file_path in tokenizer_r_files + tokenizer_p_files:


pretty aggressive addition to test that should make sure all tokenizer json files will from now on have the correct format

patrickvonplaten · 2022-05-27T20:50:24Z

tests/test_feature_extraction_common.py

        with tempfile.TemporaryDirectory() as tmpdirname:
-            feat_extract_first.save_pretrained(tmpdirname)
+            saved_file = feat_extract_first.save_pretrained(tmpdirname)
+            check_json_file_has_correct_format(saved_file)


verify that feature extractor configs also have correct structure

src/transformers/feature_extraction_utils.py

tests/test_feature_extraction_common.py

patrickvonplaten · 2022-05-27T20:52:17Z

src/transformers/feature_extraction_utils.py

            url = self._push_to_hub(repo, commit_message=commit_message)
            logger.info(f"Feature extractor pushed to the hub in this commit: {url}")

+        return [output_feature_extractor_file]


@sgugger - analogue to tokenizers, let's also output a list of saved files for the feature extractors no? Don't think this can break anything

Works for me!

patrickvonplaten · 2022-05-27T20:53:22Z

@julien-c @sgugger do you think it could make sense to make a huge automated PR creation to correct all tokenizer configs? Or maybe too much given that we have 80,000 checkpoints?

Don't think it's possible to break anything, but still not sure if it makes sense

julien-c · 2022-05-27T21:26:08Z

would be a good stress test i guess =)

julien-c · 2022-05-27T21:28:17Z

src/transformers/models/bart/tokenization_bart.py


        with open(vocab_file, "w", encoding="utf-8") as f:
-            f.write(json.dumps(self.encoder, ensure_ascii=False))
+            f.write(json.dumps(self.encoder, indent=2, sort_keys=True, ensure_ascii=False) + "\n")


are we sure we want to sort_keys?

Think it's nice - any downside to doing this?

consistency with already published models

The model's configs are already sorted: https://huggingface.co/facebook/opt-350m/blob/main/config.json but it's true it would be more consistent with existing tokenizer configs. Think I'd still be in favor of sorting here though

ok, I didn't remember configs were already sorted. Sounds good to me then!

sgugger

Nice new tests! Thanks for taking care of it!

sgugger · 2022-05-31T13:18:25Z

src/transformers/feature_extraction_utils.py

            url = self._push_to_hub(repo, commit_message=commit_message)
            logger.info(f"Feature extractor pushed to the hub in this commit: {url}")

+        return [output_feature_extractor_file]


Works for me!

LysandreJik

LGTM, thanks for taking care of that task @patrickvonplaten!

LysandreJik · 2022-05-31T14:35:58Z

tests/test_tokenization_common.py

                tokenizer_p_files = tokenizer_p.save_pretrained(tmpdirname2)

+                # make sure that all ".json" files are saved in the correct format
+                for file_path in tokenizer_r_files + tokenizer_p_files:


…ure same json format for all processors (tok + feat_extract) (huggingface#17457) * [Json dump] Make json prettier * correct more tokenizeirs * more patterns * add aggressive test * the aggressive test was actually useful :-) * more tests * Apply suggestions from code review

patrickvonplaten added 3 commits May 27, 2022 12:17

[Json dump] Make json prettier

abbc4b8

correct more tokenizeirs

2fb3a1a

more patterns

82ab50a

patrickvonplaten changed the title ~~[Json dump] Make json prettier for all processors~~ [Json dump] Make json prettier for all saved tokenizeir filese May 27, 2022

patrickvonplaten changed the title ~~[Json dump] Make json prettier for all saved tokenizeir filese~~ [Json dump] Make json prettier for all saved tokenizer files May 27, 2022

patrickvonplaten added 2 commits May 27, 2022 18:46

Merge branch 'main' of https://github.com/huggingface/transformers in…

e48b48e

…to correct_saving_format_json

add aggressive test

542ea2e

patrickvonplaten mentioned this pull request May 27, 2022

Make all configs nicely readable #17432

Closed

patrickvonplaten added 2 commits May 27, 2022 21:47

the aggressive test was actually useful :-)

872e59f

more tests

c14b7e7

patrickvonplaten changed the title ~~[Json dump] Make json prettier for all saved tokenizer files~~ [Json dump] Make json prettier for all saved tokenizer files & ensure same json format for all processors (tok + feat_extract) May 27, 2022

patrickvonplaten requested review from julien-c and sgugger May 27, 2022 20:32

patrickvonplaten commented May 27, 2022

View reviewed changes

tests/test_feature_extraction_common.py Outdated Show resolved Hide resolved

Apply suggestions from code review

ca9ef7f

patrickvonplaten commented May 27, 2022

View reviewed changes

julien-c reviewed May 27, 2022

View reviewed changes

sgugger approved these changes May 31, 2022

View reviewed changes

patrickvonplaten requested a review from LysandreJik May 31, 2022 14:01

LysandreJik approved these changes May 31, 2022

View reviewed changes

patrickvonplaten merged commit f394a2a into huggingface:main May 31, 2022

patrickvonplaten deleted the correct_saving_format_json branch May 31, 2022 15:07

Conversation

patrickvonplaten commented May 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

HuggingFaceDocBuilderDev commented May 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

patrickvonplaten commented May 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

julien-c commented May 27, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sgugger left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LysandreJik left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

patrickvonplaten commented May 27, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented May 27, 2022 •

edited

Loading

patrickvonplaten commented May 27, 2022 •

edited

Loading