Skip to content

[Json configs] Make json prettier for all saved tokenizer files & ensure same json format for all processors (tok + feat_extract)#17457

Merged
patrickvonplaten merged 8 commits intohuggingface:mainfrom
patrickvonplaten:correct_saving_format_json
May 31, 2022
Merged

[Json configs] Make json prettier for all saved tokenizer files & ensure same json format for all processors (tok + feat_extract)#17457
patrickvonplaten merged 8 commits intohuggingface:mainfrom
patrickvonplaten:correct_saving_format_json

Conversation

@patrickvonplaten
Copy link
Contributor

@patrickvonplaten patrickvonplaten commented May 27, 2022

What does this PR do?

As an example, see: https://huggingface.co/facebook/wav2vec2-base-100h/commit/9c1fef36b62a428a658e5b022ef9f21b38f47e0b

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@patrickvonplaten patrickvonplaten changed the title [Json dump] Make json prettier for all processors [Json dump] Make json prettier for all saved tokenizeir filese May 27, 2022
@patrickvonplaten patrickvonplaten changed the title [Json dump] Make json prettier for all saved tokenizeir filese [Json dump] Make json prettier for all saved tokenizer files May 27, 2022
@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented May 27, 2022

The documentation is not available anymore as the PR was closed or merged.

@patrickvonplaten patrickvonplaten changed the title [Json dump] Make json prettier for all saved tokenizer files [Json dump] Make json prettier for all saved tokenizer files & ensure same json format for all processors (tok + feat_extract) May 27, 2022
@patrickvonplaten patrickvonplaten changed the title [Json dump] Make json prettier for all saved tokenizer files & ensure same json format for all processors (tok + feat_extract) [Json configs] Make json prettier for all saved tokenizer files & ensure same json format for all processors (tok + feat_extract) May 27, 2022
tokenizer_p_files = tokenizer_p.save_pretrained(tmpdirname2)

# make sure that all ".json" files are saved in the correct format
for file_path in tokenizer_r_files + tokenizer_p_files:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pretty aggressive addition to test that should make sure all tokenizer json files will from now on have the correct format

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Smart!

with tempfile.TemporaryDirectory() as tmpdirname:
feat_extract_first.save_pretrained(tmpdirname)
saved_file = feat_extract_first.save_pretrained(tmpdirname)
check_json_file_has_correct_format(saved_file)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

verify that feature extractor configs also have correct structure

url = self._push_to_hub(repo, commit_message=commit_message)
logger.info(f"Feature extractor pushed to the hub in this commit: {url}")

return [output_feature_extractor_file]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sgugger - analogue to tokenizers, let's also output a list of saved files for the feature extractors no? Don't think this can break anything

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works for me!

@patrickvonplaten
Copy link
Contributor Author

patrickvonplaten commented May 27, 2022

@julien-c @sgugger do you think it could make sense to make a huge automated PR creation to correct all tokenizer configs? Or maybe too much given that we have 80,000 checkpoints?

Don't think it's possible to break anything, but still not sure if it makes sense

@julien-c
Copy link
Member

would be a good stress test i guess =)


with open(vocab_file, "w", encoding="utf-8") as f:
f.write(json.dumps(self.encoder, ensure_ascii=False))
f.write(json.dumps(self.encoder, indent=2, sort_keys=True, ensure_ascii=False) + "\n")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we sure we want to sort_keys?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think it's nice - any downside to doing this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consistency with already published models

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The model's configs are already sorted: https://huggingface.co/facebook/opt-350m/blob/main/config.json but it's true it would be more consistent with existing tokenizer configs. Think I'd still be in favor of sorting here though

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I didn't remember configs were already sorted. Sounds good to me then!

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice new tests! Thanks for taking care of it!

url = self._push_to_hub(repo, commit_message=commit_message)
logger.info(f"Feature extractor pushed to the hub in this commit: {url}")

return [output_feature_extractor_file]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works for me!

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for taking care of that task @patrickvonplaten!

tokenizer_p_files = tokenizer_p.save_pretrained(tmpdirname2)

# make sure that all ".json" files are saved in the correct format
for file_path in tokenizer_r_files + tokenizer_p_files:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Smart!

@patrickvonplaten patrickvonplaten merged commit f394a2a into huggingface:main May 31, 2022
@patrickvonplaten patrickvonplaten deleted the correct_saving_format_json branch May 31, 2022 15:07
Narsil pushed a commit to Narsil/transformers that referenced this pull request Jun 7, 2022
…ure same json format for all processors (tok + feat_extract) (huggingface#17457)

* [Json dump] Make json prettier

* correct more tokenizeirs

* more patterns

* add aggressive test

* the aggressive test was actually useful :-)

* more tests

* Apply suggestions from code review
elusenji pushed a commit to elusenji/transformers that referenced this pull request Jun 12, 2022
…ure same json format for all processors (tok + feat_extract) (huggingface#17457)

* [Json dump] Make json prettier

* correct more tokenizeirs

* more patterns

* add aggressive test

* the aggressive test was actually useful :-)

* more tests

* Apply suggestions from code review
amyeroberts pushed a commit to amyeroberts/transformers that referenced this pull request Jun 16, 2022
…ure same json format for all processors (tok + feat_extract) (huggingface#17457)

* [Json dump] Make json prettier

* correct more tokenizeirs

* more patterns

* add aggressive test

* the aggressive test was actually useful :-)

* more tests

* Apply suggestions from code review
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants