[Json configs] Make json prettier for all saved tokenizer files & ensure same json format for all processors (tok + feat_extract)#17457
Conversation
|
The documentation is not available anymore as the PR was closed or merged. |
…to correct_saving_format_json
| tokenizer_p_files = tokenizer_p.save_pretrained(tmpdirname2) | ||
|
|
||
| # make sure that all ".json" files are saved in the correct format | ||
| for file_path in tokenizer_r_files + tokenizer_p_files: |
There was a problem hiding this comment.
pretty aggressive addition to test that should make sure all tokenizer json files will from now on have the correct format
| with tempfile.TemporaryDirectory() as tmpdirname: | ||
| feat_extract_first.save_pretrained(tmpdirname) | ||
| saved_file = feat_extract_first.save_pretrained(tmpdirname) | ||
| check_json_file_has_correct_format(saved_file) |
There was a problem hiding this comment.
verify that feature extractor configs also have correct structure
| url = self._push_to_hub(repo, commit_message=commit_message) | ||
| logger.info(f"Feature extractor pushed to the hub in this commit: {url}") | ||
|
|
||
| return [output_feature_extractor_file] |
There was a problem hiding this comment.
@sgugger - analogue to tokenizers, let's also output a list of saved files for the feature extractors no? Don't think this can break anything
|
would be a good stress test i guess =) |
|
|
||
| with open(vocab_file, "w", encoding="utf-8") as f: | ||
| f.write(json.dumps(self.encoder, ensure_ascii=False)) | ||
| f.write(json.dumps(self.encoder, indent=2, sort_keys=True, ensure_ascii=False) + "\n") |
There was a problem hiding this comment.
are we sure we want to sort_keys?
There was a problem hiding this comment.
Think it's nice - any downside to doing this?
There was a problem hiding this comment.
consistency with already published models
There was a problem hiding this comment.
The model's configs are already sorted: https://huggingface.co/facebook/opt-350m/blob/main/config.json but it's true it would be more consistent with existing tokenizer configs. Think I'd still be in favor of sorting here though
There was a problem hiding this comment.
ok, I didn't remember configs were already sorted. Sounds good to me then!
sgugger
left a comment
There was a problem hiding this comment.
Nice new tests! Thanks for taking care of it!
| url = self._push_to_hub(repo, commit_message=commit_message) | ||
| logger.info(f"Feature extractor pushed to the hub in this commit: {url}") | ||
|
|
||
| return [output_feature_extractor_file] |
LysandreJik
left a comment
There was a problem hiding this comment.
LGTM, thanks for taking care of that task @patrickvonplaten!
| tokenizer_p_files = tokenizer_p.save_pretrained(tmpdirname2) | ||
|
|
||
| # make sure that all ".json" files are saved in the correct format | ||
| for file_path in tokenizer_r_files + tokenizer_p_files: |
…ure same json format for all processors (tok + feat_extract) (huggingface#17457) * [Json dump] Make json prettier * correct more tokenizeirs * more patterns * add aggressive test * the aggressive test was actually useful :-) * more tests * Apply suggestions from code review
…ure same json format for all processors (tok + feat_extract) (huggingface#17457) * [Json dump] Make json prettier * correct more tokenizeirs * more patterns * add aggressive test * the aggressive test was actually useful :-) * more tests * Apply suggestions from code review
…ure same json format for all processors (tok + feat_extract) (huggingface#17457) * [Json dump] Make json prettier * correct more tokenizeirs * more patterns * add aggressive test * the aggressive test was actually useful :-) * more tests * Apply suggestions from code review
What does this PR do?
As an example, see: https://huggingface.co/facebook/wav2vec2-base-100h/commit/9c1fef36b62a428a658e5b022ef9f21b38f47e0b
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.