Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Copy tokenizer files in each of their repo #10624

Merged
merged 5 commits into from Mar 10, 2021
Merged

Copy tokenizer files in each of their repo #10624

merged 5 commits into from Mar 10, 2021

Conversation

sgugger
Copy link
Collaborator

@sgugger sgugger commented Mar 10, 2021

What does this PR do?

This PR cleans the maps in the tokenizer files to make sure each checkpoint has the proper tokenization files. This will allow us to remove custom code that mapped some checkpoints to special files (like BART using RoBERTa vocab files) and take full advantage of the versioning systems for those checkpoints. All checkpoints changed have been properly copied in the corresponding model repos in parallel.

For instance, to accomodate the move on the fast BART tokenizers, the following commits have been on the model hub:

In the PR I've also uniformized the way the maps are structured across models, to make it easier to alter (and ultimately remove) them in the future via automatic scripts.

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic work! Pinging @thomwolf and @julien-c for review, as this is the first step of removing archive maps from the repository.

@julien-c
Copy link
Member

Love it!

Maybe a good practice to link to a sample of the related commits on hf.co: for instance here https://huggingface.co/facebook/bart-base/commit/c2469fb7e666a5c5629a161f17c9ef23c85217f7

@sgugger
Copy link
Collaborator Author

sgugger commented Mar 10, 2021

I think I did around 50 of them in various repos to move all the tokenizers files, so a bit hard to keep track of all of them.

@julien-c
Copy link
Member

julien-c commented Mar 10, 2021

Yep just link one, or a small sample.

Makes it easier to see what this PR entails on hf-hub side

@sgugger sgugger merged commit 2295d78 into master Mar 10, 2021
@sgugger sgugger deleted the tokenizer_file_map branch March 10, 2021 16:26
Iwontbecreative pushed a commit to Iwontbecreative/transformers that referenced this pull request Jul 15, 2021
* Move tokenizer files in each repo

* Fix mBART50 tests

* Fix mBART tests

* Fix Marian tests

* Update templates
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants