-
Notifications
You must be signed in to change notification settings - Fork 27.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add versioning system to fast tokenizer files #12713
Conversation
might be cleaner if this worked in the other direction, i.e.
Makes more sense on the Hub side as well. What do you think? |
@julien-c this would break repositories that rely on Were we to update the |
I think your assertion depends on what kind of changes are made to the JSON files. If it's only new attributes for example I wouldn't expect older versions to break, but from what I understand you're actually talking about modifying the actual attributes? |
Yes, the attributes actually need to be modified. For example, see this issue: #9633 There was an offset mappings bug, which needed to be patched. However, the issue lived in the I believe there are other issues, and there will be other issues as the libraries continue to evolve. Implementing this here allows us to ensure that the previous versions remain completely unaffected - while offering a way to patch models for future use. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok cool! Played with it and it seems to be working well. LGTM, thanks a lot for working on it, this should really help with tokenizer files going forward!
new_tokenizer = AutoTokenizer.from_pretrained(tmp_dir) | ||
self.assertEqual(len(new_tokenizer), len(tokenizer) + 1) | ||
json_tokenizer = json.loads(new_tokenizer._tokenizer.to_str()) | ||
self.assertIn("huggingface", json_tokenizer["model"]["vocab"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice test!
# Will need to be adjusted if we reach v42 and this test is still here. | ||
# Should pick the old tokenizer file as the version of Transformers is < 4.0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's think about it next century :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when we near AGI this library will publish itself and who knows which version number it will pick
going on a whim here, but what about using git branches to do this? |
The problem with a new branch is that we then can't have a new version of the model in a new git branch that has to be used with one tokenizer file if versions of Transformers are old, and another one if they are more recent. And it wouldn't be compatible with the sure selecting their own branch as well (though in that case they should make sure to have the right version with tokenizers file). The key here (for more context) is that we have tokenizers that have a "wrong" tokenizer file for more recent versions of Tokenizers (controlled by the version of Transformers) because there was a bug in the conversion from slow to fast tokenizer script. We can't touch the main branch and the tokenizer.json file otherwise every code in production using those models will suddenly break (the changes are significant sadly). |
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
What does this PR do?
Some changes cannot be done to the fast tokenizers file without breaking backward compatibility. This PR introduces a versioning system by allowing a model repo to contain multiple tokenizer files: the
tokenizer.json
is the default one and if one (or several)tokenizer.x.y.z.json
exist, those files are used for the version x.y.z (of Transformers) and above.cc @n1t0 as it should be helpful to solve that longstanding bug.