Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add versioning system to fast tokenizer files #12713

Merged
merged 7 commits into from
Jul 21, 2021
Merged

Conversation

sgugger
Copy link
Collaborator

@sgugger sgugger commented Jul 14, 2021

What does this PR do?

Some changes cannot be done to the fast tokenizers file without breaking backward compatibility. This PR introduces a versioning system by allowing a model repo to contain multiple tokenizer files: the tokenizer.json is the default one and if one (or several) tokenizer.x.y.z.json exist, those files are used for the version x.y.z (of Transformers) and above.

cc @n1t0 as it should be helpful to solve that longstanding bug.

@sgugger sgugger requested a review from LysandreJik July 14, 2021 21:06
@julien-c
Copy link
Member

might be cleaner if this worked in the other direction, i.e.

multiple tokenizer files: the tokenizer.json is the default one, used in the most recent version of Transformers. If one or more tokenizer-x.y.z.json exist, those files are used for the version x.y.z (of Transformers) and below.

Makes more sense on the Hub side as well. What do you think?

@LysandreJik
Copy link
Member

@julien-c this would break repositories that rely on transformers versions that are earlier than the first one that will have this feature.

Were we to update the tokenizer.json file to the new, "fixed" one, and add a new tokenizer-x.x.x.json file to be used by earlier versions of transformers, then we would have no way of telling all versions < 4.10.0 to use that version rather than the standard tokenizer.json file.

@julien-c
Copy link
Member

I think your assertion depends on what kind of changes are made to the JSON files. If it's only new attributes for example I wouldn't expect older versions to break, but from what I understand you're actually talking about modifying the actual attributes?

@LysandreJik
Copy link
Member

Yes, the attributes actually need to be modified. For example, see this issue: #9633

There was an offset mappings bug, which needed to be patched. However, the issue lived in the tokenizer.json file itself - so the recommended way to patch this was for users to recompile that file, by passing the "slow" tokenizer files, and using the newer tokenizers version to generate the updated file.

I believe there are other issues, and there will be other issues as the libraries continue to evolve. Implementing this here allows us to ensure that the previous versions remain completely unaffected - while offering a way to patch models for future use.

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok cool! Played with it and it seems to be working well. LGTM, thanks a lot for working on it, this should really help with tokenizer files going forward!

src/transformers/file_utils.py Outdated Show resolved Hide resolved
new_tokenizer = AutoTokenizer.from_pretrained(tmp_dir)
self.assertEqual(len(new_tokenizer), len(tokenizer) + 1)
json_tokenizer = json.loads(new_tokenizer._tokenizer.to_str())
self.assertIn("huggingface", json_tokenizer["model"]["vocab"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice test!

Comment on lines +115 to +116
# Will need to be adjusted if we reach v42 and this test is still here.
# Should pick the old tokenizer file as the version of Transformers is < 4.0.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's think about it next century :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when we near AGI this library will publish itself and who knows which version number it will pick

@julien-c
Copy link
Member

Yes, the attributes actually need to be modified. For example, see this issue: #9633

There was an offset mappings bug, which needed to be patched. However, the issue lived in the tokenizer.json file itself - so the recommended way to patch this was for users to recompile that file, by passing the "slow" tokenizer files, and using the newer tokenizers version to generate the updated file.

I believe there are other issues, and there will be other issues as the libraries continue to evolve. Implementing this here allows us to ensure that the previous versions remain completely unaffected - while offering a way to patch models for future use.

going on a whim here, but what about using git branches to do this?

@sgugger
Copy link
Collaborator Author

sgugger commented Jul 17, 2021

The problem with a new branch is that we then can't have a new version of the model in a new git branch that has to be used with one tokenizer file if versions of Transformers are old, and another one if they are more recent. And it wouldn't be compatible with the sure selecting their own branch as well (though in that case they should make sure to have the right version with tokenizers file).

The key here (for more context) is that we have tokenizers that have a "wrong" tokenizer file for more recent versions of Tokenizers (controlled by the version of Transformers) because there was a bug in the conversion from slow to fast tokenizer script. We can't touch the main branch and the tokenizer.json file otherwise every code in production using those models will suddenly break (the changes are significant sadly).

sgugger and others added 3 commits July 17, 2021 16:18
@sgugger sgugger merged commit 786ced3 into master Jul 21, 2021
@sgugger sgugger deleted the tokenizer_file_version branch July 21, 2021 12:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants