Add versioning system to fast tokenizer files #12713

sgugger · 2021-07-14T21:06:39Z

What does this PR do?

Some changes cannot be done to the fast tokenizers file without breaking backward compatibility. This PR introduces a versioning system by allowing a model repo to contain multiple tokenizer files: the tokenizer.json is the default one and if one (or several) tokenizer.x.y.z.json exist, those files are used for the version x.y.z (of Transformers) and above.

cc @n1t0 as it should be helpful to solve that longstanding bug.

julien-c · 2021-07-15T06:33:56Z

might be cleaner if this worked in the other direction, i.e.

multiple tokenizer files: the tokenizer.json is the default one, used in the most recent version of Transformers. If one or more tokenizer-x.y.z.json exist, those files are used for the version x.y.z (of Transformers) and below.

Makes more sense on the Hub side as well. What do you think?

LysandreJik · 2021-07-15T06:57:51Z

@julien-c this would break repositories that rely on transformers versions that are earlier than the first one that will have this feature.

Were we to update the tokenizer.json file to the new, "fixed" one, and add a new tokenizer-x.x.x.json file to be used by earlier versions of transformers, then we would have no way of telling all versions < 4.10.0 to use that version rather than the standard tokenizer.json file.

julien-c · 2021-07-15T07:15:53Z

I think your assertion depends on what kind of changes are made to the JSON files. If it's only new attributes for example I wouldn't expect older versions to break, but from what I understand you're actually talking about modifying the actual attributes?

LysandreJik · 2021-07-15T08:12:18Z

Yes, the attributes actually need to be modified. For example, see this issue: #9633

There was an offset mappings bug, which needed to be patched. However, the issue lived in the tokenizer.json file itself - so the recommended way to patch this was for users to recompile that file, by passing the "slow" tokenizer files, and using the newer tokenizers version to generate the updated file.

I believe there are other issues, and there will be other issues as the libraries continue to evolve. Implementing this here allows us to ensure that the previous versions remain completely unaffected - while offering a way to patch models for future use.

LysandreJik

Ok cool! Played with it and it seems to be working well. LGTM, thanks a lot for working on it, this should really help with tokenizer files going forward!

src/transformers/file_utils.py

LysandreJik · 2021-07-17T09:58:04Z

tests/test_tokenization_fast.py

+            new_tokenizer = AutoTokenizer.from_pretrained(tmp_dir)
+            self.assertEqual(len(new_tokenizer), len(tokenizer) + 1)
+            json_tokenizer = json.loads(new_tokenizer._tokenizer.to_str())
+            self.assertIn("huggingface", json_tokenizer["model"]["vocab"])


LysandreJik · 2021-07-17T09:58:14Z

tests/test_tokenization_fast.py

+            # Will need to be adjusted if we reach v42 and this test is still here.
+            # Should pick the old tokenizer file as the version of Transformers is < 4.0.0


Let's think about it next century :)

when we near AGI this library will publish itself and who knows which version number it will pick

julien-c · 2021-07-17T10:04:52Z

Yes, the attributes actually need to be modified. For example, see this issue: #9633

There was an offset mappings bug, which needed to be patched. However, the issue lived in the tokenizer.json file itself - so the recommended way to patch this was for users to recompile that file, by passing the "slow" tokenizer files, and using the newer tokenizers version to generate the updated file.

I believe there are other issues, and there will be other issues as the libraries continue to evolve. Implementing this here allows us to ensure that the previous versions remain completely unaffected - while offering a way to patch models for future use.

going on a whim here, but what about using git branches to do this?

sgugger · 2021-07-17T14:00:07Z

The problem with a new branch is that we then can't have a new version of the model in a new git branch that has to be used with one tokenizer file if versions of Transformers are old, and another one if they are more recent. And it wouldn't be compatible with the sure selecting their own branch as well (though in that case they should make sure to have the right version with tokenizers file).

The key here (for more context) is that we have tokenizers that have a "wrong" tokenizer file for more recent versions of Tokenizers (controlled by the version of Transformers) because there was a bug in the conversion from slow to fast tokenizer script. We can't touch the main branch and the tokenizer.json file otherwise every code in production using those models will suddenly break (the changes are significant sadly).

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

Add versioning system to fast tokenizer files

0065352

sgugger requested a review from LysandreJik July 14, 2021 21:06

sgugger added 3 commits July 14, 2021 17:57

Deal with offline mode

8295ff5

Use staging env in tests

38829a3

Style

0d7bc9a

LysandreJik approved these changes Jul 17, 2021

View reviewed changes

sgugger and others added 3 commits July 17, 2021 16:18

Apply suggestions from code review

36b4b14

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

Merge branch 'master' into tokenizer_file_version

f83bce9

Style

e75f33c

sgugger merged commit 786ced3 into master Jul 21, 2021

sgugger deleted the tokenizer_file_version branch July 21, 2021 12:24

LysandreJik mentioned this pull request Nov 9, 2021

Allow per-version configurations #14344

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add versioning system to fast tokenizer files #12713

Add versioning system to fast tokenizer files #12713

sgugger commented Jul 14, 2021

julien-c commented Jul 15, 2021

LysandreJik commented Jul 15, 2021

julien-c commented Jul 15, 2021

LysandreJik commented Jul 15, 2021

LysandreJik left a comment

LysandreJik Jul 17, 2021

LysandreJik Jul 17, 2021

julien-c Jul 21, 2021

julien-c commented Jul 17, 2021

sgugger commented Jul 17, 2021

		# Will need to be adjusted if we reach v42 and this test is still here.
		# Should pick the old tokenizer file as the version of Transformers is < 4.0.0

Add versioning system to fast tokenizer files #12713

Add versioning system to fast tokenizer files #12713

Conversation

sgugger commented Jul 14, 2021

What does this PR do?

julien-c commented Jul 15, 2021

LysandreJik commented Jul 15, 2021

julien-c commented Jul 15, 2021

LysandreJik commented Jul 15, 2021

LysandreJik left a comment

Choose a reason for hiding this comment

LysandreJik Jul 17, 2021

Choose a reason for hiding this comment

LysandreJik Jul 17, 2021

Choose a reason for hiding this comment

julien-c Jul 21, 2021

Choose a reason for hiding this comment

julien-c commented Jul 17, 2021

sgugger commented Jul 17, 2021