Replaced torch.load for loading the pretrained vocab of TransformerXL tokenizer to pickle.load #6935

w4nderlust · 2020-09-04T01:53:56Z

TransformerXL tokenizer requires torch to work because it uses torch.load to load the vocabulary. This means that if I'm using the TF2 implementation, I have to add torch as a dependency just for that. So I replaced the call to a call to pickle.load (which is what torch.load internally uses) to solve the issue.
Tested an all the TransformerXL related tests (also the slow ones) and they all passed.

… to pickle.load

thomwolf · 2020-09-04T08:11:58Z

Hi @w4nderlust that's a good idea!

The CI seems not happy with the change though and it seems related to your changes.

Do you think you could take a look?

w4nderlust · 2020-09-04T19:08:18Z

@thomwolf trying to see the details of the failing tests, but circleci wat me to login with github and grant access to all my repos and orgs, I prefer to avoid it.

If you can point out the failing tests I'm happy to take a look at it.

w4nderlust · 2020-09-04T21:40:42Z

@thomwolf I inspected further and this is what i discovered:

i was running the modeling_transfo_xl.py tests, but I should have been running the tokenization_transfo_xl.py test. I Imagine the errors in CI are coming from there.
upon further inspection, I noticed that inside tokenization_transfo_xl.py torch is used everywhere. There probably was a design decision that led to that which I'm not aware of, but as a user i would ask you to reconsider, because if the tokenizer uses torch, the TF version of TransformerXL can neve be used without installing torch. The extent to which torch is used goes beyond my current familiarity with the library, so i will refrain to propose modifications to it, a part from what i propose in the next point.
in my commit I replaced the loading of the vocab, but upon inspection i realized that, yes, torch uses pickle, but it does that in a way that is peculiar, including magic numbers and protocol versions and some custom logic that will take some time to reverse engineer ( https://github.com/pytorch/pytorch/blob/0c01f136f3c8d16f221d641befcb5a74142bbeb1/torch/serialization.py#L764-L774 ). It doesn't seem you can directly load the vocab dictionary without re-implementing quite some load code from torch, plus this doesn't sound like a sound approach because PyTorch can itself start using a new load mechanism in the future. So, what I tried to do is to replace ALSO torch.save usages within the context of vocabulary save with pickle.dump.( line 260-262) in my last commit). The effect is that now all test_tokenize_transfo_cl.py tests pass, the vocab can be saved and loaded, but, because the vocab that ships with the pretrained models was saved originally with torch, If it try to load from pretrained model, loading doesn't work (what is loaded is just the torch magic number). So here I guess you have to make a call about what you want to do: if you want to use pickle to load and save vocab, this PR does it for you, but you have to change the TransformerXL pretrained model that you ship by replacing the vocab file saved with PyTorch with one saved with pickle (the code to do it from the current vocab file is straightforward pickle.dump(torch.load(vocab_file), vocab_file)).

As I realized the issue is bigger than I originally thought, it would be great if someone could look at it in more detail from the HF side.

thomwolf · 2020-09-25T10:15:26Z

Hi @w4nderlust ok, I'm reaching this PR now.

So the original tokenizer for Transformer-XL was copied from the original research work to be able to import the trained checkpoints. The reliance on PyTorch is thus not really a design decision of us but more of the original author.

We can definitely reconsider it and if you don't mind, I'll try to build upon your PR to relax this reliance on PyTorch while keeping backward compatibility if possible.

…nto pr/6935

codecov · 2020-09-25T14:19:49Z

Codecov Report

Merging #6935 into master will increase coverage by 1.96%.
The diff coverage is 79.16%.

@@            Coverage Diff             @@
##           master    #6935      +/-   ##
==========================================
+ Coverage   74.71%   76.67%   +1.96%     
==========================================
  Files         194      181      -13     
  Lines       39407    35738    -3669     
==========================================
- Hits        29441    27401    -2040     
+ Misses       9966     8337    -1629

Impacted Files	Coverage Δ
src/transformers/file_utils.py	`83.09% <60.00%> (+0.17%)`	⬆️
src/transformers/tokenization_transfo_xl.py	`42.48% <84.21%> (+0.74%)`	⬆️
src/transformers/configuration_reformer.py	`21.62% <0.00%> (-78.38%)`	⬇️
src/transformers/modeling_reformer.py	`16.71% <0.00%> (-77.89%)`	⬇️
src/transformers/modeling_tf_xlm.py	`18.73% <0.00%> (-74.53%)`	⬇️
src/transformers/modeling_tf_flaubert.py	`21.53% <0.00%> (-68.15%)`	⬇️
src/transformers/tokenization_marian.py	`32.75% <0.00%> (-66.38%)`	⬇️
src/transformers/trainer_utils.py	`60.63% <0.00%> (-20.14%)`	⬇️
src/transformers/trainer.py	`55.70% <0.00%> (-15.11%)`	⬇️
src/transformers/integrations.py	`29.00% <0.00%> (-5.66%)`	⬇️
... and 71 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update aba4e22...cd57922. Read the comment docs.

LysandreJik

Other than the nitpick, LGTM.

LysandreJik · 2020-09-25T15:14:32Z

src/transformers/file_utils.py

@@ -190,6 +190,19 @@ def is_faiss_available():
    return _faiss_available


+def require_torch(fn):


Nitpick: Is require_torch the best name for that method, as it already exists in testing_utils but works slightly differently?

I fear this would make autocompletion more ambiguous

sgugger

Looks good to me, thanks for solving this issue! However, these changes are also needed in the fast tokenizer, no?

While we're in this file, could you check the full dosctrings of the two tokenizers (fast and slow) and add documentation for the arguments I didn't know how to explain? That would be super awesome.

sgugger · 2020-09-25T15:53:54Z

src/transformers/tokenization_transfo_xl.py

-        torch.save(self.__dict__, vocab_file)
+        with open(vocab_file, "wb") as f:
+            pickle.dump(self.__dict__, f)
+        # torch.save(self.__dict__, vocab_file)


This line doesn't need to be there anymore now, does it?

Yes, this shouldn't be needed anymore, it was there temporarily to make the point ;)

patrickvonplaten · 2020-09-25T22:27:55Z

src/transformers/tokenization_transfo_xl.py

@@ -165,7 +169,8 @@ def __init__(
        lower_case=False,
        delimiter=None,
        vocab_file=None,
-        pretrained_vocab_file=None,
+        pretrained_vocab_file: str = None,


This breaks backwards compatibility for people that have saved a torch tokenizer file locally and continue passing it as pretrained_vocab_file no? Is it maybe possible to just have pretrained_vocab_file as before and check if it is a torch or pickle file?

Minor suggestion: in Ludwig I adopted the convention that when the variable ends with _file it is an opened File, and when it ends with _fp it is a string containing a file path. That may help resolving the ambiguity.

But Patrick is right, changing from one type to another right now may break backwards compatibility.

Doesn't break compatibility anymore!

LysandreJik

@thomwolf @patrickvonplaten @sgugger updated the code to reflect the review comments. Didn't do the fast tokenizers as it is getting removed in #7141.

LysandreJik · 2020-10-06T19:52:37Z

src/transformers/file_utils.py

@@ -190,6 +190,19 @@ def is_faiss_available():
    return _faiss_available


+def torch_only_method(fn):


Changed the name to torch_only_method, no strong opinions

LysandreJik · 2020-10-06T19:53:28Z

src/transformers/tokenization_transfo_xl.py

@@ -165,7 +169,8 @@ def __init__(
        lower_case=False,
        delimiter=None,
        vocab_file=None,
-        pretrained_vocab_file=None,
+        pretrained_vocab_file: str = None,


Doesn't break compatibility anymore!

LysandreJik · 2020-10-06T19:54:09Z

src/transformers/tokenization_transfo_xl.py

            raise ValueError(
                "Unable to parse file {}. Unknown format. "
                "If you tried to load a model saved through TransfoXLTokenizerFast,"
                "please note they are not compatible.".format(pretrained_vocab_file)
-            )
+            ) from e


This ensures that we keep the ImportError message in the error.

thomwolf

Awesome @LysandreJik

w4nderlust · 2020-10-08T22:29:48Z

Thank you for the work on this! Much appreciated! :)

… tokenizer to pickle.load (huggingface#6935) * Replaced torch.load for loading the pretrained vocab of TransformerXL to pickle.load * Replaced torch.save with pickle.dump when saving the vocabulary * updating transformer-xl * uploaded on S3 - compatibility * fix tests * style * Address review comments Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com> Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>

…formerXL tokenizer to pickle.load (huggingface#6935)" This reverts commit a81970f.

Replaced torch.load for loading the pretrained vocab of TransformerXL…

57405ce

… to pickle.load

Replaced torch.save with pickle.dump when saving the vocabulary

f22e6d9

thomwolf added 6 commits September 25, 2020 14:32

updating transformer-xl

a0600b7

uploaded on S3 - compatibility

92e335f

Merge branch 'master' into master

33a97c7

fix tests

588e10f

Merge branch 'master' of https://github.com/w4nderlust/transformers i…

c897dc3

…nto pr/6935

style

ead70d7

thomwolf requested review from LysandreJik, patrickvonplaten, sgugger and TevenLeScao September 25, 2020 14:38

LysandreJik approved these changes Sep 25, 2020

View reviewed changes

sgugger reviewed Sep 25, 2020

View reviewed changes

patrickvonplaten reviewed Sep 25, 2020

View reviewed changes

Address review comments

6404f1a

LysandreJik reviewed Oct 6, 2020

View reviewed changes

patrickvonplaten approved these changes Oct 6, 2020

View reviewed changes

Merge branch 'master' into master

cd57922

thomwolf approved these changes Oct 8, 2020

View reviewed changes

thomwolf merged commit 4d04120 into huggingface:master Oct 8, 2020

fabiocapsouza added a commit to fabiocapsouza/transformers that referenced this pull request Nov 15, 2020

Revert "Replaced torch.load for loading the pretrained vocab of Trans…

c016e7d

…formerXL tokenizer to pickle.load (huggingface#6935)" This reverts commit a81970f.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replaced torch.load for loading the pretrained vocab of TransformerXL tokenizer to pickle.load #6935

Replaced torch.load for loading the pretrained vocab of TransformerXL tokenizer to pickle.load #6935

w4nderlust commented Sep 4, 2020 •

edited

thomwolf commented Sep 4, 2020

w4nderlust commented Sep 4, 2020

w4nderlust commented Sep 4, 2020 •

edited

thomwolf commented Sep 25, 2020

codecov bot commented Sep 25, 2020 •

edited

LysandreJik left a comment

LysandreJik Sep 25, 2020

sgugger left a comment

sgugger Sep 25, 2020

w4nderlust Sep 25, 2020

patrickvonplaten Sep 25, 2020 •

edited

w4nderlust Sep 25, 2020

LysandreJik Oct 6, 2020

LysandreJik left a comment

LysandreJik Oct 6, 2020

LysandreJik Oct 6, 2020

LysandreJik Oct 6, 2020

thomwolf left a comment

w4nderlust commented Oct 8, 2020

		@@ -190,6 +190,19 @@ def is_faiss_available():
		return _faiss_available


		def require_torch(fn):

		@@ -190,6 +190,19 @@ def is_faiss_available():
		return _faiss_available


		def torch_only_method(fn):

Replaced torch.load for loading the pretrained vocab of TransformerXL tokenizer to pickle.load #6935

Replaced torch.load for loading the pretrained vocab of TransformerXL tokenizer to pickle.load #6935

Conversation

w4nderlust commented Sep 4, 2020 • edited

thomwolf commented Sep 4, 2020

w4nderlust commented Sep 4, 2020

w4nderlust commented Sep 4, 2020 • edited

thomwolf commented Sep 25, 2020

codecov bot commented Sep 25, 2020 • edited

Codecov Report

LysandreJik left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sgugger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

patrickvonplaten Sep 25, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LysandreJik left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thomwolf left a comment

Choose a reason for hiding this comment

w4nderlust commented Oct 8, 2020

w4nderlust commented Sep 4, 2020 •

edited

w4nderlust commented Sep 4, 2020 •

edited

codecov bot commented Sep 25, 2020 •

edited

patrickvonplaten Sep 25, 2020 •

edited