Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replaced torch.load for loading the pretrained vocab of TransformerXL tokenizer to pickle.load #6935

Merged
merged 10 commits into from Oct 8, 2020

Conversation

w4nderlust
Copy link
Contributor

@w4nderlust w4nderlust commented Sep 4, 2020

TransformerXL tokenizer requires torch to work because it uses torch.load to load the vocabulary. This means that if I'm using the TF2 implementation, I have to add torch as a dependency just for that. So I replaced the call to a call to pickle.load (which is what torch.load internally uses) to solve the issue.
Tested an all the TransformerXL related tests (also the slow ones) and they all passed.

@thomwolf
Copy link
Member

thomwolf commented Sep 4, 2020

Hi @w4nderlust that's a good idea!

The CI seems not happy with the change though and it seems related to your changes.

Do you think you could take a look?

@w4nderlust
Copy link
Contributor Author

@thomwolf trying to see the details of the failing tests, but circleci wat me to login with github and grant access to all my repos and orgs, I prefer to avoid it.

If you can point out the failing tests I'm happy to take a look at it.

@w4nderlust
Copy link
Contributor Author

w4nderlust commented Sep 4, 2020

@thomwolf I inspected further and this is what i discovered:

  1. i was running the modeling_transfo_xl.py tests, but I should have been running the tokenization_transfo_xl.py test. I Imagine the errors in CI are coming from there.
  2. upon further inspection, I noticed that inside tokenization_transfo_xl.py torch is used everywhere. There probably was a design decision that led to that which I'm not aware of, but as a user i would ask you to reconsider, because if the tokenizer uses torch, the TF version of TransformerXL can neve be used without installing torch. The extent to which torch is used goes beyond my current familiarity with the library, so i will refrain to propose modifications to it, a part from what i propose in the next point.
  3. in my commit I replaced the loading of the vocab, but upon inspection i realized that, yes, torch uses pickle, but it does that in a way that is peculiar, including magic numbers and protocol versions and some custom logic that will take some time to reverse engineer ( https://github.com/pytorch/pytorch/blob/0c01f136f3c8d16f221d641befcb5a74142bbeb1/torch/serialization.py#L764-L774 ). It doesn't seem you can directly load the vocab dictionary without re-implementing quite some load code from torch, plus this doesn't sound like a sound approach because PyTorch can itself start using a new load mechanism in the future. So, what I tried to do is to replace ALSO torch.save usages within the context of vocabulary save with pickle.dump.( line 260-262) in my last commit). The effect is that now all test_tokenize_transfo_cl.py tests pass, the vocab can be saved and loaded, but, because the vocab that ships with the pretrained models was saved originally with torch, If it try to load from pretrained model, loading doesn't work (what is loaded is just the torch magic number). So here I guess you have to make a call about what you want to do: if you want to use pickle to load and save vocab, this PR does it for you, but you have to change the TransformerXL pretrained model that you ship by replacing the vocab file saved with PyTorch with one saved with pickle (the code to do it from the current vocab file is straightforward pickle.dump(torch.load(vocab_file), vocab_file)).

As I realized the issue is bigger than I originally thought, it would be great if someone could look at it in more detail from the HF side.

@thomwolf
Copy link
Member

Hi @w4nderlust ok, I'm reaching this PR now.

So the original tokenizer for Transformer-XL was copied from the original research work to be able to import the trained checkpoints. The reliance on PyTorch is thus not really a design decision of us but more of the original author.

We can definitely reconsider it and if you don't mind, I'll try to build upon your PR to relax this reliance on PyTorch while keeping backward compatibility if possible.

@codecov
Copy link

codecov bot commented Sep 25, 2020

Codecov Report

Merging #6935 into master will increase coverage by 1.96%.
The diff coverage is 79.16%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #6935      +/-   ##
==========================================
+ Coverage   74.71%   76.67%   +1.96%     
==========================================
  Files         194      181      -13     
  Lines       39407    35738    -3669     
==========================================
- Hits        29441    27401    -2040     
+ Misses       9966     8337    -1629     
Impacted Files Coverage Δ
src/transformers/file_utils.py 83.09% <60.00%> (+0.17%) ⬆️
src/transformers/tokenization_transfo_xl.py 42.48% <84.21%> (+0.74%) ⬆️
src/transformers/configuration_reformer.py 21.62% <0.00%> (-78.38%) ⬇️
src/transformers/modeling_reformer.py 16.71% <0.00%> (-77.89%) ⬇️
src/transformers/modeling_tf_xlm.py 18.73% <0.00%> (-74.53%) ⬇️
src/transformers/modeling_tf_flaubert.py 21.53% <0.00%> (-68.15%) ⬇️
src/transformers/tokenization_marian.py 32.75% <0.00%> (-66.38%) ⬇️
src/transformers/trainer_utils.py 60.63% <0.00%> (-20.14%) ⬇️
src/transformers/trainer.py 55.70% <0.00%> (-15.11%) ⬇️
src/transformers/integrations.py 29.00% <0.00%> (-5.66%) ⬇️
... and 71 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update aba4e22...cd57922. Read the comment docs.

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than the nitpick, LGTM.

@@ -190,6 +190,19 @@ def is_faiss_available():
return _faiss_available


def require_torch(fn):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: Is require_torch the best name for that method, as it already exists in testing_utils but works slightly differently?

I fear this would make autocompletion more ambiguous

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thanks for solving this issue! However, these changes are also needed in the fast tokenizer, no?

While we're in this file, could you check the full dosctrings of the two tokenizers (fast and slow) and add documentation for the arguments I didn't know how to explain? That would be super awesome.

torch.save(self.__dict__, vocab_file)
with open(vocab_file, "wb") as f:
pickle.dump(self.__dict__, f)
# torch.save(self.__dict__, vocab_file)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line doesn't need to be there anymore now, does it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this shouldn't be needed anymore, it was there temporarily to make the point ;)

@@ -165,7 +169,8 @@ def __init__(
lower_case=False,
delimiter=None,
vocab_file=None,
pretrained_vocab_file=None,
pretrained_vocab_file: str = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This breaks backwards compatibility for people that have saved a torch tokenizer file locally and continue passing it as pretrained_vocab_file no? Is it maybe possible to just have pretrained_vocab_file as before and check if it is a torch or pickle file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor suggestion: in Ludwig I adopted the convention that when the variable ends with _file it is an opened File, and when it ends with _fp it is a string containing a file path. That may help resolving the ambiguity.

But Patrick is right, changing from one type to another right now may break backwards compatibility.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't break compatibility anymore!

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thomwolf @patrickvonplaten @sgugger updated the code to reflect the review comments. Didn't do the fast tokenizers as it is getting removed in #7141.

@@ -190,6 +190,19 @@ def is_faiss_available():
return _faiss_available


def torch_only_method(fn):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the name to torch_only_method, no strong opinions

@@ -165,7 +169,8 @@ def __init__(
lower_case=False,
delimiter=None,
vocab_file=None,
pretrained_vocab_file=None,
pretrained_vocab_file: str = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't break compatibility anymore!

raise ValueError(
"Unable to parse file {}. Unknown format. "
"If you tried to load a model saved through TransfoXLTokenizerFast,"
"please note they are not compatible.".format(pretrained_vocab_file)
)
) from e
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ensures that we keep the ImportError message in the error.

Copy link
Member

@thomwolf thomwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome @LysandreJik

@thomwolf thomwolf merged commit 4d04120 into huggingface:master Oct 8, 2020
@w4nderlust
Copy link
Contributor Author

Thank you for the work on this! Much appreciated! :)

fabiocapsouza pushed a commit to fabiocapsouza/transformers that referenced this pull request Nov 15, 2020
… tokenizer to pickle.load (huggingface#6935)

* Replaced torch.load for loading the pretrained vocab of TransformerXL to pickle.load

* Replaced torch.save with pickle.dump when saving the vocabulary

* updating transformer-xl

* uploaded on S3 - compatibility

* fix tests

* style

* Address review comments

Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>
Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>
fabiocapsouza added a commit to fabiocapsouza/transformers that referenced this pull request Nov 15, 2020
…formerXL tokenizer to pickle.load (huggingface#6935)"

This reverts commit a81970f.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants