New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replaced torch.load for loading the pretrained vocab of TransformerXL tokenizer to pickle.load #6935
Conversation
Hi @w4nderlust that's a good idea! The CI seems not happy with the change though and it seems related to your changes. Do you think you could take a look? |
@thomwolf trying to see the details of the failing tests, but circleci wat me to login with github and grant access to all my repos and orgs, I prefer to avoid it. If you can point out the failing tests I'm happy to take a look at it. |
@thomwolf I inspected further and this is what i discovered:
As I realized the issue is bigger than I originally thought, it would be great if someone could look at it in more detail from the HF side. |
Hi @w4nderlust ok, I'm reaching this PR now. So the original tokenizer for Transformer-XL was copied from the original research work to be able to import the trained checkpoints. The reliance on PyTorch is thus not really a design decision of us but more of the original author. We can definitely reconsider it and if you don't mind, I'll try to build upon your PR to relax this reliance on PyTorch while keeping backward compatibility if possible. |
Codecov Report
@@ Coverage Diff @@
## master #6935 +/- ##
==========================================
+ Coverage 74.71% 76.67% +1.96%
==========================================
Files 194 181 -13
Lines 39407 35738 -3669
==========================================
- Hits 29441 27401 -2040
+ Misses 9966 8337 -1629
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than the nitpick, LGTM.
src/transformers/file_utils.py
Outdated
@@ -190,6 +190,19 @@ def is_faiss_available(): | |||
return _faiss_available | |||
|
|||
|
|||
def require_torch(fn): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: Is require_torch
the best name for that method, as it already exists in testing_utils
but works slightly differently?
I fear this would make autocompletion more ambiguous
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, thanks for solving this issue! However, these changes are also needed in the fast tokenizer, no?
While we're in this file, could you check the full dosctrings of the two tokenizers (fast and slow) and add documentation for the arguments I didn't know how to explain? That would be super awesome.
torch.save(self.__dict__, vocab_file) | ||
with open(vocab_file, "wb") as f: | ||
pickle.dump(self.__dict__, f) | ||
# torch.save(self.__dict__, vocab_file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line doesn't need to be there anymore now, does it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this shouldn't be needed anymore, it was there temporarily to make the point ;)
@@ -165,7 +169,8 @@ def __init__( | |||
lower_case=False, | |||
delimiter=None, | |||
vocab_file=None, | |||
pretrained_vocab_file=None, | |||
pretrained_vocab_file: str = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This breaks backwards compatibility for people that have saved a torch tokenizer file locally and continue passing it as pretrained_vocab_file
no? Is it maybe possible to just have pretrained_vocab_file
as before and check if it is a torch or pickle file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor suggestion: in Ludwig I adopted the convention that when the variable ends with _file
it is an opened File
, and when it ends with _fp
it is a string containing a file path. That may help resolving the ambiguity.
But Patrick is right, changing from one type to another right now may break backwards compatibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't break compatibility anymore!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thomwolf @patrickvonplaten @sgugger updated the code to reflect the review comments. Didn't do the fast tokenizers as it is getting removed in #7141.
@@ -190,6 +190,19 @@ def is_faiss_available(): | |||
return _faiss_available | |||
|
|||
|
|||
def torch_only_method(fn): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed the name to torch_only_method
, no strong opinions
@@ -165,7 +169,8 @@ def __init__( | |||
lower_case=False, | |||
delimiter=None, | |||
vocab_file=None, | |||
pretrained_vocab_file=None, | |||
pretrained_vocab_file: str = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't break compatibility anymore!
raise ValueError( | ||
"Unable to parse file {}. Unknown format. " | ||
"If you tried to load a model saved through TransfoXLTokenizerFast," | ||
"please note they are not compatible.".format(pretrained_vocab_file) | ||
) | ||
) from e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This ensures that we keep the ImportError
message in the error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome @LysandreJik
Thank you for the work on this! Much appreciated! :) |
… tokenizer to pickle.load (huggingface#6935) * Replaced torch.load for loading the pretrained vocab of TransformerXL to pickle.load * Replaced torch.save with pickle.dump when saving the vocabulary * updating transformer-xl * uploaded on S3 - compatibility * fix tests * style * Address review comments Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com> Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>
…formerXL tokenizer to pickle.load (huggingface#6935)" This reverts commit a81970f.
TransformerXL tokenizer requires torch to work because it uses torch.load to load the vocabulary. This means that if I'm using the TF2 implementation, I have to add torch as a dependency just for that. So I replaced the call to a call to pickle.load (which is what torch.load internally uses) to solve the issue.
Tested an all the TransformerXL related tests (also the slow ones) and they all passed.