Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serialization error when tokenizer_config key matches function name in PreTrainedTokenizerBase #30796

Open
avnermay opened this issue May 14, 2024 · 2 comments
Labels
Core: Tokenization Internals of the library; Tokenization. Good Second Issue Issues that are more difficult to do than "Good First" issues - give it a try if you want!

Comments

@avnermay
Copy link

for k in target_keys:
if hasattr(self, k):
tokenizer_config[k] = getattr(self, k)

When one of the keys in self.init_kwargs matches the name of a function in PreTrainedTokenizerBase (e.g., add_special_tokens), this for loops replaces the value for that key in tokenizer_config with the function object, which is not serializable, thus causing an error during save_pretrained.

To solve this issue, one option is to add an assert in the __init__ function that throws an error if one of the keys matches an existing attribute/function on the PreTrainedTokenizerBase:

self.init_kwargs = copy.deepcopy(kwargs)

This error was also raised in the Stack Overflow issue below:
https://stackoverflow.com/questions/78062739/huggingface-transformers-error-when-saving-model-typeerror-object-of-type-meth

@amyeroberts
Copy link
Collaborator

cc @ArthurZucker

@amyeroberts amyeroberts added the Core: Tokenization Internals of the library; Tokenization. label May 14, 2024
@ArthurZucker
Copy link
Collaborator

Yep, this is known. I remember saying that I'd rather have a failure than duplicate attribute / functions.
Do you want to open a PR to add some kind of check?
I am fine with doing this in the init as long as it does not slow it down too much

@ArthurZucker ArthurZucker added the Good Second Issue Issues that are more difficult to do than "Good First" issues - give it a try if you want! label May 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Core: Tokenization Internals of the library; Tokenization. Good Second Issue Issues that are more difficult to do than "Good First" issues - give it a try if you want!
Projects
None yet
Development

No branches or pull requests

3 participants