Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XLMR tokenizer is fully picklable #13577

Merged

Conversation

ben-davidson-6
Copy link
Contributor

What does this PR do?

This addresses the issue here #13200 to summarize:

  • unpickling was dependant on what was on disk
    the tokenizer is now unpickled only with the serialised proto.

This is needed if you want to write a pyspark udf which tokenizes a column, as the tokenizer needs to be pickled and sent to other nodes.

Who can help

@LysandreJik

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me - do you think you could implement a test in tests/test_tokenization_xlm_roberta.py?

@ben-davidson-6
Copy link
Contributor Author

This looks good to me - do you think you could implement a test in tests/test_tokenization_xlm_roberta.py?

done

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonderful! Thank you, @ben-davidson-6!

@LysandreJik LysandreJik merged commit e02ed0e into huggingface:master Sep 16, 2021
Albertobegue pushed a commit to Albertobegue/transformers that referenced this pull request Jan 13, 2022
* made tokenizer fully picklable

* remove whitespace

* added testcase
@icyblade icyblade mentioned this pull request Jul 6, 2023
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants