Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

minimal fixes to run DataCollatorForWholeWordMask with return_tensors="np" and return_tensors="tf" #13891

Merged

Conversation

dwyatte
Copy link
Contributor

@dwyatte dwyatte commented Oct 5, 2021

What does this PR do?

This PR addresses #13890 with the minimal fixes to run DataCollatorForWholeWordMask with return_tensors="np" and return_tensors="tf"

Specific problems addressed:

  • Renamed np_call -> numpy_call
  • Call _numpy_collate_batch instead of _tf_collate_batch when returning numpy tensors
  • Fix size of random_words in numpy_mask_tokens
  • Clone tensorflow tensors with tf.identity(tensor)
  • Use tensorflow tensors’ built-in iteration instead of attempting to convert to list
  • Change calls to tf.convert_to_tensor with dtype=tf.bool to tf.cast with dtype=tf.bool

I've only added a simple test to check for regressions vs all of the padded/unpadded cases as is done for DataCollatorForLanguageModeling

Fixes #13890

Before submitting

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

CC @Rocketknight1 (looks like you did the initial Numpy/TF implementation of these data collators)

@dwyatte
Copy link
Contributor Author

dwyatte commented Nov 2, 2021

Bump @Rocketknight1 @LysandreJik @sgugger since transformers releases frequently and it would be nice to have this functionality.

This gives a vetted source for masked language modeling that is not https://github.com/google-research/bert/blob/master/create_pretraining_data.py or https://github.com/tensorflow/models/blob/master/official/nlp/data/create_pretraining_data.py in a package many people are going to be using for modeling (e.g., I could delete my code that is a copy/paste of the former and integrate against this instead for creating a large offline dataset).

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, thanks for the fixes!

@Rocketknight1
Copy link
Member

Sorry for the delay, and thank you for the ping! This looks good to me. I've got one question about the convert_to_tensor versus cast call, but other than that this is a solid and necessary PR.

@sgugger sgugger merged commit 27b1516 into huggingface:master Nov 3, 2021
Albertobegue pushed a commit to Albertobegue/transformers that referenced this pull request Jan 27, 2022
…="np" and return_tensors="tf" (huggingface#13891)

* minimal fixes to run DataCollatorForWholeWordMask with return_tensors="np" and return_tensors="tf"

* more consinstent implementation for numpy_mask_tokens
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DataCollatorForWholeWordMask does not work with return_tensors = "np" and return_tensors = "tf"
3 participants