minimal fixes to run DataCollatorForWholeWordMask with return_tensors="np" and return_tensors="tf" #13891

dwyatte · 2021-10-05T21:00:23Z

What does this PR do?

This PR addresses #13890 with the minimal fixes to run DataCollatorForWholeWordMask with return_tensors="np" and return_tensors="tf"

Specific problems addressed:

Renamed np_call -> numpy_call
Call _numpy_collate_batch instead of _tf_collate_batch when returning numpy tensors
Fix size of random_words in numpy_mask_tokens
Clone tensorflow tensors with tf.identity(tensor)
Use tensorflow tensors’ built-in iteration instead of attempting to convert to list
Change calls to tf.convert_to_tensor with dtype=tf.bool to tf.cast with dtype=tf.bool

I've only added a simple test to check for regressions vs all of the padded/unpadded cases as is done for DataCollatorForLanguageModeling

Fixes #13890

Before submitting

Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

CC @Rocketknight1 (looks like you did the initial Numpy/TF implementation of these data collators)

…="np" and return_tensors="tf"

dwyatte · 2021-11-02T18:46:47Z

Bump @Rocketknight1 @LysandreJik @sgugger since transformers releases frequently and it would be nice to have this functionality.

This gives a vetted source for masked language modeling that is not https://github.com/google-research/bert/blob/master/create_pretraining_data.py or https://github.com/tensorflow/models/blob/master/official/nlp/data/create_pretraining_data.py in a package many people are going to be using for modeling (e.g., I could delete my code that is a copy/paste of the former and integrate against this instead for creating a large offline dataset).

sgugger

This looks good to me, thanks for the fixes!

src/transformers/data/data_collator.py

Rocketknight1 · 2021-11-03T14:06:55Z

Sorry for the delay, and thank you for the ping! This looks good to me. I've got one question about the convert_to_tensor versus cast call, but other than that this is a solid and necessary PR.

…="np" and return_tensors="tf" (huggingface#13891) * minimal fixes to run DataCollatorForWholeWordMask with return_tensors="np" and return_tensors="tf" * more consinstent implementation for numpy_mask_tokens

minimal fixes to run DataCollatorForWholeWordMask with return_tensors…

934baf2

…="np" and return_tensors="tf"

LysandreJik requested a review from Rocketknight1 October 6, 2021 02:52

more consinstent implementation for numpy_mask_tokens

94a3cf0

sgugger approved these changes Nov 2, 2021

View reviewed changes

Rocketknight1 reviewed Nov 3, 2021

View reviewed changes

src/transformers/data/data_collator.py Show resolved Hide resolved

Rocketknight1 approved these changes Nov 3, 2021

View reviewed changes

sgugger merged commit 27b1516 into huggingface:master Nov 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

minimal fixes to run DataCollatorForWholeWordMask with return_tensors="np" and return_tensors="tf" #13891

minimal fixes to run DataCollatorForWholeWordMask with return_tensors="np" and return_tensors="tf" #13891

dwyatte commented Oct 5, 2021 •

edited

Loading

dwyatte commented Nov 2, 2021 •

edited

Loading

sgugger left a comment

Rocketknight1 commented Nov 3, 2021

minimal fixes to run DataCollatorForWholeWordMask with return_tensors="np" and return_tensors="tf" #13891

minimal fixes to run DataCollatorForWholeWordMask with return_tensors="np" and return_tensors="tf" #13891

Conversation

dwyatte commented Oct 5, 2021 • edited Loading

What does this PR do?

Before submitting

Who can review?

dwyatte commented Nov 2, 2021 • edited Loading

sgugger left a comment

Choose a reason for hiding this comment

Rocketknight1 commented Nov 3, 2021

dwyatte commented Oct 5, 2021 •

edited

Loading

dwyatte commented Nov 2, 2021 •

edited

Loading