Fix sentinel token IDs in data collator for Flax T5 pretraining script #14477

rahuln · 2021-11-21T19:27:24Z

What does this PR do?

Modifies the sentinel token IDs used in the data collator for the Flax T5 pretraining script so that they go in decreasing order starting at len(tokenizer) - 1, which matches the original T5 code.

Fixes #14282

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

LysandreJik · 2021-11-22T07:00:01Z

Hey @rahuln! I'm pinging @patrickvonplaten for review as you have worked with him until now, but please note that he's off until next week so he'll review your PR when he's back! Thanks for your understanding.

patrickvonplaten · 2021-11-29T16:30:13Z

Great! Thanks a lot for digging into this issue and fixing it

thomasw21 · 2023-01-20T10:56:52Z

examples/flax/language-modeling/run_t5_mlm_flax.py

@@ -290,7 +290,7 @@ def create_sentinel_ids(self, mask_indices):
        start_indices[:, 0] = mask_indices[:, 0]

        sentinel_ids = np.where(start_indices != 0, np.cumsum(start_indices, axis=-1), start_indices)


Is that line not completely unecessary since we have the line right after?

Well the next line makes use of the just changed sentinel_ids parrameter no?

Sorry what I meant is those two lines should be summarizable to

sentinel_ids = np.where(start_indices != 0, (len(self.tokenizer) - np.cumsum(start_indices, axis=-1)), 0)

Ie what is 0 is kept at 0 and what's not 0 is given a non 0 value, which means that the next where operation uses the same segmentation and thus overrides the values.

cc @patil-suraj

Fix sentinel token IDs in data collator for Flax T5 pretraining script

9fc7fbb

rahuln mentioned this pull request Nov 21, 2021

Mismatch between sentinel token IDs from T5 data collator and T5 tokenizer #14282

Closed

LysandreJik requested a review from patrickvonplaten November 22, 2021 07:00

patrickvonplaten merged commit 8332327 into huggingface:master Nov 29, 2021

thomasw21 reviewed Jan 20, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix sentinel token IDs in data collator for Flax T5 pretraining script #14477

Fix sentinel token IDs in data collator for Flax T5 pretraining script #14477

rahuln commented Nov 21, 2021

LysandreJik commented Nov 22, 2021

patrickvonplaten commented Nov 29, 2021

thomasw21 Jan 20, 2023

patrickvonplaten Jan 22, 2023

thomasw21 Jan 23, 2023

		@@ -290,7 +290,7 @@ def create_sentinel_ids(self, mask_indices):
		start_indices[:, 0] = mask_indices[:, 0]

		sentinel_ids = np.where(start_indices != 0, np.cumsum(start_indices, axis=-1), start_indices)

Fix sentinel token IDs in data collator for Flax T5 pretraining script #14477

Fix sentinel token IDs in data collator for Flax T5 pretraining script #14477

Conversation

rahuln commented Nov 21, 2021

What does this PR do?

Before submitting

LysandreJik commented Nov 22, 2021

patrickvonplaten commented Nov 29, 2021

thomasw21 Jan 20, 2023

Choose a reason for hiding this comment

patrickvonplaten Jan 22, 2023

Choose a reason for hiding this comment

thomasw21 Jan 23, 2023

Choose a reason for hiding this comment