Skip to content

Conversation

@Guitaricet
Copy link

@Guitaricet Guitaricet commented Jan 31, 2022

If the input text contains "<pad>" (true for C4/en dataset) T5 tokenizer assigns it index 0.
Collator code (line 371) assumes that only tokens with id > 0 are the text tokens and the rest are sentinel ids (-1), which turns out not to be true for this particular case.
This causes the line to throw an error similar to

input_ids = input_ids_full[input_ids_full > 0].reshape((batch_size, -1))
*** ValueError: cannot reshape array of size 1040383 into shape (8192,newaxis)

This modification should also allow to use padding in T5 pre-training.

@patil-suraj (note edited by @stas00 to tag the correct maintainer for flax)

If the input text contains "<pad>" (true for C4/en dataset) T5 tokenizer assigns it index 0.
Collator code (line 371) assumes that only tokens with `id > 0` are the text tokens and the rest are sentinel ids (`-1`), which turns out not to be true for this particular case.
This causes the line to throw an error similar to

```
input_ids = input_ids_full[input_ids_full >= 0].reshape((batch_size, -1))
*** ValueError: cannot reshape array of size 1040383 into shape (8192,newaxis)
```

This modification should also allow to use padding in T5 pre-training.
@HuggingFaceDocBuilder
Copy link

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 3, 2022

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@Guitaricet
Copy link
Author

@patil-suraj , cold you please review this PR? It's very short

@stefan-it
Copy link
Collaborator

Oh we have a similar PR here from @patrickvonplaten

#15835

@Guitaricet
Copy link
Author

Great!

@Guitaricet Guitaricet closed this Mar 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants