Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix list index out of range when padding nested empty lists #13876

Merged
merged 4 commits into from
Nov 10, 2021

Conversation

qqaatw
Copy link
Contributor

@qqaatw qqaatw commented Oct 5, 2021

What does this PR do?

This PR fixes list index out of range error when nested empty lists are supplied.

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
padded = tokenizer.pad({"input_ids": [[], [], []] })
print(padded)
Traceback (most recent call last):
  File "test.py", line 6, in <module>
    padded = tokenizer.pad({"input_ids": [[], [], []] })
  File "/src/transformers/tokenization_utils_base.py", line 2730, in pad
    while len(required_input[index]) == 0:
IndexError: list index out of range

@LysandreJik

@LysandreJik
Copy link
Member

What is the use-case of passing nested empty lists like this?

@qqaatw
Copy link
Contributor Author

qqaatw commented Oct 9, 2021

What is the use-case of passing nested empty lists like this?

I'm using tokenizer.pad to pad question answering labels like the below snippet,

# labels may look like this, where it should be padded to (batch_size, max_answers), 
# and sometimes the entire batch might have no answer being like this: `[[], [], []]`, so I proposed this PR.
start_position = [
    [3, 5, 10], # 3 answers
    [4], # 1 answer
    [], # no answer
]
end_position = [
    [4, 7, 14],
    [8],
    [],
]

Also, could you checkout this issue #13879 and see if that makes sense to you? (similar use-case as this PR.)

Thanks a lot!

@LysandreJik
Copy link
Member

I would appreciate it greatly if you could take a look at this when you're back @SaulLu!

@huggingface huggingface deleted a comment from github-actions bot Nov 6, 2021
@SaulLu SaulLu self-requested a review November 9, 2021 17:51
Copy link
Contributor

@SaulLu SaulLu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for your contribution @qqaatw !

What you propose makes sense to me! I just have a small request: could you add a test (probably in test_padding in the test_tokenization_common.py file)

Comment on lines +2980 to +2981
input_p = tokenizer_p.encode_plus("This is a input 1")
input_p = tokenizer_p.pad(input_p)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is an overlooking when copying test cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh! Great catch! Thanks! 👍

@qqaatw qqaatw requested a review from SaulLu November 10, 2021 06:57
Copy link
Contributor

@SaulLu SaulLu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is perfect! Thanks for the addition! 🙌

Comment on lines +2980 to +2981
input_p = tokenizer_p.encode_plus("This is a input 1")
input_p = tokenizer_p.pad(input_p)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh! Great catch! Thanks! 👍

@SaulLu SaulLu merged commit 9e37c5c into huggingface:master Nov 10, 2021
Albertobegue pushed a commit to Albertobegue/transformers that referenced this pull request Jan 27, 2022
…ace#13876)

* Fix index out of range when padding

* Apply suggestions from code review

* Style
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants