Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Token offsets computation fails when input is truncated #82

Open
Valahaar opened this issue May 13, 2022 · 0 comments
Open

Token offsets computation fails when input is truncated #82

Valahaar opened this issue May 13, 2022 · 0 comments
Assignees
Labels
bug Something isn't working

Comments

@Valahaar
Copy link
Member

Describe the bug
Title. In classy/data/dataset/hf/classification.py#L89 we invoke self.tokenize (#L109) which correctly truncates the input.

The issue arises due to tuple(tok_encoding.word_to_tokens(wi)) for wi in range(len(tokens)): when a token is not included in the input due to truncation, word_to_tokens returns None, and tuple(None) raises a TypeError, which triggers the catch condition and makes the function return None, which cannot be unpacked in input_ids, token_offsets = self.tokenize(token_sample.tokens), resulting in another unhandled exception that finally crashes classy.

To Reproduce
In the token classification setting, input a sentence that has too many tokens (or reduce truncation to obtain the same effect).

Expected behaviour
I think there is a way to know how many of the original tokens were kept, and we can iterate over them instead of len(tokens), otherwise we can just iterate until word_to_tokens(wi) is not None. Comments?

@Valahaar Valahaar added the bug Something isn't working label May 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants