[BUG] Unexpected overflowing_tokens in tokenizer.encode_plus #8028

wangxinyu0922 · 2020-10-25T11:13:30Z

Environment info

transformers version: 3.4.0
Platform: Linux
Python version: 3.7
PyTorch version (GPU?): 1.3.1
Tensorflow version (GPU?):
Using GPU in script?: True
Using distributed or parallel set-up in script?:

Who can help

Information

When I am using BERT tokenizer, I get unexpected overflowing_tokens. Here is a example code to reproduce:

To reproduce

import torch
import transformers
from transformers import AutoTokenizer
import pdb

tokenizer = AutoTokenizer.from_pretrained('bert-base-multilingual-cased')

subtoken_ids_sentence = [x for x in range(1000,1050)]

nr_sentence_parts += 1
encoded_inputs = tokenizer.encode_plus(subtoken_ids_sentence,
                                            max_length=40,
                                            stride=20,
                                            return_overflowing_tokens=True,
                                            truncation=True,
                                            )

print(encoded_inputs['overflowing_tokens'])

The output is: [1029, 1030, 1031, 1032, 1033, 1034, 1035, 1036, 1037, 1038, 1039, 1040, 1041, 1042, 1043, 1044, 1045, 1046, 1047, 1048, 1049, 1048, 1047, 1046, 1045, 1044, 1043, 1042, 1041, 1040, 1039, 1038]

Expected behavior

The expected behavior I want is:
[1018, 1019, 1020, 1021, 1022, 1023, 1024, 1025, 1026, 1027, 1028, 1029, 1030, 1031, 1032, 1033, 1034, 1035, 1036, 1037, 1038, 1039, 1040, 1041, 1042, 1043, 1044, 1045, 1046, 1047, 1048, 1049]
The current output contains [1029, 1030, 1031, 1032, 1033, 1034, 1035, 1036, 1037, 1038, 1039, 1040, 1041, 1042, 1043, 1044, 1045, 1046, 1047, 1048, 1049] and an additional reversed Tensor of [1048, 1047, 1046, 1045, 1044, 1043, 1042, 1041, 1040, 1039, 1038], which I think is wrong.

When I dig into the code, I find that:

transformers/src/transformers/tokenization_utils_base.py

Lines 2556 to 2564 in 6b4c617

    
           if truncation_strategy == TruncationStrategy.LONGEST_FIRST: 
        
               for _ in range(num_tokens_to_remove): 
        
                   if pair_ids is None or len(ids) > len(pair_ids): 
        
                       if not overflowing_tokens: 
        
                           window_len = min(len(ids), stride + 1) 
        
                       else: 
        
                           window_len = 1 
        
                       overflowing_tokens.extend(ids[-window_len:]) 
        
                       ids = ids[:-1]

I wonder why there is a for loop in it and I think I need truncation_strategy = TruncationStrategy.ONLY_FIRST. However, I failed to turn the truncation_stractegy to only_first because the code here turn the truncation strategy to longest_first.

transformers/src/transformers/tokenization_utils_base.py

Lines 1750 to 1759 in 6b4c617

    
           if max_length is not None and padding is False and truncation is False: 
        
               if verbose: 
        
                   logger.warning( 
        
                       "Truncation was not explicitely activated but `max_length` is provided a specific value, " 
        
                       "please use `truncation=True` to explicitely truncate examples to max length. " 
        
                       "Defaulting to 'longest_first' truncation strategy. " 
        
                       "If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy " 
        
                       "more precisely by providing a specific strategy to `truncation`." 
        
                   ) 
        
               truncation = "longest_first"

Can you give me any help?

The text was updated successfully, but these errors were encountered:

djstrong · 2020-10-25T12:31:40Z

I confirm the issue. It was ok with transformers 3.0.0, but from 3.1.0 it is changed.

djstrong · 2020-10-25T12:51:51Z

And the code:

transformers/src/transformers/tokenization_utils_base.py

Lines 2558 to 2571 in 6b4c617

    
           if pair_ids is None or len(ids) > len(pair_ids): 
        
               if not overflowing_tokens: 
        
                   window_len = min(len(ids), stride + 1) 
        
               else: 
        
                   window_len = 1 
        
               overflowing_tokens.extend(ids[-window_len:]) 
        
               ids = ids[:-1] 
        
           else: 
        
               if not overflowing_tokens: 
        
                   window_len = min(len(pair_ids), stride + 1) 
        
               else: 
        
                   window_len = 1 
        
               overflowing_tokens.extend(pair_ids[-window_len:]) 
        
               pair_ids = pair_ids[:-1]

looks bugged, despite above: ids = ids[:-1] should be ids = ids[:-window_len].

* Exposing prepare_for_model for both slow & fast tokenizers * Update method signature * The traditional style commit * Hide the warnings behind the verbose flag * update default truncation strategy and prepare_for_model * fix tests and prepare_for_models methods Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>

LysandreJik · 2020-10-26T12:44:56Z

Pinging @thomwolf

stale · 2020-12-25T15:25:07Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

djstrong mentioned this issue Oct 26, 2020

Option for TransformerWordEmbeddings to process long sentences. flairNLP/flair#1680

Merged

stale bot added the wontfix label Dec 25, 2020

stale bot closed this as completed Jan 2, 2021

wangxinyu0922 mentioned this issue Feb 14, 2021

Missing config_gen yml files Alibaba-NLP/MultilangStructureKD#4

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Unexpected overflowing_tokens in tokenizer.encode_plus #8028

[BUG] Unexpected overflowing_tokens in tokenizer.encode_plus #8028

wangxinyu0922 commented Oct 25, 2020 •

edited

Loading

djstrong commented Oct 25, 2020

djstrong commented Oct 25, 2020

LysandreJik commented Oct 26, 2020

stale bot commented Dec 25, 2020

[BUG] Unexpected overflowing_tokens in tokenizer.encode_plus #8028

[BUG] Unexpected overflowing_tokens in tokenizer.encode_plus #8028

Comments

wangxinyu0922 commented Oct 25, 2020 • edited Loading

Environment info

Who can help

Information

To reproduce

Expected behavior

djstrong commented Oct 25, 2020

djstrong commented Oct 25, 2020

LysandreJik commented Oct 26, 2020

stale bot commented Dec 25, 2020

wangxinyu0922 commented Oct 25, 2020 •

edited

Loading