`OnlyFirst` and `OnlySecond` truncation strategies #108

Pierrci · 2020-01-28T20:15:48Z

The current behavior for OnlyFirst and OnlySecond truncation strategies is not the one I would expect, and diverges from the current behavior in transformers:

tokenizers/tokenizers/src/utils.rs

Lines 87 to 99 in 88391dd

    
           TruncationStrategy::OnlyFirst | TruncationStrategy::OnlySecond => { 
        
               let target = if params.strategy == TruncationStrategy::OnlyFirst { 
        
                   Ok(&mut encoding) 
        
               } else if let Some(encoding) = pair_encoding.as_mut() { 
        
                   Ok(encoding) 
        
               } else { 
        
                   Err(Box::new(Error::SecondSequenceNotProvided)) 
        
               }?; 
        
               if target.get_ids().len() > params.max_length { 
        
                   target.truncate(params.max_length, params.stride); 
        
               } 
        
           }

It currently takes only the first encoding (OnlyFirst) or the second one (OnlySecond), and then truncates it to make its length less than the desired max_length.

But this doesn't guarantee that the combined encodings have a length inferior to max_length, which is the behavior I was expecting: those strategies should take into account the combined encodings length when truncating only the first or second one.

What do you think @n1t0 @mfuntowicz?

The text was updated successfully, but these errors were encountered:

mfuntowicz · 2020-01-29T10:00:06Z

I agree that I would also expect, in both case, the final - combined - length should be <= max_len, otherwise it sounds weird.

Pierrci added the bug Something isn't working label Jan 28, 2020

Pierrci mentioned this issue Jan 29, 2020

rust: fix OnlyFirst and OnlySecond truncation strategies #112

Merged

Pierrci closed this as completed in #112 Jan 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`OnlyFirst` and `OnlySecond` truncation strategies #108

`OnlyFirst` and `OnlySecond` truncation strategies #108

Pierrci commented Jan 28, 2020

mfuntowicz commented Jan 29, 2020

OnlyFirst and OnlySecond truncation strategies #108

OnlyFirst and OnlySecond truncation strategies #108

Comments

Pierrci commented Jan 28, 2020

mfuntowicz commented Jan 29, 2020

`OnlyFirst` and `OnlySecond` truncation strategies #108

`OnlyFirst` and `OnlySecond` truncation strategies #108