You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The current behavior for OnlyFirst and OnlySecond truncation strategies is not the one I would expect, and diverges from the current behavior in transformers:
It currently takes only the first encoding (OnlyFirst) or the second one (OnlySecond), and then truncates it to make its length less than the desired max_length.
But this doesn't guarantee that the combined encodings have a length inferior to max_length, which is the behavior I was expecting: those strategies should take into account the combined encodings length when truncating only the first or second one.
The current behavior for
OnlyFirst
andOnlySecond
truncation strategies is not the one I would expect, and diverges from the current behavior in transformers:tokenizers/tokenizers/src/utils.rs
Lines 87 to 99 in 88391dd
It currently takes only the first encoding (
OnlyFirst
) or the second one (OnlySecond
), and then truncates it to make its length less than the desiredmax_length
.But this doesn't guarantee that the combined encodings have a length inferior to
max_length
, which is the behavior I was expecting: those strategies should take into account the combined encodings length when truncating only the first or second one.What do you think @n1t0 @mfuntowicz?
The text was updated successfully, but these errors were encountered: