You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When tokenizing with the llama-3 tokenizer in tandem with return_offsets_mapping=True, the resulting offset_mapping does not align with the behavior outlined in docs.
Hey! This seems to be expected no? The documentation might be wrong, but there are no offsets here (trim_offsets is set to False I think): ['Sample', 'Ġinput'] are the two tokens
@ArthurZucker I don't think this would be expected -- return_offsets_mapping for other tokenizers gives the mapping of character index to token. See below for Mistral output
When tokenizing with the llama-3 tokenizer in tandem with return_offsets_mapping=True, the resulting offset_mapping does not align with the behavior outlined in docs.
Example:
will yield:
Offset_mapping should have tuples representing
(char_start, char_end)
for each token.The text was updated successfully, but these errors were encountered: