Llama3 tokenizer with Incorrect offset_mapping #1517

justin-shao · 2024-04-27T01:33:56Z

When tokenizing with the llama-3 tokenizer in tandem with return_offsets_mapping=True, the resulting offset_mapping does not align with the behavior outlined in docs.

Example:

model_name = "meta-llama/Meta-Llama-3-8B"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True, padding_side="left")
print(tokenizer(["Sample input"], return_offsets_mapping=True))

will yield:

{'input_ids': [[128000, 18031, 1988]], 'attention_mask': [[1, 1, 1]], 'offset_mapping': [[(0, 0), (0, 0), (6, 6)]]}

Offset_mapping should have tuples representing (char_start, char_end) for each token.

The text was updated successfully, but these errors were encountered:

ArthurZucker · 2024-04-30T15:15:04Z

Hey! This seems to be expected no? The documentation might be wrong, but there are no offsets here (trim_offsets is set to False I think):
['Sample', 'Ġinput'] are the two tokens

github-actions · 2024-05-31T01:50:48Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

davidb-cerebras · 2024-06-14T23:32:52Z

@ArthurZucker I don't think this would be expected -- return_offsets_mapping for other tokenizers gives the mapping of character index to token. See below for Mistral output

(Pdb) from transformers import AutoTokenizer
(Pdb) tok_mistral = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")
(Pdb) tok_mistral(["Sample input"], return_offsets_mapping=True)
{'input_ids': [[1, 27797, 2787]], 'attention_mask': [[1, 1, 1]], 'offset_mapping': [[(0, 0), (0, 6), (6, 12)]]}
(Pdb) tok_mistral.convert_ids_to_tokens([1, 27797, 2787])
['<s>', '▁Sample', '▁input']
(Pdb) "Sample input"[0:6]
'Sample'
(Pdb) "Sample input"[6:12]
' input'

github-actions bot added the Stale label May 31, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 6, 2024

davidb-cerebras mentioned this issue Jun 14, 2024

Llama-3 offset-mapping needs fixing #1553

Open

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama3 tokenizer with Incorrect offset_mapping #1517

Llama3 tokenizer with Incorrect offset_mapping #1517

justin-shao commented Apr 27, 2024

ArthurZucker commented Apr 30, 2024

github-actions bot commented May 31, 2024

davidb-cerebras commented Jun 14, 2024

Llama3 tokenizer with Incorrect offset_mapping #1517

Llama3 tokenizer with Incorrect offset_mapping #1517

Comments

justin-shao commented Apr 27, 2024

ArthurZucker commented Apr 30, 2024

github-actions bot commented May 31, 2024

davidb-cerebras commented Jun 14, 2024