Predictions for pre-tokenized tokens with Roberta have strange offset_mapping #14305

jcklie · 2021-11-06T22:58:28Z

Environment info

transformers version: 4.12.3
Platform: Windows-10-10.0.19041-SP0
Python version: 3.9.2
PyTorch version (GPU?): 1.9.0+cu111 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: no
Using distributed or parallel set-up in script?: no

Who can help

Error/Issue is in fast Roberta tokenizers

@LysandreJik

Information

The problem arises when using:

the official example scripts: (give details below)
my own modified scripts: (give details below)

The tasks I am working on is:

an official GLUE/SQUaD task: (give the name)
my own task or dataset: POS tagging with Roberta based models

I try to POS tag with a Roberta based transformer. I base my code on this. The issues arises when I want to map back from subword tokenized predictions to my tokens.

I followed this guide and it works for BERT-based models, but I do not know exactly how to check whether something is a subword token with add_prefix_space, as they start both with 1 when a token of length 1 is followed by a subword token:

(0, 1)	I
(1, 3)	##KE
(3, 4)	##A

(1, 1)	ĠI
(1, 3)	KE
(3, 4)	A

I do not know whether it is intended or not, but it makes it not easy to align the predictions back to original tokens, as the rule that the last and first index of consecutive tokens are identical for subwords is broken in fast Roberta tokenizers.

in the WNUT example, it says That means that if the first position in the tuple is anything other than 0, , we will set its corresponding label to -100, which means that we do not keep it.. If we now use 1 instead, as for every token, a space is added, then this rule breaks.

To reproduce

Steps to reproduce the behavior:

Tokenize pre-tokenized sequences, e.g. for POS tagging with a fast Roberta Tokenizer and use add_prefix_space together with is_split_into_words
See that the offset_mapping looks strange

from collections import defaultdict

from transformers import AutoTokenizer

s = ['I', 'love', 'IKEA', 'very', 'much', '.']


keeps = defaultdict(list)

names = ["distilbert-base-cased", "distilroberta-base"]

for name in names:
    is_roberta = "roberta" in name
    tokenizer = AutoTokenizer.from_pretrained(name, use_fast=True, add_prefix_space=is_roberta)

    encoding = tokenizer(
        s, truncation=True, padding=True, is_split_into_words=True, return_offsets_mapping=True
    )

    offsets = encoding.offset_mapping
    input_ids = encoding.input_ids

    decoded_tokens = tokenizer.convert_ids_to_tokens(input_ids)

    print(name)
    for idx in range(len(input_ids)):
        offset = offsets[idx]
        token_id = input_ids[idx]

        if is_roberta:
            keep = decoded_tokens[idx][0] == "Ġ"
        else:
            keep = offset != (0, 0) and offset[0] == 0

        print(f"{offset}\t{decoded_tokens[idx]}")

        keeps[name].append(keep)

    print()

for name in names:
    print(f"{name:25}\t{keeps[name]}")

Output

distilbert-base-cased
(0, 0)	[CLS]
(0, 1)	I
(0, 4)	love
(0, 1)	I
(1, 3)	##KE
(3, 4)	##A
(0, 4)	very
(0, 4)	much
(0, 1)	.
(0, 0)	[SEP]

distilroberta-base
(0, 0)	<s>
(1, 1)	ĠI
(1, 4)	Ġlove
(1, 1)	ĠI
(1, 3)	KE
(3, 4)	A
(1, 4)	Ġvery
(1, 4)	Ġmuch
(1, 1)	Ġ.
(0, 0)	</s>

distilbert-base-cased    	[False, True, True, True, False, False, True, True, True, False]
distilroberta-base       	[False, True, True, True, False, False, True, True, True, False]

Expected behavior

I would expect that the offsets behave similar to when not using add_prefix_space, e.g. that the space added does not influence the offsets, as it is automatically added. Is there a better way to align tokens and predictions for Roberta tokenizers than looking at the first char being a space?

The text was updated successfully, but these errors were encountered:

github-actions · 2021-12-07T15:01:53Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

jcklie · 2021-12-07T15:28:29Z

This is still relevant and dear to me.

LysandreJik · 2021-12-08T11:10:59Z

Pinging @SaulLu for advice

SaulLu · 2021-12-13T19:00:02Z

First of all, thank you very much for the detailed issue that allows to understand very easily your problem. 🤗

To put it in context, the offsets feature comes from the (Rust) Tokenizers library. And I must unfortunately admit that I would need to have a little more information about the behavior in this library to be able to provide you with a solution to your problem (see the question I asked here).

That being said, I strongly suspect that there was also an oversight on our part to adapt the tokenizer stored into the backend_tokenizer from the transformers library (see this PR). I propose a little more to have additional information on the behavior of the rust library (which would confirm the necessity of this PR)

github-actions · 2022-01-07T15:02:17Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

SaulLu · 2022-01-10T08:53:19Z

@jcklie some news about your issue, we merged some corrections in the main branch of transformers (this PR) and in the new version of tokenizers (this PR). So, using the the main branch of transformers and the last version of tokenizers, here are the outputs you will get on your example:

distilbert-base-cased
(0, 0)	[CLS]
(0, 1)	I
(0, 4)	love
(0, 1)	I
(1, 3)	##KE
(3, 4)	##A
(0, 4)	very
(0, 4)	much
(0, 1)	.
(0, 0)	[SEP]

distilroberta-base
(0, 0)	<s>
(0, 1)	ĠI
(0, 4)	Ġlove
(0, 1)	ĠI
(1, 3)	KE
(3, 4)	A
(0, 4)	Ġvery
(0, 4)	Ġmuch
(0, 1)	Ġ.
(0, 0)	</s>

There is one more case where the returned offsets can be a bit confusing, but we hesitate to make a fix in the tokenizers library because the fix will be quite heavy to implement. Don't hesitate to share your opinion in the issue that explains and discusses this case here.

I'll close this issue but don't hesitate to react on it if you think your problem is not solved.

ohmeow · 2022-03-16T19:01:08Z

This is still an issue with roberta-large ...

inputs = hf_tokenizer("17 yo with High blood pressure", return_offsets_mapping=True)
inputs["offset_mapping"]

# [(0, 0), (1, 2), (3, 5), (6, 10), (11, 15), (16, 21), (22, 30), (0, 0)]

SaulLu · 2022-03-17T09:54:15Z

@ohmeow, I just tested the code snippet bellow with tokenizers==0.11.6 and transformers==4.17.0

from transformers import AutoTokenizer

name = "roberta-large"
text = "17 yo with High blood pressure"

hf_tokenizer = AutoTokenizer.from_pretrained(name, use_fast=True)

inputs = hf_tokenizer(text, return_offsets_mapping=True)

# Print result offset mapping
title = f"{'token':10} | {'offset':10} | corresponding text"
print(title)
print("-"*len(title))
for (start_idx, end_idx), token in zip(inputs["offset_mapping"], hf_tokenizer.convert_ids_to_tokens(inputs["input_ids"])):
    print(f"{token:10} | {f'({start_idx}, {end_idx})':10} | {repr(text[start_idx:end_idx])}")

and the result looks good to me:

token      | offset     | corresponding text
--------------------------------------------
<s>        | (0, 0)     | ''
17         | (0, 2)     | '17'
Ġyo        | (3, 5)     | 'yo'
Ġwith      | (6, 10)    | 'with'
ĠHigh      | (11, 15)   | 'High'
Ġblood     | (16, 21)   | 'blood'
Ġpressure  | (22, 30)   | 'pressure'
</s>       | (0, 0)     | ''

Do you agree?

To understand why my output is different from yours, can you run the command transformers-cli env and copy-and-paste its output ? 😊 Also, I would be super helpful if you can share your entire code - in particular how you initialized hf_tokenizer.

ohmeow · 2022-03-23T00:08:14Z

Yup ... my version of tokenizers was outdated! Sorry to bother you :)

Thanks for the follow-up.

jcklie changed the title ~~Mapping predictions for pre-tokenized tokens with Roberta has strange offsets~~ Predictions for pre-tokenized tokens with Roberta have strange offset_mapping Nov 6, 2021

This was referenced Dec 13, 2021

Question about the arguments add_prefix_space and trim_offsets of RobertaProcessing huggingface/tokenizers#843

Closed

update the arguments add_prefix_space and trim_offsets in backend_tokenizer.post_processor of RobertaTokenizerFast #14752

Merged

SaulLu closed this as completed Jan 10, 2022

jcklie mentioned this issue Jan 28, 2022

Error with running the Adapters via INCEpTION inception-project/inception-external-recommender#34

Open

jcklie mentioned this issue Feb 7, 2022

Create HuggingFaceTransformer.py inception-project/inception-external-recommender#35

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Predictions for pre-tokenized tokens with Roberta have strange offset_mapping #14305

Predictions for pre-tokenized tokens with Roberta have strange offset_mapping #14305

jcklie commented Nov 6, 2021 •

edited

github-actions bot commented Dec 7, 2021

jcklie commented Dec 7, 2021

LysandreJik commented Dec 8, 2021

SaulLu commented Dec 13, 2021 •

edited

github-actions bot commented Jan 7, 2022

SaulLu commented Jan 10, 2022

ohmeow commented Mar 16, 2022

SaulLu commented Mar 17, 2022

ohmeow commented Mar 23, 2022

Predictions for pre-tokenized tokens with Roberta have strange offset_mapping #14305

Predictions for pre-tokenized tokens with Roberta have strange offset_mapping #14305

Comments

jcklie commented Nov 6, 2021 • edited

Environment info

Who can help

Information

To reproduce

Expected behavior

github-actions bot commented Dec 7, 2021

jcklie commented Dec 7, 2021

LysandreJik commented Dec 8, 2021

SaulLu commented Dec 13, 2021 • edited

github-actions bot commented Jan 7, 2022

SaulLu commented Jan 10, 2022

ohmeow commented Mar 16, 2022

SaulLu commented Mar 17, 2022

ohmeow commented Mar 23, 2022

jcklie commented Nov 6, 2021 •

edited

SaulLu commented Dec 13, 2021 •

edited