Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion src/transformers/pipelines/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -149,7 +149,7 @@ def inner(items):
_padding_value = t_padding_value
elif key in {"input_values", "pixel_values", "input_features"}:
_padding_value = f_padding_value
elif key in {"p_mask"}:
elif key in {"p_mask", "special_tokens_mask"}:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In random models, special_tokens_mask would be extended in the batch with 0 instead of 1 so we could still predict PAD token in the pipeline.

I think having pad being always considered a special_tokens_mask is fine.

_padding_value = 1
elif key in {"attention_mask", "token_type_ids"}:
_padding_value = 0
Expand Down
1 change: 0 additions & 1 deletion src/transformers/pipelines/token_classification.py
Original file line number Diff line number Diff line change
Expand Up @@ -192,7 +192,6 @@ def preprocess(self, sentence, offset_mapping=None):
truncation = True if self.tokenizer.model_max_length and self.tokenizer.model_max_length > 0 else False
model_inputs = self.tokenizer(
sentence,
return_attention_mask=False,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return_attention_mask=True is also incorrect because FNet doesn't expect an attention mask.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe FNet will continue to exhibit the flaw that pad tokens modify the output, I don't know enough about it though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You will thus get the attention mask since you don't remove it afterward, but I'm guessing that's the whole point?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it seems the FNet tokenizer doesn't return attention mask if we don't ask for it. (Which is fair since the model doesn't seem to accept them).

return_tensors=self.framework,
truncation=truncation,
return_special_tokens_mask=True,
Expand Down
17 changes: 17 additions & 0 deletions tests/pipelines/test_pipelines_token_classification.py
Original file line number Diff line number Diff line change
Expand Up @@ -649,6 +649,23 @@ def test_small_model_pt(self):
],
)

# Batch size does not affect outputs (attention_mask are required)
sentences = ["This is a test !", "Another test this is with longer sentence"]
outputs = token_classifier(sentences)
outputs_batched = token_classifier(sentences, batch_size=2)
# Batching does not make a difference in predictions
self.assertEqual(nested_simplify(outputs_batched), nested_simplify(outputs))
self.assertEqual(
nested_simplify(outputs_batched),
[
[
{"entity": "I-MISC", "score": 0.115, "index": 1, "word": "this", "start": 0, "end": 4},
{"entity": "I-MISC", "score": 0.115, "index": 2, "word": "is", "start": 5, "end": 7},
],
[],
],
)

@require_torch
def test_pt_ignore_subwords_slow_tokenizer_raises(self):
model_name = "sshleifer/tiny-dbmdz-bert-large-cased-finetuned-conll03-english"
Expand Down