NER pipeline doesn't work for a list of sequences #10168

elk-cloner · 2021-02-13T11:52:50Z

Environment info

transformers version: transformers==4.3.2
Platform: Linux Ubuntu 20.04
Python version: 3.6
PyTorch version (GPU?): torch==1.7.0+cu101
Tensorflow version (GPU?): tensorflow==2.4.1
Using GPU in script?: No
Using distributed or parallel set-up in script?: No

Who can help

Library:

pipelines: @LysandreJik

Documentation: @sgugger

Information

Model I am using (Bert, XLNet ...):

The problem arises when using:

the official example scripts: (give details below)
my own modified scripts: (give details below)

The tasks I am working on is:

an official GLUE/SQUaD task: (give the name)
my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

i used the steps here to use pipelines for NER task with a little change, so my script is as follow:

from transformers import pipeline
nlp = pipeline("ner")
sequence = [
    "Hugging Face Inc. is a company based in New York City.",
    "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very close to the Manhattan Bridge which is visible from the window."
]
print(nlp(sequence))

Expected behavior

i expected to get a list like this:

[
    [
        {'word': 'Hu', 'score': 0.999578595161438, 'entity': 'I-ORG', 'index': 1, 'start': 0, 'end': 2}
        {'word': '##gging', 'score': 0.9909763932228088, 'entity': 'I-ORG', 'index': 2, 'start': 2, 'end': 7}
        {'word': 'Face', 'score': 0.9982224702835083, 'entity': 'I-ORG', 'index': 3, 'start': 8, 'end': 12}
        {'word': 'Inc', 'score': 0.9994880557060242, 'entity': 'I-ORG', 'index': 4, 'start': 13, 'end': 16}
        {'word': 'New', 'score': 0.9994344711303711, 'entity': 'I-LOC', 'index': 11, 'start': 40, 'end': 43}
        {'word': 'York', 'score': 0.9993196129798889, 'entity': 'I-LOC', 'index': 12, 'start': 44, 'end': 48}
        {'word': 'City', 'score': 0.9993793964385986, 'entity': 'I-LOC', 'index': 13, 'start': 49, 'end': 53}
    ],
    [
        {'word': 'Hu', 'score': 0.9995632767677307, 'entity': 'I-ORG'},
        {'word': '##gging', 'score': 0.9915938973426819, 'entity': 'I-ORG'},
        {'word': 'Face', 'score': 0.9982671737670898, 'entity': 'I-ORG'},
        {'word': 'Inc', 'score': 0.9994403719902039, 'entity': 'I-ORG'},
        {'word': 'New', 'score': 0.9994346499443054, 'entity': 'I-LOC'},
        {'word': 'York', 'score': 0.9993270635604858, 'entity': 'I-LOC'},
        {'word': 'City', 'score': 0.9993864893913269, 'entity': 'I-LOC'},
        {'word': 'D', 'score': 0.9825621843338013, 'entity': 'I-LOC'},
        {'word': '##UM', 'score': 0.936983048915863, 'entity': 'I-LOC'},
        {'word': '##BO', 'score': 0.8987102508544922, 'entity': 'I-LOC'},
        {'word': 'Manhattan', 'score': 0.9758241176605225, 'entity': 'I-LOC'},
        {'word': 'Bridge', 'score': 0.990249514579773, 'entity': 'I-LOC'}
    ]
]

but i got this error

ValueError                                Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/transformers/tokenization_utils_base.py in convert_to_tensors(self, tensor_type, prepend_batch_axis)
    770                 if not is_tensor(value):
--> 771                     tensor = as_tensor(value)
    772 

ValueError: expected sequence of length 16 at dim 1 (got 38)

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
6 frames
/usr/local/lib/python3.6/dist-packages/transformers/tokenization_utils_base.py in convert_to_tensors(self, tensor_type, prepend_batch_axis)
    786                     )
    787                 raise ValueError(
--> 788                     "Unable to create tensor, you should probably activate truncation and/or padding "
    789                     "with 'padding=True' 'truncation=True' to have batched tensors with the same length."
    790                 )

ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length.

i know the problem is from tokenizer and i should use tokenizer with some arguments like this:

tokenizer(
    sequence,
    return_tensors="pt",
    truncation=True,
    padding=True,
    max_length=512,
)

but it's not clear from the documentation how can we define these argument("truncation=True", "padding=True", "max_length=512") when using pipelines for NER task

The text was updated successfully, but these errors were encountered:

LysandreJik · 2021-02-13T14:35:21Z

@Narsil, do you want to take a look at this?

Fixes huggingface#10168

Narsil · 2021-02-15T10:33:16Z

Took a look, it seems the issue was not padding, but argument handling.

Fixes #10168

LysandreJik assigned LysandreJik and unassigned LysandreJik Feb 13, 2021

Narsil added a commit to Narsil/transformers that referenced this issue Feb 15, 2021

Fixing NER pipeline for list inputs.

2967c80

Fixes huggingface#10168

Narsil mentioned this issue Feb 15, 2021

Fixing NER pipeline for list inputs. #10184

Merged

5 tasks

LysandreJik closed this as completed in #10184 Feb 15, 2021

LysandreJik pushed a commit that referenced this issue Feb 15, 2021

Fixing NER pipeline for list inputs. (#10184)

900daec

Fixes #10168

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NER pipeline doesn't work for a list of sequences #10168

NER pipeline doesn't work for a list of sequences #10168

elk-cloner commented Feb 13, 2021

LysandreJik commented Feb 13, 2021

Narsil commented Feb 15, 2021

NER pipeline doesn't work for a list of sequences #10168

NER pipeline doesn't work for a list of sequences #10168

Comments

elk-cloner commented Feb 13, 2021

Environment info

Who can help

Information

To reproduce

Expected behavior

LysandreJik commented Feb 13, 2021

Narsil commented Feb 15, 2021