Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NER pipeline doesn't work for a list of sequences #10168

Closed
2 of 4 tasks
elk-cloner opened this issue Feb 13, 2021 · 2 comments · Fixed by #10184
Closed
2 of 4 tasks

NER pipeline doesn't work for a list of sequences #10168

elk-cloner opened this issue Feb 13, 2021 · 2 comments · Fixed by #10184

Comments

@elk-cloner
Copy link
Contributor

Environment info

  • transformers version: transformers==4.3.2
  • Platform: Linux Ubuntu 20.04
  • Python version: 3.6
  • PyTorch version (GPU?): torch==1.7.0+cu101
  • Tensorflow version (GPU?): tensorflow==2.4.1
  • Using GPU in script?: No
  • Using distributed or parallel set-up in script?: No

Who can help

Library:

Documentation: @sgugger

Information

Model I am using (Bert, XLNet ...):

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

  1. i used the steps here to use pipelines for NER task with a little change, so my script is as follow:
from transformers import pipeline
nlp = pipeline("ner")
sequence = [
    "Hugging Face Inc. is a company based in New York City.",
    "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very close to the Manhattan Bridge which is visible from the window."
]
print(nlp(sequence))

Expected behavior

i expected to get a list like this:

[
    [
        {'word': 'Hu', 'score': 0.999578595161438, 'entity': 'I-ORG', 'index': 1, 'start': 0, 'end': 2}
        {'word': '##gging', 'score': 0.9909763932228088, 'entity': 'I-ORG', 'index': 2, 'start': 2, 'end': 7}
        {'word': 'Face', 'score': 0.9982224702835083, 'entity': 'I-ORG', 'index': 3, 'start': 8, 'end': 12}
        {'word': 'Inc', 'score': 0.9994880557060242, 'entity': 'I-ORG', 'index': 4, 'start': 13, 'end': 16}
        {'word': 'New', 'score': 0.9994344711303711, 'entity': 'I-LOC', 'index': 11, 'start': 40, 'end': 43}
        {'word': 'York', 'score': 0.9993196129798889, 'entity': 'I-LOC', 'index': 12, 'start': 44, 'end': 48}
        {'word': 'City', 'score': 0.9993793964385986, 'entity': 'I-LOC', 'index': 13, 'start': 49, 'end': 53}
    ],
    [
        {'word': 'Hu', 'score': 0.9995632767677307, 'entity': 'I-ORG'},
        {'word': '##gging', 'score': 0.9915938973426819, 'entity': 'I-ORG'},
        {'word': 'Face', 'score': 0.9982671737670898, 'entity': 'I-ORG'},
        {'word': 'Inc', 'score': 0.9994403719902039, 'entity': 'I-ORG'},
        {'word': 'New', 'score': 0.9994346499443054, 'entity': 'I-LOC'},
        {'word': 'York', 'score': 0.9993270635604858, 'entity': 'I-LOC'},
        {'word': 'City', 'score': 0.9993864893913269, 'entity': 'I-LOC'},
        {'word': 'D', 'score': 0.9825621843338013, 'entity': 'I-LOC'},
        {'word': '##UM', 'score': 0.936983048915863, 'entity': 'I-LOC'},
        {'word': '##BO', 'score': 0.8987102508544922, 'entity': 'I-LOC'},
        {'word': 'Manhattan', 'score': 0.9758241176605225, 'entity': 'I-LOC'},
        {'word': 'Bridge', 'score': 0.990249514579773, 'entity': 'I-LOC'}
    ]
]

but i got this error

ValueError                                Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/transformers/tokenization_utils_base.py in convert_to_tensors(self, tensor_type, prepend_batch_axis)
    770                 if not is_tensor(value):
--> 771                     tensor = as_tensor(value)
    772 

ValueError: expected sequence of length 16 at dim 1 (got 38)

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
6 frames
/usr/local/lib/python3.6/dist-packages/transformers/tokenization_utils_base.py in convert_to_tensors(self, tensor_type, prepend_batch_axis)
    786                     )
    787                 raise ValueError(
--> 788                     "Unable to create tensor, you should probably activate truncation and/or padding "
    789                     "with 'padding=True' 'truncation=True' to have batched tensors with the same length."
    790                 )

ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length.

i know the problem is from tokenizer and i should use tokenizer with some arguments like this:

tokenizer(
    sequence,
    return_tensors="pt",
    truncation=True,
    padding=True,
    max_length=512,
)

but it's not clear from the documentation how can we define these argument("truncation=True", "padding=True", "max_length=512") when using pipelines for NER task

@LysandreJik
Copy link
Member

@Narsil, do you want to take a look at this?

Narsil added a commit to Narsil/transformers that referenced this issue Feb 15, 2021
@Narsil
Copy link
Contributor

Narsil commented Feb 15, 2021

Took a look, it seems the issue was not padding, but argument handling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants