Hosted Inference API for Token Classification doesn't Highlight Tokens correctly #7716

agemagician · 2020-10-11T21:34:22Z

Environment info

transformers version: 3.3.1
Platform: Linux-4.19.112+-x86_64-with-Ubuntu-18.04-bionic
Python version: 3.6.9
PyTorch version (GPU?): 1.6.0+cu101 (False)
Tensorflow version (GPU?): 2.3.0 (False)
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No

Who can help

albert, bert, GPT2, XLM: @LysandreJik
Model Cards: @julien-c
examples/token-classification: @stefan-it

Information

Model I am using (Bert, XLNet ...):
Bert

The problem arises when using:

the official example scripts: (give details below)
my own modified scripts: (give details below)

The tasks I am working on is:

an official GLUE/SQUaD task: (give the name)
my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

https://huggingface.co/Rostlab/prot_bert_bfd_ss3?text=N+L+Y+I+Q+W+L+K+D+G+G+P+S+S+G+R+P+P+P+S
https://huggingface.co/Rostlab/prot_bert_bfd_ss3?text=T+G+N+L+Y+I+Q+W+L+K+D+G+G+P+S+S+G+R+P+P+P+S+A+T+G

Expected behavior

When the Hosted inference API finds a token tag for special tokens like "[CLS] and [SEP]" also occurs with next or previous tokens, it doesn't highlight it and tag it properly.

Example:

Because token "N" had the same token group as the previous special token "[CLS]", it was not highlighted. However, it was detected correctly.

The text was updated successfully, but these errors were encountered:

agemagician · 2020-10-14T14:36:36Z

@julien-c @mfuntowicz any insights or updates for this issue ?

julien-c · 2020-10-14T14:56:16Z

I see the issue, but not sure how to best fix it to be honest – as it seems a very specific problem (token classification models that classify the special tokens as non-O)

When we run fast tokenizers by default we'll get token alignment offsets into the original inputs, so that might solve this issue elegantly.

May I ask what's your use case here and do you need this use case supported by the inference widget (and at which horizon)?

agemagician · 2020-10-14T15:14:28Z

Our main project called ProtTrans, which trains various language modelling models for protein sequences at large scale.

This specific use case predicts the secondary structure for protein sequences. It is one step behind predicting the 3D structure of protein sequences (like Google AlphaFold) that allows companies to find a drug or a cure for a virus like Covid-19.

For us we want to use the inference widget to show a live example for the prediction power of our fine-tuned models on different tasks. Later, companies or researchers might need to use it at large scale to make this prediction using your APIs.

Hopefully, this anwer your question 😄

References:
https://blogs.nvidia.com/blog/2020/07/16/ai-reads-proteins-covid/
https://www.youtube.com/watch?v=04E3EjsQLYo&t=89s

julien-c · 2020-10-14T15:34:00Z

👍 Oh yes I know (and love) your project and general goal/use case. I was referring to the specific use of the inference widget.

I'll see what we can do. Out of curiosity, any specific reason you trained with special tokens (vs. just the raw sequence)? To be able to also do document-level classification from the same pretrained model?

agemagician · 2020-10-14T16:44:44Z

The original pretrained model ProtBert-BFD was trained using Google Bert script on TPU, which automatically add these special tokens.

This allows us to perform also document-level classification as you mentioned. Like ProtBert-BFD-MS fine-tuned model.

We found out that using also the special tokens during fine-tuning ProtBert-BFD-SS3 model perform better than not using it. I would assume because: 1)The positional encoding. 2) It matches the original Bert training method. 3) you recommended to use it in your token classification example :)

Thanks in advance for looking into this issue.

cceyda · 2020-10-16T06:21:20Z

not to keep pushing my own PR #5970 but this solves some existing problems related to NER pipelines. The current hold-up is whether or not this provides a general enough solution for various models/langs *.
If fast tokenizers are supported by all I can switch to a better implementation on the pipeline too, but at the current state I don't have an alternative. (suggestions are welcome)

agemagician · 2020-11-03T20:26:24Z

@julien-c Any progress for this issue ?

stale · 2021-01-03T13:22:54Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

LysandreJik assigned julien-c and mfuntowicz Oct 12, 2020

stale bot added the wontfix label Jan 3, 2021

stale bot closed this as completed Jan 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hosted Inference API for Token Classification doesn't Highlight Tokens correctly #7716

Hosted Inference API for Token Classification doesn't Highlight Tokens correctly #7716

agemagician commented Oct 11, 2020

agemagician commented Oct 14, 2020

julien-c commented Oct 14, 2020

agemagician commented Oct 14, 2020 •

edited

Loading

julien-c commented Oct 14, 2020

agemagician commented Oct 14, 2020 •

edited

Loading

cceyda commented Oct 16, 2020

agemagician commented Nov 3, 2020

stale bot commented Jan 3, 2021

Hosted Inference API for Token Classification doesn't Highlight Tokens correctly #7716

Hosted Inference API for Token Classification doesn't Highlight Tokens correctly #7716

Comments

agemagician commented Oct 11, 2020

Environment info

Who can help

Information

To reproduce

Expected behavior

agemagician commented Oct 14, 2020

julien-c commented Oct 14, 2020

agemagician commented Oct 14, 2020 • edited Loading

julien-c commented Oct 14, 2020

agemagician commented Oct 14, 2020 • edited Loading

cceyda commented Oct 16, 2020

agemagician commented Nov 3, 2020

stale bot commented Jan 3, 2021

agemagician commented Oct 14, 2020 •

edited

Loading

agemagician commented Oct 14, 2020 •

edited

Loading