Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hosted Inference API for Token Classification doesn't Highlight Tokens correctly #7716

Closed
2 of 4 tasks
agemagician opened this issue Oct 11, 2020 · 8 comments
Closed
2 of 4 tasks
Assignees
Labels

Comments

@agemagician
Copy link
Contributor

Environment info

  • transformers version: 3.3.1
  • Platform: Linux-4.19.112+-x86_64-with-Ubuntu-18.04-bionic
  • Python version: 3.6.9
  • PyTorch version (GPU?): 1.6.0+cu101 (False)
  • Tensorflow version (GPU?): 2.3.0 (False)
  • Using GPU in script?: Yes
  • Using distributed or parallel set-up in script?: No

Who can help

albert, bert, GPT2, XLM: @LysandreJik
Model Cards: @julien-c
examples/token-classification: @stefan-it

Information

Model I am using (Bert, XLNet ...):
Bert

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

https://huggingface.co/Rostlab/prot_bert_bfd_ss3?text=N+L+Y+I+Q+W+L+K+D+G+G+P+S+S+G+R+P+P+P+S
https://huggingface.co/Rostlab/prot_bert_bfd_ss3?text=T+G+N+L+Y+I+Q+W+L+K+D+G+G+P+S+S+G+R+P+P+P+S+A+T+G

Expected behavior

When the Hosted inference API finds a token tag for special tokens like "[CLS] and [SEP]" also occurs with next or previous tokens, it doesn't highlight it and tag it properly.

Example:
Screenshot 2020-10-11 at 23 31 35
Because token "N" had the same token group as the previous special token "[CLS]", it was not highlighted. However, it was detected correctly.

@agemagician
Copy link
Contributor Author

@julien-c @mfuntowicz any insights or updates for this issue ?

@julien-c
Copy link
Member

I see the issue, but not sure how to best fix it to be honest – as it seems a very specific problem (token classification models that classify the special tokens as non-O)

When we run fast tokenizers by default we'll get token alignment offsets into the original inputs, so that might solve this issue elegantly.

May I ask what's your use case here and do you need this use case supported by the inference widget (and at which horizon)?

@agemagician
Copy link
Contributor Author

agemagician commented Oct 14, 2020

Our main project called ProtTrans, which trains various language modelling models for protein sequences at large scale.

This specific use case predicts the secondary structure for protein sequences. It is one step behind predicting the 3D structure of protein sequences (like Google AlphaFold) that allows companies to find a drug or a cure for a virus like Covid-19.

For us we want to use the inference widget to show a live example for the prediction power of our fine-tuned models on different tasks. Later, companies or researchers might need to use it at large scale to make this prediction using your APIs.

Hopefully, this anwer your question 😄

References:
https://blogs.nvidia.com/blog/2020/07/16/ai-reads-proteins-covid/
https://www.youtube.com/watch?v=04E3EjsQLYo&t=89s

@julien-c
Copy link
Member

👍 Oh yes I know (and love) your project and general goal/use case. I was referring to the specific use of the inference widget.

I'll see what we can do. Out of curiosity, any specific reason you trained with special tokens (vs. just the raw sequence)? To be able to also do document-level classification from the same pretrained model?

@agemagician
Copy link
Contributor Author

agemagician commented Oct 14, 2020

The original pretrained model ProtBert-BFD was trained using Google Bert script on TPU, which automatically add these special tokens.

This allows us to perform also document-level classification as you mentioned. Like ProtBert-BFD-MS fine-tuned model.

We found out that using also the special tokens during fine-tuning ProtBert-BFD-SS3 model perform better than not using it. I would assume because: 1)The positional encoding. 2) It matches the original Bert training method. 3) you recommended to use it in your token classification example :)

Thanks in advance for looking into this issue.

@cceyda
Copy link
Contributor

cceyda commented Oct 16, 2020

not to keep pushing my own PR #5970 but this solves some existing problems related to NER pipelines. The current hold-up is whether or not this provides a general enough solution for various models/langs *.
If fast tokenizers are supported by all I can switch to a better implementation on the pipeline too, but at the current state I don't have an alternative. (suggestions are welcome)

@agemagician
Copy link
Contributor Author

@julien-c Any progress for this issue ?

@stale
Copy link

stale bot commented Jan 3, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Jan 3, 2021
@stale stale bot closed this as completed Jan 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants