-
Notifications
You must be signed in to change notification settings - Fork 25.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hosted Inference API for Token Classification doesn't Highlight Tokens correctly #7716
Comments
@julien-c @mfuntowicz any insights or updates for this issue ? |
I see the issue, but not sure how to best fix it to be honest – as it seems a very specific problem (token classification models that classify the special tokens as non- When we run fast tokenizers by default we'll get token alignment offsets into the original inputs, so that might solve this issue elegantly. May I ask what's your use case here and do you need this use case supported by the inference widget (and at which horizon)? |
Our main project called ProtTrans, which trains various language modelling models for protein sequences at large scale. This specific use case predicts the secondary structure for protein sequences. It is one step behind predicting the 3D structure of protein sequences (like Google AlphaFold) that allows companies to find a drug or a cure for a virus like Covid-19. For us we want to use the inference widget to show a live example for the prediction power of our fine-tuned models on different tasks. Later, companies or researchers might need to use it at large scale to make this prediction using your APIs. Hopefully, this anwer your question 😄 References: |
👍 Oh yes I know (and love) your project and general goal/use case. I was referring to the specific use of the inference widget. I'll see what we can do. Out of curiosity, any specific reason you trained with special tokens (vs. just the raw sequence)? To be able to also do document-level classification from the same pretrained model? |
The original pretrained model ProtBert-BFD was trained using Google Bert script on TPU, which automatically add these special tokens. This allows us to perform also document-level classification as you mentioned. Like ProtBert-BFD-MS fine-tuned model. We found out that using also the special tokens during fine-tuning ProtBert-BFD-SS3 model perform better than not using it. I would assume because: 1)The positional encoding. 2) It matches the original Bert training method. 3) you recommended to use it in your token classification example :) Thanks in advance for looking into this issue. |
not to keep pushing my own PR #5970 but this solves some existing problems related to NER pipelines. The current hold-up is whether or not this provides a general enough solution for various models/langs *. |
@julien-c Any progress for this issue ? |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Environment info
transformers
version: 3.3.1Who can help
albert, bert, GPT2, XLM: @LysandreJik
Model Cards: @julien-c
examples/token-classification: @stefan-it
Information
Model I am using (Bert, XLNet ...):
Bert
The problem arises when using:
The tasks I am working on is:
To reproduce
Steps to reproduce the behavior:
https://huggingface.co/Rostlab/prot_bert_bfd_ss3?text=N+L+Y+I+Q+W+L+K+D+G+G+P+S+S+G+R+P+P+P+S
https://huggingface.co/Rostlab/prot_bert_bfd_ss3?text=T+G+N+L+Y+I+Q+W+L+K+D+G+G+P+S+S+G+R+P+P+P+S+A+T+G
Expected behavior
When the Hosted inference API finds a token tag for special tokens like "[CLS] and [SEP]" also occurs with next or previous tokens, it doesn't highlight it and tag it properly.
Example:
Because token "N" had the same token group as the previous special token "[CLS]", it was not highlighted. However, it was detected correctly.
The text was updated successfully, but these errors were encountered: