-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Polyglot-NER Dataset #641
Conversation
Hi @joeddav thanks for adding this! (I did a long webarchive.org session to actually find that dataset a while ago). One question: should we manually correct the labeling scheme to (at least) IOB1? That means "LOC" will be converted to "I-LOC". IOB1 is not explict. mentioned in the paper, but it is used in the documentation: https://polyglot.readthedocs.io/en/latest/NamedEntityRecognition.html |
@stefan-it I went back and forth on this. My biggest problem with it is that once you are in IOB, there is the expectation that the beginning of new entities are marked with a
If we just prepend But I could go either way if someone has a strong opinion. |
Indeed I'm not sure we can convert them to IOB because of this issue. I'm fine with keeping it like that |
I'll do a release later today, hopefully we can include this dataset in the release :) Let me know if you need help with the dummy data |
@lhoestq cool thanks, I think I've got it right now – just zipped them wrong. I'm running tests locally now and then will push. |
@lhoestq set to merge? |
@joeddav I'm fine with keeping the original labeling scheme :) |
Adds the Polyglot-NER dataset with named entity tags for 40 languages. I include separate configs for each language as well as a
combined
config which lumps them all together.