Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Polyglot-NER Dataset #641

Merged
merged 8 commits into from
Sep 20, 2020
Merged

Add Polyglot-NER Dataset #641

merged 8 commits into from
Sep 20, 2020

Conversation

joeddav
Copy link
Contributor

@joeddav joeddav commented Sep 18, 2020

Adds the Polyglot-NER dataset with named entity tags for 40 languages. I include separate configs for each language as well as a combined config which lumps them all together.

@lhoestq lhoestq mentioned this pull request Sep 18, 2020
@stefan-it
Copy link
Contributor

Hi @joeddav thanks for adding this! (I did a long webarchive.org session to actually find that dataset a while ago).

One question: should we manually correct the labeling scheme to (at least) IOB1?

That means "LOC" will be converted to "I-LOC". IOB1 is not explict. mentioned in the paper, but it is used in the documentation:

https://polyglot.readthedocs.io/en/latest/NamedEntityRecognition.html

@joeddav
Copy link
Contributor Author

joeddav commented Sep 18, 2020

@stefan-it I went back and forth on this. My biggest problem with it is that once you are in IOB, there is the expectation that the beginning of new entities are marked with a B- (at least in the case of two back-to-back entities):

Today    O
Alice    I-PER
Bob      B-PER
and      O
I        O 
ate      O
lasagna  O

If we just prepend I- to everything, Bob would be incorrectly tagged I-PER, meaning Bob Alice is a single entity. The current format is bad but is at least clear that it does not contain that information.

But I could go either way if someone has a strong opinion.

@lhoestq
Copy link
Member

lhoestq commented Sep 18, 2020

Indeed I'm not sure we can convert them to IOB because of this issue. I'm fine with keeping it like that

@lhoestq
Copy link
Member

lhoestq commented Sep 18, 2020

I'll do a release later today, hopefully we can include this dataset in the release :)

Let me know if you need help with the dummy data

@joeddav
Copy link
Contributor Author

joeddav commented Sep 18, 2020

@lhoestq cool thanks, I think I've got it right now – just zipped them wrong. I'm running tests locally now and then will push.

@joeddav
Copy link
Contributor Author

joeddav commented Sep 19, 2020

@lhoestq set to merge?

@stefan-it
Copy link
Contributor

@joeddav I'm fine with keeping the original labeling scheme :)

@joeddav joeddav merged commit 07839de into huggingface:master Sep 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants