Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing Entities in TopRes19th Dataset #2

Closed
stefan-it opened this issue Mar 15, 2022 · 2 comments
Closed

Missing Entities in TopRes19th Dataset #2

stefan-it opened this issue Mar 15, 2022 · 2 comments

Comments

@stefan-it
Copy link

Hi,

during review of adding HIPE-2022 dataset into Flair, we just found that some of the listed entites do not exist in the actual dataset.

These entities are: ALIEN, OTHER, FICTION.

Could you please clarify what happened to these entites? Will they be added later (or will they appear in the final test dataset).

Many thanks,

Stefan

@stefan-it stefan-it changed the title TopRes19th dataset Missing Entities in TopRes19th Dataset Mar 15, 2022
@simon-clematide
Copy link
Contributor

Hi Stefan
you are right, these entities were removed. We chose to do it due to their scarcity in the training data. We first thought about keeping OTHER, but then decided for HIPE 2022 to keep it simple, given that all these datasets are already a bit nightmarish. Here are the stats that one can generate from the published data in webanno tsv format:

annotated_tsv $ cat *.tsv | grep -vP '^#'  | cut -f 5| grep -Po '[A-Z]+' |sort |uniq -c | sort -rn
   3470 LOC
    891 BUILDING
    406 STREET
      5 OTHER
      1 FICTION

For the currently private data, the distribution situation is similar and we will not add additional entity types.

@stefan-it
Copy link
Author

Hi @simon-clematide ,

thanks for the explanation! c673e29 mentions is, so I'm closing here 🤗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants