Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about alias map #12

Closed
hitercs opened this issue Dec 16, 2020 · 4 comments
Closed

Question about alias map #12

hitercs opened this issue Dec 16, 2020 · 4 comments
Assignees
Labels
question Further information is requested

Comments

@hitercs
Copy link

hitercs commented Dec 16, 2020

Hi,

Thanks for your great work! Super cool.

I have one question about the alias map generation.
As I have seen in the alias2qids_kore50.json and alias2qids_rss500.json files, the same alias in different examples may have different candidate sets.
e.g. in alias2qids_kore50.json, the candidate list of "david_0" and "david_1" are not identical.

In details:
In "david_1" but not in "david_0":
{'Q5240660', 'Q5236763', 'Q5240530', 'Q16079082', 'Q27827705', 'Q17318723', 'Q18632066', 'Q5238957', 'Q768479', 'Q3017915', 'Q5239424', 'Q20684456', 'Q5234065', 'Q5234667', 'Q5230766', 'Q10264386', 'Q672856', 'Q1174097', 'Q5240118', 'Q583264'}

In "david_0" but not in "david_1":
{'Q312649', 'Q1173922', 'Q19668637', 'Q5236091', 'Q178517', 'Q2420499', 'Q353983', 'Q24248231', 'Q5239917', 'Q336640', 'Q5241350', 'Q184903', 'Q338628', 'Q2071', 'Q1175688', 'Q41564', 'Q1173934', 'Q214601', 'Q5236705', 'Q1177021'}

So I am wondering how was the candidates are generated? are they context dependent?

by the way, could you please explain the score (e.g. ["Q8016", 5947]) associated with each candidate entity? What does it mean and how does it is calculated?

Thanks a lot.

@lorr1 lorr1 added the question Further information is requested label Dec 17, 2020
@lorr1 lorr1 self-assigned this Dec 17, 2020
@lorr1
Copy link
Contributor

lorr1 commented Dec 17, 2020

Hello! Great question.

So we do have a contextual candidate generator that we used for kore50 and rss500. This takes into account contextual similarities between an entity's Wikipedia page and the sentence itself. So because the sentences are different, the lists are different.

The score for a candidate is based on a few features that we used for this contextual generation: the similarity between the mention, the overall entity popularity, and the similarity between the sentence and the entity's Wikipedia page. We only use this score for filtering the lists.

@hitercs
Copy link
Author

hitercs commented Dec 18, 2020

@lorr1 Thanks for answer! How about the the score in data/wiki_entity_data/entity_mappings/alias2qids.json file? Is it generated by averaging the scores of the same alias-entity pairs in Wikipedia anchor texts using the same contextual candidate generator?
Thanks.

@lorr1
Copy link
Contributor

lorr1 commented Dec 19, 2020

So that one is used just for training so is not contextual. We could certainly make it that way (and are exploring these ideas!) but didn't use a contextual one for training. That score based on an overall entity's occurrence in Wikipedia. So it's a ranking based on entity popularity. Note that this is not conditioned on a specific alias - it's just overall entity popularity. We found that this was necessary when incorporating aliases from Wikidata that may never have been seen in Wikipedia yet still be valid aliases.

@hitercs
Copy link
Author

hitercs commented Dec 19, 2020

Great. Got it. Well understood. Thanks!

@hitercs hitercs closed this as completed Dec 19, 2020
lorr1 added a commit that referenced this issue Mar 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants