Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add information content (IC) #40

Closed
goodmami opened this issue Nov 5, 2020 · 7 comments
Closed

Add information content (IC) #40

goodmami opened this issue Nov 5, 2020 · 7 comments
Labels
enhancement New feature or request
Milestone

Comments

@goodmami
Copy link
Owner

goodmami commented Nov 5, 2020

Three of the similarity measures require information content to work. The IC that is shipped with the NLTK's wordnet data is based on synset offsets, so those will need to be mapped somehow to something that this module uses.

@fcbond
Copy link
Collaborator

fcbond commented Nov 5, 2020

May I suggest ili numbers?

@goodmami
Copy link
Owner Author

goodmami commented Nov 5, 2020

I'm not yet sure how IC works, but I'm happy to use that if it's sufficient. With ILI, instead of synset identifiers, I could imagine two wordnets for the same language (PWN vs English Wordnet, Italian, Chinese perhaps) could use the same IC for some corpus.

@arademaker
Copy link

arademaker commented Nov 5, 2020

What is IC here? May I suggest consider the glosstag version that I am completing https://github.com/own-pt/glosstag

I would be happy to discuss the better format for releasing it. Most updated branch is AR

@goodmami
Copy link
Owner Author

goodmami commented Jan 21, 2021

An IC file has data like this:

6484n 87

Where 6484n is synset offset 6484 and pos n in PWN 3.0. Then, 87 is the value associated with that synset, computed from occurrences of words in a corpus matching synsets. We can map the synsets to ILIs, which makes them more portable across English wordnet versions, and also makes them useful for other languages, but it should be noted that these values came from English corpora. This is important to note because not only would we expect a different distribution of ILIs across corpora in different languages, but the values would be computed differently because it depends on how many senses each word has (each occurrence increments the value by 1/n where n is the number of senses for the word). To be more accurate, the numbers can change across English wordnet versions, too, but we wouldn't expect such a drastic change.

Once this data is mapped and distributed, the next questions are how to work with it in Wn:

  • how to index it
  • how to download/install it
  • how to load it
  • how to use it

@goodmami
Copy link
Owner Author

@arademaker Would your glosstag data serve as the source data to generate information content files? Or if you envision some other use, perhaps open another issue as that data seems different from the standard information content files.

@goodmami goodmami added this to the v0.8.0 milestone Jun 8, 2021
@goodmami
Copy link
Owner Author

@fcbond in the implementation to be released, I use the synset IDs for the internal mapping, and I use an ID-mapping function to get those synset IDs from the old offset+pos encoding. ILIs wouldn't be good because not all synsets have ILIs, but all can receive information content weights, and also I think we should discourage (or at least warn against) reusing information content between different wordnets.

@fcbond
Copy link
Collaborator

fcbond commented Jun 29, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants