Add information content (IC) #40

goodmami · 2020-11-05T03:48:45Z

Three of the similarity measures require information content to work. The IC that is shipped with the NLTK's wordnet data is based on synset offsets, so those will need to be mapped somehow to something that this module uses.

fcbond · 2020-11-05T03:51:52Z

May I suggest ili numbers?

goodmami · 2020-11-05T03:57:00Z

I'm not yet sure how IC works, but I'm happy to use that if it's sufficient. With ILI, instead of synset identifiers, I could imagine two wordnets for the same language (PWN vs English Wordnet, Italian, Chinese perhaps) could use the same IC for some corpus.

arademaker · 2020-11-05T17:23:09Z

What is IC here? May I suggest consider the glosstag version that I am completing https://github.com/own-pt/glosstag

I would be happy to discuss the better format for releasing it. Most updated branch is AR

goodmami · 2021-01-21T09:29:53Z

An IC file has data like this:

6484n 87

Where 6484n is synset offset 6484 and pos n in PWN 3.0. Then, 87 is the value associated with that synset, computed from occurrences of words in a corpus matching synsets. We can map the synsets to ILIs, which makes them more portable across English wordnet versions, and also makes them useful for other languages, but it should be noted that these values came from English corpora. This is important to note because not only would we expect a different distribution of ILIs across corpora in different languages, but the values would be computed differently because it depends on how many senses each word has (each occurrence increments the value by 1/n where n is the number of senses for the word). To be more accurate, the numbers can change across English wordnet versions, too, but we wouldn't expect such a drastic change.

Once this data is mapped and distributed, the next questions are how to work with it in Wn:

how to index it
how to download/install it
how to load it
how to use it

goodmami · 2021-01-21T09:35:03Z

@arademaker Would your glosstag data serve as the source data to generate information content files? Or if you envision some other use, perhaps open another issue as that data seems different from the standard information content files.

goodmami · 2021-06-24T04:44:35Z

@fcbond in the implementation to be released, I use the synset IDs for the internal mapping, and I use an ID-mapping function to get those synset IDs from the old offset+pos encoding. ILIs wouldn't be good because not all synsets have ILIs, but all can receive information content weights, and also I think we should discourage (or at least warn against) reusing information content between different wordnets.

fcbond · 2021-06-29T01:17:44Z

I agree that that makes sense. In fact, even for different versions of the same wordent, the number of senses may change, which will affect the calculations, ...

…

On Thu, Jun 24, 2021 at 2:54 PM Michael Wayne Goodman < ***@***.***> wrote: @fcbond <https://github.com/fcbond> in the implementation to be released, I use the synset IDs for the internal mapping, and I use an ID-mapping function to get those synset IDs from the old offset+pos encoding. ILIs wouldn't be good because not all synsets have ILIs, but all can receive information content weights, and also I think we should discourage (or at least warn against) reusing information content between different wordnets. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#40 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAIPZRXJTKDZXWAZUFL6DE3TUKZ35ANCNFSM4TK22C2Q> .

-- Francis Bond <http://www3.ntu.edu.sg/home/fcbond/> Division of Linguistics and Multilingual Studies Nanyang Technological University

goodmami added the enhancement New feature or request label Nov 5, 2020

goodmami mentioned this issue Nov 5, 2020

Notice: please pin wn dependency alvations/pywsd#62

Closed

goodmami mentioned this issue Apr 7, 2021

Most common word sense #111

Closed

goodmami mentioned this issue Jun 5, 2021

Resnik, Lin similarity ? #120

Closed

goodmami added this to the v0.8.0 milestone Jun 8, 2021

goodmami closed this as completed in a4aa995 Jul 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add information content (IC) #40

Add information content (IC) #40

goodmami commented Nov 5, 2020

fcbond commented Nov 5, 2020

goodmami commented Nov 5, 2020

arademaker commented Nov 5, 2020 •

edited

Loading

goodmami commented Jan 21, 2021 •

edited

Loading

goodmami commented Jan 21, 2021

goodmami commented Jun 24, 2021

fcbond commented Jun 29, 2021 via email

Add information content (IC) #40

Add information content (IC) #40

Comments

goodmami commented Nov 5, 2020

fcbond commented Nov 5, 2020

goodmami commented Nov 5, 2020

arademaker commented Nov 5, 2020 • edited Loading

goodmami commented Jan 21, 2021 • edited Loading

goodmami commented Jan 21, 2021

goodmami commented Jun 24, 2021

fcbond commented Jun 29, 2021 via email

arademaker commented Nov 5, 2020 •

edited

Loading

goodmami commented Jan 21, 2021 •

edited

Loading