-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add information content (IC) #40
Comments
May I suggest ili numbers? |
I'm not yet sure how IC works, but I'm happy to use that if it's sufficient. With ILI, instead of synset identifiers, I could imagine two wordnets for the same language (PWN vs English Wordnet, Italian, Chinese perhaps) could use the same IC for some corpus. |
What is IC here? May I suggest consider the glosstag version that I am completing https://github.com/own-pt/glosstag I would be happy to discuss the better format for releasing it. Most updated branch is AR |
An IC file has data like this:
Where 6484n is synset offset Once this data is mapped and distributed, the next questions are how to work with it in Wn:
|
@arademaker Would your glosstag data serve as the source data to generate information content files? Or if you envision some other use, perhaps open another issue as that data seems different from the standard information content files. |
@fcbond in the implementation to be released, I use the synset IDs for the internal mapping, and I use an ID-mapping function to get those synset IDs from the old offset+pos encoding. ILIs wouldn't be good because not all synsets have ILIs, but all can receive information content weights, and also I think we should discourage (or at least warn against) reusing information content between different wordnets. |
I agree that that makes sense. In fact, even for different versions of
the same wordent, the number of senses may change, which will affect the
calculations, ...
…On Thu, Jun 24, 2021 at 2:54 PM Michael Wayne Goodman < ***@***.***> wrote:
@fcbond <https://github.com/fcbond> in the implementation to be released,
I use the synset IDs for the internal mapping, and I use an ID-mapping
function to get those synset IDs from the old offset+pos encoding. ILIs
wouldn't be good because not all synsets have ILIs, but all can receive
information content weights, and also I think we should discourage (or at
least warn against) reusing information content between different wordnets.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#40 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAIPZRXJTKDZXWAZUFL6DE3TUKZ35ANCNFSM4TK22C2Q>
.
--
Francis Bond <http://www3.ntu.edu.sg/home/fcbond/>
Division of Linguistics and Multilingual Studies
Nanyang Technological University
|
Three of the similarity measures require information content to work. The IC that is shipped with the NLTK's wordnet data is based on synset offsets, so those will need to be mapped somehow to something that this module uses.
The text was updated successfully, but these errors were encountered: