Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix WordNet 3.0 gloss inconsistencies #160

Open
genericallyterrible opened this issue Sep 10, 2021 · 5 comments
Open

Fix WordNet 3.0 gloss inconsistencies #160

genericallyterrible opened this issue Sep 10, 2021 · 5 comments

Comments

@genericallyterrible
Copy link

@fcbond, @stevenbird There are several consistency issues with the gloss portions of WordNet 3.0 making parsing difficult. Would it be possible for us to manually fix these issues without breaking word associations as seen with the problems currently facing the update to WordNet 3.1?

@fcbond
Copy link
Contributor

fcbond commented Sep 11, 2021 via email

@goodmami
Copy link

Would it be possible for us to manually fix these issues without breaking word associations [...]

Replying specifically to this: It is incredibly difficult to alter WNDB data without breaking things, as the synset IDs are byte-offsets in the file, so any modified gloss has to have the same number of bytes as before. Secondly, we're not allowed to change the Princeton WordNet data and still call it as such (it would have to be called the "NLTK Wordnet of English" or something).

the problems currently facing the update to WordNet 3.1?

That issue was closed 2 years ago, which suggests to me that there are no plans to add WordNet 3.1 to the NLTK. There was an attempt at adding next-generation wordnet support to, or alongside, the NLTK (see https://github.com/nltk/wordnet), and it included WordNet 3.1 data as an option. Development stalled, however, so I took over the effort (and package name on PyPI) with an entirely new module, which Francis has linked above.

@stevenbird
Copy link
Member

@goodmami, thanks for the update. This sounds like a more sustainable option. How easily could a user of the NLTK wordnet package port their code to use your package? Does it include the similarity metrics?

@fcbond
Copy link
Contributor

fcbond commented Sep 14, 2021 via email

@goodmami
Copy link

Thanks, @fcbond!

@stevenbird, Wn has the similarity metrics, information content (it even reads the wordnet_ic files from nltk_data), Morphy, etc. Some absent features that may be desired are looking things up by sense keys (e.g., eat%2:34:02::; workaround) or the NLTK's shorthand synset identifiers (feed.v.06). If you wish to discuss a plan for deprecating the NLTK's wordnet module in favor of Wn, we should open separate issues to track the necessary changes to the code, data, documentation, and book.

Back to the current issue: in the modern WN-LMF format for wordnets, Definition and Example elements are structurally separate, having been split from WNDB's combined "gloss" line in the format-conversion process. That process, however, may not account for the inconsistencies noted by @genericallyterrible, who did a nice and thorough analysis in nltk/nltk#2527. So as to not let that effort go to waste, it might be good to compare it with the WNDB-to-LMF converter. The relevant code is here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants