Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support OMW 1.4 #2899

Merged
merged 13 commits into from Dec 8, 2021
Merged

Support OMW 1.4 #2899

merged 13 commits into from Dec 8, 2021

Conversation

ekaf
Copy link
Contributor

@ekaf ekaf commented Nov 29, 2021

This PR adapts the multilingual functions in wordnet.py to use the new OMW-data 1.4 (nltk/nltk_data#171), the recent release of the Open Multilingual Wordnet.

The directory structure of the new nltk_data/corpora/omw package has a slightly different layout, where each folder name indicates the provenance of any number of wordnets included in the corresponding folder.

For English and Italian, OMW now includes wordnets from two different provenances, so the lang parameter needs to eventually encode the provenance, in cases where more wordnets exist for the same language.

Also, in addition to lemmas, some wordnets in OMW 1.4 now also include definitions (def) and examples (exe).

This PR supports both the new and the old omw formats.

@ekaf
Copy link
Contributor Author

@ekaf ekaf commented Nov 29, 2021

Example use, adapted from #2423 (comment)

import nltk
from nltk.corpus import wordnet31 as wn
print(f"Wordnet v. {wn.get_version()}\n")

Wordnet v. 3.1
print(wn.langs())
dict_keys(['eng', 'als', 'arb', 'bul', 'cmn', 'qcn', 'dan', 'ell', 'eng_eng', 'fas', 'fin', 'fra', 'heb', 'hrv', 'isl', 'ita', 'ita_iwn', 'jpn', 'cat', 'eus', 'glg', 'spa', 'ind', 'zsm', 'nld', 'nno', 'nob', 'pol', 'por', 'ron', 'lit', 'slk', 'slv', 'swe', 'tha'])

for ss in wn.synsets('犬', lang='jpn'):
    print(ss)
    for lg in ["jpn","ita_iwn"]:
        print(f"{lg} lemmas:{ss.lemmas(lang=lg)}")
        print(f"{lg} definition:{ss.definition(lang=lg)}")
    print()

Synset('dog.n.01')
jpn lemmas:[Lemma('dog.n.01.ドッグ'), Lemma('dog.n.01.イヌ'), Lemma('dog.n.01.洋犬'), Lemma('dog.n.01.犬'), Lemma('dog.n.01.飼い犬'), Lemma('dog.n.01.飼犬')]
jpn definition:['有史以前から人間に家畜化されて来た(おそらく普通のオオカミを先祖とする)イヌ属の動物', '多数の品種がある']
ita_iwn lemmas:[Lemma('dog.n.01.cane')]
ita_iwn definition:['animale domestico molto comune, diffuso in tutto il mondo, usato per la caccia, la difesa, nella pastorizia, e come animale da compagnia']

Synset('spy.n.01')
jpn lemmas:[Lemma('spy.n.01.スパイ'), Lemma('spy.n.01.いぬ'), Lemma('spy.n.01.回し者'), Lemma('spy.n.01.回者'), Lemma('spy.n.01.密偵'), Lemma('spy.n.01.工作員'), Lemma('spy.n.01.廻し者'), Lemma('spy.n.01.廻者'), Lemma('spy.n.01.探り'), Lemma('spy.n.01.探'), Lemma('spy.n.01.犬'), Lemma('spy.n.01.秘密捜査員'), Lemma('spy.n.01.まわし者'), Lemma('spy.n.01.諜報員'), Lemma('spy.n.01.諜者'), Lemma('spy.n.01.間者'), Lemma('spy.n.01.間諜'), Lemma('spy.n.01.隠密')]
jpn definition:['敵の情報を得るために国家に雇われた、または競合他社の企業秘密を得るために会社に雇われた秘密諜報部員']
ita_iwn lemmas:[Lemma('spy.n.01.agente_segreto'), Lemma('spy.n.01.emissario'), Lemma('spy.n.01.spia')]
ita_iwn definition:['chi esercita lo spionaggio']

@ekaf
Copy link
Contributor Author

@ekaf ekaf commented Dec 4, 2021

import nltk
from nltk.corpus import wordnet as wn
print(f"Wordnet v. {wn.get_version()}\n")

Wordnet v. 3.0

for lg in sorted(wn.langs()):
    print(f"{lg}: {len(list(wn.words(lang=lg)))} words in {len(list(wn.all_synsets(lang=lg)))} synsets")

als: 5988 words in 4675 synsets
arb: 17785 words in 9916 synsets
bul: 6720 words in 4959 synsets
cat: 46531 words in 45826 synsets
cmn: 61533 words in 42300 synsets
dan: 4468 words in 4476 synsets
ell: 18225 words in 18049 synsets
eng: 147306 words in 117659 synsets
eus: 26240 words in 29413 synsets
fin: 129839 words in 116763 synsets
fra: 55351 words in 59091 synsets
glg: 23124 words in 19311 synsets
heb: 5325 words in 5448 synsets
hrv: 29008 words in 23115 synsets
ind: 36954 words in 38085 synsets
isl: 11504 words in 4951 synsets
ita: 41855 words in 35001 synsets
ita_iwn: 19221 words in 15563 synsets
jpn: 91964 words in 57184 synsets
lit: 11395 words in 9462 synsets
nld: 43077 words in 30177 synsets
nno: 3387 words in 3671 synsets
nob: 4186 words in 4455 synsets
pol: 45387 words in 33826 synsets
por: 54071 words in 43895 synsets
ron: 49987 words in 56026 synsets
slk: 29150 words in 18507 synsets
slv: 40230 words in 42583 synsets
spa: 36681 words in 38512 synsets
swe: 5824 words in 6796 synsets
tha: 82504 words in 73350 synsets
zsm: 33932 words in 36911 synsets

@ekaf
Copy link
Contributor Author

@ekaf ekaf commented Dec 4, 2021

After the latest commit, pytest succeeds on windows but fails on mac and ubuntu.
The error doesn't seem related to this PR:

E LookupError:
E **********************************************************************
E Resource inaugural not found.
E Please use the NLTK Downloader to obtain the resource:

@tomaarsen
Copy link
Member

@tomaarsen tomaarsen commented Dec 5, 2021

@ekaf
This has been caused by an issue now solved through nltk/nltk_data#174. However, the CI has cached the (broken) nltk_data. Rerunning the CI with a new CACHE_VERSION Secret should really force the CI to gather nltk_data from fresh, but it seems that it will not. I'll try to push an empty commit to force the CI to restart. If all is well, that should work.

@tomaarsen
Copy link
Member

@tomaarsen tomaarsen commented Dec 5, 2021

Annoyingly, this does not seem to be working: The broken cache of nltk_data is still being used. I can't really tell why. The CACHE_VERSION secret was changed, and I added that to the cache key solely for allowing me to reset the cache as suggested in this SO link: https://stackoverflow.com/questions/63521430/clear-cache-in-github-actions.

This is getting a bit frustrating.

@ekaf
Copy link
Contributor Author

@ekaf ekaf commented Dec 6, 2021

@tomaarsen, there seem to be workarounds at r-lib/actions#86

@ekaf
Copy link
Contributor Author

@ekaf ekaf commented Dec 7, 2021

@tomaarsen, here is a commit that looks like it worked: Robinlovelace/geocompr@9189efb

@ekaf
Copy link
Contributor Author

@ekaf ekaf commented Dec 7, 2021

@tomaarsen: prefixing the key with "new-" in .github/workflows/ci.yaml actually cleared the cache. This needs to also be done a second place in the file, for the new cache to be used instead of the old one. Maybe an explanation why only changing the secret didn't work could be that this variable is interpreted as void (I'm just guessing...).

@Memode

This comment was marked as spam.

@ekaf
Copy link
Contributor Author

@ekaf ekaf commented Dec 8, 2021

@tomaarsen, the changes to .github/workflows/ci.yaml don't belong in this PR, since they solve a completely different problem. So maybe that part should be split out into another PR about clearing cached dependencies. However, since I don't control the ${{ secrets.CACHE_VERSION }} variable, I feel that you would be better equipped to handle this.

On the other hand, the update to wordnet.py is acutely needed in order to fix the new issue #2905 (comment), which arises because the new OMW package was merged into nltk_data, without also merging the present PR.

Please let me know me if there is anything I can do about this.

Copy link
Member

@tomaarsen tomaarsen left a comment

I've scheduled some time this morning to look at this PR. I've reverted to using the "normal" key, and it seems like the cache has been refreshed by now. I've also created a helper method for Synset.definition() and Synset.examples(), as the code for these was near identical.
Beyond that, I had to update some doctests which were failing due to the nltk_data changes.

If these tests are failing for you locally, then either:

  • Your omw nltk_data is outdated, or
  • You've updated your omw nltk_data, but the old files were not removed. Deleting the omw folder within nltk_data and re-downloading will solve this. Alternatively, you can delete the entire nltk_data and redownload it all.

This PR is ready for merging as far as I can tell.

The problematic thing is - nltk_data cannot be pinned to some older version. People can't say "Oh, my NLTK is locked to 3.2.5, so I'll use the nltk_data that works with that version". Because of these changes, no NLTK version works like expected, with the exception of this PR.
It is a priority that we merge this PR, and publish a new version.

In part due to this PR and its consequences, I believe it's time to release 3.7.0 rather than 3.6.6. After all, the nltk_data changes essentially deprecate all currently released NLTK versions, I'm afraid.

@ekaf
Copy link
Contributor Author

@ekaf ekaf commented Dec 8, 2021

@tomaarsen , I'm sorry for all the trouble you have with this PR.
I have now tested your changes to wordnet.doctest with "tox -e py39", and everything was fine, except for one failure:


096     >>> len(inaugural.words())
Expected:
    152901
Got:   
    149797

@tomaarsen
Copy link
Member

@tomaarsen tomaarsen commented Dec 8, 2021

@ekaf This is likely a consequence of having an outdated inaugural. python -m nltk.downloader --force inaugural ought to help. If that does not help, then it might be because inaugural was temporarily broken, meaning that you might have unintended files in your local version of nltk_data.

@ekaf
Copy link
Contributor Author

@ekaf ekaf commented Dec 8, 2021

@tomaarsen yes, you are right, with the new inaugural package all tests now succeed.
congratulations :)

@tomaarsen
Copy link
Member

@tomaarsen tomaarsen commented Dec 8, 2021

Glad to hear!

I'll merge this, so people with issues like #2905 at least have a solution that isn't just using this PR. Thanks for these changes, and thanks for bearing with me while we've been having these cache issues.

@tomaarsen tomaarsen merged commit 8ed8b70 into nltk:develop Dec 8, 2021
16 checks passed
@ekaf
Copy link
Contributor Author

@ekaf ekaf commented Dec 8, 2021

Definitions and examples also work with Albanian ('als'):

import nltk
from nltk.corpus import wordnet2021 as wn
print(f"Wordnet v. {wn.get_version()}\n")

Wordnet v. 2021

ss = wn.synset('school.n.02')
lg='als'
print(f"{lg} lemmas:{ss.lemmas(lang=lg)}")

als lemmas:[Lemma('school.n.02.mësonjëtore'), Lemma('school.n.02.shkollë')]

print(f"{lg} definition:{ss.definition(lang=lg)}")

als definition:['institucion arsimor ku mëson dhe edukohet në mënyrë të organizuar brezi i ri; një institucion i tillë i specializuar; ndërtesa e këtij institucioni']

print(f"{lg} examples:{ss.examples(lang=lg)}")

als examples:['Shkolla është ndërtuar më 1932', 'Ai shkon në shkoll çdo ditë']

@ekaf ekaf deleted the omw14 branch Dec 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants