Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multilingual wordnet #9

Closed
stevenbird opened this issue Oct 28, 2013 · 22 comments
Closed

Add multilingual wordnet #9

stevenbird opened this issue Oct 28, 2013 · 22 comments
Assignees

Comments

@stevenbird
Copy link
Member

@francisbond is contributing the Open Multilingual Wordnet to NLTK (http://www.casta-net.jp/~kuribayashi/multi/).

We need to settle on a short name to use: multiwordnet?

@ghost ghost assigned stevenbird Oct 28, 2013
@fcbond
Copy link
Contributor

fcbond commented Nov 5, 2013

There is an Italian project called 'MultiWordNet' so I would like to avoid just 'multiwordnet'. How about omw?

@stevenbird
Copy link
Member Author

OK. We're often writing "from nltk import wordnet as wn", and so wn has gained some currency as an abbreviation for WordNet.

We could have omwn. But in a world where openness is the unmarked case, we could have mwn.

Do either of these appeal or would you still prefer omw?

@fcbond
Copy link
Contributor

fcbond commented Nov 6, 2013

G'day,

OK. We're often writing "from nltk import wordnet as wn", and so wn has

gained some currency as an abbreviation for WordNet.

We could have omwn. But in a world where openness is the unmarked case, we
could have mwn.

Do either of these appeal or would you still prefer omw?

I alos like to thing of openness as the default, but 'mwn' is still a bit
close to Multiwordnet. I guess omwn is ok, although I have a slight
preference for 'omw'. 'wngrid' is another possibility: this is the name
chosen by the global wordnet association, and we are now the current
implementation.

Francis Bond http://www3.ntu.edu.sg/home/fcbond/
Division of Linguistics and Multilingual Studies
Nanyang Technological University

@stevenbird
Copy link
Member Author

OK, omw it is then, thanks.

@stevenbird
Copy link
Member Author

The list of languages in the supplied omw corpus is as follows. I think fre is spurious (a copy of fra) and we seem to be missing ind even though it is mentioned in the documentation.

als cmn eng fin fre ita mcr nor por
arb dan fas fra heb jpn msa pol tha

@fcbond would you please advise.

@stevenbird stevenbird reopened this May 3, 2014
@fcbond
Copy link
Contributor

fcbond commented May 4, 2014

The current list is as follows:

langs = ("eng", "ind", "zsm", "jpn", "tha",
"cmn", "qcn",
"fas", "arb", "heb", "ita", "por",
"nob", "nno", "dan", "swe",
"fra", "fin", "ell",
"glg", "cat", "spa", "eus",
"als", "pol", "slv")

We use qcn for traditional Chinese (and the slightly differently designed NTU, Taiwan Chinese Wordnet).

We will try to upload a new omw.zip sometime today.

t = dd(lambda: dd(unicode))

thing, lang, = label

t['eng']['eng'] = 'English'
t['eng']['ind'] = 'Inggeris'
t['eng']['zsm'] = 'Inggeris'
t['ind']['eng'] = 'Indonesian'
t['ind']['ind'] = 'Bahasa Indonesia'
t['ind']['zsm'] = 'Bahasa Indonesia'
t['zsm']['eng'] = 'Malaysian'
t['zsm']['ind'] = 'Bahasa Malaysia'
t['zsm']['zsm'] = 'Bahasa Malaysia'
t['msa']['eng'] = 'Malay'

t["swe"]["eng"] = "Swedish";
t["ell"]["eng"] = "Greek";
t["cmn"]["eng"] = "Chinese (simplified)";
t["qcn"]["eng"] = "Chinese (traditional)";
t['eng']['cmn'] = u'英语'
t['cmn']['cmn'] = u'汉语'
t['qcn']['cmn'] = u'漢語'
t['cmn']['qcn'] = u'汉语'
t['qcn']['qcn'] = u'漢語'
t['jpn']['cmn'] = u'日语'
t['jpn']['qcn'] = u'日语'

t['als']['eng'] = 'Albanian'
t['arb']['eng'] = 'Arabic'
t['cat']['eng'] = 'Catalan'
t['dan']['eng'] = 'Danish'
t['eus']['eng'] = 'Basque'
t['fas']['eng'] = 'Farsi'
t['fin']['eng'] = 'Finnish'
t['fra']['eng'] = 'French'
t['glg']['eng'] = 'Galician'
t['heb']['eng'] = 'Hebrew'
t['ita']['eng'] = 'Italian'
t['jpn']['eng'] = 'Japanese'
t['mkd']['eng'] = 'Macedonian'
t['nno']['eng'] = 'Nynorsk'
t['nob']['eng'] = u'Bokmål'
t['pol']['eng'] = 'Polish'
t['por']['eng'] = 'Portuguese'
t['slv']['eng'] = 'Slovene'
t['spa']['eng'] = 'Spanish'
t['tha']['eng'] = 'Thai'

@franquattri
Copy link

Hi, got the same problem that somebody posted on Quora some months ago:
"I can call:
from nltk.corpus import sinica_treebank

but when i call
from nltk.corpus import omw
The result is: cannot import name omw
No module named omw. "

I checked the downloader and the omw is installed. I am using Python 2.7.
Other modules work fine.
Any clues? Thanks in advance.

@franquattri
Copy link

One just needed to read the NLTK cookbook more accurately. You don't need to import the module 'omw', but you can recall it directly by simply importing wordnet (wn). More under: http://www.nltk.org/howto/wordnet.html

@alvations
Copy link
Contributor

A user reported missing spanish lemmas from OMW: http://stackoverflow.com/questions/26474731/missing-spanish-wordnet-from-nltk/26494099#26494099

@DarrenCook
Copy link

@franquattri It would be useful if the howto showed full installation instructions. On Ubuntu 14.04, with the data URL fixed (http://askubuntu.com/a/527408/93794), I have wordnet and omw installed (I see them under ~/nltk_data/corpora), but when I follow through http://www.nltk.org/howto/wordnet.html a lot of the examples fail, in particular wn.langs() fails with "AttributeError: 'WordNetCorpusReader' object has no attribute 'langs'".
Is that manual for a specific version?

@franquattri
Copy link

Hi Darren, The manual has been updated to the NLTK 3.0 version but it should work fine with the previous NLTK versions too. I'm working with Windows, Python 2.7 and iPython (which I suggest also for Unicode matters) Both attempts work for me:
from nltk.corpus import wordnet as wn
wn.langs() and

from nltk.corpus import wordnet as wn
sorted(wn.langs()) # as showed here http://www.nltk.org/howto/wordnet.html

Can you be more specific about the examples that fail?

@alvations
Copy link
Contributor

@DarrenCook, there are discrepancies between the API, the documentation and the nltk_data but i'm sure the OMW team will fix it and the documentation will follow shortly.

Please note that catalan seem to be missing from the wn.langs() although it's in the MCR.

>>> import nltk
>>> nltk.__version__
'3.0.0'
>>> nltk.download('omw')
[nltk_data] Downloading package omw to /home/alvas/nltk_data...
[nltk_data]   Package omw is already up-to-date!
True

>>> from nltk.corpus import wordnet as wn
>>> wn.langs()
[u'als', u'arb', u'cmn', u'dan', u'eng', u'fas', u'fin', u'fra', u'fre', u'heb', u'ita', u'jpn', u'cat', u'eus', u'glg', u'spa', u'ind', u'zsm', u'nno', u'nob', u'pol', u'por', u'tha']
>>> exit()
alvas@ubi:~$ cd ~/nltk_data/corpora/omw/
alvas@ubi:~/nltk_data/corpora/omw$ ls
als  cmn  eng  fin  fre  ita  mcr  nor  por     tha
arb  dan  fas  fra  heb  jpn  msa  pol  README

alvas@ubi:~/nltk_data/corpora/omw$ cd mcr/
alvas@ubi:~/nltk_data/corpora/omw/mcr$ ls
LICENSE     wn-data-cat.tab  wn-data-glg.tab  wn-data-spa.tab.gz
mcr2tab.py  wn-data-eus.tab  wn-data-spa.tab

@DarrenCook
Copy link

nltk.version
'2.0b9'

Is that too old?

(apt-get install python-nltk tells me "python-nltk is already the newest version.")

Working through the examples, the first one that fails is "print(wn.synset('dog.n.01').definition())", which says "TypeError: 'str' object is not callable". The three commands before that worked fine.

@alvations
Copy link
Contributor

Using pip install -U nltk would update to 3.0.0. apt-get is still holding the older version.

With regards to accessing synsets from the wordnet API in NLTK, i think the major change would be nltk/nltk@ba8ab7e

Possibly you'll find errors from nltk.download() too, if you're using the apt-get branch of NLTK, see http://askubuntu.com/questions/527388/python-nltk-on-ubuntu-12-04-lts-nltk-downloadbrown-results-in-html-error-40

See also:
Change Log: https://github.com/nltk/nltk/blob/develop/ChangeLog
API Changes: https://github.com/nltk/nltk/wiki/Porting-your-code-to-NLTK-3.0

@franquattri
Copy link

@DarrenCook you sure you have installed NLTK correctly? you can take a look here: http://www.nltk.org/install.html

To find out which nltk version you have:
import nltk
nltk.version

to update NLTK / modules (for windows) > Command Prompt > python -m pip install -upgrade SomePackage

Are you using the WN version that comes with NLTK (WN 3.0) or the newest release (i.e.have you imported it in NLTK)? There might be some issues for that reason as well.

@DarrenCook
Copy link

Thanks @alvations and Francesca for your help. These two commands got everything working:

sudo apt-get install python-pip
sudo pip install -U nltk

@franquattri I think I may have downloaded the latest wordnet, while having the 2.0b9 of nltk installed, so maybe that was the issue.

@franquattri
Copy link

Hi,
does Anybody know of multilingual framenets (apart from the English FrameNet) that can be searched with nltk?

@bryant1410
Copy link

This is already done, doesn't it?

@stevenbird
Copy link
Member Author

Thanks @bryant1410. Yes, this is resolved.

@nicoleljc1227
Copy link

i download cow from http://globalwordnet.org/wordnets-in-the-world/ to process Chinese. How can i use cow in python?
for example, from nltk.corpus import wordnet as wn then how can i use cow?

@fcbond
Copy link
Contributor

fcbond commented Apr 13, 2017 via email

@tvrbanec
Copy link

Can we use wn.synsets('dog')[0].lemmas(lang='jpn') in a way of using more than one language, ie wn.synsets('dog')[0].lemmas(lang='jpn, ita')?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants