Import the original Wordnets #2

moreymat · 2014-03-24T10:26:22Z

Each original Wordnet, for example the WOLF (Wordnet Libre du Français), contains its own language-specific structure.
This structure is very valuable information that we want to import into the graph database.

As each Wordnet is distributed in its own format, we need one import function per Wordnet.
The OMW team had the same need.
They provide one script per Wordnet that retrieves the aligned data from the original files.

The idea is to transform each of the OMW import scripts into a function, expand each function to import more information (including structure) and wrap all functions in a module.

moreymat · 2014-03-24T10:37:59Z

A first step would be to try and do this on the WOLF and see how it goes.

rhin0cer0s · 2014-03-28T09:38:30Z

So do we have to build another form of .tab files which would likely go like this ?

ID-TYPE \t LEMMA \t word \t synonym#synonym#... \t hyponyme#hyponyme#... \ etc ...

And our parser should be able to read it.

Or do we make it in different steps :

parsing and injecting actual dictionnaries
parsing and building relations between words already in the db

moreymat · 2014-03-28T11:00:21Z

I am sorry I was not very clear about this issue in the first place.

TL; DR: this issue belongs to a future milestone

Long version:
We can build multilingual graphs in three different ways:

get the nodes from OMW (tab files) and the edges from the Princeton Wordnet (via nltk),
get the nodes from OMW and the edges from the original Wordnets,
get the nodes and structure from the original Wordnets.

Solution (1) is our top priority at the moment: it should be quite cheap and enables us to look into data quickly.

This issue is about solutions (2) and (3), which are the next steps.
These solutions are more costly but they produce a much richer graph than solution (1).
The extra cost comes from having to parse the original Wordnet files.
The scripts provided on the OMW page for each language already do that: they parse the original files, extract the lemmas, do some cleaning to ensure compatibility and output the result to .tab files.
The idea for (2) and (3) is to expand these scripts to retrieve more information than the mere lemma (e.g. relations), do some cleaning and output the result to the db.

I will set up milestones to make the roadmap clearer :)

fcbond · 2014-03-29T03:58:09Z

G'day,

for (2) and (3), in the OMW we are strongly encouraging wordnet projects to output wordnet-LMF, we can then just have one parser to input that. Also, for some of the file I had to do some hand-cleaning as it was not easy to parse the original file.

In practice there will still be some issues:
(i) dormant projects (like Hebrew and Albanian)
--- we can make the LMF for them (I do already)
(ii) there are more than one schema's in use for wordnet-LMF
--- I hope we can encourage standardization by showing the benefit

Francis

moreymat · 2014-04-09T10:36:35Z

G'day @fcbond

We could produce wordnet-LMF as part of the conversion+import process, if :

using, adapting or writing conversion scripts is not too much work,
the wordnet-LMF files can be efficiently processed for batch insertion into the database.

Could you provide conversion scripts to test this approach on one or two wordnets?
I would gladly accept any pull request :-)

Mathieu

fcbond · 2014-04-13T04:36:09Z

G'day,
We could produce wordnet-LMF as part of the conversion+import process, if :

using, adapting or writing conversion scripts is not too much work,

I attach the script I currently use to output LMF :-).

the wordnet-LMF files can be efficiently processed for batch insertion
into the database.

Could you provide conversion scripts to test this approach on one or two
wordnets?
I would gladly accept any pull request :-)

The Thai input is currently done from LMF, although not very generally:

#!/usr/share/python

-- encoding: utf-8 --

Extract synset-word pairs from the Persian Wordnet

import sys
import codecs
import re

wnname = "Thai"
wnlang = "tha"
wnurl = "http://th.asianwordnet.org/"
wnlicense = "wordnet"

header

outfile = "wn-data-%s.tab" % wnlang
o = codecs.open(outfile, "w", "utf-8" )

o.write("# %s\t%s\t%s\t%s \n" % (wnname, wnlang, wnurl, wnlicense))

Data is in the file tha-wn-1.0-lmf.xml

exploit the fact that the synset is the same as wn3.0 offset

f = codecs.open("tha-wn-1.0-lmf.xml", "r", "utf-8" )

sysnset = str()
lemma = str()
for l in f:
m = re.search(r"<Lemma writtenForm="([^\"])" part",l)
if(m):
lemma = m.group(1).strip()
m = re.search(r"synset="tha-07-(.)"",l)
if(m):
synset = m.group(1)
o.write("%s\t%s\n" % (synset, lemma))
##print "%s\t%s\n" % (synset, lemma)

Mathieu

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/2#issuecomment-39949844
.

Sorry to mail it, I am not so used to git yet :-)

Francis Bond http://www3.ntu.edu.sg/home/fcbond/
Division of Linguistics and Multilingual Studies
Nanyang Technological University

moreymat · 2014-04-13T07:02:22Z

G'day,

Thanks for the information. There is no attached file though :-)

The students have started to include relations from wolf, thus included and
extended fre2tab.
Is it okay to have this extended version of your script in our github
project? If so, how do you want your authorship to be acknowledged?
Options include adding you to the list of authors of each extended script,
adding you to the global list of contributors to the project, explicit
mentions of omw as basis, any combination or variant of these or anything
else you see fit.

Have a nice sunday,

Mathieu
On 13 Apr 2014 06:36, "Francis Bond" notifications@github.com wrote:

G'day,
We could produce wordnet-LMF as part of the conversion+import process, if
:

using, adapting or writing conversion scripts is not too much work,

I attach the script I currently use to output LMF :-).

the wordnet-LMF files can be efficiently processed for batch
insertion
into the database.

Could you provide conversion scripts to test this approach on one or two
wordnets?
I would gladly accept any pull request :-)

The Thai input is currently done from LMF, although not very generally:

#!/usr/share/python

-- encoding: utf-8 --

Extract synset-word pairs from the Persian Wordnet

import sys
import codecs
import re

wnname = "Thai"
wnlang = "tha"
wnurl = "http://th.asianwordnet.org/"
wnlicense = "wordnet"

header

outfile = "wn-data-%s.tab" % wnlang
o = codecs.open(outfile, "w", "utf-8" )

o.write("# %s\t%s\t%s\t%s \n" % (wnname, wnlang, wnurl, wnlicense))

Data is in the file tha-wn-1.0-lmf.xml

exploit the fact that the synset is the same as wn3.0 offset

f = codecs.open("tha-wn-1.0-lmf.xml", "r", "utf-8" )

sysnset = str()
lemma = str()
for l in f:
m = re.search(r"<Lemma writtenForm="([^\"])" part",l)
if(m):
lemma = m.group(1).strip()
m = re.search(r"synset="tha-07-(.)"",l)
if(m):
synset = m.group(1)
o.write("%s\t%s\n" % (synset, lemma))
##print "%s\t%s\n" % (synset, lemma)

Mathieu

—
Reply to this email directly or view it on GitHub<
https://github.com/moreymat/omw-graph/issues/2#issuecomment-39949844>
.

Sorry to mail it, I am not so used to git yet :-)

Francis Bond http://www3.ntu.edu.sg/home/fcbond/
Division of Linguistics and Multilingual Studies
Nanyang Technological University

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/2#issuecomment-40299422
.

fcbond · 2014-04-13T09:13:30Z

G'day,

Thanks for the information. There is no attached file though :-)

The students have started to include relations from wolf, thus included
and
extended fre2tab.
Is it okay to have this extended version of your script in our github
project? If so, how do you want your authorship to be acknowledged?
Options include adding you to the list of authors of each extended script,
adding you to the global list of contributors to the project, explicit
mentions of omw as basis, any combination or variant of these or anything
else you see fit.

Please (i) add me to the global list of contributors to the project. You
already link to the OMW page in the Readme, which is enough.

Have a nice sunday,

You too.

Mathieu
On 13 Apr 2014 06:36, "Francis Bond" notifications@github.com wrote:

G'day,
We could produce wordnet-LMF as part of the conversion+import process,
if
:

using, adapting or writing conversion scripts is not too much work,

I attach the script I currently use to output LMF :-).

the wordnet-LMF files can be efficiently processed for batch
insertion
into the database.

Could you provide conversion scripts to test this approach on one or
two
wordnets?
I would gladly accept any pull request :-)

The Thai input is currently done from LMF, although not very generally:

#!/usr/share/python

-- encoding: utf-8 --

Extract synset-word pairs from the Persian Wordnet

import sys
import codecs
import re

wnname = "Thai"
wnlang = "tha"
wnurl = "http://th.asianwordnet.org/"
wnlicense = "wordnet"

header

outfile = "wn-data-%s.tab" % wnlang
o = codecs.open(outfile, "w", "utf-8" )

o.write("# %s\t%s\t%s\t%s \n" % (wnname, wnlang, wnurl, wnlicense))

Data is in the file tha-wn-1.0-lmf.xml

exploit the fact that the synset is the same as wn3.0 offset

f = codecs.open("tha-wn-1.0-lmf.xml", "r", "utf-8" )

sysnset = str()
lemma = str()
for l in f:
m = re.search(r"<Lemma writtenForm="([^\"])" part",l)
if(m):
lemma = m.group(1).strip()
m = re.search(r"synset="tha-07-(.)"",l)
if(m):
synset = m.group(1)
o.write("%s\t%s\n" % (synset, lemma))
##print "%s\t%s\n" % (synset, lemma)

Mathieu

—
Reply to this email directly or view it on GitHub<
https://github.com/moreymat/omw-graph/issues/2#issuecomment-39949844>
.

Sorry to mail it, I am not so used to git yet :-)

Francis Bond http://www3.ntu.edu.sg/home/fcbond/
Division of Linguistics and Multilingual Studies
Nanyang Technological University

—
Reply to this email directly or view it on GitHub<
https://github.com/moreymat/omw-graph/issues/2#issuecomment-40299422>
.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/2#issuecomment-40301136
.

Francis Bond http://www3.ntu.edu.sg/home/fcbond/
Division of Linguistics and Multilingual Studies
Nanyang Technological University

moreymat · 2014-04-18T09:40:12Z

G'day Francis,

Adrien and Christophe noticed that you were now distributing XML files (in the LMF and lemon formats) on the OMW website at NTU.

Could you tell us how the LMF, lemon and tab files you provide compare content-wise?
From a quick look at the english files, the LMF file contains relations between synsets, whereas the lemon file does not.

Is one of these formats (thinking LMF) complete and mature enough so that we can use your files as our only source of information to build the whole graph?
FWIW, we could even stop depending on NLTK.

Mathieu

fcbond · 2014-04-18T09:59:29Z

G'day,
Adrien and Christophe noticed that you were now distributing XML files (in
the LMF and lemon formats) on the OMW website at NTU.

Could you tell us how the LMF, lemon and tab files you provide compare
content-wise?
From a quick look at the english files, the LMF file contains relations
between synsets, whereas the lemon file does not.

That's right. LEMON is just the TAB files in very verbose XML :-). The
assumption is that the ontology (wordnet) is separate.
LMF should be complete (although I don't guarantee it).

Is one of these formats (thinking LMF) complete and mature enough so that

we can use your files as our only source of information to build the whole
graph?

In theory LMF should be, in practice I generally add information to the
database first, and then to the LMF.

FWIW, we could even stop depending on NLTK.

I think it is worth trying with LMF which we hope to be the format of the
future.

Wordnet-LMF (and LEMON) don't have anywhere to record frequency counts (the
idea is that they are from a corpus), although in practice they are useful
:-).

Wait just a little though, as I seem to have lost the English and Japanese
definitions in my move to svn (although we gained Greek).

Yours,

Francis Bond http://www3.ntu.edu.sg/home/fcbond/
Division of Linguistics and Multilingual Studies
Nanyang Technological University

moreymat · 2014-04-18T10:59:29Z

@fcbond thanks a lot. It seems we can give it a try.

@zorgulle @rhin0cer0s could you provide a rough estimate of how much work it would be to use the LMF XML files instead?
If it is reasonable, we might try and do this before 0.1.

fcbond · 2014-04-19T03:43:59Z

G'day,

I have (finally) restored the English definitions and example so it should
be good to go. There are also definitions for Albanian, Greek and Japanese
:-).

On Fri, Apr 18, 2014 at 6:59 PM, Mathieu Morey notifications@github.comwrote:

@fcbond https://github.com/fcbond thanks a lot. It seems we can give it
a try.

@zorgulle https://github.com/zorgulle @rhin0cer0shttps://github.com/rhin0cer0scould you provide a rough estimate of how much work it would be to use the
LMF XML files instead?
If it is reasonable, we might try and do this before 0.1.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/2#issuecomment-40801712
.

Francis Bond http://www3.ntu.edu.sg/home/fcbond/
Division of Linguistics and Multilingual Studies
Nanyang Technological University

moreymat · 2014-04-19T20:58:32Z

G'day Francis,

This is great, thank you !
I just noticed the OMW-LMF files provide exactly the information we wanted for milestone 0.1: aligned lemmas + relations from Princeton Wordnet.
We will still have to scrape the original wordnets to retrieve their own structures for milestone 0.3, unless you plan to do that as well? :-)

fcbond · 2014-04-20T03:08:11Z

G'day,

This is great, thank you !
I just noticed the OMW-LMF files provide exactly the information we wanted
for milestone 0.1: aligned lemmas + relations from Princeton Wordnet.
We will still have to scrape the original wordnets to retrieve their own
structures for milestone 0.3, unless you plan to do that as well? :-)

Not in the very near future. The next priority for me is adding confidence
scores (manually verified or not) and corpus frequencies.

Francis Bond http://www3.ntu.edu.sg/home/fcbond/
Division of Linguistics and Multilingual Studies
Nanyang Technological University

rhin0cer0s · 2014-04-21T08:30:38Z

Hi @fcbond and thank you for your involvement !

@moreymat
We build a little parser over the week-end ( we still have to fix some things before pushing it ) so the LMF 'support' is nearly done.

moreymat · 2014-04-21T09:00:22Z

Splendid! It would be great if we could release 0.1 this week.
On 21 Apr 2014 10:30, "Christophe Guieu" notifications@github.com wrote:

Hi @fcbond https://github.com/fcbond and thank you for your involvement
!

@moreymat https://github.com/moreymat
We build a little parser over the week-end ( we still have to fix some
things before pushing it ) so the LMF 'support' is nearly done.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/2#issuecomment-40923244
.

zorgulle · 2014-04-22T14:51:55Z

Hello,
The lmf parser works, we can import word and relations, we tested it with English and French, we still have the index key length issue. we work on this problem. it might be solve soon

moreymat · 2014-04-22T14:53:27Z

OK great, I am looking forward for this.

moreymat added the enhancement label Mar 24, 2014

moreymat added this to the 0.3 milestone Mar 28, 2014

rhin0cer0s modified the milestones: 0.1, 0.3 Apr 4, 2014

moreymat mentioned this issue Apr 18, 2014

Batch english #12

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Import the original Wordnets #2

Import the original Wordnets #2

moreymat commented Mar 24, 2014

moreymat commented Mar 24, 2014

rhin0cer0s commented Mar 28, 2014

moreymat commented Mar 28, 2014

fcbond commented Mar 29, 2014

moreymat commented Apr 9, 2014

fcbond commented Apr 13, 2014

moreymat commented Apr 13, 2014

-- encoding: utf-8 --

Extract synset-word pairs from the Persian Wordnet

header

Data is in the file tha-wn-1.0-lmf.xml

exploit the fact that the synset is the same as wn3.0 offset

fcbond commented Apr 13, 2014

-- encoding: utf-8 --

Extract synset-word pairs from the Persian Wordnet

header

Data is in the file tha-wn-1.0-lmf.xml

exploit the fact that the synset is the same as wn3.0 offset

moreymat commented Apr 18, 2014

fcbond commented Apr 18, 2014

moreymat commented Apr 18, 2014

fcbond commented Apr 19, 2014

moreymat commented Apr 19, 2014

fcbond commented Apr 20, 2014

rhin0cer0s commented Apr 21, 2014

moreymat commented Apr 21, 2014

zorgulle commented Apr 22, 2014

moreymat commented Apr 22, 2014

Import the original Wordnets #2

Import the original Wordnets #2

Comments

moreymat commented Mar 24, 2014

moreymat commented Mar 24, 2014

rhin0cer0s commented Mar 28, 2014

moreymat commented Mar 28, 2014

fcbond commented Mar 29, 2014

moreymat commented Apr 9, 2014

fcbond commented Apr 13, 2014

-- encoding: utf-8 --

Extract synset-word pairs from the Persian Wordnet

header

Data is in the file tha-wn-1.0-lmf.xml

exploit the fact that the synset is the same as wn3.0 offset

moreymat commented Apr 13, 2014

-- encoding: utf-8 --

Extract synset-word pairs from the Persian Wordnet

header

Data is in the file tha-wn-1.0-lmf.xml

exploit the fact that the synset is the same as wn3.0 offset

fcbond commented Apr 13, 2014

-- encoding: utf-8 --

Extract synset-word pairs from the Persian Wordnet

header

Data is in the file tha-wn-1.0-lmf.xml

exploit the fact that the synset is the same as wn3.0 offset

moreymat commented Apr 18, 2014

fcbond commented Apr 18, 2014

moreymat commented Apr 18, 2014

fcbond commented Apr 19, 2014

moreymat commented Apr 19, 2014

fcbond commented Apr 20, 2014

rhin0cer0s commented Apr 21, 2014

moreymat commented Apr 21, 2014

zorgulle commented Apr 22, 2014

moreymat commented Apr 22, 2014