Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import the original Wordnets #2

Open
moreymat opened this issue Mar 24, 2014 · 18 comments
Open

Import the original Wordnets #2

moreymat opened this issue Mar 24, 2014 · 18 comments
Milestone

Comments

@moreymat
Copy link
Owner

Each original Wordnet, for example the WOLF (Wordnet Libre du Français), contains its own language-specific structure.
This structure is very valuable information that we want to import into the graph database.

As each Wordnet is distributed in its own format, we need one import function per Wordnet.
The OMW team had the same need.
They provide one script per Wordnet that retrieves the aligned data from the original files.

The idea is to transform each of the OMW import scripts into a function, expand each function to import more information (including structure) and wrap all functions in a module.

@moreymat
Copy link
Owner Author

A first step would be to try and do this on the WOLF and see how it goes.

@rhin0cer0s
Copy link
Collaborator

So do we have to build another form of .tab files which would likely go like this ?

ID-TYPE \t LEMMA \t word \t synonym#synonym#... \t hyponyme#hyponyme#... \ etc ...

And our parser should be able to read it.

Or do we make it in different steps :

  • parsing and injecting actual dictionnaries
  • parsing and building relations between words already in the db

@moreymat
Copy link
Owner Author

I am sorry I was not very clear about this issue in the first place.

TL; DR: this issue belongs to a future milestone

Long version:
We can build multilingual graphs in three different ways:

  1. get the nodes from OMW (tab files) and the edges from the Princeton Wordnet (via nltk),
  2. get the nodes from OMW and the edges from the original Wordnets,
  3. get the nodes and structure from the original Wordnets.

Solution (1) is our top priority at the moment: it should be quite cheap and enables us to look into data quickly.

This issue is about solutions (2) and (3), which are the next steps.
These solutions are more costly but they produce a much richer graph than solution (1).
The extra cost comes from having to parse the original Wordnet files.
The scripts provided on the OMW page for each language already do that: they parse the original files, extract the lemmas, do some cleaning to ensure compatibility and output the result to .tab files.
The idea for (2) and (3) is to expand these scripts to retrieve more information than the mere lemma (e.g. relations), do some cleaning and output the result to the db.

I will set up milestones to make the roadmap clearer :)

@moreymat moreymat added this to the 0.3 milestone Mar 28, 2014
@fcbond
Copy link

fcbond commented Mar 29, 2014

G'day,

for (2) and (3), in the OMW we are strongly encouraging wordnet projects to output wordnet-LMF, we can then just have one parser to input that. Also, for some of the file I had to do some hand-cleaning as it was not easy to parse the original file.

In practice there will still be some issues:
(i) dormant projects (like Hebrew and Albanian)
--- we can make the LMF for them (I do already)
(ii) there are more than one schema's in use for wordnet-LMF
--- I hope we can encourage standardization by showing the benefit

Francis

@rhin0cer0s rhin0cer0s modified the milestones: 0.1, 0.3 Apr 4, 2014
@moreymat
Copy link
Owner Author

moreymat commented Apr 9, 2014

G'day @fcbond

We could produce wordnet-LMF as part of the conversion+import process, if :

  1. using, adapting or writing conversion scripts is not too much work,
  2. the wordnet-LMF files can be efficiently processed for batch insertion into the database.

Could you provide conversion scripts to test this approach on one or two wordnets?
I would gladly accept any pull request :-)

Mathieu

@fcbond
Copy link

fcbond commented Apr 13, 2014

G'day,
We could produce wordnet-LMF as part of the conversion+import process, if :

  1. using, adapting or writing conversion scripts is not too much work,

I attach the script I currently use to output LMF :-).

  1. the wordnet-LMF files can be efficiently processed for batch insertion
    into the database.

Could you provide conversion scripts to test this approach on one or two
wordnets?
I would gladly accept any pull request :-)

The Thai input is currently done from LMF, although not very generally:

#!/usr/share/python

-- encoding: utf-8 --

Extract synset-word pairs from the Persian Wordnet

import sys
import codecs
import re

wnname = "Thai"
wnlang = "tha"
wnurl = "http://th.asianwordnet.org/"
wnlicense = "wordnet"

header

outfile = "wn-data-%s.tab" % wnlang
o = codecs.open(outfile, "w", "utf-8" )

o.write("# %s\t%s\t%s\t%s \n" % (wnname, wnlang, wnurl, wnlicense))

Data is in the file tha-wn-1.0-lmf.xml

exploit the fact that the synset is the same as wn3.0 offset

f = codecs.open("tha-wn-1.0-lmf.xml", "r", "utf-8" )

sysnset = str()
lemma = str()
for l in f:
m = re.search(r"<Lemma writtenForm="([^\"])" part",l)
if(m):
lemma = m.group(1).strip()
m = re.search(r"synset="tha-07-(.
)"",l)
if(m):
synset = m.group(1)
o.write("%s\t%s\n" % (synset, lemma))
##print "%s\t%s\n" % (synset, lemma)

Mathieu


Reply to this email directly or view it on GitHubhttps://github.com//issues/2#issuecomment-39949844
.

Sorry to mail it, I am not so used to git yet :-)

Francis Bond http://www3.ntu.edu.sg/home/fcbond/
Division of Linguistics and Multilingual Studies
Nanyang Technological University

@moreymat
Copy link
Owner Author

G'day,

Thanks for the information. There is no attached file though :-)

The students have started to include relations from wolf, thus included and
extended fre2tab.
Is it okay to have this extended version of your script in our github
project? If so, how do you want your authorship to be acknowledged?
Options include adding you to the list of authors of each extended script,
adding you to the global list of contributors to the project, explicit
mentions of omw as basis, any combination or variant of these or anything
else you see fit.

Have a nice sunday,

Mathieu
On 13 Apr 2014 06:36, "Francis Bond" notifications@github.com wrote:

G'day,
We could produce wordnet-LMF as part of the conversion+import process, if
:

  1. using, adapting or writing conversion scripts is not too much work,

I attach the script I currently use to output LMF :-).

  1. the wordnet-LMF files can be efficiently processed for batch
    insertion
    into the database.

Could you provide conversion scripts to test this approach on one or two
wordnets?
I would gladly accept any pull request :-)

The Thai input is currently done from LMF, although not very generally:

#!/usr/share/python

-- encoding: utf-8 --

Extract synset-word pairs from the Persian Wordnet

import sys
import codecs
import re

wnname = "Thai"
wnlang = "tha"
wnurl = "http://th.asianwordnet.org/"
wnlicense = "wordnet"

header

outfile = "wn-data-%s.tab" % wnlang
o = codecs.open(outfile, "w", "utf-8" )

o.write("# %s\t%s\t%s\t%s \n" % (wnname, wnlang, wnurl, wnlicense))

Data is in the file tha-wn-1.0-lmf.xml

exploit the fact that the synset is the same as wn3.0 offset

f = codecs.open("tha-wn-1.0-lmf.xml", "r", "utf-8" )

sysnset = str()
lemma = str()
for l in f:
m = re.search(r"<Lemma writtenForm="([^\"])" part",l)
if(m):
lemma = m.group(1).strip()
m = re.search(r"synset="tha-07-(.
)"",l)
if(m):
synset = m.group(1)
o.write("%s\t%s\n" % (synset, lemma))
##print "%s\t%s\n" % (synset, lemma)

Mathieu


Reply to this email directly or view it on GitHub<
https://github.com/moreymat/omw-graph/issues/2#issuecomment-39949844>
.

Sorry to mail it, I am not so used to git yet :-)

Francis Bond http://www3.ntu.edu.sg/home/fcbond/
Division of Linguistics and Multilingual Studies
Nanyang Technological University


Reply to this email directly or view it on GitHubhttps://github.com//issues/2#issuecomment-40299422
.

@fcbond
Copy link

fcbond commented Apr 13, 2014

G'day,

Thanks for the information. There is no attached file though :-)

The students have started to include relations from wolf, thus included
and
extended fre2tab.
Is it okay to have this extended version of your script in our github
project? If so, how do you want your authorship to be acknowledged?
Options include adding you to the list of authors of each extended script,
adding you to the global list of contributors to the project, explicit
mentions of omw as basis, any combination or variant of these or anything
else you see fit.

Please (i) add me to the global list of contributors to the project. You
already link to the OMW page in the Readme, which is enough.

Have a nice sunday,

You too.

Mathieu
On 13 Apr 2014 06:36, "Francis Bond" notifications@github.com wrote:

G'day,
We could produce wordnet-LMF as part of the conversion+import process,
if
:

  1. using, adapting or writing conversion scripts is not too much work,

I attach the script I currently use to output LMF :-).

  1. the wordnet-LMF files can be efficiently processed for batch
    insertion
    into the database.

Could you provide conversion scripts to test this approach on one or
two
wordnets?
I would gladly accept any pull request :-)

The Thai input is currently done from LMF, although not very generally:

#!/usr/share/python

-- encoding: utf-8 --

Extract synset-word pairs from the Persian Wordnet

import sys
import codecs
import re

wnname = "Thai"
wnlang = "tha"
wnurl = "http://th.asianwordnet.org/"
wnlicense = "wordnet"

header

outfile = "wn-data-%s.tab" % wnlang
o = codecs.open(outfile, "w", "utf-8" )

o.write("# %s\t%s\t%s\t%s \n" % (wnname, wnlang, wnurl, wnlicense))

Data is in the file tha-wn-1.0-lmf.xml

exploit the fact that the synset is the same as wn3.0 offset

f = codecs.open("tha-wn-1.0-lmf.xml", "r", "utf-8" )

sysnset = str()
lemma = str()
for l in f:
m = re.search(r"<Lemma writtenForm="([^\"])" part",l)
if(m):
lemma = m.group(1).strip()
m = re.search(r"synset="tha-07-(.
)"",l)
if(m):
synset = m.group(1)
o.write("%s\t%s\n" % (synset, lemma))
##print "%s\t%s\n" % (synset, lemma)

Mathieu


Reply to this email directly or view it on GitHub<
https://github.com/moreymat/omw-graph/issues/2#issuecomment-39949844>
.

Sorry to mail it, I am not so used to git yet :-)

Francis Bond http://www3.ntu.edu.sg/home/fcbond/
Division of Linguistics and Multilingual Studies
Nanyang Technological University


Reply to this email directly or view it on GitHub<
https://github.com/moreymat/omw-graph/issues/2#issuecomment-40299422>
.


Reply to this email directly or view it on GitHubhttps://github.com//issues/2#issuecomment-40301136
.

Francis Bond http://www3.ntu.edu.sg/home/fcbond/
Division of Linguistics and Multilingual Studies
Nanyang Technological University

@moreymat
Copy link
Owner Author

G'day Francis,

Adrien and Christophe noticed that you were now distributing XML files (in the LMF and lemon formats) on the OMW website at NTU.

Could you tell us how the LMF, lemon and tab files you provide compare content-wise?
From a quick look at the english files, the LMF file contains relations between synsets, whereas the lemon file does not.

Is one of these formats (thinking LMF) complete and mature enough so that we can use your files as our only source of information to build the whole graph?
FWIW, we could even stop depending on NLTK.

Mathieu

@moreymat moreymat mentioned this issue Apr 18, 2014
@fcbond
Copy link

fcbond commented Apr 18, 2014

G'day,
Adrien and Christophe noticed that you were now distributing XML files (in
the LMF and lemon formats) on the OMW website at NTU.

Could you tell us how the LMF, lemon and tab files you provide compare
content-wise?
From a quick look at the english files, the LMF file contains relations
between synsets, whereas the lemon file does not.

That's right. LEMON is just the TAB files in very verbose XML :-). The
assumption is that the ontology (wordnet) is separate.
LMF should be complete (although I don't guarantee it).

Is one of these formats (thinking LMF) complete and mature enough so that

we can use your files as our only source of information to build the whole
graph?

In theory LMF should be, in practice I generally add information to the
database first, and then to the LMF.

FWIW, we could even stop depending on NLTK.

I think it is worth trying with LMF which we hope to be the format of the
future.

Wordnet-LMF (and LEMON) don't have anywhere to record frequency counts (the
idea is that they are from a corpus), although in practice they are useful
:-).

Wait just a little though, as I seem to have lost the English and Japanese
definitions in my move to svn (although we gained Greek).

Yours,

Francis Bond http://www3.ntu.edu.sg/home/fcbond/
Division of Linguistics and Multilingual Studies
Nanyang Technological University

@moreymat
Copy link
Owner Author

@fcbond thanks a lot. It seems we can give it a try.

@zorgulle @rhin0cer0s could you provide a rough estimate of how much work it would be to use the LMF XML files instead?
If it is reasonable, we might try and do this before 0.1.

@fcbond
Copy link

fcbond commented Apr 19, 2014

G'day,

I have (finally) restored the English definitions and example so it should
be good to go. There are also definitions for Albanian, Greek and Japanese
:-).

On Fri, Apr 18, 2014 at 6:59 PM, Mathieu Morey notifications@github.comwrote:

@fcbond https://github.com/fcbond thanks a lot. It seems we can give it
a try.

@zorgulle https://github.com/zorgulle @rhin0cer0shttps://github.com/rhin0cer0scould you provide a rough estimate of how much work it would be to use the
LMF XML files instead?
If it is reasonable, we might try and do this before 0.1.


Reply to this email directly or view it on GitHubhttps://github.com//issues/2#issuecomment-40801712
.

Francis Bond http://www3.ntu.edu.sg/home/fcbond/
Division of Linguistics and Multilingual Studies
Nanyang Technological University

@moreymat
Copy link
Owner Author

G'day Francis,

This is great, thank you !
I just noticed the OMW-LMF files provide exactly the information we wanted for milestone 0.1: aligned lemmas + relations from Princeton Wordnet.
We will still have to scrape the original wordnets to retrieve their own structures for milestone 0.3, unless you plan to do that as well? :-)

@fcbond
Copy link

fcbond commented Apr 20, 2014

G'day,

This is great, thank you !
I just noticed the OMW-LMF files provide exactly the information we wanted
for milestone 0.1: aligned lemmas + relations from Princeton Wordnet.
We will still have to scrape the original wordnets to retrieve their own
structures for milestone 0.3, unless you plan to do that as well? :-)

Not in the very near future. The next priority for me is adding confidence
scores (manually verified or not) and corpus frequencies.

Francis Bond http://www3.ntu.edu.sg/home/fcbond/
Division of Linguistics and Multilingual Studies
Nanyang Technological University

@rhin0cer0s
Copy link
Collaborator

Hi @fcbond and thank you for your involvement !

@moreymat
We build a little parser over the week-end ( we still have to fix some things before pushing it ) so the LMF 'support' is nearly done.

@moreymat
Copy link
Owner Author

Splendid! It would be great if we could release 0.1 this week.
On 21 Apr 2014 10:30, "Christophe Guieu" notifications@github.com wrote:

Hi @fcbond https://github.com/fcbond and thank you for your involvement
!

@moreymat https://github.com/moreymat
We build a little parser over the week-end ( we still have to fix some
things before pushing it ) so the LMF 'support' is nearly done.


Reply to this email directly or view it on GitHubhttps://github.com//issues/2#issuecomment-40923244
.

@zorgulle
Copy link
Collaborator

Hello,
The lmf parser works, we can import word and relations, we tested it with English and French, we still have the index key length issue. we work on this problem. it might be solve soon

@moreymat
Copy link
Owner Author

OK great, I am looking forward for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants