-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Import the original Wordnets #2
Comments
A first step would be to try and do this on the WOLF and see how it goes. |
So do we have to build another form of .tab files which would likely go like this ? ID-TYPE \t LEMMA \t word \t synonym#synonym#... \t hyponyme#hyponyme#... \ etc ... And our parser should be able to read it. Or do we make it in different steps :
|
I am sorry I was not very clear about this issue in the first place. TL; DR: this issue belongs to a future milestone Long version:
Solution (1) is our top priority at the moment: it should be quite cheap and enables us to look into data quickly. This issue is about solutions (2) and (3), which are the next steps. I will set up milestones to make the roadmap clearer :) |
G'day, for (2) and (3), in the OMW we are strongly encouraging wordnet projects to output wordnet-LMF, we can then just have one parser to input that. Also, for some of the file I had to do some hand-cleaning as it was not easy to parse the original file. In practice there will still be some issues: Francis |
G'day @fcbond We could produce wordnet-LMF as part of the conversion+import process, if :
Could you provide conversion scripts to test this approach on one or two wordnets? Mathieu |
G'day,
I attach the script I currently use to output LMF :-).
The Thai input is currently done from LMF, although not very generally: #!/usr/share/python -- encoding: utf-8 --Extract synset-word pairs from the Persian Wordnetimport sys wnname = "Thai" headeroutfile = "wn-data-%s.tab" % wnlang o.write("# %s\t%s\t%s\t%s \n" % (wnname, wnlang, wnurl, wnlicense)) Data is in the file tha-wn-1.0-lmf.xmlexploit the fact that the synset is the same as wn3.0 offsetf = codecs.open("tha-wn-1.0-lmf.xml", "r", "utf-8" ) sysnset = str() Mathieu
Sorry to mail it, I am not so used to git yet :-) Francis Bond http://www3.ntu.edu.sg/home/fcbond/ |
G'day, Thanks for the information. There is no attached file though :-) The students have started to include relations from wolf, thus included and Have a nice sunday, Mathieu
|
G'day,
Please (i) add me to the global list of contributors to the project. You
You too.
Francis Bond http://www3.ntu.edu.sg/home/fcbond/ |
G'day Francis, Adrien and Christophe noticed that you were now distributing XML files (in the LMF and lemon formats) on the OMW website at NTU. Could you tell us how the LMF, lemon and tab files you provide compare content-wise? Is one of these formats (thinking LMF) complete and mature enough so that we can use your files as our only source of information to build the whole graph? Mathieu |
G'day,
Is one of these formats (thinking LMF) complete and mature enough so that
In theory LMF should be, in practice I generally add information to the
I think it is worth trying with LMF which we hope to be the format of the Wordnet-LMF (and LEMON) don't have anywhere to record frequency counts (the Wait just a little though, as I seem to have lost the English and Japanese Yours, Francis Bond http://www3.ntu.edu.sg/home/fcbond/ |
@fcbond thanks a lot. It seems we can give it a try. @zorgulle @rhin0cer0s could you provide a rough estimate of how much work it would be to use the LMF XML files instead? |
G'day, I have (finally) restored the English definitions and example so it should On Fri, Apr 18, 2014 at 6:59 PM, Mathieu Morey notifications@github.comwrote:
Francis Bond http://www3.ntu.edu.sg/home/fcbond/ |
G'day Francis, This is great, thank you ! |
G'day,
Not in the very near future. The next priority for me is adding confidence Francis Bond http://www3.ntu.edu.sg/home/fcbond/ |
Splendid! It would be great if we could release 0.1 this week.
|
Hello, |
OK great, I am looking forward for this. |
Each original Wordnet, for example the WOLF (Wordnet Libre du Français), contains its own language-specific structure.
This structure is very valuable information that we want to import into the graph database.
As each Wordnet is distributed in its own format, we need one import function per Wordnet.
The OMW team had the same need.
They provide one script per Wordnet that retrieves the aligned data from the original files.
The idea is to transform each of the OMW import scripts into a function, expand each function to import more information (including structure) and wrap all functions in a module.
The text was updated successfully, but these errors were encountered: