-
Notifications
You must be signed in to change notification settings - Fork 1
ItalianLexicalResources
This page provides information about the resources available for Italian language processing.
Currently the EXCITEMENT platform provides functionality that uses the Italian version of WordNet, available by request from FBK. Compared to previous releases, we now provide a version that is compatible with the Princeton WordNet's Java API. Using the resource is therefore similar to using the Princeton WordNet: download the resource in a local directory, and provide the access path in the configuration file. In EditDistanceEDA's configuration file, this would look like this:
<subsection name="wordnet">
<!-- path of the WordNet files -->
<property name="path">/tmp/wnita</property>
</subsection>
In the configuration file you must indicate that the platform should use the "wordnet" section you just defined, by adding the following line to the components section:
<property name="instances">wordnet</property>
Further information about configuration files can be found here.
The EOP package corresponding to using this resource is:
eu.excitementproject.eop.core.component.lexicalknowledge.wordnet
Entailment rules for Italian nouns were extracted from a corpus of Italian Wikipedia articles using the Wikipedia lexical miner, as described in Shnarch et al., 2009. The rule extraction relies on linking nouns based on various indicators -- redirects, hyperlinks, categories, parenthesis at the title, inference from term definition, category network, article text (in particular the first sentence, considered as a definition). The rules are scores based on the indicators used to identify them.
Currently, this resource is distributed as a(n archived) MySQL database. It contains approximately 7 million rules. To use, download the archive. After unpacking the resource, install it in MySQL by running the command:
> mysql -u username -p < resource
Then add a section in the configuration file that describes how to access the resource. The name of the database is "wikilesresita". Apart from the database name, the dbconnection, dbuser and dbpasswd should reflect your installation of MySQL:
<subsection name="wikipedia">
<!-- connection to the Wikipedia data base -->
<property name="dbconnection">jdbc:mysql://nathrezim:3306/wikilexresita</property>
<property name="dbuser">username</property>
<property name="dbpasswd">password</property>
</subsection>
To indicate to the platform to use this resource, in the chosen component in the configuration file, insert the line:
<property name="instances">wikipedia</property>
Further information about configuration files can be found here.
The EOP package corresponding to using this resource is:
eu.excitementproject.eop.core.component.lexicalknowledge.wordnet
The resources described here were generated with the distsim package. They provide similarity scores between words computed based on the words' distributional representation built from a parsed corpus. Read more about the resource generation process.
The resources described in this section were built from a corpus of Italian Wikipedia pages, parsed with TextPro. At the time of writing this documentation, the resources described below were built from 350M out of the 1G corpus of Italian Wikipedia pages of 2013/09/08.
The resources are available for download from the Artifactory repository -- Italian Redis DBs. The archive contains 7 Redis database files, which contain rules as directed similarity scores between pairs of words:
Model | File | Nr. of rules | File size |
---|---|---|---|
BAP | similarity-l2r.rdb | 443374 | 13M |
similarity-r2l.rdb | 443121 | 13M | |
DIRT | similarity-l2r.rdb | 1700 | 4k |
LIN proximity | similarity-l2r.rdb | 454964 | 13M |
similarity-r2l.rdb | 454964 | 13M | |
LIN dependency | similarity-l2r.rdb | 443374 | 13M |
similarity-r2l.rdb | 443374 | 13M |
The Italian paraphrase table is similar to a translation phrase table, only both the source and target languages are the same. It was obtained from a bilingual English-Italian translation table, by translating back and forth (so to speak), and using some filters to avoid as much noise as possible. Here is a sample from the table, which contains also paraphrase probabilities and word-level alignments:
phrase 1 | phrase 2 | probabilities | alignment |
---|---|---|---|
Oriente | dell' est | 8.3403675825e-05 3.26660488188e-05 | 0-0 0-1 (2) |
Oriente | orientale | 0.0289970464107 0.1030637784109 | 0-0 (3) |
Oriente | di Levante | 0.00066489375 2.6212694103e-05 | 0-1 (1) |
Oriente | Est | 0.0717464906149 0.03115844430799 | 0-0 (3) |
merito all' evoluzione | sull' andamento | 6.6540063843e-08 2.2462629514e-06 | 0-0 1-0 2-1 (1) |
merito all' evoluzione | sui progressi realizzati | 3.4636629312e-07 6.2507666868e-07 | 0-0 1-0 2-1 2-2 (1) |
The Italian paraphrase table is currently used by the Alignment EDA (P1EDA).
- Licence
- Requirements
- Installation
- Quick Start
- Step by Step Tutorial
- Entailment Algorithms
- BIUTEE
- EditDistance
- TIE
- P1EDA
- Lexical Resources
- Configuration Files
- FAQ