Skip to content

ItalianLexicalResources

vnastase edited this page Feb 11, 2015 · 2 revisions

This page provides information about the resources available for Italian language processing.

Table of Contents

Italian WordNet

Currently the EXCITEMENT platform provides functionality that uses the Italian version of WordNet, available by request from FBK. Compared to previous releases, we now provide a version that is compatible with the Princeton WordNet's Java API. Using the resource is therefore similar to using the Princeton WordNet: download the resource in a local directory, and provide the access path in the configuration file. In EditDistanceEDA's configuration file, this would look like this:

<subsection name="wordnet">
	<!-- path of the WordNet files -->
	<property name="path">/tmp/wnita</property>
</subsection>

In the configuration file you must indicate that the platform should use the "wordnet" section you just defined, by adding the following line to the components section:

<property name="instances">wordnet</property>

Further information about configuration files can be found here.

The EOP package corresponding to using this resource is:

eu.excitementproject.eop.core.component.lexicalknowledge.wordnet

Inference rules from Italian Wikipedia

Entailment rules for Italian nouns were extracted from a corpus of Italian Wikipedia articles using the Wikipedia lexical miner, as described in Shnarch et al., 2009. The rule extraction relies on linking nouns based on various indicators -- redirects, hyperlinks, categories, parenthesis at the title, inference from term definition, category network, article text (in particular the first sentence, considered as a definition). The rules are scores based on the indicators used to identify them.

Currently, this resource is distributed as a(n archived) MySQL database. It contains approximately 7 million rules. To use, download the archive. After unpacking the resource, install it in MySQL by running the command:

> mysql -u username -p < resource

Then add a section in the configuration file that describes how to access the resource. The name of the database is "wikilesresita". Apart from the database name, the dbconnection, dbuser and dbpasswd should reflect your installation of MySQL:

<subsection name="wikipedia">
	<!-- connection to the Wikipedia data base -->
	<property name="dbconnection">jdbc:mysql://nathrezim:3306/wikilexresita</property>
	<property name="dbuser">username</property>
	<property name="dbpasswd">password</property>
</subsection>

To indicate to the platform to use this resource, in the chosen component in the configuration file, insert the line:

<property name="instances">wikipedia</property>

Further information about configuration files can be found here.

The EOP package corresponding to using this resource is:

eu.excitementproject.eop.core.component.lexicalknowledge.wordnet

Similarity scores from distributional representations

The resources described here were generated with the distsim package. They provide similarity scores between words computed based on the words' distributional representation built from a parsed corpus. Read more about the resource generation process.

Corpus

The resources described in this section were built from a corpus of Italian Wikipedia pages, parsed with TextPro. At the time of writing this documentation, the resources described below were built from 350M out of the 1G corpus of Italian Wikipedia pages of 2013/09/08.

Resources

The resources are available for download from the Artifactory repository -- Italian Redis DBs. The archive contains 7 Redis database files, which contain rules as directed similarity scores between pairs of words:

Model File Nr. of rules File size
BAP similarity-l2r.rdb 443374 13M
similarity-r2l.rdb 443121 13M
DIRT similarity-l2r.rdb 1700 4k
LIN proximity similarity-l2r.rdb 454964 13M
similarity-r2l.rdb 454964 13M
LIN dependency similarity-l2r.rdb 443374 13M
similarity-r2l.rdb 443374 13M

Italian Paraphrase Table

The Italian paraphrase table is similar to a translation phrase table, only both the source and target languages are the same. It was obtained from a bilingual English-Italian translation table, by translating back and forth (so to speak), and using some filters to avoid as much noise as possible. Here is a sample from the table, which contains also paraphrase probabilities and word-level alignments:

phrase 1 phrase 2 probabilities alignment
Oriente dell' est 8.3403675825e-05 3.26660488188e-05 0-0 0-1 (2)
Oriente orientale 0.0289970464107 0.1030637784109 0-0 (3)
Oriente di Levante 0.00066489375 2.6212694103e-05 0-1 (1)
Oriente Est 0.0717464906149 0.03115844430799 0-0 (3)
merito all' evoluzione sull' andamento 6.6540063843e-08 2.2462629514e-06 0-0 1-0 2-1 (1)
merito all' evoluzione sui progressi realizzati 3.4636629312e-07 6.2507666868e-07 0-0 1-0 2-1 2-2 (1)

The Italian paraphrase table is currently used by the Alignment EDA (P1EDA).

Clone this wiki locally