ItalianLexicalResources

This page provides information about the resources available for Italian language processing.

Table of Contents Italian WordNet Inference rules from Italian Wikipedia Similarity scores from distributional representations Corpus Resources Italian Paraphrase Table

Italian WordNet

Currently the EXCITEMENT platform provides functionality that uses the Italian version of WordNet, available by request from FBK. Compared to previous releases, we now provide a version that is compatible with the Princeton WordNet's Java API. Using the resource is therefore similar to using the Princeton WordNet: download the resource in a local directory, and provide the access path in the configuration file. In EditDistanceEDA's configuration file, this would look like this:

<subsection name="wordnet">
	<!-- path of the WordNet files -->
	<property name="path">/tmp/wnita</property>
</subsection>

In the configuration file you must indicate that the platform should use the "wordnet" section you just defined, by adding the following line to the components section:

<property name="instances">wordnet</property>

Further information about configuration files can be found here.

The EOP package corresponding to using this resource is:

eu.excitementproject.eop.core.component.lexicalknowledge.wordnet

Inference rules from Italian Wikipedia

Entailment rules for Italian nouns were extracted from a corpus of Italian Wikipedia articles using the Wikipedia lexical miner, as described in Shnarch et al., 2009. The rule extraction relies on linking nouns based on various indicators -- redirects, hyperlinks, categories, parenthesis at the title, inference from term definition, category network, article text (in particular the first sentence, considered as a definition). The rules are scores based on the indicators used to identify them.

Currently, this resource is distributed as a(n archived) MySQL database. It contains approximately 7 million rules. To use, download the archive. After unpacking the resource, install it in MySQL by running the command:

> mysql -u username -p < resource

Then add a section in the configuration file that describes how to access the resource. The name of the database is "wikilesresita". Apart from the database name, the dbconnection, dbuser and dbpasswd should reflect your installation of MySQL:

<subsection name="wikipedia">
	<!-- connection to the Wikipedia data base -->
	<property name="dbconnection">jdbc:mysql://nathrezim:3306/wikilexresita</property>
	<property name="dbuser">username</property>
	<property name="dbpasswd">password</property>
</subsection>

To indicate to the platform to use this resource, in the chosen component in the configuration file, insert the line:

<property name="instances">wikipedia</property>

Further information about configuration files can be found here.

The EOP package corresponding to using this resource is:

eu.excitementproject.eop.core.component.lexicalknowledge.wordnet

Similarity scores from distributional representations

The resources described here were generated with the distsim package. They provide similarity scores between words computed based on the words' distributional representation built from a parsed corpus. Read more about the resource generation process.

Corpus

The resources described in this section were built from a corpus of Italian Wikipedia pages, parsed with TextPro. At the time of writing this documentation, the resources described below were built from 350M out of the 1G corpus of Italian Wikipedia pages of 2013/09/08.

Resources

The resources are available for download from the Artifactory repository -- Italian Redis DBs. The archive contains 7 Redis database files, which contain rules as directed similarity scores between pairs of words:

Model	File	Nr. of rules	File size
BAP	similarity-l2r.rdb	443374	13M
	similarity-r2l.rdb	443121	13M
DIRT	similarity-l2r.rdb	1700	4k
LIN proximity	similarity-l2r.rdb	454964	13M
	similarity-r2l.rdb	454964	13M
LIN dependency	similarity-l2r.rdb	443374	13M
	similarity-r2l.rdb	443374	13M

Italian Paraphrase Table

The Italian paraphrase table is similar to a translation phrase table, only both the source and target languages are the same. It was obtained from a bilingual English-Italian translation table, by translating back and forth (so to speak), and using some filters to avoid as much noise as possible. Here is a sample from the table, which contains also paraphrase probabilities and word-level alignments:

phrase 1	phrase 2	probabilities	alignment
Oriente	dell' est	8.3403675825e-05 3.26660488188e-05	0-0 0-1 (2)
Oriente	orientale	0.0289970464107 0.1030637784109	0-0 (3)
Oriente	di Levante	0.00066489375 2.6212694103e-05	0-1 (1)
Oriente	Est	0.0717464906149 0.03115844430799	0-0 (3)
merito all' evoluzione	sull' andamento	6.6540063843e-08 2.2462629514e-06	0-0 1-0 2-1 (1)
merito all' evoluzione	sui progressi realizzati	3.4636629312e-07 6.2507666868e-07	0-0 1-0 2-1 2-2 (1)

The Italian paraphrase table is currently used by the Alignment EDA (P1EDA).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ItalianLexicalResources

Table of Contents

Italian WordNet

Inference rules from Italian Wikipedia

Similarity scores from distributional representations

Corpus

Resources

Italian Paraphrase Table

Documentation

Get Involved

Clone this wiki locally