Skip to content

JIGSAW is a knowledge-based Word Sense Disambiguation system that attempts to disambiguate all words in a text by exploiting WordNet senses. The main assumption is that a specific strategy for each Part-Of-Speech is better than a single strategy.

License

Notifications You must be signed in to change notification settings

pippokill/JIGSAW

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

JIGSAW, a knowledge-based algorithm for Word Sense Disambiguation

Author/developer: Pierpaolo Basile <pierpaolo.basile@gmail.com>

If you use this software in writing scientific papers, or you use this
software in any other medium serving scientists or students (e.g. web-sites,
CD-ROMs) please include the following citation.

1) for ENGLISH Word Sense Disambiguation:

P. Basile, M. Degemmis, A. Gentile, P. Lops, and G. Semeraro. UNIBA:
JIGSAW Algorithm for Word Sense Disambiguation. In Proceedings of the 4th ACL
2007 International Workshop on Semantic Evaluations (SemEval-2007), pages 398–401.
Association for Computational Linguistics (ACL), 2007.

@InProceedings{basile-EtAl:2007:SemEval-2007,
  author    = {Basile, Pierpaolo  and  de Gemmis, Marco  and  Gentile, Anna Lisa  and  Lops, Pasquale  and  Semeraro, Giovanni},
  title     = {UNIBA: JIGSAW algorithm for Word Sense Disambiguation},
  booktitle = {Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007)},
  month     = {June},
  year      = {2007},
  address   = {Prague, Czech Republic},
  publisher = {Association for Computational Linguistics},
  pages     = {398--401},
  url       = {http://www.aclweb.org/anthology/S/S07/S07-1088}
}

The paper is available on line: http://aclweb.org/anthology-new/S/S07/S07-1088.pdf

2) for ITALIAN Word Sense Disambiguation:

P. Basile and G. Semeraro. JIGSAW: An algorithm for word sense disambiguation.
Intelligenza Artificiale, 4(2):53–54, 2007

@article{EVALITA2007,
  author    = {Pierpaolo Basile and
               Giovanni Semeraro},
  title     = {{JIGSAW}: An Algorithm for Word Sense Disambiguation},
  journal   = {Intelligenza Artificiale},
  volume    = {4},
  number    = {2},
  year      = {2007},
  pages     = {53-54},
}

-------------------------------------------------------------------------------------------

JIGSAW is a knowledge-based Word Sense Disambiguation (WSD) system that attempts to disambiguate all words in a text by exploiting WordNet senses.
The main assumption is that a specific strategy for each Part-Of-Speech (POS) is better than a single strategy.

In order to run JIGSAW, you must execute the script jigsaw.sh in the ./bin directory. 
The script requires the following parameters:

-cf <configuration file>, the configuration file is available in /resources/jigsaw.properties

-i <input file>, the file which contains the text that will be disambiguated

-o <output file>, the file in which you want to save the output. This parameter is optional, the standard output is used if you omit it

-m tokenized|tagged, the format of the input text. tokenized: one token per line. tagged: one token per line with pos-tag. This parameter 
is optional, non-tokenized text is used if you omit it

*CONFIGURATION FILE

This file contains options about JIGSAW. One option per line. The option is in the following format <option>=<value>. Information about options are reported
into the configuration file: ./resources/jigsaw.properties

*INPUT FILE

You can use three different formats:

-non-tokenized text

-tokenized text: one token per line

-tagged: one token per line with pos-tag in the following format: word.postag. The postag is a character which indicates the part-of-speech of the word.
	 You can use five types of pos-tag: n for nouns, v for verbs, a for adjectives, r for adverbs and o for other part-of speech tags.

*OUTPUT FILE

The output reports one token per line. Each token is followed by the stem, part-of-speech, lemma and the WSD output.

*WORDNET

JIGSAW relies on WordNet as knowledge base. You can download WordNet from http://wordnet.princeton.edu/wordnet/download/.
In order to use WordNet, you must configure the file ./resources/wn_file_properties.xml. In particular you must change the following line:

<param name="dictionary_path" value="/home/user/WordNet/3.0/dict/"/>

Change the attribute value and set the directory in which the WordNet dictionary is installed.

*MULTIWORDNET

JIGSAW for Italian disambiguation relies on MultiWordNet as knowledge base. You can obtain MultiWordNet from http://multiwordnet.fbk.eu/

MultiWordNet is provided as SQL data. You must modify resource/jigsawIT.properties file with your database information. Moreover, you must add
to the classpath the JDBC driver.

JIGSAW needs some NLP steps to work, in particular lemmatization and pos-tagging. The lemmatization implemented in JIGSAW relies on the morph-it resource 
that you can download from http://dev.sslmit.unibo.it/linguistics/morph-it.php . You must modify the property nlp.morph-it in the resource/jigsawIT.properties file.
The pos-tagging is implemented via OpenNLP API, but you must provide the model data.
You can find information about how to build pos-tag-model for OpenNLP here: http://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html#tools.postagger 
WE CANNOT PROVIDE POS-TAG MODEL DUE TO LICENSE RESTRICTIONS. The resource/jigsawIT.properties contains the property nlp.posTagModel where you must specify the path of
the pos-tag model.
IMPORTANT: the pos-tag model must rely on a specific tags set, in particular tags related to nouns should start with the character 'S', verbs with 'V', adverbs with 'B' and
adjectives with 'A'. Otherwise, you must modify the class jigsaw.wn.Tag2MWN to re-map your pos-tags set.


*EXAMPLES

The test directory contains some input/output files in different formats. Some command lines are reported here:

./jigsaw.sh -cf ../resources/jigsaw.properties -i ../test/non-tokenized -o ../test/non-tokenized.jigsaw

./jigsaw.sh -cf ../resources/jigsaw.properties -i ../test/tokenized -m tokenized -o ../test/tokenized.jigsaw

./jigsaw.sh -cf ../resources/jigsaw.properties -i ../test/tagged -m tagged -o ../test/tagged.jigsaw

In order to run JIGSAW on Italian documents, you can use the jigsaw_it.sh script.

About

JIGSAW is a knowledge-based Word Sense Disambiguation system that attempts to disambiguate all words in a text by exploiting WordNet senses. The main assumption is that a specific strategy for each Part-Of-Speech is better than a single strategy.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published