Join GitHub today
Clone this wiki locally
This wiki is meant to provide additional information and how-tos for certain tasks imperfectly described in the Talismane documentation. Much more detailed information about Talismane can also be found in Assaf Urieli's PhD thesis. Little by little, this wiki will become the primary source for Talismane documentation.
Talismane is a statistical transition-based dependency parser for natural languages, written in Java. Many aspects of Talismane's behaviour can be tuned via the available configuration parameters. Furthermore, Talismane is based on an open, modular architecture, enabling a more advanced user to easily replace and/or extend the various modules, and, if required, to explore and modify the source code. It is distributed under an AGPL open-source license in order to encourage its non-commercial redistribution and adaptation.
Talismane stands for "Traitement Automatique des Langues par Inférence Statistique Moyennant l'Annotation de Nombreux Exemples" in French, or "Tool for the Analysis of Language, Inferring Statistical Models from the Annotation of Numerous Examples" in English.
Talismane should be considered as a framework which could potentially be adapted to any natural language. Currently, language packs are available for French, English and Occitan.
Talismane is a statistical toolset and makes heavy use of a probablistic classifier (currently either Linear SVM, Maximum Entropy or Perceptrons). Linguistic knowledge is incorporated into the system via the selection of features and rules specific to the language being processed.
The portability offered by Java enables Talismane to function on most operating systems, including Linux, Unix, MacOS and Windows.
Talismane consists of four main modules which transform a raw unannotated text into a series of syntax dependency trees. It also contains a number of pre-processing and post-processing filters to manage and transform input and output. In sequence, these modules are:
Each of the modules in the processing chain can be used independently if desired.
Talismane also contains some additional modules that are in a more experimental stage. These include:
- language detection: supervised language detection based on a training corpus containing examples in the various languages.
To analyse a text file in French, download the latest Talismane release at: https://github.com/urieli/talismane/releases Quick Start
You need three files:
Unzip the file talismane-distribution-X.X.X-bin.zip, but not frenchLanguagePack-X.X.X.zip. Then copy the other two files into into the folder where talismane-distribution-X.X.X-bin.zip was unzipped.
The command for syntax parsing then:
java -Xmx1G -Dconfig.file=talismane-fr-X.X.X.conf -jar talismane-core-X.X.X.jar --analyse --sessionId=fr --encoding=UTF8 --inFile=data/frTest.txt --outFile=data/frTest.tal
If you want to stop at pos-tagging, you can use:
java -Xmx1G -Dconfig.file=talismane-fr-X.X.X.conf -jar talismane-core-X.X.X.jar --analyse --endModule=posTagger --sessionId=fr --encoding=UTF8 --inFile=data/frTest.txt --outFile=data/frTest.tal
To see the full list of command-line options, type:
java -jar talismane-core-X.X.X.jar --help
The full configuration options available in the configuration file (indicated by the
-Dconfig.file switch) can be found at:
These are the default values, which can be overridden in your configuration file, as shown in the standard French and English configuration files. Note that the
sessionId passed in the command line should correspond to the content the configuration file: if the configuration file is structured as
sessionId should be "fr".