Home

Assaf Urieli edited this page Dec 12, 2017 · 27 revisions

This wiki is meant to provide additional information and how-tos for certain tasks imperfectly described in the Talismane documentation. Much more detailed information about Talismane can also be found in Assaf Urieli's PhD thesis. Little by little, this wiki will become the primary source for Talismane documentation.

Introduction

Talismane is a statistical transition-based dependency parser for natural languages, written in Java. Many aspects of Talismane's behaviour can be tuned via the available configuration parameters. Furthermore, Talismane is based on an open, modular architecture, enabling a more advanced user to easily replace and/or extend the various modules, and, if required, to explore and modify the source code. It is distributed under an AGPL open-source license in order to encourage its non-commercial redistribution and adaptation.

Talismane stands for "Traitement Automatique des Langues par Inférence Statistique Moyennant l'Annotation de Nombreux Exemples" in French, or "Tool for the Analysis of Language, Inferring Statistical Models from the Annotation of Numerous Examples" in English.

Talismane should be considered as a framework which could potentially be adapted to any natural language. Currently, language packs are available for French, English and Occitan.

Talismane is a statistical toolset and makes heavy use of a probablistic classifier (currently either Linear SVM, Maximum Entropy or Perceptrons). Linguistic knowledge is incorporated into the system via the selection of features and rules specific to the language being processed.

The portability offered by Java enables Talismane to function on most operating systems, including Linux, Unix, MacOS and Windows.

Talismane consists of four main modules which transform a raw unannotated text into a series of syntax dependency trees. It also contains a number of pre-processing and post-processing filters to manage and transform input and output. In sequence, these modules are:

Each of the modules in the processing chain can be used independently if desired.

Talismane Processing Chain
Talismane Processing Chain

Talismane also contains some additional modules that are in a more experimental stage. These include:

  • language detection: supervised language detection based on a training corpus containing examples in the various languages.

Quick Start

To analyse a text file in French, download the latest Talismane release at: https://github.com/urieli/talismane/releases

You need three files:

  • talismane-distribution-X.X.X-bin.zip
  • talismane-fr-X.X.X.conf
  • frenchLanguagePack-X.X.X.zip

Unzip the file talismane-distribution-X.X.X-bin.zip, but not frenchLanguagePack-X.X.X.zip. Then copy the other two files into into the folder where talismane-distribution-X.X.X-bin.zip was unzipped.

The command for syntax parsing then:

java -Xmx1G -Dconfig.file=talismane-fr-X.X.X.conf -jar talismane-core-X.X.X.jar --analyse --sessionId=fr --encoding=UTF8 --inFile=data/frTest.txt --outFile=data/frTest.tal

If you want to stop at pos-tagging, you can use:

java -Xmx1G -Dconfig.file=talismane-fr-X.X.X.conf -jar talismane-core-X.X.X.jar --analyse --endModule=posTagger --sessionId=fr --encoding=UTF8 --inFile=data/frTest.txt --outFile=data/frTest.tal

To see the full list of command-line options, type:

java -jar talismane-core-X.X.X.jar --help

The full configuration options available in the configuration file (indicated by the -Dconfig.file switch) can be found at:

These are the default values, which can be overridden in your configuration file, as shown in the standard French and English configuration files. Note that the sessionId passed in the command line should correspond to the content the configuration file: if the configuration file is structured as talismane.core.fr, the sessionId should be "fr".

Topics