This project provides an importer to support the TigerXML format and the ISOTiger format for the linguistic converter framework Pepper (see https://u.hu-berlin.de/saltnpepper). A detailed description of that importer can be found in section Tiger2Importer.
Pepper is a pluggable framework to convert a variety of linguistic formats (like TigerXML, the EXMARaLDA format, PAULA etc.) into each other. Furthermore Pepper uses Salt (see https://github.com/korpling/salt), the graph-based meta model for linguistic data, which acts as an intermediate model to reduce the number of mappings to be implemented. That means converting data from a format A to format B consists of two steps. First the data is mapped from format A to Salt and second from Salt to format B. This detour reduces the number of Pepper modules from n2-n (in the case of a direct mapping) to 2n to handle a number of n formats.
In Pepper there are three different types of modules:
- importers (to map a format A to a Salt model)
- manipulators (to map a Salt model to a Salt model, e.g. to add additional annotations, to rename things to merge data etc.)
- exporters (to map a Salt model to a format B).
For a simple Pepper workflow you need at least one importer and one exporter.
Since the here provided module is a plugin for Pepper, you need an instance of the Pepper framework. If you do not already have a running Pepper instance, click on the link below and download the latest stable version (not a SNAPSHOT):
Note: Pepper is a Java based program, therefore you need to have at least Java 7 (JRE or JDK) on your system. You can download Java from https://www.oracle.com/java/index.html or http://openjdk.java.net/ .
If this Pepper module is not yet contained in your Pepper distribution, you can easily install it. Just open a command line and enter one of the following program calls:
Then type in command is and the path from where to install the module:
pepper> update de.hu_berlin.german.korpling.saltnpepper::pepperModules-pepperModules-TigerModules::https://korpling.german.hu-berlin.de/maven2/
To use this module in your Pepper workflow, put the following lines into the workflow description file. Note the fixed order of xml elements in the workflow description file: <importer/>, <manipulator/>, <exporter/>. The Tiger2Importer is an importer module, which can be addressed by one of the following alternatives. A detailed description of the Pepper workflow can be found on the Pepper project site.
a) Identify the module by name
<importer name="Tiger2Importer" path="PATH_TO_CORPUS"/>
b) Identify the module by formats
<importer formatName="tigerXML" formatVersion="1.0" path="PATH_TO_CORPUS"/>
<importer formatName="tiger2" formatVersion="2.0.5" path="PATH_TO_CORPUS"/>
c) Use properties
<importer name="Tiger2Importer" path="PATH_TO_CORPUS"> <property key="PROPERTY_NAME">PROPERTY_VALUE</property> </importer>
Since this Pepper module is under a free license, please feel free to fork it from github and improve the module. If you even think that others can benefit from your improvements, don't hesitate to make a pull request, so that your changes can be merged. If you have found any bugs, or have some feature request, please open an issue on github. If you need any help, please write an e-mail to firstname.lastname@example.org .
This project has been funded by the department of corpus linguistics and morphology of the Humboldt-Universität zu Berlin, the Institut national de recherche en informatique et en automatique (INRIA) and the Sonderforschungsbereich 632.
Copyright 2009 Humboldt-Universität zu Berlin, INRIA.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
The TigerImporter is able to import data comming from the TigerXML format and from the ISOTiger-format as well. Therefore, the here described mapping only covers the mapping between the ISOTiger-api and Salt.
The mapping of the document-structure of a Document in ISOTiger to a SDocument in Salt is very straight forward.
Metadata in the ISOTiger-model are all fields of the object Meta. These are for instance name, author and date. Each of these fields is mapped to an own metadata objects in Salt called SMetaAnnotation. The name of the metadate in the ISOTiger-model is mapped to the field SMetadata.sName and its value values is mapped to the SMetadata.sValue. All SMetaAnnotation objects are added to the SDocument object representing the Corpus object in the ISOTiger-model
Metadata in TigerXML (version 1) is restricted to the built-in types appearing in the document
<head> element, e.g.:
<head> <meta> <name>my_doc_name</name> <author>Amir</author> <date>2016-12-31</date> <description>My corpus, see http://mycorpus.com/</description> <format>TigerXML</format> <history/> </meta>
Note especially that the
<name> tag in Tiger XML determins the name of the document in the imported Salt model, and will subsequently determine the name of exported output files.
text, token and terminal
A terminal node (Terminal) is mapped to a SToken node. The overlaped text is mapped to a STextualDS object. During the mapping, only one STextualDS object is created for the entire document. Neither in the TigerXML format nor in the ISOTiger format the primary text can not be recreated, since only tokens are kept, but no information about separators like whitespaces. Therefore the importer provides a property () to customize a separator between tokens. The default separator is the blank character. Imagine two terminals covering the text "a" and "sample", the default mapping will produce the sText value "a sample". non-terminal A non-terminal node (NonTerminal) is mapped to a SStructure node.
The descision to which class of an SRelation an edge is mapped is rule based, depending on the class of source or the target node of the edge. when source of Edge object is a SToken object, than the Edge is mapped to a SPointingRelation object when Edge.source is a SSpan object and Edge.target is a SToken object , than the Edge is mapped to a SSpanningRelation object when Edge.source is a SStructure object, than the Edge is mapped to a SDominanceRelation object SPointingRelation otherwise
Annotations in general (represented by a Annotation object in the ISOTiger-api) are mapped to a SAnnotation object, where the SAnnotation.sName is mapped to the Annotation.name and the Annotation.value is mapped to the SAnnotation.sValue field. An Annotation object can belong to either a Terminal, a NonTerminal or an Edge object and therefore is referred to the corresponding SNode or SRelation object in Salt. To adopt the mapping with renaming name of an annotation, you can use the property .
In the default case, Segment objects are ignored and not mapped to Salt. To adopt this behavior you can use the property .
The table contains an overview of all usable properties to customize the behavior of this pepper module. The following section contains a brief description to each single property and describes the resulting differences in the mapping to the Salt model. properties to customize importer behavior
|Name of property||Type of property||optional/ mandatory||default value|
This flag determines if a SSpan object shall be created for each segment. Must be mappable to a Boolean value.
Property to determine, which Egde type shall be mapped to which kind of SRelation.This is just a prefix of the real property, which has a suffix specifying the Edge type. For instance map.dep or map.prim.
Determines the separator between terminal nodes. The default separator is ' '.
Gives a renaming table for the sType of a SRelation. The syntax of defining such a table is 'OLDNAME=NEWNAME (,OLDNAME=NEWNAME)*', for instance the property value prim=edge, sec=secedge, will rename all sType values from 'prim' to edge and 'sec' to secedge.
If true this will reverse the direction of edges having the given types. Thus the source node becomes the target node and the target node becomes the source node. This is useful when secondary edges are mapped to dominance edges and the annotation scheme would introduce cycles. By inverting the edges, cycles are avoided. This must be a list of type names, seperated by comma.
Gives a renaming table for the name of an annotation, or more specific, which value the sName of the SAnnotation object shall get. The syntax of defining such a table is 'OLDNAME=NEWNAME (,OLDNAME=NEWNAME)*', for instance the property value label=func, will rename all sName values from 'label' to 'func'."