This project provides an im- and an exporter to support the TreeTagger format in the linguistic converter framework Pepper (see http://corpus-tools.org/pepper/). The TreeTagger is a natural language processing tool, to annotate text with part-of-speech and lemma annotations. A detailed description of the importer can be found in section TreeTagg…
Pull request Compare This branch is 1 commit behind develop.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.settings
gh-site/img
src
.classpath
.gitignore
.project
LICENSE
NOTICE
README.md
build.properties
nb-configuration.xml
pom.xml

README.md

SaltNPepper project

pepperModules-TreetaggerModules

This project provides an im- and an exporter to support the TreeTagger format in the linguistic converter framework Pepper (see https://u.hu-berlin.de/saltnpepper). The TreeTagger is a natural language processing tool, to annotate text with part-of-speech and lemma annotations. A detailed description of the importer can be found in section TreeTaggerImporter and a description for the exporter can be found TreeTaggerExporter.

Pepper is a pluggable framework to convert a variety of linguistic formats (like TigerXML, the EXMARaLDA format, PAULA etc.) into each other. Furthermore Pepper uses Salt (see https://github.com/korpling/salt), the graph-based meta model for linguistic data, which acts as an intermediate model to reduce the number of mappings to be implemented. That means converting data from a format A to format B consists of two steps. First the data is mapped from format A to Salt and second from Salt to format B. This detour reduces the number of Pepper modules from n2-n (in the case of a direct mapping) to 2n to handle a number of n formats.

n:n mappings via SaltNPepper

In Pepper there are three different types of modules:

  • importers (to map a format A to a Salt model)
  • manipulators (to map a Salt model to a Salt model, e.g. to add additional annotations, to rename things to merge data etc.)
  • exporters (to map a Salt model to a format B).

For a simple Pepper workflow you need at least one importer and one exporter.

Requirements

Since the here provided module is a plugin for Pepper, you need an instance of the Pepper framework. If you do not already have a running Pepper instance, click on the link below and download the latest stable version (not a SNAPSHOT):

Note: Pepper is a Java based program, therefore you need to have at least Java 7 (JRE or JDK) on your system. You can download Java from https://www.oracle.com/java/index.html or http://openjdk.java.net/ .

Install module

If this Pepper module is not yet contained in your Pepper distribution, you can easily install it. Just open a command line and enter one of the following program calls:

Windows

pepperStart.bat 

Linux/Unix

bash pepperStart.sh 

Then type in command is and the path from where to install the module:

pepper> update de.hu_berlin.german.korpling.saltnpepper::pepperModules-${project.artifactId}::https://korpling.german.hu-berlin.de/maven2/

Usage

To use this module in your Pepper workflow, put the following lines into the workflow description file. Note the fixed order of xml elements in the workflow description file: <importer/>, <manipulator/>, <exporter>. The TreetaggerImporter is an importer module, which can be addressed by one of the following alternatives. A detailed description of the Pepper workflow can be found on the Pepper project site.

a) Identify the module by name

<importer name="TreetaggerImporter" path="PATH_TO_CORPUS"/>

or

<exporter name="TreetaggerExporter" path="PATH_TO_CORPUS"/>

b) Identify the module by formats

<importer formatName="treetagger" formatVersion="1.0" path="PATH_TO_CORPUS"/>

or

<exporter formatName="treetagger" formatVersion="1.0" path="PATH_TO_CORPUS"/>

c) Use properties

<importer name="TreetaggerImporter" path="PATH_TO_CORPUS">
  <property key="PROPERTY_NAME">PROPERTY_VALUE</property>
</importer>

or

<exporter name="TreetaggerExporter" path="PATH_TO_CORPUS">
  <property key="PROPERTY_NAME">PROPERTY_VALUE</property>
</exporter>

#TreetaggerImporter Input Data

An input file for the TreetaggerImporter is a tab separated file, having the ending 'tt', 'TreeTagger' or 'tab'. The text overlapped by the current token and its annotations are separated by one tab (columns). Each token description gets its own row. The first column is mandatory and contains the token´s form. Any further column is optional and can be declared in the properties file. Each column is required to have a distinct name. By default, i.e. when there are no column declarations in the properties file, the second and third column are considered the part-of-speech annotation and the lemma annotation respectively. Note that this default is overridden if there is any declaration of columns in the properties file.

It is possible to use SGML tags to mark spans of tokens. Invalid SGML elements (e.g. missing opening or closing tag) will be ignored.

A whole TreeTagger document may be marked by a surrounding SGML element, which is required to have a certain name. This name is definable in the properties file and defaults to "meta".

The expected input file encoding defaults to "UTF-8" and is also definable in the properties file. Any input file´s name is required to end on ".tab" or ".tt".

<meta someDocumentAttribute="someDocumentValue">
    <someSpan someSpanAttribute="someSpanValue">
        TOKEN_1 POS_1 LEMMA_1
        TOKEN_2 POS_2 LEMMA_2
    </someSpan>
    <someMoreSpan someMoreSpanAttribute="someMoreSpanValue">
        TOKEN_3 POS_3 LEMMA_3
        TOKEN_4 POS_4 LEMMA_4
        TOKEN_5 POS_5 LEMMA_5
    </someMoreSpan>
</meta>

Span annotations may also be used to express pointing relations between spans. In this case, some attribute of each span must encode a unique identifier (id in the example below), and another attribute may be used to refer to this attribute to create the pointing relation (head in the example). A further attribute may specify an edge annotation (func in the example):

I read

See the properties containing PointingRelation below for relevant configuration flags.

Creating TreeTagger Representation

The file´s content is converted to a TreeTagger document containing a list of tokens and spans. If there is a document marking SGML element in the input file, an annotation for each of it´s attribute-value-pairs is added to the TreeTagger document. For all other SGML elements, spans are created and annotations according to the element´s attribute-value-pairs are added. For each data row, a token is created. The token´s text attribute is set to the first column´s content. For each additional column, an annotation is created and added to the token. There are three different types of annotations: The POSAnnotation, used for part-of-speech annotations, the LemmaAnnotation, used for lemma annotations, and the AnyAnnotation, used for all user-defined annotations.

If the default settings for columns are used, a POSAnnotation for the second column´s content and a LemmaAnnotation for the third column´s content are created and added to the token.

If user-defined settings for columns are used, a POSAnnotation will be created for a column named "pos", a LemmaAnnotation will be created for a column named "lemma", and an AnyAnnotation will be created for each other column.

Note

All names have to be distinct, and that a token can only have one POSAnnotation and one LemmaAnnotation, but any number of AnyAnnotations.

Mapping to Salt

Document Annotations

When converting a TreeTagger document to Salt, a SDocument and it´s proper SDocumentGraph will be created. If there is a meta tag for the whole document, all it´s attributes will be added to the SDocument as SMetaAnnotations.

Tokens

Each token will be mapped to an SToken, which is added to the SDocumentGraph. If a token has a POSAnnotation, the SToken will get a corresponding SPOSAnnotation. If a token has a LemmaAnnotation, the SToken will get a corresponding SLemmaAnnotation.

All the token´s forms, separated by space characters, will be contained in the SText attribute of one STextualDS, which also is also added to the SDocumentGraph.

Spans

Each TreeTagger span is mapped to a SSpan, which is added to the SDocumentGraph. The SSpan´s name is set to the TreeTagger span´s name. The annotations on the TreeTagger span are mapped to SAnnotations and added to the SSpan. For each span, a SSpanningRelation between the span and all contained tokens is created. The SSpanningRelation is added to the SDocumentGraph as well.

There are two switches concerning the annotations on the TreeTagger spans. The one of them concerns spans without any annotations and will add a SAnnotation, having the span´s name as name and as value, to the SSpan. The other one will do the same, but applies to all spans, regardless of the presence of annotations.

Additionally, selected span attributes may be used to define annotated pointing relations with an id annotation, target annotation and optional annotation label.

Properties

The following table contains an overview of all usable properties to customize the behavior of this pepper module. The following section contains a brief description to each single property and describes the resulting differences in the mapping to the Salt model.

Note that if any comma separated properties are used to generate multiple types of pointing relations (e.g. dependencies and coreference annotations in the same document), then all pointing relation properties that can carry comma separated values must contain exactly the same number of comma separated values.

properties to customize importer behavior
Name of property Type of property optional/ mandatory default value
treetagger.input.metaTag String optional --
treetagger.input.annotateUnannotatedSpans Boolean optional false
treetagger.input.annotateAllSpansWithSpanName Boolean optional --
treetagger.input.separatorAfterToken String optional " "
treetagger.input.prefixElementToAttributes Boolean optional false
treetagger.input.prefixElementSeparator String optional _
columnNames comma separated Strings optional tok, pos, lemma
treetagger.input.replaceTokens String optional --
treetagger.input.replacementsInAnnos Boolean optional true
treetagger.input.makePointingRelations Boolean optional false
treetagger.input.pointingRelationTargetAnnotation comma separated Strings optional head
treetagger.input.pointingRelationIDAnnotation comma separated Strings optional id
treetagger.input.pointingRelationNamespace comma separated Strings optional dep
treetagger.input.pointingRelationType comma separated Strings optional dep
treetagger.input.invertPointingRelations comma separated booleans optional true
treetagger.input.pointingRelationEdgeAnnotation comma separated Strings optional func
treetagger.input.pointingRelationSuppressID Boolean optional true
treetagger.input.pointingRelationSuppressTarget Boolean optional true
treetagger.input.pointingRelationSuppressLabel Boolean optional true
treetagger.input.pointingRelationUseHash Boolean optional true
treetagger.input.spanAnnotationNamespace String optional --
treetagger.input.separateSpanAnnotations comma separated Strings optional --

treetagger.input.metaTag

States the meta tag used to mark the TreeTagger document in the input file(s).

treetagger.input.annotateUnannotatedSpans

If set true, this switch will cause the module to annotate all spans without attributes with their name as attribute and value, i.e. <MyTag> will be treated as <MyTag mytag="mytag">

treetagger.input.annotateAllSpansWithSpanName

If set true, this switch will cause the module to annotate all spans with their name as attribute and value, i.e. <MyTag attribute="value"> will be treated as <MyTag attribute="value" mytag="mytag">

treetagger.input.separatorAfterToken

Determines the separator which should be artificially added after a token, when mapping TreeTagger token to STextualDS in Salt. The default separator is a whitespace given by the character sequence \" \".

Note

The separator sequence, must be surrounded by double quotes. To shut of the adding of a separator, just this property value to "".

treetagger.input.prefixElementToAttributes

Choose whether to prefix the element name to all span annotation attributes. For example, if set to true, then for a span , the 'when' annotation becomes date_when (the default separator is _ and can be configured).

treetagger.input.prefixElementSeparator

The string to use to separate element and attribute name when using the prefixElementToAttributes option. By default this is '_' ,so that for a span , the 'when' annotation becomes date_when="2016".

columnNames

This property allows to change the default columns of the TreeTagger format, which is: token, part-of-speech annotation, lemma annotation.

This	DT	this
means	VVZ	mean
the	DT	the
experimenter	NN	experimenter
does	VVZ	do
n't	RB	n't
know	VV	know

(excerpt comes from the GUM corpus, see: https://corpling.uis.georgetown.edu/gum/) You can determine an unbound number of columns and name each column. Imagine the following file, where 1st column is the token itself, the second is the part-of-speech annotation, the third is the lemma annotation, the fourth stands for claws and the third for token function:

This	DT	this	DT0	nsubj
means	VVZ	mean	VVZ	root
the	DT	the	AT0	det
experimenter	NN	experimenter	NN1	nsubj
does	VVZ	do	VDZ	aux
n't	RB	n't	XX0	neg
know	VV	know	VVI	ccomp

(excerpt comes from the GUM corpus, see: https://corpling.uis.georgetown.edu/gum/)

The corresponding customization properties would look like this:

<property key="columnNames">pos, lemma, claws, tok_func</property>

Note: the first column is always the tokenization.

treetagger.input.replaceTokens

Specify values to find and replace in tokens. This value is a comma separated list of mappings: "REPLACED_STRING" : "REPLACEMENT" (, "REPLACED_STRING" : "REPLACEMENT")*

This property can be helpful including XML escapes in TT tokens. For example, if we have tokens like &amp; but we would like them to be imported as &, we can use:

<property key="treetagger.input.replaceTokens">"&amp;amp;":"&amp;"</property>

treetagger.input.replacementsInAnnos

If true, make token replacement patterns apply to annotations as well. This means that a lemma like &amp; could be made to work in the same way as with treetagger.input.replaceTokens.

treetagger.input.spanAnnotationNamespace

Namespace to use for span annotations. Default null (if not set, defaults to built-in model behavior, e.g. default_ns in ANNISExporter).

treetagger.input.makePointingRelations

If true, pointing relations will be considered for import based on the other pointing relation import settings below. Default: FALSE.

treetagger.input.pointingRelationTargetAnnotation

The span annotation marking the pointing relation target (or source if inverting). If creating multiple types of pointing relations, use comma separated values (e.g. head,target). Default head.

treetagger.input.pointingRelationIDAnnotation

The span annotation marking the span ID to create pointing relations to (or from if inverting). If creating multiple types of pointing relations, use comma separated values (e.g. id,xml:id). Default id.

treetagger.input.pointingRelationNamespace

A namespace or Salt Layer name given to edges and their annotations. If creating multiple types of pointing relations, use comma separated values (e.g. dep,coref). Default dep.

treetagger.input.pointingRelationType

The edge type to assign to pointing relations. If creating multiple types of pointing relations, use comma separated values (e.g. dep,coref). Default dep.

treetagger.input.invertPointingRelations

If true, pointing relations are inverted (head -> id instead of id -> head). Since the most common application of pointing relations in TreeTagger format is dependency trees, this is set to TRUE by default. If creating multiple types of pointing relations, use comma separated values (e.g. TRUE,FALSE).

treetagger.input.pointingRelationEdgeAnnotation

A span annotation name marking an edge annotation. If creating multiple types of pointing relations, use comma separated values (e.g. func,type). Default func.

treetagger.input.pointingRelationSuppressID

If true, the span annotation marking IDs is not imported as a span annotation, and is only used to determine pointing relation source/target. Default: TRUE.

treetagger.input.pointingRelationSuppressTarget

If true, the span annotation marking targets is not imported as a span annotation, and is only used to determine pointing relation source/target. Default: TRUE.

treetagger.input.pointingRelationSuppressLabel

If true, the span annotation marking edge label values is not imported as a span annotation, and is only used to determine pointing relation annotations. Default: TRUE.

treetagger.input.pointingRelationUseHash

If true, pointing relation targets starting with # will match IDs without the # (href style syntax); may be left on even if no hashtags are used. Default: TRUE.

treetagger.input.separateSpanAnnotations

Optional comma separated list of span attribute names for which new spans will be created, resulting in separate nodes in the graph, each with their own annotation. Default: null.

#TreetaggerExporter Mapping to TreeTagger format

Document Annotations

When converting from Salt to TreeTagger, a TreeTagger document will be created, and all SMetaAnnotations of the SDocument will be mapped to Annotations of that document.

Tokens

Each SToken will be mapped to a TreeTagger token. The token´s text comes from the STextualDS of the SDocumentGraph. If there is a SPOSAnnotation for the SToken, or if there is a SAnnotation named "pos", "part-of-speech" or "partofspeech" (all case insensitive), it will be mapped to the POSAannotation of the token. If there is a SLemmaAnnotation for the SToken, or if there is a SAnnotation named "lemma", "lemmatisation", "lemmatization" or "lemmata" (all case insensitive), it will be mapped to the LemmaAnnotation of the token. All other SAnnotations will be mapped to an "AnyAnnotation" of the TreeTagger token.

Spans

The SSpans from all SSpanningRelations in the SDocumentGraph will be mapped to spans of the TreeTagger document. There is a switch in the properties file for the processing of SSpans with generic names ("sSpan", followed by a number). If this switch is set "true", these names will be replaced by the name of the first annotation found on the span.

Output file

An output file from the TreetaggerExporter always contains a SGML element marking the whole document. The tag for the element defaults to "meta". This tag is definable in the properties file. All the annotations on the TreeTagger document will be added to the SGML element as attribute-valuepairs.

The output file contains one tab separated row per token. The first column contains the token´s form. If there is a POSAnnotation for the token, the second column contains it´s value, else it remains empty. If there is a LemmaAnnotation for the token, the third column contains it´s value, else it remains empty. The output of AnyAnnotations can be set in the properties file. If it is set "true", a column for each distinctly named AnyAnnotation will appears in the output file, sorted alphabetically by the AnyAnnotations´ names. Note that these names do not appear in the output file. However, the names and the order of the columns will be logged on the info-level of the conversion process.

All Spans will appear as SGML elements in the output file. The SGML element´s name is the Spans name, and all it´s annotations will be added to the element as attribute-value-pairs. In the properties file, the renaming of generically named Spans

The default encoding for output files is "UTF-8". This also is definable in the properties file. All output files´ names end on ".tt".

Properties

The following table contains an overview of all usable properties to customize the behavior of this pepper module. The following section contains a brief description to each single property and describes the resulting differences in the mapping to the salt model.

properties to customize exporter behavior
Name of property Type of property optional/ mandatory default value
treetagger.fileExtension String optional utf-8
treetagger.output.metaTag String optional --
treetagger.output.exportAnyAnnotation Boolean optional false
treetagger.output.replaceGenericSpanNames Boolean optional false
treetagger.output.flatten Boolean optional false

treetagger.fileExtension

This property determines the ending of TreeTagger files, which are exported. The default value is '.tt'.

treetagger.output.metaTag

Sets the meta tag used to mark the TreeTagger document in the output file(s).

treetagger.output.exportAnyAnnotation

If set true, each AnyAnnotation of tokens will appear in the output file.

treetagger.output.replaceGenericSpanNames

If set true, generic span names like sSpan123 will be replaced with the first annotation of the span found. If the span has no annotations, the generic name will not be replaced.

treetagger.output.flatten

If set true, the output directory structure is flat: all documents are put in the output root directory.

Contribute

Since this Pepper module is under a free license, please feel free to fork it from github and improve the module. If you even think that others can benefit from your improvements, don't hesitate to make a pull request, so that your changes can be merged. If you have found any bugs, or have some feature request, please open an issue on github. If you need any help, please write an e-mail to saltnpepper@lists.hu-berlin.de .

Funders

This project has been funded by the department of corpus linguistics and morphology of the Humboldt-Universität zu Berlin, the Institut national de recherche en informatique et en automatique (INRIA) and the Sonderforschungsbereich 632.

License

Copyright 2009 Humboldt-Universität zu Berlin, INRIA.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.