Skip to content

This project provides an im- and an exporter to support the Penn Treebank Format (PTB) for the linguistic converter framework Pepper (see https://u.hu-berlin.de/saltnpepper).

License

Notifications You must be signed in to change notification settings

korpling/pepperModules-PTBModules

Repository files navigation

SaltNPepper project

pepperModules-PTBModules

This project provides an im- and an exporter to support the Penn Treebank Format (PTB) for the linguistic converter framework Pepper (see https://u.hu-berlin.de/saltnpepper). A detailed description of the importer can be found in section PTBImporter and one for the exporter can be found in PTBExporter.

Pepper is a pluggable framework to convert a variety of linguistic formats (like TigerXML, the EXMARaLDA format, PAULA etc.) into each other. Furthermore Pepper uses Salt (see https://github.com/korpling/salt), the graph-based meta model for linguistic data, which acts as an intermediate model to reduce the number of mappings to be implemented. That means converting data from a format A to format B consists of two steps. First the data is mapped from format A to Salt and second from Salt to format B. This detour reduces the number of Pepper modules from n2-n (in the case of a direct mapping) to 2n to handle a number of n formats.

n:n mappings via SaltNPepper

In Pepper there are three different types of modules:

  • importers (to map a format A to a Salt model)
  • manipulators (to map a Salt model to a Salt model, e.g. to add additional annotations, to rename things to merge data etc.)
  • exporters (to map a Salt model to a format B).

For a simple Pepper workflow you need at least one importer and one exporter.

Requirements

Since the here provided module is a plugin for Pepper, you need an instance of the Pepper framework. If you do not already have a running Pepper instance, click on the link below and download the latest stable version (not a SNAPSHOT):

Note: Pepper is a Java based program, therefore you need to have at least Java 7 (JRE or JDK) on your system. You can download Java from https://www.oracle.com/java/index.html or http://openjdk.java.net/ .

Install module

If this Pepper module is not yet contained in your Pepper distribution, you can easily install it. Just open a command line and enter one of the following program calls:

Windows

pepperStart.bat 

Linux/Unix

bash pepperStart.sh 

Then type in command is and the path from where to install the module:

pepper> update de.hu_berlin.german.korpling.saltnpepper::pepperModules-pepperModules-PTBModules::https://korpling.german.hu-berlin.de/maven2/

Usage

To use this module in your Pepper workflow, put the following lines into the workflow description file. Note the fixed order of xml elements in the workflow description file: <importer/>, <manipulator/>, <exporter/>. The PTBImporter is an importer module, which can be addressed by one of the following alternatives. A detailed description of the Pepper workflow can be found on the Pepper project site.

a) Identify the module by name

<importer name="PTBImporter" path="PATH_TO_CORPUS"/>

or

<exporter name="PTBExporter" path="PATH_TO_CORPUS"/>

b) Identify the module by formats

<importer formatName="PTB" formatVersion="1.0" path="PATH_TO_CORPUS"/>

or

<exporter formatName="PTB" formatVersion="1.0" path="PATH_TO_CORPUS"/>

c) Use properties

<importer name="PTBImporter" path="PATH_TO_CORPUS">
  <property key="PROPERTY_NAME">PROPERTY_VALUE</property>
</importer>

or

<importer name="PTBExporter" path="PATH_TO_CORPUS">
  <property key="PROPERTY_NAME">PROPERTY_VALUE</property>
</importer>

Contribute

Since this Pepper module is under a free license, please feel free to fork it from github and improve the module. If you even think that others can benefit from your improvements, don't hesitate to make a pull request, so that your changes can be merged. If you have found any bugs, or have some feature request, please open an issue on github. If you need any help, please write an e-mail to saltnpepper@lists.hu-berlin.de .

Funders

This project has been funded by the department of corpus linguistics and morphology of the Humboldt-Universität zu Berlin and the Sonderforschungsbereich 632.

License

Copyright 2014 Humboldt-Universität zu Berlin.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

This module imports text files in the PTB format into a Salt corpus.

Each PTB format text file is mapped to a single Salt document. Multiple files in a folder are interpreted as a corpus or subcorpus containing those documents. It is possible to have a folder hierarchy corresponding to a corpus with multiple corpora. The documents in each subcorpus are text files with the expected extension .txt, .ptb or .mrg. A single PTB text document for import has the following form:

(S  
      (PP (IN In)  
        (NP (JJ American) (NN romance) )) 
      (, ,)  
      (NP-SBJ-2 (RB almost) (NN nothing) ) 
      (VP (VBZ rates)  
        (S  
          (NP-SBJ (-NONE- *-2) ) 
          (ADJP-PRD  
            (ADJP (JJR higher) ) 
            (PP (IN than)  
              (SBAR-NOM  
                (WHNP-1 (WP what) ) 
                (S  
                  (NP-SBJ (DT the) (NN movie) (NNS men) ) 
                  (VP (VB have)  
                    (VP (VBN called)  
                      (S  
                        (NP-SBJ (-NONE- *T*-1) ) 
                        (`` ``)  
                        (S-NOM-PRD  
                          (NP-SBJ (-NONE- *) ) 
                          (VP (NN meeting)  
                            (NP (JJ cute) ))) 
                        ('' '') ))))))))))));

Each opening bracket represents deeper embedding in the tree hierarchy. Terminal nodes (i.e. tokens/word forms) are found in the inner-most brackets aligned with the closing bracket to the right. Their part-of-speech tag is aligned to the left, e.g. in the token "(VB have)", which stands for the word "have", with the part-of-speech "VB". Sentences can be indented as in the example above, for easier human readability, but one-line per sentence format is also supported: white space between tree nodes is completely ignored. A sentence beginning is recognized wherever an openning brackets is found, and the end of the sentence is detected when an equal number of opening and closing brackets have been found. Multiple sentences in one line are no supported. In some PTB style corpora, node 'functions' are designated after a separator in the node name, and these may optionally be interprested as edge labels. For example, in "(NP-SBJ" above, the segment "-SBJ" signifies that this node is the subject of the sentence. This part of the node label can be optionally removed and used as an edge annotation for the incoming edge (see below). Some corpora have a different notation for tokens, where they are not bracketed and the word form precedes the part-of-speech, which is separated by a slash (a.k.a. "atis-style"). For example:

(NP two/CD friends/NNS )

This PTB 'dialect' is also supported, and support for such tokens can be switched on or off using the properties below.

Properties

The table contains an overview of all usable properties to customize the behaviour of this Pepper module. The following section contains a description of each property and describes the resulting differences in the mapping to the Salt model.

Name of property Type of property optional/ mandatory default value
nodeNamespace String optional ptb
posName String optional pos
catName String optional cat
edgeType String optional edge
edgeAnnoSeparator String optional -
edgeAnnoNamespace String optional ptb
edgeAnnoName String optional func
nodeNamespace Boolean optional true
handleSlashTokens Boolean optional true

nodeNamespace

nodeNamespace=ptb Determines the name of the Salt layer assigned to tree nodes on import.

posName

posName=pos Name of pos annotation name for PTB tokens, e.g. 'pos'.

catName

catName=cat Name of category annotation for PTB non-terminal nodes, e.g. 'cat'.

edgeType

edgeType=edge Name of edge type for PTB dominance edges, e.g. 'edge'.

edgeAnnoSeparator

edgeAnnoSeparator=- Separator character for edge labels following node annotation, e.g. the '-' in (NP-subj (....

edgeAnnoNamespace

edgeAnnoNamespace=ptb Namespace for PTB edge annotations (represented within a node label after a separator), e.g. 'ptb'.

edgeAnnoName

edgeAnnoName=func Name of PTB dominance edge annotation name, e.g. 'func'.

nodeNamespace

importEdgeAnnos=true Boolean, whether to look for edge annotations after a separator.

handleSlashTokens

handleSlashTokens=true Boolean, whether to handle Penn atis-style tokens, which are non bracketed and separate the pos tag with a slash, e.g.: (NP two/CD friends/NNS ).

This module exports text files in the PTB format from a Salt representation.

Each Salt document is mapped to a single PTB text file. Multiple documents in subcorpora are interpreted as a folder structure with multiple text files in each subcorpus folder. A single PTB text document generated by the exporter might look like this:

(S  
      (PP (IN In)  
        (NP (JJ American) (NN romance) )) 
      (, ,)  
      (NP-SBJ-2 (RB almost) (NN nothing) ) 
      (VP (VBZ rates)  
        (S  
          (NP-SBJ (-NONE- *-2) ) 
          (ADJP-PRD  
            (ADJP (JJR higher) ) 
            (PP (IN than)  
              (SBAR-NOM  
                (WHNP-1 (WP what) ) 
                (S  
                  (NP-SBJ (DT the) (NN movie) (NNS men) ) 
                  (VP (VB have)  
                    (VP (VBN called)  
                      (S  
                        (NP-SBJ (-NONE- *T*-1) ) 
                        (`` ``)  
                        (S-NOM-PRD  
                          (NP-SBJ (-NONE- *) ) 
                          (VP (NN meeting)  
                            (NP (JJ cute) ))) 
                        ('' '') ))))))))))));

The Salt document graph is searched for tokens and hirearchical dominance relations above these. The tree of dominance relations is realized as a nested bracket structure, as shown above. Tokens (i.e. word forms) are found in the inner-most brackets aligned with the closing bracket to the right. Their part-of-speech tag is aligned to the left, e.g. in the token "(VB have)", which stands for the word "have", with the part-of-speech "VB".

In some PTB style corpora, node 'functions' are designated after a separator in the node name, and these may optionally be interprested as edge labels. For example, in "(NP-SBJ" above, the segment "-SBJ" signifies that this node is the subject of the sentence. This part of the node label can be generated from a specified edge annotation for the incoming edge with a specified separator (the '-' in this example; see properties below).

Some corpora use a different notation for tokens, where they are not bracketed and the word form precedes the part-of-speech, which is separated by a slash (a.k.a. "atis-style"). For example:

(NP two/CD friends/NNS )

This PTB 'dialect' is also supported, and generation of such tokens can be switched on or off using the below

Properties

The table contains an overview of all usable properties to customize the behaviour of this Pepper module. The following section contains a description of each property and describes the resulting differences in the mapping to the Salt model.

Name of property Type of property optional/ mandatory default value
nodeNamespace String optional ptb
posName String optional pos
catName String optional cat
edgeType String optional edge
edgeAnnoSeparator String optional -
edgeAnnoNamespace String optional ptb
edgeAnnoName String optional func
nodeNamespace Boolean optional true
handleSlashTokens Boolean optional false

nodeNamespace

nodeNamespace=ptb Name of namespace for nodes to export and their annotations, e.g. 'ptb'. Only nodes within this layer name in Salt will be exported.

posName

posName=pos Name of pos annotation name for tokens, e.g. 'pos'. Only this annotation name will be taken to generate pos labels for the tokens.

catName

catName=cat Name of category annotation for non-terminal nodes to be exported, e.g. 'cat'. Only this annotation name will be taken to generate non-terminal node lables.

edgeType

edgeType=edge Name of edge type for dominance edges to be exported, e.g. 'edge'. Only this edge type will be taken to generate the PTB bracket structure.

edgeAnnoSeparator

edgeAnnoSeparator=- Separator character for edge labels following node annotation, e.g. the '-' in (NP-subj (....

edgeAnnoNamespace

edgeAnnoNamespace=ptb Namespace for edge annotations to be exported (represented within a node label after a separator), e.g. 'ptb'.

edgeAnnoName

edgeAnnoName=func Name of dominance edge annotation name to be exported, e.g. 'func'.

###nodeNamespace exportEdgeAnnos=true Boolean, whether to output edge annotations after a separator.

handleSlashTokens

handleSlashTokens=false Boolean, whether to create Penn atis-style tokens, which are non bracketed and separate the pos tag with a slash, e.g.: (NP two/CD friends/NNS ).

About

This project provides an im- and an exporter to support the Penn Treebank Format (PTB) for the linguistic converter framework Pepper (see https://u.hu-berlin.de/saltnpepper).

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages