pepperModules-PTBModules

This project provides an im- and an exporter to support the Penn Treebank Format (PTB) for the linguistic converter framework Pepper (see https://u.hu-berlin.de/saltnpepper). A detailed description of the importer can be found in section PTBImporter and one for the exporter can be found in PTBExporter.

Pepper is a pluggable framework to convert a variety of linguistic formats (like TigerXML, the EXMARaLDA format, PAULA etc.) into each other. Furthermore Pepper uses Salt (see https://github.com/korpling/salt), the graph-based meta model for linguistic data, which acts as an intermediate model to reduce the number of mappings to be implemented. That means converting data from a format A to format B consists of two steps. First the data is mapped from format A to Salt and second from Salt to format B. This detour reduces the number of Pepper modules from n²-n (in the case of a direct mapping) to 2n to handle a number of n formats.

In Pepper there are three different types of modules:

importers (to map a format A to a Salt model)
manipulators (to map a Salt model to a Salt model, e.g. to add additional annotations, to rename things to merge data etc.)
exporters (to map a Salt model to a format B).

For a simple Pepper workflow you need at least one importer and one exporter.

Requirements

Since the here provided module is a plugin for Pepper, you need an instance of the Pepper framework. If you do not already have a running Pepper instance, click on the link below and download the latest stable version (not a SNAPSHOT):

Note: Pepper is a Java based program, therefore you need to have at least Java 7 (JRE or JDK) on your system. You can download Java from https://www.oracle.com/java/index.html or http://openjdk.java.net/ .

Install module

If this Pepper module is not yet contained in your Pepper distribution, you can easily install it. Just open a command line and enter one of the following program calls:

Windows

pepperStart.bat

Linux/Unix

bash pepperStart.sh

Then type in command is and the path from where to install the module:

pepper> update de.hu_berlin.german.korpling.saltnpepper::pepperModules-pepperModules-PTBModules::https://korpling.german.hu-berlin.de/maven2/

Usage

To use this module in your Pepper workflow, put the following lines into the workflow description file. Note the fixed order of xml elements in the workflow description file: <importer/>, <manipulator/>, <exporter/>. The PTBImporter is an importer module, which can be addressed by one of the following alternatives. A detailed description of the Pepper workflow can be found on the Pepper project site.

a) Identify the module by name

<importer name="PTBImporter" path="PATH_TO_CORPUS"/>

or

<exporter name="PTBExporter" path="PATH_TO_CORPUS"/>

b) Identify the module by formats

<importer formatName="PTB" formatVersion="1.0" path="PATH_TO_CORPUS"/>

or

<exporter formatName="PTB" formatVersion="1.0" path="PATH_TO_CORPUS"/>

c) Use properties

<importer name="PTBImporter" path="PATH_TO_CORPUS">
  <property key="PROPERTY_NAME">PROPERTY_VALUE</property>
</importer>

or

<importer name="PTBExporter" path="PATH_TO_CORPUS">
  <property key="PROPERTY_NAME">PROPERTY_VALUE</property>
</importer>

Contribute

Since this Pepper module is under a free license, please feel free to fork it from github and improve the module. If you even think that others can benefit from your improvements, don't hesitate to make a pull request, so that your changes can be merged. If you have found any bugs, or have some feature request, please open an issue on github. If you need any help, please write an e-mail to saltnpepper@lists.hu-berlin.de .

Funders

This project has been funded by the department of corpus linguistics and morphology of the Humboldt-Universität zu Berlin and the Sonderforschungsbereich 632.

License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

PTBImporter

This module imports text files in the PTB format into a Salt corpus.

Each PTB format text file is mapped to a single Salt document. Multiple files in a folder are interpreted as a corpus or subcorpus containing those documents. It is possible to have a folder hierarchy corresponding to a corpus with multiple corpora. The documents in each subcorpus are text files with the expected extension .txt, .ptb or .mrg. A single PTB text document for import has the following form:

(S  
      (PP (IN In)  
        (NP (JJ American) (NN romance) )) 
      (, ,)  
      (NP-SBJ-2 (RB almost) (NN nothing) ) 
      (VP (VBZ rates)  
        (S  
          (NP-SBJ (-NONE- *-2) ) 
          (ADJP-PRD  
            (ADJP (JJR higher) ) 
            (PP (IN than)  
              (SBAR-NOM  
                (WHNP-1 (WP what) ) 
                (S  
                  (NP-SBJ (DT the) (NN movie) (NNS men) ) 
                  (VP (VB have)  
                    (VP (VBN called)  
                      (S  
                        (NP-SBJ (-NONE- *T*-1) ) 
                        (`` ``)  
                        (S-NOM-PRD  
                          (NP-SBJ (-NONE- *) ) 
                          (VP (NN meeting)  
                            (NP (JJ cute) ))) 
                        ('' '') ))))))))))));

Each opening bracket represents deeper embedding in the tree hierarchy. Terminal nodes (i.e. tokens/word forms) are found in the inner-most brackets aligned with the closing bracket to the right. Their part-of-speech tag is aligned to the left, e.g. in the token "(VB have)", which stands for the word "have", with the part-of-speech "VB". Sentences can be indented as in the example above, for easier human readability, but one-line per sentence format is also supported: white space between tree nodes is completely ignored. A sentence beginning is recognized wherever an openning brackets is found, and the end of the sentence is detected when an equal number of opening and closing brackets have been found. Multiple sentences in one line are no supported. In some PTB style corpora, node 'functions' are designated after a separator in the node name, and these may optionally be interprested as edge labels. For example, in "(NP-SBJ" above, the segment "-SBJ" signifies that this node is the subject of the sentence. This part of the node label can be optionally removed and used as an edge annotation for the incoming edge (see below). Some corpora have a different notation for tokens, where they are not bracketed and the word form precedes the part-of-speech, which is separated by a slash (a.k.a. "atis-style"). For example:

(NP two/CD friends/NNS )

This PTB 'dialect' is also supported, and support for such tokens can be switched on or off using the properties below.

Properties

The table contains an overview of all usable properties to customize the behaviour of this Pepper module. The following section contains a description of each property and describes the resulting differences in the mapping to the Salt model.

Name of property	Type of property	optional/ mandatory	default value
nodeNamespace	String	optional	ptb
posName	String	optional	pos
catName	String	optional	cat
edgeType	String	optional	edge
edgeAnnoSeparator	String	optional	-
edgeAnnoNamespace	String	optional	ptb
edgeAnnoName	String	optional	func
nodeNamespace	Boolean	optional	true
handleSlashTokens	Boolean	optional	true

nodeNamespace

nodeNamespace=ptb Determines the name of the Salt layer assigned to tree nodes on import.

posName

posName=pos Name of pos annotation name for PTB tokens, e.g. 'pos'.

catName

catName=cat Name of category annotation for PTB non-terminal nodes, e.g. 'cat'.

edgeType

edgeType=edge Name of edge type for PTB dominance edges, e.g. 'edge'.

edgeAnnoSeparator

edgeAnnoSeparator=- Separator character for edge labels following node annotation, e.g. the '-' in (NP-subj (....

edgeAnnoNamespace

edgeAnnoNamespace=ptb Namespace for PTB edge annotations (represented within a node label after a separator), e.g. 'ptb'.

edgeAnnoName

edgeAnnoName=func Name of PTB dominance edge annotation name, e.g. 'func'.

nodeNamespace

importEdgeAnnos=true Boolean, whether to look for edge annotations after a separator.

handleSlashTokens

handleSlashTokens=true Boolean, whether to handle Penn atis-style tokens, which are non bracketed and separate the pos tag with a slash, e.g.: (NP two/CD friends/NNS ).

PTBExporter

This module exports text files in the PTB format from a Salt representation.

Each Salt document is mapped to a single PTB text file. Multiple documents in subcorpora are interpreted as a folder structure with multiple text files in each subcorpus folder. A single PTB text document generated by the exporter might look like this:

(S  
      (PP (IN In)  
        (NP (JJ American) (NN romance) )) 
      (, ,)  
      (NP-SBJ-2 (RB almost) (NN nothing) ) 
      (VP (VBZ rates)  
        (S  
          (NP-SBJ (-NONE- *-2) ) 
          (ADJP-PRD  
            (ADJP (JJR higher) ) 
            (PP (IN than)  
              (SBAR-NOM  
                (WHNP-1 (WP what) ) 
                (S  
                  (NP-SBJ (DT the) (NN movie) (NNS men) ) 
                  (VP (VB have)  
                    (VP (VBN called)  
                      (S  
                        (NP-SBJ (-NONE- *T*-1) ) 
                        (`` ``)  
                        (S-NOM-PRD  
                          (NP-SBJ (-NONE- *) ) 
                          (VP (NN meeting)  
                            (NP (JJ cute) ))) 
                        ('' '') ))))))))))));

The Salt document graph is searched for tokens and hirearchical dominance relations above these. The tree of dominance relations is realized as a nested bracket structure, as shown above. Tokens (i.e. word forms) are found in the inner-most brackets aligned with the closing bracket to the right. Their part-of-speech tag is aligned to the left, e.g. in the token "(VB have)", which stands for the word "have", with the part-of-speech "VB".

In some PTB style corpora, node 'functions' are designated after a separator in the node name, and these may optionally be interprested as edge labels. For example, in "(NP-SBJ" above, the segment "-SBJ" signifies that this node is the subject of the sentence. This part of the node label can be generated from a specified edge annotation for the incoming edge with a specified separator (the '-' in this example; see properties below).

Some corpora use a different notation for tokens, where they are not bracketed and the word form precedes the part-of-speech, which is separated by a slash (a.k.a. "atis-style"). For example:

(NP two/CD friends/NNS )

This PTB 'dialect' is also supported, and generation of such tokens can be switched on or off using the below

Properties

The table contains an overview of all usable properties to customize the behaviour of this Pepper module. The following section contains a description of each property and describes the resulting differences in the mapping to the Salt model.

Name of property	Type of property	optional/ mandatory	default value
nodeNamespace	String	optional	ptb
posName	String	optional	pos
catName	String	optional	cat
edgeType	String	optional	edge
edgeAnnoSeparator	String	optional	-
edgeAnnoNamespace	String	optional	ptb
edgeAnnoName	String	optional	func
nodeNamespace	Boolean	optional	true
handleSlashTokens	Boolean	optional	false

nodeNamespace

nodeNamespace=ptb Name of namespace for nodes to export and their annotations, e.g. 'ptb'. Only nodes within this layer name in Salt will be exported.

posName

posName=pos Name of pos annotation name for tokens, e.g. 'pos'. Only this annotation name will be taken to generate pos labels for the tokens.

catName

catName=cat Name of category annotation for non-terminal nodes to be exported, e.g. 'cat'. Only this annotation name will be taken to generate non-terminal node lables.

edgeType

edgeType=edge Name of edge type for dominance edges to be exported, e.g. 'edge'. Only this edge type will be taken to generate the PTB bracket structure.

edgeAnnoSeparator

edgeAnnoSeparator=- Separator character for edge labels following node annotation, e.g. the '-' in (NP-subj (....

edgeAnnoNamespace

edgeAnnoNamespace=ptb Namespace for edge annotations to be exported (represented within a node label after a separator), e.g. 'ptb'.

edgeAnnoName

edgeAnnoName=func Name of dominance edge annotation name to be exported, e.g. 'func'.

###nodeNamespace exportEdgeAnnos=true Boolean, whether to output edge annotations after a separator.

handleSlashTokens

handleSlashTokens=false Boolean, whether to create Penn atis-style tokens, which are non bracketed and separate the pos tag with a slash, e.g.: (NP two/CD friends/NNS ).

Name		Name	Last commit message	Last commit date
Latest commit History 106 Commits
.settings		.settings
gh-site/img		gh-site/img
src		src
.classpath		.classpath
.gitignore		.gitignore
.project		.project
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
build.properties		build.properties
nb-configuration.xml		nb-configuration.xml
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pepperModules-PTBModules

Requirements

Install module

Usage

a) Identify the module by name

b) Identify the module by formats

c) Use properties

Contribute

Funders

License

PTBImporter

Properties

nodeNamespace

posName

catName

edgeType

edgeAnnoSeparator

edgeAnnoNamespace

edgeAnnoName

nodeNamespace

handleSlashTokens

PTBExporter

Properties

nodeNamespace

posName

catName

edgeType

edgeAnnoSeparator

edgeAnnoNamespace

edgeAnnoName

handleSlashTokens

About

Releases

Packages

Contributors 4

Languages

License

korpling/pepperModules-PTBModules

Folders and files

Latest commit

History

Repository files navigation

pepperModules-PTBModules

Requirements

Install module

Usage

a) Identify the module by name

b) Identify the module by formats

c) Use properties

Contribute

Funders

License

PTBImporter

Properties

nodeNamespace

posName

catName

edgeType

edgeAnnoSeparator

edgeAnnoNamespace

edgeAnnoName

nodeNamespace

handleSlashTokens

PTBExporter

Properties

nodeNamespace

posName

catName

edgeType

edgeAnnoSeparator

edgeAnnoNamespace

edgeAnnoName

handleSlashTokens

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages