This project provides an importer for the linguistic converter framework Pepper (see http://corpus-tools.org/pepper/) to support the EXCEL format.
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
gh-site/img
src
.gitignore
LICENSE
NOTICE
README.md
build.properties
pom.xml

README.md

SaltNPepper project

pepperModules-SpreadsheetModules

This project provides an importer to support the Excel format in linguistic converter framework Pepper (see https://u.hu-berlin.de/saltnpepper). A detailed description of the importer can be found in section SpreadsheetImporter.

Pepper is a pluggable framework to convert a variety of linguistic formats (like TigerXML, the EXMARaLDA format, PAULA etc.) into each other. Furthermore Pepper uses Salt (see https://github.com/korpling/salt), the graph-based meta model for linguistic data, which acts as an intermediate model to reduce the number of mappings to be implemented. That means converting data from a format A to format B consists of two steps. First the data is mapped from format A to Salt and second from Salt to format B. This detour reduces the number of Pepper modules from n2-n (in the case of a direct mapping) to 2n to handle a number of n formats.

n:n mappings via SaltNPepper

In Pepper there are three different types of modules:

  • importers (to map a format A to a Salt model)
  • manipulators (to map a Salt model to a Salt model, e.g. to add additional annotations, to rename things to merge data etc.)
  • exporters (to map a Salt model to a format B).

For a simple Pepper workflow you need at least one importer and one exporter.

Requirements

Since the here provided module is a plugin for Pepper, you need an instance of the Pepper framework. If you do not already have a running Pepper instance, click on the link below and download the latest stable version (not a SNAPSHOT):

Note: Pepper is a Java based program, therefore you need to have at least Java 7 (JRE or JDK) on your system. You can download Java from https://www.oracle.com/java/index.html or http://openjdk.java.net/ .

Install module

If this Pepper module is not yet contained in your Pepper distribution, you can easily install it. Just open a command line and enter one of the following program calls:

Windows

pepperStart.bat 

Linux/Unix

bash pepperStart.sh 

Then type in command is and the path from where to install the module:

pepper> update org.corpus-tools::pepperModules-pepperModules-EXMARaLDAModules::https://korpling.german.hu-berlin.de/maven2/

Usage

To use this module in your Pepper workflow, put the following lines into the workflow description file. Note the fixed order of xml elements in the workflow description file: <importer/>, <manipulator/>, <exporter/>. The SpreadsheetImporter is an importer module, which can be addressed by one of the following alternatives. A detailed description of the Pepper workflow can be found on the Pepper project site.

a) Identify the module by name

<importer name="SpreadsheetImporter" path="PATH_TO_CORPUS"/>

b) Identify the module by formats

<importer formatName="xls" formatVersion="1.0" path="PATH_TO_CORPUS"/>

or

<importer formatName="xlsx" formatVersion="1.0" path="PATH_TO_CORPUS"/>

c) Use properties

<importer name="SpreadsheetImporter" path="PATH_TO_CORPUS">
  <property key="PROPERTY_NAME">PROPERTY_VALUE</property>
</importer>

SpreadsheetImporter

At this stage, we want to explain the mapping of a Spreadsheet model to a Salt model. Since there are some conceptual differences between both models, we need to bridge the Spreadsheet model to the graph based Salt model.

Corpus information and meta data

While you can use a lot of different sheets in e.g. an excel file, by default the first sheet of such a file will be interpreted as the sheet that holds the corpus information, whereas the second sheet will be interpreted as the sheet, that holds the meta data of the given document. You can change those settings by the properties 'corpusSheet', 'metaSheet' and 'metaAnnotation' (See Properties for further information).

Primary text, tokenization and the timeline

Since in Salt the anchor of all higher structures and annotations are tokens, we need to identify which column of a corpus sheet represents the primary text and it's tokenization. This is handled by the property "primText" (See the Properties for further information). Each column with a name matching to the value you give to "primText" is used to create a primary text in Salt, by default (if you don't use the property "primText") a column named "tok" will be interpreted as the primary text. Imagine the following Spreadsheet data:

tokanno1anno2
Thisa11a21
isa12a22
aa13a23
samplea14a24
texta15a25
.a16a26

Without additional property settings in this sample, there is one primary text "This is a sample text ." since the default primary text column is named "tok". Now for each cell in such a column, interpreted as an annotation tier, a token in Salt is created. That means for our sample, the Salt model contains exactly 6 tokens.

Furthermore imagine the following Spreadsheet data:

prim1primNormanno1
ThisThisa1
isisa2
anana3
ex-examplea4
amplea5
..a6

with the property "primText" set to "prim1, primNorm" this sample contains two primary texts "This is an ex- ample ." and "This is an example ." Now for each cell in both annotation tiers, a token in Salt is created. That means for this sample, the Salt model contains exactly 11 tokens: 6 tokens connected to the first primary text "prim1" and 5 tokens connected to the second primary text "primNorm". To bring the tokens of both primary texts into a relation, the line number of the Spreadsheet model is mapped to a timeline in the Salt model.

Annotations

Along with tokens annotations are also modeled as columns in a spreadsheet as shown in the following sample:

<
prim1primNormanno1anno2
ThisThisa11a21
isisa12a22
anana13a23
ex-examplea14a24
amplea15
..a16a25

whereas 'anno1' and 'anno2' are annotation tiers and 'prim1' and 'primNorm' are primary text tiers. If your corpus contains more than one primary text tier as in the sample, it's not clear to which primary text a given annotation is related to. Thus you need to specify for each annotation tier to which primary text tier it relates to. This can be managed in the annotation itself, by writing the primary text tier in square brackets behind the annotation name as in the following samle:

prim1primNormanno1[prim1]anno2[primNorm]
ThisThisa11a21
isisa12a22
anana13a23
ex-examplea14a24
amplea15
..a16a25

If your annotations do not contain those specifications you can add them by the property 'annoPrimRel' without changing your original files, see Properties for further information. Please note that each annotation tier, that is not related to a primary text, will be ignored in the convertion process.

Meta Annotations

The module currently supports meta annotations of a document only in a specific way. It is assumed that the first column of the sheet, that holds the meta data, contains the meta annotation names, while the second column holds the respective meta annotation value. All other columns will be ignored by the module.

Properties

The table contains an overview of all usable properties to customize the behavior of this Pepper module.

Name of property Type of property optional/ mandatory default value
corpusSheet String optional [first sheet]
primText primaryTier1, primaryTier2, ... optional tok
metaSheet String optional [second sheet]
shortAnnoPrimRel primaryText1={tier1, tier2, tier3}, primaryText2={tier4}, ... optional null
annoPrimRel anno1=anno1[primaryTier1], anno2=anno2[primaryTier1], ... optional null
setLayer categoryName={tier1, tier2, tier3}, categoryName2={tier4}, ... optional null
metaAnnotation Boolean optional true
includeEmptyPrimCells Boolean optional false
addOrderRelation Boolean optional true
parseNamespace Boolean optional false

corpusSheet

With the property corpusSheet you can define the sheet that holds the actual corpus information. If you do not set this property, the first sheet will allways be interpreted as the sheet that holds the primary text.

primText

With the property primText you can define the name of the column(s), that hold the primary text, this can either be a single column name:

primText=”TIER_NAME”

, or a comma seperated enumeration of column names:

primText=”TIER_NAME1, TIER_NAME2, ...”

Please make sure that the given tier names are represented in the corpus. If you do not use this property, a column named 'tok' will be considered as the one that holds the primary text.

metaSheet

With the property metaSheet, you can define the sheet that holds the meta information of the document. If you don not set this property, the second sheet will allways be interpreted as the sheet that holds the meta data of the document. Please note that the first column of this sheet shall hold the meta annotation names, while the second column shall contain the respective meta annotation values. If the sheet contains a meta annotation without a respective value, a warning message will be printed.

metaSheet=”SHEET_NAME”

annoPrimRel

In multi-level corpora you need to specify which annotation tier refers to which primary text tier. Therefor the annotation tier name is either followed by the name of the tier, that holds the primary text in square brackets in the annotation itself (in this case you don't need this property), or you set this specification with the property annoPrimRel. A possible key-value set could be:

annoPrimRel=”ANNOTATION_NAME=ANNOTATION_NAME[PRIMARY_TEXT1],ANNOTATION_NAME2=ANNOTATION_NAME2[PRIMARY_TEXT2],...”

shortAnnoPrimRel

This property provides a shorter way to specify, which annotation refers to which primary text, therefore you write the primary text, followed by its annotations

shortAnnoPrimRel=”PRIMARY_TEXT_NAME={ANNOTATION_NAME1,ANNOTATION_NAME2,...}”

setLayer

Sometimes it is desirable to add linguistical categories to your corpus, e.g. to get a better overview, for this purpose you can use the property setLayer. Imagine the following sample:

<
prim1primNormanno1anno2
ThisThisa11a21
isisa12a22
anana13a23
ex-examplea14a24
amplea15
..a16a25

after mapping with the properties:

primText=”prim1, primNorm” annoPrimRel=”anno1=anno1[prim1], anno2=anno2[primNorm]” setLayer=”transcription={prim1, primNorm}, morphology{anno1, anno2}” 

prim1 and primNorm will interpreted as primary texts, whereas anno1 is an annotation of prim1 and anno2 is an annotation of primNorm. Here we grouped the annotations of the tiers prim1 and primNorm to one SLayer object named transcription and we grouped the annotations of the tiers anno1 and anno2 to another SLayer object named morphology.

metaAnnotation

If you don't have any meta annotations for your documents, or they don't match the structure needed by the module, you can disable the search for meta annotations by using the property metaAnnotation:

metaAnnotation=false

includeEmptyPrimCells

If the primary text tier of your corpus contains empty cells you need to set the property includeEmptyPrimCells to true, by default this property is set to false.

addOrderRelation

If your corpus contains more than one primary text tier, you need order relations between the tokens of your corpus, therefore the default value of the property addOrderRelation is set to true. Otherwise if your corpus contains only one primary text tier the order of the tokens of your corpus is set automatically and you can set this property to false. If your corpus only contains one, empty primary text tier you need to set the property addOrderRelation to false.

parseNamespace

If true, the part of the column name before '::' is interpreted as namespace of the annotation (instead of beeing a part of the name itself).

Contribute

Since this Pepper module is under a free license, please feel free to fork it from github and improve the module. If you even think that others can benefit from your improvements, don't hesitate to make a pull request, so that your changes can be merged. If you have found any bugs, or have some feature request, please open an issue on github. If you need any help, please write an e-mail to saltnpepper@lists.hu-berlin.de .

Funders

This project has been funded by the department of corpus linguistics and morphology of the Humboldt-Universität zu Berlin.

License

Copyright 2016 Humboldt-Universität zu Berlin, INRIA.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.