Skip to content

lueck/htcf

Repository files navigation

License GPLv3

htfc - Haskell library and executables for generating and parsing files in Text Corpus Format (TCF)

The Text Corpus Format (TCF) is an XML data exchange format that has been developed within the WebLicht architecture to facilitate efficient interoperability of the WebLicht services. htcf is a library for parsing and writing TCF data in the haskell programming language and also a configurable tokenizer for XML input.

htcf provides the xml2tcf commandline program that takes an XML input file, e.g. in TEI P5 format, and generates TCF layers from it. xml2tcf has a neat feature: It does not only provide a start and end character offsets of the tokens in relation to the text layer, but also in relation to the source file form the input. These offsets are provided in all layers, where it makes sense, e.g. in the tokens layer and in the text structure layer. This makes it possible to interrelate the TCF data to semantic annotations that were made to the input file (manually) in a standoff manner.

htcf also provides a taxi for setting TCF data over to CSV, JSON or raw haskell data: the tcflayer, tcftokens and tcffreq commandline programs. They let you specify the output format and which layer to get out of the file. They are useful for preparing bulk inserts into a database. While tcflayer reads a single layer, tcftokens collects information about tokens from all layers. tcffreq calculates the absolute frequencies of tokens or lemmas in a tcf file.

Layers currently supported


| Layer          | read (library) | write (library) | xml2tcf (exec) | tcflayer (exec) |
|----------------+----------------+-----------------+----------------+-----------------|
| text           | yes            | yes             | yes            | yes             |
| tokens         | yes            | yes             | yes            | yes             |
| sentences      | yes            | yes             | no             | yes             |
| POStags        | yes            | yes             | no             | yes             |
| lemmas         | yes            | yes             | no             | yes             |
| text structure | no             | yes             | yes            | no              |

Roadmap: read text structure

Installation

stack, the haskell build tool, is required for installation. After cloning htcf from github, cd into the working directory and run stack like follows.

$ cd <path-to-htcf>
$ stack setup
$ stack build

Have fun with

$ stack exec -- xml2tcf [options] <INPUT.XML>
$ stack exec -- tcflayer [options] <INPUT.TCF>

Running these programs with the -h option gives you a help message.

If you like, install the binaries within your system path by calling the stack install htcf in the htcf working directory.

Tests

Some QuickCheck tests only make sense with real world XML input. They need a TEI file from Deutsches Textarchiv which is not included in this repository due to license conditions. But you can download Kants Was ist Aufklaerung? and put it into the doc/examples directory. Then run the tests.

$ stack test

xml2tcf


xml2tcf - generate TCF from XML input.

Usage: xml2tcf [-c|--config CONFIGFILE] [-a|--abbrevs ABBREVFILE]
               [-S|--no-structure] [-o|--output OUTFILE] INFILE
  xml2tcf generates a TCF file from XML input. TCF is the Text Corpus Format
  defined for WebLicht.

Available options:
  -h,--help                Show this help text
  -c,--config CONFIGFILE   Specify a config file. Defaults to config.xml in the
                           working directory.
  -a,--abbrevs ABBREVFILE  Specify a abbreviations file. The file is expected to
                           be plain text with one abbreviation per line. Dots
                           shoult not be in there. Defaults to abbrevs.txt in
                           the working directory.
  -S,--no-structure        Do not output structure layer.
  -o,--output OUTFILE      Output file. If left, the TCF is printed to stdout.

htcf provides a config.xml as a reasonable config for parsing a TEI file and abbrevs.txt as a starting point for abbreviations, used to configure the tokenizer.

Here is an example of the character offsets xml2tcf generates:


    <token>
      ...
      <token id="w37b" start="5036" end="5045" srcStart="18726" srcEnd="18759">Aufklärung</token>
      <token id="w37c" start="5047" end="5049" srcStart="18766" srcEnd="18775">iſt</token>
      <token id="w37d" start="5051" end="5053" srcStart="18777" srcEnd="18779">der</token>
      <token id="w37e" start="5055" end="5061" srcStart="18781" srcEnd="18787">Ausgang</token>
      <token id="w37f" start="5063" end="5065" srcStart="18789" srcEnd="18791">des</token>
      <token id="w380" start="5067" end="5074" srcStart="18793" srcEnd="18814">Menſchen</token>
	  ...
    </token>
	<textstructure>
	  ...
	  <textspan type="p" namespace="http://www.tei-c.org/ns/1.0" start="w37b" end="w3c8" textStart="5035" textEnd="5534" srcStart="18699" srcEnd="19556"/>
      <textspan type="hi" namespace="http://www.tei-c.org/ns/1.0" start="w37b" end="w37b" textStart="5036" textEnd="5037" srcStart="18703" srcEnd="18731"/>
      <textspan type="hi" namespace="http://www.tei-c.org/ns/1.0" start="w37b" end="w37b" textStart="5037" textEnd="5046" srcStart="18732" srcEnd="18764"/>
	  ...
    </textstructure>

xml2tcf does not generate a source layer where the opening tags are escaped to &lt;. This escaping isn't a bijective mapping from the xml source to the string of the source layer. So this layer is worth nothing in regard to the character offsets generated by xml2tcf. Only a link to the source file and a md5 hash of it would make sense here.

Caveat

When reading TCF data with tcflayer, tcftokens or tcffreq, it automatically parses the IDs of tokens, sentences etc. into integer counts. For that purpose the common prefix of the tokens is stripped and the base of the resting numerical part is calculated from the length of the "alphabet" of digits in the numerical parts. This works for prefixed IDs encoded in binary, octet, decimal, hexadecimal etc. format. But it only works as long as the count of tokens is not lesser than the base of their IDs. So you might run into problems with short texts. And there is an additional constraint: The alphabet must utilize contigous characters out of 0-9 first and a-z then, in ascending order.

License

GPL V3

About

Haskell library for the Text Corput Format

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published