Skip to content

TreeTagger docker image - Annotate text with POS tags and lemma information

License

Notifications You must be signed in to change notification settings

leodido/treetagger.docker

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

treetagger.docker

Build Docker

This repository contains docker images to build and ship ready to use TreeTagger instances.

You will not have to manually install TreeTagger in your system again.

What it is

A tool for annotating text with part-of-speech (POS tagging) and lemma information.

TreeTagger consists of two programs:

  1. train-tree-tagger

    Creates a parameter file from a lexicon and a handtagged corpus.

  2. tree-tagger

    Annotates the text with part-of-speech tags, given a parameter file and a text file as arguments.

This image contains:

  • training program and tagger executables

  • program for tokenization (i.e., separate-punctuation)

  • shell scripts (shortcuts) which simplify tagging and chunking:

    e.g., tree-tagger-italian, tree-tagger-german, tagger-chunker-english, ...

  • parameter files, chunker parameter files, and abbreviations files

  • documentaion and language tagsets references

See yourself them:

$ docker run -i -t leodido/treetagger ls /usr/local

At this link the offical page and further documentation.

Installation

Directly pull this image from the docker index.

$ docker pull leodido/treetagger

Usage

Tagging

Suppose you want to (tokenize and) tag an Italian text.

The script to use is tree-tagger-italian.

It expects UTF8 encoded input files as arguments. If no files have been specified, input from stdin is expected.

$ echo 'Proviamo semplicemente a eseguire un test di prova.' | docker run --rm -i leodido/treetagger tree-tagger-italian

Outputs:

Proviamo	    VER:pres	provare
semplicemente	ADV	        semplicemente
a	            PRE	        a
eseguire	    VER:infi	eseguire
un	            DET:indef	un
test	        NOM	        test
di	            PRE	        di
prova	        NOM	        prova
.	            SENT	    .

Now, try with some Portuguese.

$ echo 'Qual é o seu nome?' | docker run --rm -i leodido/treetagger tree-tagger-portuguese

Results:

Qual	PT0	    qual
é	     VMI	 ser
o	    DA0	    o
seu	    DP3	    seu
nome	NCMS	nome
?	    Fit	    ?

Finegrained?

$ echo 'Qual é o seu nome?' | docker run --rm -i leodido/treetagger tree-tagger-portuguese-finegrained

Results:

Qual	PT0CS000	qual
é       VMIP3S0	 ser
o	    DA0MS0	    o
seu	    DP3MSS	    seu
nome	NCMS000	    nome
?	    Fit	        ?

And so on for other supported languages.

Chunking

Suppose you want to tokenize, tag and annotate a German text with nominal and verbal chunks.

$ echo 'Das ist ein Test.' | docker run -i leodido/treetagger tagger-chunker-german

Outputs:

<NC>
Das	    PDS	    die
</NC>
<VC>
ist	    VAFIN	sein
</VC>
<NC>
ein	    ART	    eine
Test	NN	    Test
</NC>
.	    $.	    .

Supported languages

17 languages are supported: bulgarian, dutch, english, estonian, finnish, french, galician, german, italian, latin, portuguese, polish, russian, slovak, spanish, swahili, mongolian (only parameter file provided, no scripts).

Some of them have also alternative parameter files.

Todos

  • Add support for Chinese, and Spoken French.

Credits

  • Helmut Schmid, University of Stuttgart, Germany - TreeTagger.

Last update: 28/05/2015


Analytics

About

TreeTagger docker image - Annotate text with POS tags and lemma information

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Sponsor this project

 

Packages

No packages published

Languages