This repository contains docker images to build and ship ready to use TreeTagger instances.
You will not have to manually install TreeTagger in your system again.
A tool for annotating text with part-of-speech (POS tagging) and lemma information.
TreeTagger consists of two programs:
-
train-tree-tagger
Creates a parameter file from a lexicon and a handtagged corpus.
-
tree-tagger
Annotates the text with part-of-speech tags, given a parameter file and a text file as arguments.
This image contains:
-
training program and tagger executables
-
program for tokenization (i.e., separate-punctuation)
-
shell scripts (shortcuts) which simplify tagging and chunking:
e.g., tree-tagger-italian, tree-tagger-german, tagger-chunker-english, ...
-
parameter files, chunker parameter files, and abbreviations files
-
documentaion and language tagsets references
See yourself them:
$ docker run -i -t leodido/treetagger ls /usr/local
At this link the offical page and further documentation.
Directly pull this image from the docker index.
$ docker pull leodido/treetagger
Suppose you want to (tokenize and) tag an Italian text.
The script to use is tree-tagger-italian.
It expects UTF8 encoded input files as arguments. If no files have been specified, input from stdin is expected.
$ echo 'Proviamo semplicemente a eseguire un test di prova.' | docker run --rm -i leodido/treetagger tree-tagger-italian
Outputs:
Proviamo VER:pres provare
semplicemente ADV semplicemente
a PRE a
eseguire VER:infi eseguire
un DET:indef un
test NOM test
di PRE di
prova NOM prova
. SENT .
Now, try with some Portuguese.
$ echo 'Qual é o seu nome?' | docker run --rm -i leodido/treetagger tree-tagger-portuguese
Results:
Qual PT0 qual
é VMI ser
o DA0 o
seu DP3 seu
nome NCMS nome
? Fit ?
Finegrained?
$ echo 'Qual é o seu nome?' | docker run --rm -i leodido/treetagger tree-tagger-portuguese-finegrained
Results:
Qual PT0CS000 qual
é VMIP3S0 ser
o DA0MS0 o
seu DP3MSS seu
nome NCMS000 nome
? Fit ?
And so on for other supported languages.
Suppose you want to tokenize, tag and annotate a German text with nominal and verbal chunks.
$ echo 'Das ist ein Test.' | docker run -i leodido/treetagger tagger-chunker-german
Outputs:
<NC>
Das PDS die
</NC>
<VC>
ist VAFIN sein
</VC>
<NC>
ein ART eine
Test NN Test
</NC>
. $. .
17 languages are supported: bulgarian, dutch, english, estonian, finnish, french, galician, german, italian, latin, portuguese, polish, russian, slovak, spanish, swahili, mongolian (only parameter file provided, no scripts).
Some of them have also alternative parameter files.
- Add support for Chinese, and Spoken French.
- Helmut Schmid, University of Stuttgart, Germany - TreeTagger.
Last update: 28/05/2015