[very much UNDER DEVELOPMENT]
This repository holds the code for "Midas" (Middle Dutch Annotation System), a Tagger-Lemmatiser for Middle Dutch. While Midas has been originally developed to deal with medieval Dutch, it is largely language-independent and can be applied to other (historic) languages such as medieval Latin or Old French. Midas provides functionality for tokenization, part-of-speech tagging and lemmatization, with a heavy bias towards language which show a considerable amount of orhtographic variation in spelling and spacing. Midas is written in pure Python (>= py2.7 or py3) and has been tested on UNIX-like systems. Via keras and theano, Midas makes heavy use of neural networks for its language modeling: luckily, training the tagger-lemmatizer can accelerated by running Midas on the GPU instead of the CPU.
All input files should be encoded in UTF-8. Midas expects annotated training data to have the following, three-column format:
@ begin_of_text.txt
ambrosius N(prop) ambrosius
ende Conj(coord) en
iacob N(prop) jacob
van Adp() van
uitri N(prop) vitry
ende Conj(coord) en
isidorus N(prop) isidorus
dar~bi PronAdv(dem) daarbij
nomic V(fin,pres,lex)+Pron(pers,1,sing) noemen+ik
iv Pron(pers,2,plu) gij
dese Pron(dem) deze
bi Adp() bij
namen N(sing) naam
A normal line should contain the original token, the part-of-speech tag and the lemma, separated by tabs. The beginning of a new document can be encoded as "@ begin_of_text.txt". Empty newlines (\n\n
) can be used to indicate utterance boundaries, e.g. to mark verse endings in medieval poetry. If consecutive tokens in the original input, had to be concatenated to assign a lemma to them (e.g. dar~bi
in the example above), the concatenation can be marked using a tilde and, if needed, a tokenizer can be trained to learn and reproduce this behaviour. Due to cliticization phenomena, sometimes composite tags are assigned to words (e.g. noemen+ik); Midas considers as these atomic tags. Midas is agnostic with respect to the specific tag or lemma set used: any system can be used as long as it is consistent.
With respect to unannotated data (used for pretraining), Midas simply expects utf8-encoded files, respecting the original spacing between tokens and using empty lines to mark boundaries between utterances:
ambrosius ende iacob van uitri
ende isidorus dar bi
nomic iv dese bi namen
Midas can be used in the following modes: "tag", "test" and "train". Its configuration and hyperparameters can be set using a standard config file. Previously trained models can be saved via pickling and reused for tagging or testing. Run midas from the command line:
>>> python midas.py train config.txt my_model
>>> python midas.py tag config.txt my_model
>>> python midas.py test config.txt my_model
To enable GPU acceleration, add something like:
>>> THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python midas.py train config.txt my_model
Midas mainly depends on scikit-learn, keras (and thus theano). If you want to use theano's support GPU-acceleration (which comes highly recommended for larger data sets), you will have to properly install Nvidia’s CUDA.