Center for Sprogteknologi, Copenhagen University


Lemmatiser that uses affix rules (affix: prefix, infix, suffix, circumfix). Rules are obtained by supervised learning from a full form - lemma list.

Updated Oct 19, 2016

Servlet that computes candidate tool workflows given input file(s) and the user's requirements regarding the output. Afterwards, runs a workflow selected by the user from the list of candidates.

Updated Oct 18, 2016


Using supervised learning, create a set of affix rules for use by the CSTlemma lemmatiser.

Updated Oct 3, 2016

OpenCV-based Plugin for the Anvil annotation software that tracks faces and creates annotations when velocity or acceleration thresholds are transgressed.

Updated Aug 25, 2016


Modernized version of Eric Brill's Part Of Speech tagger.

Updated May 17, 2016


Reads an RTF or flat text file and outputs the text, one line per sentence & optionally tokenized.

Updated May 3, 2016


converts UTF-16 (BE/LE), UTF-32 (BE/LE), ISO-8859-N to UTF-8. Removes BOM and surrogate pairs from UTF-8, converting a codepoint between U-D800 and U-DBFF followed by a codepoint between U-DC00 and U-DFFF to one valid codepoint > U-FFFF.

Updated Oct 14, 2015


Functions for upper/lower casing, for testing whether a character is a letter and for conversion between Unicode encodings UTF-8 and UTF-16

Updated Aug 31, 2015


Simple implementation of a hash map using separate chaining. The table allocates more buckets if the load factor is more than 100% and frees buckets if the loadfactor falls below 20%.

Updated Jul 3, 2015


Parse sgml, html and xml in a forgiving way.

Updated Aug 9, 2014