This repository provides NLP resources for Norwegian, based on the Norwegian Dependency Treebank (NDT). It provides a data set split (training/dev/test) of the treebank as well as PoS tagger models and syntactic parser models trained on the training data in the treebank.
Optimized PoS tag set
Hohle (2016) proposes a tag set optimized for syntactic dependency parsing of Norwegian, hitherto referred to as the optimized tag set. This tag set is based on the original tag set of NDT, with the addition of 20 PoS tags providing more fine-grained morphosyntactic information.
Data set split
This repository provides a data set split (training/dev/test) of NDT. This split follows the commonly used 80-10-10 split, where 80% of the data resides in the training data, 10% is used for testing during development and 10% is held-out and used for final evaluation. In the creation of this split, care was taken to preserve contiguous texts and to keep the split balanced in terms of genre.
Using the original tag set
training.conllcontains the training data.
dev.conllcontains the development data.
test.conllcontains the test data.
Using the optimized tag set
training-optimized.conllcontains the training data.
dev-optimized.conllcontains the development data.
test-optimized.conllcontains the test data.
PoS tagger models and syntactic parser models
svmtool-tagger-modelcontains the model files for use with the SVMTool tagger, using the original tag set.
svmtool-optimized-tagger-modelcontains the model files for use with the SVMTool tagger, using the optimized tag set.
mate-parser-modelcontains the model file for use with the Mate parser, using the original tag set.
mate-optimized-parser-modelcontains the model file for use with the Mate parser, using the optimized tag set.
In the evaluation of PoS taggers and syntactic dependency parsers in Hohle (2016), I found that SVMTool was the best tagger and Mate the best parser on NDT.
generate_split.pygenerates a data set split (training/dev/test) of the treebank, provided a path to the original treebank files.
map_tagset.pymaps the tag set of the treebank by introducing supplied morphological features present in the treebank.
tagging_error_analysis.pyperforms error analysis in terms of precision, recall and F score.
Please cite the following paper if you use the data sets in academic works:
Hohle, P., Velldal, E., Øvrelid, L. (2017). Optimizing a PoS Tagset for Norwegian Dependency Parsing. In Proceedings of the 21st Nordic Conference on Computational Linguistics (NoDaLiDa) (pp. 142-151). Gothenburg, Sweden.
Please cite the following thesis if you use the models or scripts in academic works:
Hohle, P. (2016). Optimizing a PoS Tag Set for Norwegian Dependency Parsing (Master's thesis). University of Oslo, Oslo, Norway.