OpenKorPos is a Korean part-of-speech tagging corpus. It is a free, open alternative to the Sejong corpus and Modu corpus.
For background of this work, please refer to our paper.
Building the corpus requires Python 3.9+, Click, and Ninja. You can install all dependencies using the provided requirements.txt
.
pip install -r requirements.txt
To build the corpus, you will need to generate the corresponding ninja files, then build.
python openkorpos.py ningen base
ninja
You can also enable all the quarantined (flagged) sentences to be included into the generated corpus.
python openkorpos.py ningen --flagged base
The build artifacts get dropped into the build
directory.
Each file is a JSON lines formatted file, encoded in UTF-8.
If you need to cite this work before it is made available in the ACL Anthology bibtex, please use the following:
@inproceedings{Moon:LREC2022,
title = "OpenKorPOS: Democratizing Korean Tokenization with Voting-Based Open Corpus Annotation",
author = "Moon, Sangwhan and
Cho, Won Ik and
Han, Hye Joo and
Okazaki, Naoaki and
Kim, Nam Soo",
booktitle = "Proceedings of the 13th Language Resources and Evaluation Conference (LREC)",
month = June,
year = "2022",
address = "Marseille",
publisher = "European Language Resources Association",
}