UPC - University of Ulsan Open Parallel Corpora
UPC - University of Ulsan Open Parallel Corpora with Korean Word-Sense Annotation and Morphological analysis have been built in NLP Lab., University of Ulsan, Rep. of Korea. (http://nlplab.ulsan.ac.kr).
UKren is a large-scale Korean-English Parallel Corpus with the detailed information as the following.
. UKren: Korean-English Parallel Corpus.
. UKren_WS_Ann : Korean-English Parallel Corpus with Word-sense annotation for Korean by UTagger
. Total sentences: 969,194 pairs
. Average sentence length
. English: 13.0
. Korean: 10.2
. Korean with Word-sense annotation: 16.2
. Total tokens
. English: 12,291,207
. Korean: 9,918,960
. Korean with Word-sense annotation: 15,691,059
. Total vocabularies
. English: 347,658
. Korean: 816,273
. Korean with Word-sense annotation: 132,754
UKrvi is a large-scale Vietnamese - Korean Parallel Corpus with the detailed information as the following.
. UKrvi: Korean-Vietname Parallel Corpus.
. UKrvi_WS_Ann : Korean-Vietnamese Parallel Corpus with Word-sense annotation for Korean by UTagger
. Total sentences: 412,317 pairs
. Average sentence length
. Vietnamese: 11.6
. Korean: 12.0
. Korean with Word-sense disambiguation and Morphological analysis: 20.1
. Total tokens
. Vietnamese: 5,958,096
. Korean: 4,782,063
. Korean with Word-sense disambiguation and Morphological analysis: 8,287,635
. Total vocabularies
. Vietnamese: 40,090
. Korean: 39,748
. Korean with Word-sense disambiguation: 68,719
. The Korean Word-sense disambiguation and Morphological analysis were conducted by UTagger (http://nlplab.ulsan.ac.kr/doku.php?id=utagger) that consists of the following processes:
. Sense-codes tagging (A sense-code, which represents a special sense of a word is defined in the Standard Korean Language Dictionary) x_original_sample.txt and x_WSD_MA_sample.txt are the sample files with 5,000 sentence pairs. If you want to use the full corpus, please contact us through e-mail: haivv279@gmail.com