Skip to content

haivv/UPC

Repository files navigation

UPC - University of Ulsan Open Parallel Corpora

UPC - University of Ulsan Open Parallel Corpora with Korean Word-Sense Annotation and Morphological analysis have been built in NLP Lab., University of Ulsan, Rep. of Korea. (http://nlplab.ulsan.ac.kr).

UKren is a large-scale Korean-English Parallel Corpus with the detailed information as the following.

. UKren: Korean-English Parallel Corpus.

. UKren_WS_Ann : Korean-English Parallel Corpus with Word-sense annotation for Korean by UTagger

. Total sentences: 969,194 pairs	
	
. Average sentence length

	. English: 13.0
	
	. Korean: 10.2
	
	. Korean with Word-sense annotation: 16.2
	
. Total tokens

	. English: 12,291,207
	
	. Korean:  9,918,960
	
	. Korean with Word-sense annotation: 15,691,059
	

. Total vocabularies

	. English: 347,658
	
	. Korean:  816,273
	
	. Korean with Word-sense annotation: 132,754

UKrvi is a large-scale Vietnamese - Korean Parallel Corpus with the detailed information as the following.

. UKrvi: Korean-Vietname Parallel Corpus.

. UKrvi_WS_Ann : Korean-Vietnamese Parallel Corpus with Word-sense annotation for Korean by UTagger

. Total sentences: 412,317 pairs	

. Average sentence length

	. Vietnamese: 11.6

	. Korean: 12.0

	. Korean with Word-sense disambiguation and Morphological analysis: 20.1

. Total tokens

	. Vietnamese: 5,958,096

	. Korean: 4,782,063

	. Korean with Word-sense disambiguation and Morphological analysis: 8,287,635


. Total vocabularies

	. Vietnamese: 40,090

	. Korean:  39,748

	. Korean with Word-sense disambiguation: 68,719

. The Korean Word-sense disambiguation and Morphological analysis were conducted by UTagger (http://nlplab.ulsan.ac.kr/doku.php?id=utagger) that consists of the following processes:

. Sense-codes tagging (A sense-code, which represents a special sense of a word is defined in the Standard Korean Language Dictionary) x_original_sample.txt and x_WSD_MA_sample.txt are the sample files with 5,000 sentence pairs. If you want to use the full corpus, please contact us through e-mail: haivv279@gmail.com

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published