Skip to content

lovit/kowikitext

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Ko-wikitext

Wikitext format Korean corpus

한국어 위키의 덤프 데이터를 바탕을 제작한 wikitext 형식의 텍스트 파일입니다.

Corpus size

  • train : 14528095 lines (868230 articles)
  • dev : 76788 lines (4385 articles)
  • test : 70774 lines (4386 articles)

To fetch data, run below script. Then three corpus, train / dev / test files are downloaded at ./data/

python fetch.py

This corpus is licensed with CC-BY-SA 3.0 which kowiki is licensed. For detail, visit https://www.creativecommons.org/licenses/by-sa/3.0/

Fetch and load using Korpora

Korpora is Korean Corpora Archives, implemented based on Python. We will provide the fetch / load function at Korpora

이 코퍼스는 Korpora 프로젝트에서 사용하는 기능을 제공할 예정입니다.

from Korpora import Korpora

kowikitext = Korpora.load('kowikitext')

# or
Korpora.fetch('kowikitext')

License

CC-BY-SA 3.0 which kowiki dump dataset is licensed