Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

modu_written.md 영문 번역 반영 #156

Merged
merged 9 commits into from
Nov 11, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
47 changes: 46 additions & 1 deletion en-docs/corpuslist/modu_written.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,49 @@ sort: 19

# Modu: Written

TBD
Modu: Written is a dataset released by National Institute of Korean Language.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

이 내용까지 확인하였습니다. 수정 감사합니다 @hank110

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

이와 관련된 이슈를 #165 에 남겼습니다. 이후 수정을 위한 인덱싱 용으로 커멘트를 남깁니다.

Data specification is as follows.

- author: National Institute of Korean Language
- repository: [https://corpus.korean.go.kr](https://corpus.korean.go.kr)
- references: [document](https://rlkujwkk7.toastcdn.net/NIKL_WRITTEN(v1.0).pdf)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

모두의 말뭉치 안내문의 주소를 직접 입력하면 access denied 가 출력되고, 국립국어원 홈페이지에서 한 번 이상 직접 다운로드 받을 때에만 정상 작동됩니다. 안내문은 공개된 문서인만큼 korpora 에서 안내문 파일만 미러링 하는 것은 어떨까요? @ratsgo

- size:
- train: 20,188 examples

```warning
Due to the licensing issue of Modu corpus, `Korpora` does not provide any download functions for this corpus. Rather, it only offers a load function.
If you wish to use this corpus, please complete the authentication process required by the National Institue of Korean Language and manually download the corpus.
```

You can load the corpus from your Python console as follows.

```python
from Korpora import Korpora
corpus = Korpora.load("modu_written")
```

```warning
The code assumes that the corpus has already been unzipped into NIKL_WRITTEN directory within `~/Korpora` (`~/Korpora/NIKL_WRITTEN`).
If the root directory is not `~/Korpora`, add `root_dir=custom_path` argument to the `load` method.
```

You can also load the corpus as follows.
The output of these codes is identical to that of previous codes.

```python
from Korpora import ModuWrittenKorpus
corpus = ModuWrittenKorpus()
```

```warning
The codes assumes that the corpus has already been unzipped into `~/Korpora/NIKL_WRITTEN` within the current user's local root.
If the corpus exists in another directory, add `root_dir_or_paths=custom_path` argument in `ModuWrittenKorpus` class declaration.
```

If you use either one of these previous examples, you can load the corpus into the variable `corpus`.
`train` refers to the training dataset of the corpus, and you can check its first training instance as follows.

```
>>> corpus.train[0]
01범보다 무서운 곶감
```
113 changes: 112 additions & 1 deletion en-docs/corpuslist/namuwikitext.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,115 @@ sort: 8

# NamuWikiText

TBD
NamuWikiText is a dataset released by lovit@github. It provides Namu Wikipedia in a text format.
Data specification is as follows.

- author: lovit@github
- repository: [https://github.com/lovit/namuwikitext](https://github.com/lovit/namuwikitext)
- size:
- train: 31,235,096 lines (500,104 docs, 4.6G)
- dev: 153,605 lines (2,525 docs, 23M)
- test: 160,233 lines (2,527 docs, 24M)

Data structure is as follows:

|Attributes|Property|
|---|---|
|text|a body of a section|
|pair|a title of a section|


## 1. Using in Python

You can download and load the corpus after executing your Python console.

### Downloading the corpus

You can download NamuWikiText corpus into your local directory with the following Python codes.

```python
from Korpora import Korpora
Korpora.fetch("namuwikitext")
```

```note
By default, the corpus is downloaded to a Korpora directory within the user's root directory (`~/Korpora`). If you wish to download the corpus to another directory,
add `root_dir=custom_path` argument to the fetch method.
```

```tip
When the fetch method is executed with `force_download=True` argument, it ignores the existing corpus in the local directory and re-downloads the corpus. The default value of `force_download` is `False`.
```


### Loading the corpus

You can load NamuWikiText corpus from your Python console with the following codes.
If the corpus does not exist in the local directory, it is also downloaded as well.

```python
from Korpora import Korpora
corpus = Korpora.load("namuwikitext")
```

You can also load the corpus as follows.
The output of these codes is identical to that of previous codes.

```python
from Korpora import NamuwikiTextKorpus
corpus = NamuwikiTextKorpus()
```

If you use either one of these previous examples, you can load the corpus into the variable `corpus`.
`train` refers to the training dataset of the corpus, and you can check its first training instance as follows.

```
>>> corpus.train[0]
SentencePair(text='상위 문서: 아스날 FC\n2009-10 시즌 2011-12 시즌\n2010 -11 시즌...', pair=' = 아스날 FC/2010-11 시즌 =')
>>> corpus.train[0].text
상위 문서: 아스날 FC\n2009-10 시즌 2011-12 시즌\n2010 -11 시즌...
>>> corpus.train[0].pair
= 아스날 FC/2010-11 시즌 =
```

`dev` and `test` refer to the validation and test datasets of the corpus, respectively. Each of their first instance can be accessed as follows.

```
>>> corpus.dev[0]
SentencePair(text='상위 항목: 축구 관련 인물, 외국인 선수/역대 프로축구\n...', pair=' = 소말리아(축구선수) =')
>>> corpus.test[0]
SentencePair(text='', pair=' = 덴덴타운 =')
```

By executing the `get_all_texts` method, you can access all texts (bodies of sections) within the corpus.

```
>>> corpus.get_all_texts()
['상위 문서: 아스날 FC\n2009-10 시즌 2011-12 시즌\n2010 -11 시즌...', ... ]
```

By executing the `get_all_pairs` method, you can access all pairs (titles of sections) within the corpus.

```
>>> corpus.get_all_pairs()
['= 아스날 FC/2010-11 시즌 =', ... ]
```


## 2. Using in a terminal

You can directly download the corpus without executing Python console.
To do so, use the following command.

```bash
korpora fetch --corpus namuwikitext
```

```note
By default, the corpus is downloaded to a Korpora directory within the user's root directory (`~/Korpora`). If you wish to download the corpus to another directory,
add `--root_dir custom_path` argument to the fetch command.
```

```tip
If you add `--force_download` argument when executing the fetch command in the terminal, it ignores the existing corpus in the local directory and re-downloads the corpus.
```
114 changes: 113 additions & 1 deletion en-docs/corpuslist/naver_changwon_ner.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,116 @@ sort: 9

# NAVER x Changwon NER

TBD
NAVER x Changwon NER is a dataset released by lovit@github. It provides the Korean Wikipedia in a text format.
Data specification is as follows.

- author: Naver + Changwon National University
- repository: [https://github.com/naver/nlp-challenge/tree/master/missions/ner](https://github.com/naver/nlp-challenge/tree/master/missions/ner)
- reference: [http://air.changwon.ac.kr/?page_id=10](http://air.changwon.ac.kr/?page_id=10)
- size:
- train: 90,000 examples

Data structure is as follows:

|Attributes|Property|
|---|---|
|text|a string of space delimited words|
|words|a word sequence|
|tags|a sequence of entity tags of words|


## 1. Using in Python

You can download and load the corpus after executing your Python console.

### Downloading the corpus

You can download NAVER x Changwon NER corpus into your local directory with the following Python codes.

```python
from Korpora import Korpora
Korpora.fetch("naver_changwon_ner")
```

```note
By default, the corpus is downloaded to a Korpora directory within the user's root directory (`~/Korpora`). If you wish to download the corpus to another directory,
add `root_dir=custom_path` argument to the fetch method.
```

```tip
When the fetch method is executed with `force_download=True` argument, it ignores the existing corpus in the local directory and re-downloads the corpus. The default value of `force_download` is `False`.
```


### Loading the corpus

You can load NAVER x Changwon NER corpus from your Python console with the following codes.
If the corpus does not exist in the local directory, it is also downloaded as well.

```python
from Korpora import Korpora
corpus = Korpora.load("naver_changwon_ner")
```

You can also load the corpus as follows.
The output of these codes is identical to that of previous codes.

```python
from Korpora import NaverChangwonNERKorpus
corpus = NaverChangwonNERKorpus()
```

If you use either one of these previous examples, you can load the corpus into the variable `corpus`.
`train` refers to the training dataset of NAVER x Changwon NER corpus, and you can check its first training instance as follows.

```
>>> corpus.train[0]
WordTag(text='비토리오 양일 만에 영사관 감호 용퇴, 항룡 압력설 의심만 가율 ', words=['비토리오', '양일', '만에', '영사관', '감호', '용퇴,', '항룡', '압력설', '의심만', '가율'], tags=['PER_B', 'DAT_B', '-', 'ORG_B', 'CVL_B', '-', '-', '-', '-', '-'])
>>> corpus.train[0].text
비토리오 양일 만에 영사관 감호 용퇴, 항룡 압력설 의심만 가율
>>> corpus.train[0].words
['비토리오', '양일', '만에', '영사관', '감호', '용퇴,', '항룡', '압력설', '의심만', '가율']
>>> corpus.train[0].tags
['PER_B', 'DAT_B', '-', 'ORG_B', 'CVL_B', '-', '-', '-', '-', '-']
```

By executing the `get_all_words` method, you can access all words (word sequences) within NAVER x Changwon NER corpus.

```
>>> corpus.get_all_words()
[['비토리오', '양일', '만에', '영사관', '감호', '용퇴,', '항룡', '압력설', '의심만', '가율'], ... ]
```

By executing the `get_all_tags` method, you can access all tags (a sequence of entity tags of words) within the corpus.

```
>>> corpus.get_all_tags()
[['PER_B', 'DAT_B', '-', 'ORG_B', 'CVL_B', '-', '-', '-', '-', '-'], ... ]
```

By executing the `get_all_texts` method, you can access all texts (a string of space delimited words) within the corpus.

```
>>> corpus.get_all_texts()
['비토리오 양일 만에 영사관 감호 용퇴, 항룡 압력설 의심만 가율 ', ... ]
```



## 2. Using in a terminal

You can directly download the corpus without executing Python console.
To do so, use the following command.

```bash
korpora fetch --corpus naver_changwon_ner
```

```note
By default, the corpus is downloaded to a Korpora directory within the user's root directory (`~/Korpora`). If you wish to download the corpus to another directory,
add `--root_dir custom_path` argument to the fetch command.
```

```tip
If you add `--force_download` argument when executing the fetch command in the terminal, it ignores the existing corpus in the local directory and re-downloads the corpus.
```
Loading