## Pretokenize
* 한국어 : Mecab 
* 일본어 : Juman++
* 영어 : Moses
* 중국어 : Stanford 

In [1]:
from konlpy.tag import Mecab
tokenizer = Mecab()

In [2]:
text = '안녕하세요, 저는 번역을 하기가 존나 싫습니다. 진짜로 존나 싫습니다.'

In [3]:
' '.join(tokenizer.morphs(text))

'안녕 하 세요 , 저 는 번역 을 하 기 가 존나 싫 습니다 . 진짜로 존나 싫 습니다 .'

In [4]:
txt_file = open('JPC4.3/traindev/ko-ja/train.ko', encoding='utf-8').read().split('\n')

In [5]:
# 맨마지막 줄은 공백. 
len(txt_file[:-1])

1000000

In [6]:
txt_list = [line for line in txt_file][:-1]

In [7]:
tokenize_list = [' '.join(tokenizer.morphs(line)) for line in txt_file][:-1]

In [8]:
len(tokenize_list)

1000000

In [9]:
len(txt_list)

1000000

In [10]:
txt_list[1]

'그래서, N 의 함유량은 0.01 ％ 이하로 한정한다.'

In [11]:
tokenize_list[1]

'그래서 , N 의 함유량 은 0 . 01 ％ 이하 로 한정 한다 .'

### Japanese tokenize with Juman

pyknp : JUMAN++를 위한 Python module

설치 방법 : https://deepage.net/machine_learning/2017/01/16/juman++.html

* pyknp-0.3설치 필요

In [4]:
from pyknp import Jumanpp
juman = Jumanpp()
result = juman.analysis('チンパンジーがじゃんけんを学習することを発見した')
result_token = [mrph.midasi for mrph in result.mrph_list()]
result_token

['チンパンジー', 'が', 'じゃんけん', 'を', '学習', 'する', 'こと', 'を', '発見', 'した']

In [5]:
ja_file = open('JPC4.3/traindev/ko-ja/train.ja', encoding='utf-8').read().split('\n')

In [6]:
ja_file[1]

'そのため、Nの含有量は0.01%以下に限定する。'

In [7]:
token_ja = juman.analysis(ja_file[1])
token_ja

<pyknp.juman.mlist.MList at 0x10cf6c2e8>

In [10]:
' '.join([mrph.midasi for mrph in token_ja.mrph_list()])

'その ため 、 N の 含有 量 は 0 . 01 % 以下 に 限定 する 。'

### English tokenize with Moses
NLTK 3.2.5 버전에서는 Moses를 포함했었음 

In [12]:
eng_file = open('JPC4.3/traindev/en-ja/train.en', encoding='utf-8').read().split('\n')

In [13]:
len(eng_file)

1000001

In [14]:
!pip install nltk==3.2.5

Collecting nltk==3.2.5
  Downloading nltk-3.2.5.tar.gz (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 2.4 MB/s eta 0:00:01
Building wheels for collected packages: nltk
  Building wheel for nltk (setup.py) ... [?25ldone
[?25h  Created wheel for nltk: filename=nltk-3.2.5-py3-none-any.whl size=1392140 sha256=87a17a98690ffe7faa087bfeaa8adba96f00788367ef37cd3c2c75b3a255eeb2
  Stored in directory: /Users/kkyuhun/Library/Caches/pip/wheels/f2/7f/71/cb36468789a03b5e2908281c8e1ce093e6860258b6b61677d8
Successfully built nltk
Installing collected packages: nltk
Successfully installed nltk-3.2.5


In [17]:
import nltk
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [19]:
from nltk.tokenize.moses import MosesTokenizer

In [20]:
mose_tokenizer = MosesTokenizer()

In [23]:
text = 'hi, I hate translation, really motherfucking translation'

In [24]:
mose_tokenizer.tokenize(text)

['hi',
 ',',
 'I',
 'hate',
 'translation',
 ',',
 'really',
 'motherfucking',
 'translation']

In [25]:
eng_file[1]

'The needle 103 is received by the hub 111 .'

In [26]:
# 영어의 경우 띄어쓰기 체계가 잘 잡혀있어서 띄어쓰기 위주로 나뉨 
' '.join(mose_tokenizer.tokenize(eng_file[1]))

'The needle 103 is received by the hub 111 .'

### Chinese tokenize with Stanford Parser

stanford - CoreNLP라이브러리는 자바로 작성되어있음

Stanza : Stanford parser의 python 라이브러리

In [56]:
conda install -c stanfordnlp stanza

Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: done


  current version: 4.9.2
  latest version: 4.10.1

Please update conda by running

    $ conda update -n base -c defaults conda



## Package Plan ##

  environment location: /Users/kkyuhun/opt/anaconda3/envs/nlp

  added / updated specs:
    - stanza


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    _pytorch_select-0.1        |            cpu_0         169 KB
    brotlipy-0.7.0             |py36h9ed2024_1003         332 KB
    chardet-4.0.0              |py36hecd8cb5_1003         198 KB
    cryptography-3.4.7         |   py36h2fd3fbb_0         687 KB
    idna-2.10

In [3]:
import stanza
# download chinese model
stanza.download('zh') 

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.2.1.json:   0%|   …

2021-06-14 18:13:57 INFO: "zh" is an alias for "zh-hans"
2021-06-14 18:13:57 INFO: Downloading default packages for language: zh-hans (Simplified_Chinese)...


Downloading http://nlp.stanford.edu/software/stanza/1.2.1/zh-hans/default.zip:   0%|          | 0.00/707M [00:…

2021-06-14 18:16:36 INFO: Finished downloading models and saved to /Users/kkyuhun/stanza_resources.


In [24]:
# Build Chinese model 
zh_nlp = stanza.Pipeline('zh', processors='tokenize', verbose=False, use_gpu=False)

In [5]:
zh_file = open('JPC4.3/traindev/zh-ja/train.zh', encoding='utf-8').read().split('\n')

In [6]:
zh_file[1]

'搅拌器28具有转子32。'

In [11]:
zh_doc = zh_nlp(zh_file[1])
print(type(zh_doc))

<class 'stanza.models.common.doc.Document'>


In [32]:
for i, sentence in enumerate(zh_doc.sentences):
    print(f'====== Sentence {i+1} tokens =======')
    print(*[f'id: {token.id}\ttext: {token.text}' for token in sentence.tokens], sep='\n')

id: (1,)	text: 搅拌
id: (2,)	text: 器
id: (3,)	text: 28
id: (4,)	text: 具有
id: (5,)	text: 转子
id: (6,)	text: 32
id: (7,)	text: 。


In [43]:
print([sentence.text for sentence in zh_doc.sentences])

['搅拌器28具有转子32。']


In [52]:
for i, sentence in enumerate(zh_doc.sentences):
    print(' '.join([token.text for token in sentence.tokens]))

搅拌 器 28 具有 转子 32 。
